Because the LinkWad link I posted earlier is not very self-documenting, here is a brief attempt at explaining what is going on.
I started out by collecting small functional words for the languages I am interested in at this point. For languages which I don't speak at all, the Wikipedia pages in that language were useful for picking up some apparent functional words (for example, Albanian and Catalan got this treatment).
I want functional words which are frequently used and ideally which don't easily clash with those in other languages, although that is of course very hard to guarantee. At the very least, it would be good if I have distinctive words for closely related languages, so that I don't get them mixed up.
Finally, in order to avoid misses or other accidents due to problematic, erroneous, or nonstandard encodings, I wish to use only words which are pure 7-bit ASCII. This turned out to be problematic e.g. for Albanian, where most of the candidate words have accents.
To these functional words, I added some search terms in order to focus the search on mailing lists. Specifically, I want "Subject" and "Message-Id", as in email headers, in the hits. Moreover, I added inurl:pipermail in order to require that the hits have the word "pipermail" in their URL; as documented earlier, I try to make this as simple and standard as possible for myself at this point, so I want hits from a widely used standard mailing list archive manager which uses a simple, downloadable text-only format (Berkeley mbox files).
Here are the resulting searches. I only list the functional words I picked for a language;
- English: there the with who
- German: aus bei sind zu
- French: mais sont les elle dont
- Italian: su sei della nel
- Spanish: el que y las
- Catalan: ens us fer i
- Portuguese: um uma em na
av från på medimproved av till och ett finds more lists too, 783 hits -- apparently precisely because of character encoding problems. I should have known!
- Danish: af fra til (should have one or two more?)
oliko mutta saa voi (not very functional, mostly verbs -- update?)improved ei joka on mutta (246 hits)
- Estonian: siit mis mida ka
- Albanian: dhe nga me tij
I also tried to peruse the Icelandic and Faroese Wikipedias in order to come up with a good list of minimal word pairs which would distinguish them from each other, but I failed -- the results I got for my Faroese search also included some Icelandic, and apparently some Old Norse as well. Besides, there don't seem to be any Pipermail archives in Google with these search words in them.
I am also noting here for possible future use:
- Icelandic: sem eru um af (106 hits)
- Polish: w do tej od z (763 hits)
- Letzebuergisch: vu vun vum zu op der (8 hits) -- I had to add more words because "vun" seems to be popular in some base64 attachments ...
- Afrikaans: by deur oor sy (39 hits)
- Basque: eta bat ez da bi (319 hits)
For languages with their own writing systems, I imagine I will need to take a different approach. First of all, I'll need to establish what the possible encodings are, and perhaps roughly their relative ratios of real-world use. Perhaps then it would be enough to collect material in UTF8 and use recode to convert to the other possible formats. And that probably still won't cut it for languages with a Cyrillic writing system.
For the longer term, Wikipedia currently has pages in 250+ languages. One could assume that there will be a rough correspondence between Wikipedia activity in a language and the availability of public mailing list archives in that language.
Exercise: find search words for Greenlandic -- it's a heavily agglutinating language, so there aren't any small particles to latch to. (Actually Navajo appears to be even harder, not least because it seems impossible to type on a standard keyboard, but then, let's not even get into various exotic ideographic scripts.)