Re: Multilingual Email Corpus Project
I saved some canned searches to a public LinkWad -- results right now are as follows:
- English -- 109,000 hits
- Swedish -- 489 hits
- Danish -- 457 hits, perhaps too wide search
- Finnish -- 39 hits, rather little, too narrow search?
- Estonian -- 7 hits, clueless selection of search words? ("omitted results included" yields ... 8)
- German -- 5,810 hits
- French -- 1,220 hits
- Italian -- 2,480 hits
- Portuguese -- 7,710 hits, perhaps not the best search words
Update: I also think I have these covered now, in spite of what I thought earlier;
- Spanish -- 11,900 hits
- Catalan -- 620 hits
In the worst case, I will simply have to exclude Finnish and Estonian from this phase. The idea is to focus on the really low-hanging fruit, although I also want Far East languages in the set for ... particular reasons. I'll need to get back to those as well.
For now, I'll probably work exclusively on Pipermail archives, because they are fairly common and structured in a standard way; thus, they should be easy to collect. Maybe later I'll want to expand the toolchain to cope with other types of archives (Mhonarc? Mailman? Listserv? Yahoo Groups? Google Groups (and NNTP!)? Gmane?)
I want a fairly broad selection of lists in order to avoid the obvious biases -- many messages from the same people, or on the same (probably computing / Internet / open source-related) topics, and of course it would be nice if there were different levels of formality and familiarity in the corpus, but that might be harder to achieve with public, open mailing lists as the sole source.
Comments