As indicated in my initial posting, the plan for this blog was mainly to announce to the world what I'm up to, in order to attain some focus and perhaps some peer pressure from my $dmr-zillion alert and supportive readers.
To my utter surprise, signing up for this 14-day trial account and familiarizing $self with Typepad has already consumed the better part of my Saturday. Maybe I can book this time under "cost of doing business" but seriously, now would be a good time for some officious declarations.
I obtained my Bachelor's degree in 2000, and have been meaning to finish a Master's before I turn 40. I have for six and a half years had an agreement with the benevolent current employer that time would be allocated for this "at some point, as projects permit". In case you are not entirely up to speed with corporate BS, that means "never". But now, the old alma mater is exerting some pressure, because they are in the final phases of a transition to a completely new exam structure; and so, they are demanding that you finish your degree now (concretely, before June) or deal with the consequences. In the worst case, we are told, those might include the abyss of a freshman course in using computers for web surfing, email, and word processing. And in fact, April would be nice, they say, because they expect quite the real-world equivalent of a Slashdot phenomenon as hordes of old $bleeps (attempt to) get their acts together.
OK, enough background. My commitments for January are:
- Finish the synopsis for my thesis which I have on this very laptop, freshly restored from the old dying hard drive; and
- Start compiling a multilingual email corpus from public mailing list archives
The corpus is just a sidetrack from the real thesis work, but it's a prerequisite for the actual research I make believe I will perform. I even have some age-old notes about how this might have something to do with spam filtering, but the way things look now, the spam topic will not really be the focus of the thesis. (Incidentally, note to self; I should also look at this stuff, especially the enigmatic reference to "Kranig".)
Because I cannot reliably identify all the, what, 47? languages in TextCat, I will have to start with a smaller selection, but I believe I should be able to cope with at least the following:
- English
- Swedish
- Danish
- Finnish
- Estonian
- German
- French
- Italian
- Portuguese
- Japanese
- Korean
- Chinese
And if I have the time, and find suitable mailing lists:
- Polish
- Hungarian
- Romanian
- Albanian
- Turkish
- Greek
- Basque
- Esperanto
- Vietnamese
(The majority of these are easy because they are easy to identify from the word shapes or the writing system used, not because I am a language wonderchild. Quite frankly, as a linguist, I am ashamed that I am not able to handle more than the three and a half languages which were compulsory at school.)
Slightly more problematic:
- Norwegian generic vs Nynorsk vs Bokmål?
- Dutch vs Afrikaans?
- Icelandic vs Færøese -- probably not hard at all with a bit of background study
- Latvian vs Lithuanian -- ditto
- Spanish - Catalan vs Castellan -- ditto
- Hebrew vs Yiddish -- ditto
- Czech vs Slovak? Probably easy to mix up with some other Slavic languages too
- Russian vs Ukrainian vs other languages with Cyrillic writing, some of which I have never heard of?
- Arabic vs Farsi vs other languages which use Arabic writing
Some of these are problematic only because I'm ignorant of the differences, but that doesn't help. I don't think I could tackle many more even if my legions of readers were able to help with some of these. And I apologize already in advance to everyone whose favorite language is not on any list.
It would be fun to add something like Dzongkha if I can find adequate resources, but maybe that's too optimistic ...
I'll try to post a follow-up soon with some method notes. (Update: here.)
Recent Comments