NLP class

Mar. 13th, 2012 06:03 pm
The NLP class that [personal profile] vlion and I are taking has started. I've watched all the videos for the first week, and just tackled the first-week problem set. It took a bit of help from #dreamwidth people to figure out how to run a python script, and a bit of help from mostly-Wikipedia to figure out how to use the Unix command line tools, but I got all 5 questions on the problem set right on the first try. (We're allowed to submit the problem sets up to 5 times, and only the best score is kept.)
[personal profile] vlion proposed for me as a programming practice project that I should try making a Perl program that would provide the data for a histogram of which word appeared how many times in the text. I'm treating that as word-forms, not lexemes, because getting from word forms to lexemes is really hard.

This is what I think the conceptual components of the task are:
1. Chunk the text into word tokens - this involves breaking the text at word boundaries, keeping word-internal punctuation (especially apostrophes, maybe also hyphens), and throwing away word-external punctuation (including periods, commas, question marks, quotation marks...). Since apostrophes and single quotation marks are generally the same character, it takes a bit of work to distinguish them. You can mostly tell them apart by "apostrophes are generally between two letters, single quotation marks are not"; this doesn't catch word-final apostrophes like at the end of plural possessive cats', but I'm going to consider it close enough for a practice project. Anyways, chunking the text into word tokens should leave each word token stored in its own "slot".

(Edit:
1a: Convert all capitals to lowercase; this will lose distinctive capitals in proper names, but the gain of being able to combine sentence-initial and non-sentence-initial words will be worth it in most circumstances.)

2. Group identical word tokens. (I.e., create word type groups, where all tokens of a word type are together.)

3. For each word type group, do two things:
a. Create a list with only one instance of each word type, as the name of that word type group
b. Count the number of tokens of each word type, and put those counts in a list, ordered in the same order as the word type list

4. Associate the list of word type names with the list of token counts

(Note: I can imagine stages 3 and 4 being done either sequentially the way I've written it, or together, where for each word type group, you'd find the name, count the tokens, and associate them, before moving on to the next word type group.)

5. Produce output ordered in the desired order, e.g. alphabetical order.

I have some idea of basic coding tools that I could use to complete these conceptual tasks, but I don't have time to sketch that right now.
I have signed up for the online Natural Language Processing class offered by Dan Jurafsky and Christopher Manning. I think it'll be a stretch for me, but I'm hoping I can still learn and do productive things in it.
I'm now here. This is mostly a placeholder entry, since I'm not sure what to say here yet.

Profile

Clifstan

March 2012

S M T W T F S
    123
45678910
1112 1314151617
18192021222324
25262728293031

Syndicate

RSS Atom

Style Credit

Expand Cut Tags

No cut tags
Page generated Dec. 10th, 2016 10:15 pm
Powered by Dreamwidth Studios