The NLP class that
vlion and I are taking has started. I've watched all the videos for the first week, and just tackled the first-week problem set. It took a bit of help from #dreamwidth people to figure out how to run a python script, and a bit of help from mostly-Wikipedia to figure out how to use the Unix command line tools, but I got all 5 questions on the problem set right on the first try. (We're allowed to submit the problem sets up to 5 times, and only the best score is kept.)
![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
Word count histograms
Dec. 9th, 2011 10:08 am![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
This is what I think the conceptual components of the task are:
1. Chunk the text into word tokens - this involves breaking the text at word boundaries, keeping word-internal punctuation (especially apostrophes, maybe also hyphens), and throwing away word-external punctuation (including periods, commas, question marks, quotation marks...). Since apostrophes and single quotation marks are generally the same character, it takes a bit of work to distinguish them. You can mostly tell them apart by "apostrophes are generally between two letters, single quotation marks are not"; this doesn't catch word-final apostrophes like at the end of plural possessive cats', but I'm going to consider it close enough for a practice project. Anyways, chunking the text into word tokens should leave each word token stored in its own "slot".
(Edit:
1a: Convert all capitals to lowercase; this will lose distinctive capitals in proper names, but the gain of being able to combine sentence-initial and non-sentence-initial words will be worth it in most circumstances.)
2. Group identical word tokens. (I.e., create word type groups, where all tokens of a word type are together.)
3. For each word type group, do two things:
a. Create a list with only one instance of each word type, as the name of that word type group
b. Count the number of tokens of each word type, and put those counts in a list, ordered in the same order as the word type list
4. Associate the list of word type names with the list of token counts
(Note: I can imagine stages 3 and 4 being done either sequentially the way I've written it, or together, where for each word type group, you'd find the name, count the tokens, and associate them, before moving on to the next word type group.)
5. Produce output ordered in the desired order, e.g. alphabetical order.
I have some idea of basic coding tools that I could use to complete these conceptual tasks, but I don't have time to sketch that right now.
(no subject)
Dec. 9th, 2011 12:25 amI have signed up for the online Natural Language Processing class offered by Dan Jurafsky and Christopher Manning. I think it'll be a stretch for me, but I'm hoping I can still learn and do productive things in it.
(no subject)
Oct. 27th, 2011 05:12 pmI'm now here. This is mostly a placeholder entry, since I'm not sure what to say here yet.