clifstan

vlion proposed for me as a programming practice project that I should try making a Perl program that would provide the data for a histogram of which word appeared how many times in the text. I'm treating that as word-forms, not lexemes, because getting from word forms to lexemes is really hard.

This is what I think the conceptual components of the task are:
1. Chunk the text into word tokens - this involves breaking the text at word boundaries, keeping word-internal punctuation (especially apostrophes, maybe also hyphens), and throwing away word-external punctuation (including periods, commas, question marks, quotation marks...). Since apostrophes and single quotation marks are generally the same character, it takes a bit of work to distinguish them. You can mostly tell them apart by "apostrophes are generally between two letters, single quotation marks are not"; this doesn't catch word-final apostrophes like at the end of plural possessive cats', but I'm going to consider it close enough for a practice project. Anyways, chunking the text into word tokens should leave each word token stored in its own "slot".

(Edit:
1a: Convert all capitals to lowercase; this will lose distinctive capitals in proper names, but the gain of being able to combine sentence-initial and non-sentence-initial words will be worth it in most circumstances.)

2. Group identical word tokens. (I.e., create word type groups, where all tokens of a word type are together.)

3. For each word type group, do two things:
a. Create a list with only one instance of each word type, as the name of that word type group
b. Count the number of tokens of each word type, and put those counts in a list, ordered in the same order as the word type list

4. Associate the list of word type names with the list of token counts

(Note: I can imagine stages 3 and 4 being done either sequentially the way I've written it, or together, where for each word type group, you'd find the name, count the tokens, and associate them, before moving on to the next word type group.)

5. Produce output ordered in the desired order, e.g. alphabetical order.

I have some idea of basic coding tools that I could use to complete these conceptual tasks, but I don't have time to sketch that right now.

S	M	T	W	T	F	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Recent Entries

Test

NLP class

Word count histograms

(no subject)

(no subject)

Profile

March 2017

Syndicate

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags