Eostric Progress

It's been a while since my last update, many apologies. Here's an update on my work with Eostric.

As it turns out, Wordnet was a bit of a boon. I've put together a pile of routines that can take arbitrary text pasted in from the web (or print) and break the text up into chunks that a machine can digest.

Take for instance the text:

Edgar Allen Poe was born in Boston in 1809 but grew up in Richmond, Virginia and attended the University of Virginia.

The code will take that and break it up as follows:

Edgar Allen Poe	pos n proper_noun 1
was	stem be pos v wordid 12254 morph 1
born	stem bear pos v wordid 12366 morph 1
in	pos preposition
Boston	pos n proper_noun 1
1809	pos n proper_noun 1
but	pos preposition
grew	stem grow pos v wordid 60126 morph 1
up	pos preposition
Richmond , Virginia	pos n proper_noun 1
and	pos conjunction
attended	pos {s a} all_pos {s a} wordid 9501
the University of Virginia	pos n proper_noun 1
.	pos {}

Essentially the code tries to lump nouns into a noun phrase, and identify the parts of speech for each word or phrase. It's not perfect yet, obviously, but it is performing a lot of cleanup that I wasn't even aware it was going to need to do before I started the implementation.

Of all things, apostrophes and quote marks are turned out to be a challenge. Because of some curious design decisions in both ASCII and keyboard design, how quotes are actually typed and stored varies widely. Also, one can never be sure if the ' character means the end of a single quote phrase, or an apostrophe, without digging into the context.

You will see that each word or phrase is marked with a part of speech (pos), and if the word could be located in wordnet, the Wordnet id for that word and the stem word that Wordnet is tracking. Wordnet only tracks nouns, verbs, adverbs, and adjectives, so you will see that there are also "propositions" and "conjunctions" listed. Those are sorted out by Eostric. Eostric also has rules for detecting and lumping proper nouns.

I didn't think I'd get this far this fast, and frankly I'm scratching my head on how to take this software to the next level. But that's what I've been up to the last few weeks. Stay tuned for updates.

I have posted the code so far on chiselapp if you would like to experiment with it. I don't have the wordnet database included in the code just yet. It's about 500mb of data, and I'm not entirely convinced I need all of it. Or even most if it. In a lot of my same exercises (including the one above) wordnet is only really useful for a small chunk of the data processing I actually need. And usually about generic words with special rules. Wordnet contains a lot of encyclopedic entries that are sort of dubious in usefulness. Things like "Battle of Jerico" and "Albert Einstein." Nice that it recognizes them. Not so nice that it does anything useful with them.

The link of the code is: https://chiselapp.com/user/hypnotoad/repository/eostric