Worser Grammar

By Joseph Kibe on 24 June 2009 9:48 PM

A few days ago I mentioned my effort to write some methods for automated sentence parsing. Specifically, I had the goal in mind of creating sentence parsing algorithms that could still do a reasonably good job of text parsing even if the author omits common grammatical structures.

It turns out, writing such a set of algorithms is no walk in the park. In particular, I now see why the software I had been using failed so miserably when I threw it sentences without some key parts of speech.

At least in English, many words fall into one or more types of speech. For example, the word "cue." It could be a noun — a implement for playing pool, a signal, a hint — but it could also be a verb as in, "He cued the tape for playback," depending, of course, upon the context.

The context, as I understand it, plays a particularly important role in some algorithms. This class of parsing methods look at all the word pairs in a sentence and assign a part of speech accordingly. For instance, given the phrase, "a yellow duck," the parser would figure out that "yellow" cannot modify the verb "duck" (as in "duck and cover"), so it's likely "duck" is a noun and "yellow" and adjective.

Of course, this approach also failed rather miserably when I subjected it to real world inputs. The two algorithms I tried depended upon the presence of determiners in many cases to act as sort of "reference points," since, for example, "the" is only ever a determiner. This then enabled the algorithms to make good assumptions about the location of nouns, which in turn forces other words to be verbs, which more or less makes everything fall into place quite nicely.

But as I said, that didn't work. People don't write notes with sufficiently polished grammar to make such approaches work. (Though if I ever need to parse well-written work, I have code that does a pretty good job.)

So I'm trying my own heuristically motivated approach using word frequency data. While I'm working on some fancy probabilistic mumbo jumbo that involves a lot of math, at its core the approach is quite simple.

Take the word "young" as an example. I suspect most English speakers would immediately classify "young" as an adjective, which is true — most of the time. "Young" can also be a noun, as in, "The young were spared the worst of the battle's ravages." But, by analyzing a whole bunch of English writing, it quickly becomes clear that "young" is used far more frequently as an adjective than as a noun.

Thus, my algorithm takes that data and makes some initial assumptions when it looks at the words and phrases in a sentence. My hope is that, with these reference points in place it will become possible to make good guesses about the rest of the parts.

I also broke down and ordered a copy of Foundations of Statistical Natural Language Processing just to give myself a touch more background on the use of probabilistic methods in natural language parsing.

It's exciting, interesting and — most of all — incredibly frustrating.

No TrackBacks

TrackBack URL: http://www.kibeland.com/cms/mt-tb.cgi/283

Leave a comment