Monday, January 30, 2006

A few utterly useless problems in NLP

Natural Language Processing is very difficult to do. For me, most of the difficulty arises from linguistic limitations. To do NLP well, you need to have a firm grasp of English language and grammar, and really be a linguistics expert. This is certainly not my specialty. Nonetheless, I find NLP to be a very interesting field of computer science, and I like to read and think about it, whenever time allows.

I want to propose two utterly useless problems in NLP that I've been considering this weekend. In the spirit of the Columbia University Computer Science Department (whose motto should really be "the idea is the important thing") I won't actually code anything, but merely discuss the theory of the problem.

Consider this example first: Suppose that I want to find out how many of my blog posts reference Mr. Tom Lehrer. Well, the first thing that we could do is just search all the text for the string "Tom Lehrer" or better yet, just "Lehrer." However, that won't pick up on the posts where I quote some line from Mr. Lehrer's records, without actually writing "Lehrer." Finding those posts is a much more difficult task. The first obstacle is to define what we really mean by "reference." Is a reference a single line from a song? A complete sentence from an intro? Will we count fragments of a sentence/lyric? And does a fragment have to be comma delimited, or will we define it otherwise? How about when I say "it seems to me" - should that be excluded? What about relatively common phrases like "we're just as close as we can be" that I might use without intending a reference? There are clearly a lot of special cases - a programmer's worst nightmare (oh no a special case). So if I were actually coding this, I would define a reference as either "Lehrer," a complete line from a song, or a complete sentence from an intro. Even after doing some nifty text processing I would still have a really bad program. For example, what if I ever write a post about Jim Lehrer, or some other Lehrer - I didn't write any exceptions that would catch that. Not to mention lack of provisions for sentence and lyrics fragments, a completely unreasonable runtime, and so on.

Now, consider a more complicated example: every intelligent person with a sense of humor hates Mark Russell. On his webpage, the self-proclaimed (because nobody else would ever proclaim such nonsense) "master of political satire" posts a few dull jokes every month. The archives go back to 2003, so there's quite a bit of unfunny data to work with. Suppose that we would like to systematically prove what everyone already knows - that all of Mark Russell's jokes are pretty much the same. That is a very complicated problem in Natural Language Processing, because it requires not only a deep understanding of the structure of English sentences, but also must take into account various aspects of content. We can go about proving the similarity of his "jokes" in various ways. We can do content analysis, and indicate that a certain high percentage of jokes are about Iraq, or Bush, or the Congress, etc. This is easy to program, however, it doesn't do nearly enough justice to just how uncreative his material really is. If we want to be fancy, we can create some sort of reference file for each month (a manual process) that will contain key words for major current events for that period (like: abortion, El Nino, Monica, etc), and match it against the jokes for that month. Another way of proving similarity is to analyze the sentence structure of his jokes and indicate that a lot of them are formulated in the same way. (For example, if you do such analysis on Yakov Smirnoff's material, you will find a lot of sentences of the form: In Soviet Russia noun, transitive verb, YOU!) In this case, this is a ridiculously intricate task and really represents the crux of the difficulty in analyzing and processing a natural language.

Creative thoughts on the topic, as well as code samples, are always appreciated, and I will certainly indulge any intelligent discussion. However, if you spam me with "I love Mark Russell you suck Irina" don't expect a response. Also, (unless you know me intimately) please don't send me anything that is already compiled.

0 Comments:

Post a Comment

<< Home