Histogram Resources

[OFFLINE]

10/23

Source files and xmlparser scripts

Extension of Previous Work

·        This work extends the current utilities to have some interaction with WordNet in hopes to prune out the excessive modifiers. Although we had hoped to also process a new corpus, I did not have enough time to rerun the extraction stage, so all tests were run on the corpus from last week.

·        Restructured the files to have a cascaded makefile subproject structure; now modules can be added and worked on independently but the will build as the whole library is made

·        Added support for WordNet development, created utility to transform a text file into morphological similars and their polysemy from WordNet.  The new program accepts input from STDIN or command line input/output argument style.

·        Restructured the help system to better reach all commands in the FIMS utility; to reduce the confusion, a new command array and clustering of commands has been introduced.

·        Added support for multi-word fetches previously searches for “blood pressure” would not work correctly and the CUI would have to be used

·        Fixed bug with the lookup for “one” to “1” simliars for fact merging; generated garbage data in the database

·        Added support for parsing WordNet csv created files by modifying the files we know as inputs to parse.

·        Moved the modifiers to a single lexicon so that we can create the polysemy counts and links to other modifiers within the lexicon when handed a WordNet (or other) processed file.  Currently, we only modify existing entries, not create new words that WordNet has provided so that the lexicon is not over populated with entries that were not found in the corpus.

·        A listing of all currently available commands can be seen in this simple output trace.

The Addition of WordNet

WordNet is a very powerful, if not a little too complex, tool that allows us to get detailed morphological and sense information about a word.  I chose the use of WordNet over the simple frequency in another corpus to choose our target modifiers because WordNet also offered morphological similars which can be used when we are merging the frequency histograms of multiple modifiers.  Also, WordNet provides a direct programmatic interface which was used to give exactly the input/output patterns that the program could easily adapt to.  The use of WordNet required a central lexicon of modifiers to be kept instead of localizing this information to each idea.  It should be noted that this step is closer to the original proposal that used a graph-type approach to weigh and link the relevance of ideas, facts, and modifiers.

An example output of the lexicon itself can be found here, and the processed output of the WordNet component can be found here.  Currently, the export/import process must be done with two simple command line executions, but eventually the central program will be able to execute this conversion itself.  It is also important to note that we would want to collect and scan the entire corpus for relevant words before subjecting the modifiers to the WordNet lookup.  This is because to reduce computation time, if a lexeme already has an associated polysemy value, it will not be exported again in a lexicon dump.

Polysemy Histogram Analyzation

With the addition of polysemy values, we can see that there are now promising results for getting frequency histograms of relevant modifiers.  Although we can not verify this against the fact data (this part of the project was not reworked yet), it is promising none-the-less.  There is an inherent danger in using the polysemy values though: we can not guarantee that the specific type of modifiers that we want (i.e. numerical sense describers) are the only high-polysemy words.  Instead, we must rely on both the value polysemy value and the frequency of the corresponding word in the corpus to get the best automated histogram results.  Figures one and two demonstrate a polysemy value that introduces too many or too few of the interested terms.  As the histograms demonstrate, too many modifiers will saturate the target modifier words which is just as bad as pruning them.  Instead, an above average, but not high, polysemy value of 7 is chosen and displayed in the third figure.  With a larger corpus, it is likely that these primary words that we are interested in will only get present a higher frequency.

Figure 1: Corpus Modifiers with Minimum Polysemy 4

Figure 2: Corpus Modifiers with Minimum Polysemy 9

Figure 3: Corpus Modifiers with Minimum Polysemy 7

Proof of Relevance of Polysemy

As a test proof, the results from last week are compared against those of this week that have the additional WordNet polysemy sense integrated.  Each graph on the left depicts last week’s results, the graph on the right depicts those of this week.  These images are recycled from last week’s output directly.   Although there are still some stray words that would not give a numerical sense that we desire, the frequencies usually give our target words more relevance.  After we expand the corpus to include more articles and add a bi-directional idea and value match, I believe that these graphs will look idea.

Figure 4: Modifier Frequency of "blood pressure" CUID

Figure 5: Modifier Frequency of C0005823 with Minimum Polysemy 7

 

Figure 6: modifier frequency of "heart rate" CUI

Figure 7: Modifier Frequency of C0018810 with Minimum Polysemy 7

Figure 8: modifier frequency of "LVEF" CUI

Figure 9: Modifier Frequency of C0428772 with Minimum Polysemy 7

Reduction of Related Modifiers

As previously mentioned, an additional benefit of using WordNet to filter our modifiers is that we also get morphological and word form simliars.  Although I have nor performed any initial tests for convergence, I would not be surprised if the use of these similar forms also improved modifier frequencies.  Unfortunately, this is not such a straightforward process, because we must carefully choose which modifier will get the added frequency distribution that previously resided on the less frequent morphological similars.  This is a task that we leave for consideration at the end of the project.

Summary

Continued Direction

 

Document Last Edited:

Thursday, October 30, 2003, 5:50 PM