>>Knowledge Representation of Unstructured Data. ^M00:00:03[ Pause ]^M00:00:06I mentioned previously that the way a patient's disease for example is coded into the medical record is by a human. A human looks through the doctor's note; patient has diabetes mellitus and assigns the appropriate ICD-9 code... ^M00:00:23[ Pause ]^M00:00:26...and/or the provider actually entries the code him or herself. But what often happens is that neither entries the correct code and its well known that providers either enter no code when there should be a code or they enter the slightly incorrect code or maybe even very incorrect code. And as I also mention, the coding is often for billing. So if a patient has several conditions for which the system is not being billed, you know some chronic condition which doesn't require any immediate care, it would be nice to have codes for those illnesses as well. So we're going to talk about how coding can be automated from analogs in the medical record. Let's take our imaginary machine again and look at the patients with diabetes mellitus. We...as mentioned we can have a human do it, but could a machine look at the patient's note? In the note, the doctor may have written, "Patient has diabetes mellitus." Could a machine look at that, those words, and figure out that the patient has diabetes mellitus and then assign the 240 code? ^M00:01:47[ Pause ]^M00:01:50There are such methods; they're quite sophisticated, incredibly powerful tools under the general category of natural language processing. A simple example of this is how Google figures out that you've misspelled something. There are very sophisticated algorithms which involve all sorts of strategies such as word matching, parts of speech, the syntax, the statistics, what words are in the neighborhood. All these strategies are used to help disambiguate; to figure out from a potentially ambiguous set of symbols if you will what is going on with a patient. So let's look at our diabetes mellitus example. A machine could be programmed to look for literally the letters "diabetes mellitus" and if in a note the doctor wrote "diabetes mellitus," it would very easy for the NLP tool to figure out that a patient has diabetes mellitus and then assign...it could be programmed to assign the code of 249. Now what happens when it's ambiguous? The word "diabetes" could be by itself. For example the physician might say, "The patent has diabetes and is being treated etc., etc." Well unfortunately there are two kinds of diabetes; there's diabetes mellitus which is the far more common and one that we've been discussing today, but there's also diabetes insipidus which has really nothing to do with diabetes mellitus. So the computer would not be able to know if the patient had diabetes mellitus or diabetes insipidus and there'd have to be some human intervention by changing the program if you will to disambiguate. It could be far more complicated than diabetes alone; it could be DM2. What if the physician was very busy and wrote "DM2" or "Diabet" which is not uncommon either? So it's not an insurmountable task, but you can see that it would require some sophisticated programming for the machine to be able to recognize "diabetes mellitus," "diabedes" (misspelled), D-M, DM2; all these different ways of categorizing diabetes mellitus. But yes, that's the challenge and that is in fact how computer scientists and biomedical informatics have approached this. We just log through and we can train the machine automatically which we may discuss later on in the course to develop a library if you will of terms and abbreviations that represent diabetes. But at the end of the day the point here is a simple one; that using sophisticated computer science methods, a machine could theoretically and in fact does in certain circumstances figure out that the patient has diabetes mellitus and then assigns the correct code. What is important I think about the approach of natural language processing is that it can help us look for facts about a patient that are not likely to be coded. I mentioned chronic diseases and syndromes are not...often not coded because there's no billing for them and... ^M00:05:27[ Pause ]^M00:05:30...there are no ICD-9 codes assigned to these chronic diseases or syndromes. How about symptoms? Someone...a patient comes into the doctor and says, "I have some muscle aches; I have some muscle pains." It's very possible that the patient is going to discuss this, but it will not be coded as an ICD-9 code because there was no definitive diagnosis made; there was no definitive treatment. Elements of the past medical history are very important to the patient record; again, not likely to have a code. Family history, the same. A natural language processing tool can read through the medical record and find all these important pieces of information about a patient which would not...more than likely not be coded. ^M00:06:20[ Pause ]^M00:06:23One question you might ask is, "How does the tool figure out what those conditions are and how does it give it a name? If it sees various signs and symptoms (myopathy, myalgia), can we use a translation table? Can the machine read through the note and acutely pick out concepts and then assign a code?" And the answer, of course, is "yes," thanks to the UMLS which we've discussed previously. It's a huge table that has all sorts of important relationships and can be used to consult...can be consulted in order to help translate findings from an unstructured text. So for example we could look at the table and find synonyms of various conditions; we could get specific codes for some of the signs of symptoms. Let's look at a specific example to make this a little more understandable. Let's take the patient with diabetes mellitus for whom we want to assign an ICD-9 code; can we use the ULMS to translate that? And the answer is "yes." The machine reads through the note; the patient is a 65 year old male with diabetes mellitus who was treated with metformin, diet, etc. And the word "diabetes mellitus" is recognized by the NLP machine and goes to the UMLS table and finds that diabetes in the UMLS has a specific quey [assumed spelling], which we discussed previously, which is C0011489. And we can go to the...drill down into the UMLS tables and find that that quey can be associated...matched to an ICD-9 code of in this case 250, which is the general quey for a diabetes mellitus. ^M00:08:30[ Pause ]^M00:08:33And if we look at the output from the UMLS (we looked briefly at this before), we see that there are all sorts of interesting pieces of information. In the output we type in "diabetes mellitus" and we get a list of useful information. For example, at the top we see the listing for the ICD-9 code of 250; this is having typed in "diabetes mellitus." And we even see the SNOMED code down below; we discussed that briefly. So we can use the UMLS to translate. This is the tool or the goal I mentioned previously; you read through the note, you want to say, "I see diabetes mellitus, now what do I do with it? I need an ICD-9 code, let's go to the UMLS." But as mentioned in the prior example, what if the doctor has only said "diabetes?" This is a 65 year old man with diabetes who we know overwhelmingly has diabetes mellitus, especially because of the treatment with metformin, but the computer has to be told this. The computer is not going to be able to figure this out because it cannot map to a unique UMLS term because in the UMLS, if one puts in "diabetes," the engine which searches the UMLS figures out that it's actually ambiguous. It could be "diabetes," plain "diabetes." You can see the bottom of the list diabetes' quey is 0011847, or various other kinds of diabetes mellitus and the point is that UMLS is stumped at this point; it really doesn't know what diabetes we're talking about. And the point is that it can't map at that point. A human intervention is going to have to take place in the form of programming. We can see here also that if a man...if the patient is labeled as having "DM" which again, anybody on metformin is overwhelmingly likely to have diabetes mellitus, the translation of that in the UMLS is extremely ambiguous and really unresolvable that the NLP engine extracts "DM" and it cannot map it to a unique UMLS code. You see here all the different terms that the UMLS thinks is representing "DM;" diabetes mellitus of course is one of them, but dexamethasone, demographics domain, double minutes, myotonic dystrophy. It's really hopeless at this point. And so the UMLS has its strengths which we've discussed, but it also has its weaknesses. And this is a perfect example that the UMLS cannot resolve "DM." It's going to have to take a human intervention to resolve that. And this just gives me an opportunity to reiterate that the UMLS is not a terminology. Unfortunately it has examples of polyseme which means that one term has two or more meanings. "Diabetes" could mean diabetes mellitus or diabetes insipidus so "diabetes" is an ambiguous unresolvable term. "DM" has many different meanings as we've just seen. It also has synonyme where one concept is associated with two or more terms. We've got diabetes mellitus; we've also got non-insulin-dependent diabetes mellitus. We have all sorts of other ways of referring to the same concept; diabetes mellitus. And not to worry; though sophisticated methods are required to resolve these issues, we have many, many superb, brilliant computer scientists working on this but it is built brick by brick. And with something as common as diabetes mellitus, quite interesting to me, common diabetes mellitus is a very important disease; there is an epidemic of it. And there are so many different ways of saying "diabetes mellitus" that it can really stump the computer programs and that's one of the...a main goal many computer scientists is to figure out how to automatically resolve this. So this ends part 3 of our talk and we will proceed to part 4 which is the representation of... ^E00:13:13