Annals
Established in 1927 by the American College of Physicians
:
Advanced search
 
box Article
 arrow  Table of Contents                
space
 arrow  Abstract of this article Free
space
 arrow  Figures/Tables List
space
 arrow  Related articles in Annals
space
 arrow  Articles citing this article
space
box Services
 arrow  Send comment/rapid response letter
space
 arrow  Notify a friend about this article
space
 arrow  Alert me when this article is cited
space
 arrow  Add to Personal Archive
space
 arrow  Download to Citation Manager
space
 arrow  ACP Search                        
space
 arrow  Get Permissions
space
box Google Scholar
 arrow  Search for Related Content
space
box PubMed
Articles in PubMed by Author:
  arrow  Hripcsak, G.
space
  arrow  Clayton, P. D.
space
 arrow  Related Articles in PubMed
space
 arrow  PubMed Citation
space
 arrow  PubMed
space

ACADEMIA AND CLINIC

Unlocking Clinical Data from Narrative Reports: A Study of Natural Language Processing

right arrow George Hripcsak; Carol Friedman; Philip O. Alderson; William DuMouchel; Stephen B. Johnson; and Paul D. Clayton

1 May 1995 | Volume 122 Issue 9 | Pages 681-688

Objective: To evaluate the automated detection of clinical conditions described in narrative reports.

Design: Automated methods and human experts detected the presence or absence of six clinical conditions in 200 admission chest radiograph reports.

Study Subjects: A computerized, general-purpose natural language processor; 6 internists; 6 radiologists; 6 lay persons; and 3 other computer methods.

Main Outcome Measures: Intersubject disagreement was quantified by "distance" (the average number of clinical conditions per report on which two subjects disagreed) and by sensitivity and specificity with respect to the physicians.

Results: Using a majority vote, physicians detected 101 conditions in the 200 reports (0.51 per report); the most common condition was acute bacterial pneumonia (prevalence, 0.14), and the least common was chronic obstructive pulmonary disease (prevalence, 0.03). Pairs of physicians disagreed on the presence of at least 1 condition for an average of 20% of reports. The average intersubject distance among physicians was 0.24 (95% CI, 0.19 to 0.29) out of a maximum possible distance of 6. No physician had a significantly greater distance than the average. The average distance of the natural language processor from the physicians was 0.26 (CI, 0.21 to 0.32; not significantly greater than the average among physicians). Lay persons and alternative computer methods had significantly greater distance from the physicians (all >0.5). The natural language processor had a sensitivity of 81% (CI, 73% to 87%) and a specificity of 98% (CI, 97% to 99%); physicians had an average sensitivity of 85% and an average specificity of 98%.

Conclusions: Physicians disagreed on the interpretation of narrative reports, but this was not caused by outlier physicians or a consistent difference in the way internists and radiologists read reports. The natural language processor was not distinguishable from the physicians and was superior to all other comparison subjects. Although the domain of this study was restricted (six clinical conditions in chest radiographs), natural language processing seems to have the potential to extract clinical information from narrative reports in a manner that will support automated decision-support and clinical research.


The use of automated systems and electronic databases to enhance the quality, reduce the cost, and improve the management of health care has become common. Recent examples include using these systems to prevent adverse drug events [1, 2] and to encourage efficient treatment [3]. To function properly, automated systems require accurate, complete data. Although laboratory results are routinely available in electronic form, the most important clinical information—symptoms, signs, and assessments—remains largely inaccessible to automated systems. Investigators have attempted to use data from nonclinical sources to fill in the gaps, but such data have been found to be unreliable [4].

Much clinical data is locked up in departmental word-processor files, clinical databases, and research databases in the form of narrative reports such as discharge summaries, radiology reports, pathology reports, admission histories, and reports of physical examinations. Untold volumes of data are deleted every day after word-processor files are printed for the paper chart and for the mailing of reports. Exploiting this information is not trivial, however. Sentences that are easy for a person to understand are difficult for a computer to sort out. Problems include the many ways in which the same concept can be expressed (for example, "heart failure," "congestive heart failure," "CHF," and so forth); ambiguities in interpreting grammatical constructs ("possible worsening infiltrate" may refer to a definite infiltrate that may be worsening or to an uncertain infiltrate that, if present, is worsening); and negation ("lung fields are unremarkable" implies a lack of infiltrate). To be accurate, automated systems require "coded" data: The concepts must come from a well-defined, finite vocabulary, and the relations among the concepts must be expressed in an unambiguous, formal structure.

How do we unlock the contents of narrative reports? Human coders can be trained to read and manually structure reports [5]. Few institutions have been willing to invest in the personnel necessary for manual coding (other than for billing purposes), and the human coders can introduce an additional delay in obtaining coded data. The producers of reports (for example, radiologists for radiology reports) can be trained to directly create coded reports. Unfortunately, because manual coding systems do not match the speed and simplicity of dictating narrative reports, this approach has not attained widespread use. It also does not address the large number of reports already available in institutions.

Natural language processing offers an automated solution [6-11]. The processor converts narrative reports that are available in electronic form—either through word processors or electronic scanning—to coded descriptions that are appropriate for automated systems. The promise of efficient, accurate extraction of coded clinical data from narrative reports is certainly enticing. The question is whether natural language processors are up to the task—just how efficient and accurate are they, and how easy is it to use their coded output?


Methods
space
up arrowTop
dotMethods
down arrowResults
down arrowDiscussion
down arrowConclusion
down arrowAuthor & Article Info
down arrowReferences

We evaluated a general-purpose processor [12] that is intended to cover various clinical reports. To be used in a particular domain (for example, radiology) and subdomain (chest radiograph), the processor must have initial programming under the supervision of an appropriate expert (radiologist). This programming process involves enumerating the vocabulary of the domain (for example, "patchy infiltrate") and formulating the grammar rules that are specific to the domain.

The natural language processor works as follows. The narrative report is fed into a preprocessor, which uses its vocabulary to recognize words and phrases in the report (for example, "lungs," "CHF"), map them to standard terms ("lung," "congestive heart failure"), and classify them into semantic categories ("bodylocation," "finding"). The parser then matches sequences of semantic categories in the report to structures defined in the grammar. For example, if the original report read, "infiltrate in lung," then the phrase might match this structure: "finding," "in," "bodylocation." Far more complex semantic structures are also supported through the grammar.

This structure is then mapped to the processor's result: a set of findings, each of which is associated with its own descriptive modifiers, such as certainty, status, location, quantity, degree, and change. For example, the following is an excerpt from a narrative report: "Probable mild pulmonary vascular congestion with new left pleural effusion, question mild congestive changes".

From this report, the natural language processor generated the following three coded findings:

Pulmonary vascular congestion

certainty: high

degree: low

Pleural effusion

region: left

status: new

Congestive changes

certainty: moderate

degree: low

The processor attempts to encode all clinical information available in reports, including the clinical indication, description, and impression. These findings are stored in a clinical database, where they can be exploited for automated decision-support and clinical research.

At Columbia-Presbyterian Medical Center, New York, New York, the processor has been trained to handle chest radiograph and mammogram reports. In normal operation, the radiologist dictates a report, which is then transcribed by a clerk with a word processor. The word-processor files are printed for the paper chart, stored in the clinical database in their narrative form for on-line review by clinicians, and transmitted to the natural language processor for coding.

The coded data produced by the processor are exploited for automated decision-support by the use of a computer program called a clinical event monitor [13]. The event monitor generates alerts, reminders, and interpretations that are based on the Arden Syntax for Medical Logic Modules [14]. The event monitor follows all clinical events (for example, admissions and laboratory results) in the medical center that can be tracked by computer. Whenever a clinically important situation is detected, the event monitor sends a message to the health care provider. For example, the storage of a low serum potassium level prompts the monitor to check whether the patient is receiving digoxin; if so, the monitor warns the health care provider that the hypokalemia may potentiate cardiac arrhythmias.

Our study was designed and conducted by an evaluation team that was separate from the development team responsible for the natural language processor. At the time of the evaluation, members of the evaluation team had no knowledge of the operation of the processor or of its strengths and weaknesses. They knew that the processor accepted chest radiograph reports and produced some coded result.

Two hundred admission chest radiograph reports were randomly selected from among those of all adult patients discharged from the inpatient service of Columbia-Presbyterian Medical Center during a particular week. An admission chest radiograph was defined as the first chest radiograph obtained during the hospital stay, even if it was not obtained on the first day. Chest radiographs were chosen because they display a broad range of disease, vocabulary, and grammatical variation. To better assess true performance, no corrections were made to reports, despite misspellings and even the inclusion of other types of reports in the same electronic files as the chest radiograph reports.

Study subjects (humans and automated methods) detected the presence or absence of six clinical conditions (Table 1). To ensure that the conditions were reasonable candidates for automated decision-support, they were selected from an independent published list of automated protocols that exploited chest radiographs [15]. An internist on the evaluation team selected the six conditions, thus ensuring that the conditions were common enough to be reasonably expected to appear several times in a set of 200 reports and that overlap would be minimized.


View this table:
[in this window]
[in a new window]
 
Table 1. Conditions

 

The 200 reports were processed by the natural language processor, and the resulting coded data were fed into the clinical event monitor. For each clinical condition, the monitor had a rule expressed as a Medical Logic Module [14] to detect the condition on the basis of the processor's coded output. The Medical Logic Modules concluded true (present) or false (absent). For example, the Medical Logic Module that detected pneumothorax was the simplest and used the following logic:

if finding is in ("pneumothorax"; "hydropneumothorax")

and certainty-modifier is not in

("no"; "rule out"; "cannot evaluate")

and status-modifier is not in ("resolved")

then

conclude true;

endif;

The Medical Logic Module looks for reports with appropriate findings but eliminates reports that are actually stating that the finding is absent, unknown, or resolved. The Medical Logic Modules were written by a member of the evaluation team who was given access to the six condition definitions Table 1, a sample of the natural language processor's output based on an independent set of chest radiographs, and a complete list of all vocabulary terms that the processor could generate in its output. No changes were made to the natural language processor, its grammar, or its vocabulary for the entire duration of the study (including the design phase). Once written, Medical Logic Modules were also held constant.

Human participants were recruited as follows. Six board-certified radiologists and six board-certified internists were selected as experts. All 12 physicians actively practice medicine in their respective fields at Columbia-Presbyterian Medical Center. Six professional lay persons without experience in the practice of medicine were selected as additional controls. Each human participant analyzed 100 reports; the time required to analyze all 200 reports (about 4 hours) would have been a disincentive to participate in the study and might have led participants to hand in unfinished forms. Reports were assigned so that every participant analyzed at least 40 reports in common with every other participant; each report was read by six physicians and three lay persons.

Participants were given instructions that included a two-sentence description of each condition (Table 1). They were asked to select the conditions that applied to the patient on the basis of reading the chest radiograph report. For each report, a participant could select zero, one, or more conditions.

In addition to the natural language processor and the human participants, three additional computer algorithms were chosen for comparison. The first subject was a simple keyword search that tested for the presence of search phrases (such as "pneumonia") that implied one of the conditions within reports. Absence of a condition was implied by failure to find a pertinent phrase. Seventy-five search phrases were used for the six diagnoses. The second subject was a more complex keyword search. A total of 240 search phrases were used; they included explicit searches for the absence of a condition ("no evidence of infiltrate") and additional search phrases and abbreviations (obtained by testing the simple search on an independent training set). To ensure that differences in outcome would not be caused by differences in clinical acumen, both keyword searches were written by the same person who wrote the Medical Logic Modules. The final control subject simply generated "absent" to every condition for every report; it was included to verify the distance metric in the setting of low prevalence.

The main outcome measure in our study was the "distance" between pairs of subjects, where distance quantified the intersubject disagreement. The distance between two subjects for a given radiology report was defined as the number of conditions (0 to 6) on which the subjects disagreed. The average distance between two subjects was the simple average of distances over all reports that were analyzed in common. For each physician, we calculated the average distance from the other 11 physicians, and for each nonphysician (lay persons and computer methods), we calculated the average distance from all 12 physicians. The null hypothesis was that each subject was no more distant from the physicians than the physicians were from each other. The sample variances for each of the results were derived from the full covariance matrix for all the intersubject distances (Appendix).

We calculated sensitivity and specificity for each subject using the majority physician opinion as the reference standard. The criterion was based on how many physicians answered that a condition was present:

≥ 4 out of 6 = condition present

3 out of 6 = condition randomly assigned to present or absent with 0.5 probability

≤ 2 out of 6 = condition absent

If the subject was a physician, then using his or her own answers to help determine the reference standard would bias the study in favor of the physicians. Therefore, if the subject was a physician, his or her data were removed from the reference standard, and the criterion was adjusted as follows:

≥ 3 out of 5 = condition present

≤ 2 out of 5 = condition absent

We defined sensitivity as the number of positive answers in common between the subject and the reference standard divided by the number of positive answers in the reference standard. We defined specificity as the number of negative answers in common between the subject and the reference standard divided by the number of negative answers in the reference standard. We plotted the sensitivities and specificities using receiver-operator characteristic curve axes [16]. To assess whether internists and radiologists interpret radiology reports differently, the average distance of a physician to another physician of the same specialty was compared with the average distance to a physician of the opposite specialty. The null hypothesis was that the internists and radiologists agreed and that the average differences of the distances were zero. To corroborate the estimates of distance and variance, we did two additional analyses. We used chance-corrected inter-rater agreement [17] as an alternative distance metric and used bootstrapping [18] to estimate variance directly from the data. Bootstrapping was also used to estimate the variance of sensitivity and specificity.


Results
space
up arrowTop
up arrowMethods
dotResults
down arrowDiscussion
down arrowConclusion
down arrowAuthor & Article Info
down arrowReferences

During the 7 days from 1 July to 7 July 1993, 402 of the 1061 patients discharged from Columbia-Presbyterian Medical Center had admission chest radiographs. Two hundred of these reports were selected randomly. The human participants required an average of 70 seconds to read a single report and choose among the six conditions, whereas the natural language processor (running on a IBM RS/6000 Model 550L, 42-MHz processor, 256-MB memory) required an average of 2 seconds.

The distribution of conditions in the reports is shown in Table 2. The distribution is expressed as the number of reports (out of 200) for which a given number of physicians said a condition was present. The last row of Table 2 shows the total over all six conditions (out of 1200 possible affirmative answers). The last two columns show the number and percentage of reports for which a majority of physicians (four or more out of six) said the condition was present; the last column can be interpreted as the prevalence of the conditions. On the basis of majority vote, physicians detected 101 conditions in the 200 reports, for an average prevalence per condition of 8.4%. The prevalences ranged from 3.0% (for chronic obstructive pulmonary disease) to 13.5% (for acute bacterial pneumonia).


View this table:
[in this window]
[in a new window]
 
Table 2. Distribution of Conditions (Number of Reports out of 200)

 

The main outcome measure, the average distance of each subject from the physicians, is shown in Figure 1. The average distance of physicians from each other was 0.24 (95% CI, 0.19 to 0.29). Physicians differed on the interpretation of reports for at least one of the six conditions about 20% of the time. The average distance of the natural language processor from the physicians was 0.26 (CI, 0.21 to 0.32). The performance of the subjects and the average performance of the physicians are shown in Table 3. Positive numbers imply worse performance (more unlike the average physician). No physicians differed significantly from other physicians. The fifth internist had the greatest deviation among the physicians, with an uncorrected P value of 0.0092, but when this is corrected by a Bonferroni multiplier for multiple hypotheses (either 12 for physicians or 22 for all subjects), the deviation becomes nonsignificant. The natural language processor did not differ significantly from the physicians. All lay persons and all comparison automated systems differed from the physicians with highly significant corrected P values.



View larger version (19K):
[in this window]
[in a new window]
 
Figure 1. Average distance of subjects from physicians. The average distance and 95% CIs from each of the subjects to the physicians are shown. A greater distance implies worse performance (further from physician consensus).

 

View this table:
[in this window]
[in a new window]
 
Table 3. Average Subject Distance Minus Average Physician Distance

 

The sensitivity and specificity for each subject are listed in Table 4 and plotted in Figure 2. Overall, the physicians had an average sensitivity of 85% and an average specificity of 98%. Of all the nonphysicians, only the natural language processor had both a sensitivity and a specificity similar to that of the physicians (that is, both CIs included the average physician value). Although in Figure 2 the complex keyword search appears to be somewhat near the physicians, it is in a position of significantly lower performance on the receiver-operator characteristic curve.


View this table:
[in this window]
[in a new window]
 
Table 4. Sensitivity and Specificity

 


View larger version (11K):
[in this window]
[in a new window]
 
Figure 2. Sensitivity and specificity plotted on receiver-operator characteristic curve axes (specificity is listed in reverse order). Ideal performance is in the upper left corner of both graphs. The first graph (top) shows the full receiver-operator characteristic curve, whereas the second graph (bottom) is an expansion of the area near ideal performance (specificity has been expanded five times as much as sensitivity).

 

The average distance of a physician from another physician of the same specialty (0.245) was slightly greater than the average distance of a physician from physicians of the opposite specialty (0.239). The difference (0.006) was not significant (P > 0.2).

The use of chance-corrected inter-rater agreement as an alternate distance metric led to the same conclusions as the linear distance (only the scale changed). Bootstrap estimates of variance were within several percentage points of covariance matrix-based estimates for distance results.


Discussion
space
up arrowTop
up arrowMethods
up arrowResults
dotDiscussion
down arrowConclusion
down arrowAuthor & Article Info
down arrowReferences

The performance of the natural language processor, in conjunction with the automated decision-support system, was indistinguishable from that of the physicians and was superior to the performance of the lay persons and the alternative automated methods. There was a highly significant difference between the physicians and all other subjects except the processor. Therefore, the study design appears to be powerful enough to detect differences in skill levels, and the performance of the processor is truly "physician-like" for the studied conditions. The conditions were selected and the event monitor modules were created without the help or knowledge of the processor's development team. The achievement of physician-like performance without the help of the developers supports the claim that it is a general-purpose processor; the selection of a different set of conditions would probably lead to similar results.

Although the 12 physicians had similar performance within the error of the experiment Table 3, and although they appear well-clustered when analyzed in terms of sensitivity and specificity Figure 2, the percentage of time that they disagree with each other—20%—is rather high. Therefore, although there is a significant level of disagreement among the physicians, no single physician stands out as significantly different from the others. This seems to be the normal level of disagreement that can be expected for the clinical interpretation of radiograph reports. The source of disagreement may be at one of several levels. Whether a finding is actually present on the chest radiograph may be unclear, leading the radiologist who originally generated the report to convey the ambiguity in the report itself. On the other hand, the finding may be clear, but the interpretation of its significance with respect to the conditions may differ among physicians. The physicians may also disagree on what the condition definitions mean. The physicians probably respond in different ways to phrases such as "reasonably likely"; these different responses result in different sensitivities and specificities. Inadvertent errors may also occur when physicians read reports. It is interesting that in a related area of research—the interpretation of the chest radiographs themselves—the same magnitude of inconsistency was found: a 30% disagreement when two radiologists read the same radiograph and a 22% disagreement when the same radiologist reread the same radiograph [19].

The natural language processor did as well as the physicians, both in terms of its distance and its sensitivity and specificity. We did not directly assess whether the cases on which the processor disagreed with physicians were clinically more significant than the cases on which physicians disagreed with each other. For example, disagreeing on what to call a borderline case of heart failure is less important than disagreeing on a clear case of a pulmonary nodule that requires follow-up. Fortunately, the distance metric indirectly accounts for clinical importance. In a case about which physicians fully agree (six out of six), picking the opposite conclusion leads to a large increase in distance. In a case for which there is split opinion (three out of six), all subjects are equally penalized by their answer, regardless of which conclusion they choose. Therefore, it is critical to agree with the physicians on those cases about which the physicians themselves agree. Alternate distance metrics, in which a simple majority physician vote determines a criterion standard, do not have this property.

Figure 2 shows that the lay persons are clustered in an area of lower sensitivity than that of the physicians are but, for the most part, in an area of specificity similar to that of the physicians. The lower sensitivity implies that lay persons did not recognize conditions that were recognized by physicians. The condition definitions were technical, aimed at defining the condition rather than teaching the medical background. Lay persons could recognize words in reports that matched words in the condition definitions, but they lacked the training to recognize alternate indications or even understand the vocabulary. More interesting is the fact that the specificities of most of the lay persons were close to those of the physicians. Because the lay persons could understand the English grammar (even though they did not understand the vocabulary), they were not fooled into thinking a condition was present when the report was actually stating that the condition was absent.

Sensitivity matched the level of the physicians for the complex keyword search and was a little lower for the simple keyword search; the difference between them was due to the training of the complex search. Both keyword searches had worse specificities than that of physicians. Because the simple search did not recognize negation, reports that said "new infiltrates not seen" were counted as having possible acute bacterial pneumonia. The complex search achieved better specificity because two thirds of its search phrases looked for the many ways of saying that an indication was absent. Nevertheless, even with 240 search phrases, complex Boolean logic, and training, it did not achieve the specificity of the physicians or the natural language processor. Our study does not show that other automated methods will not succeed but rather that a straightforward approach such as a keyword search is not sufficient.

Our study also indicates that internists and radiologists interpret reports similarly. If internists and radiologists consistently differed, the distance of a physician from a physician of the opposite type would be expected to exceed the distance to a physician of the same type. In fact, these average distances were almost identical. Figure 2 shows the same result. The internists and radiologists are intermixed without separate clusters.

The design of our study is appropriate for natural language processing in several ways. The use of several physicians provides a solid standard against which natural language processing can be judged, and it allows calculation of the variance among the physicians so that the significance of a difference between the processor and the physicians can be interpreted. The statistical methods are designed to accommodate studies in which physicians do not read each report; this will be critical for larger studies. By placing the processor in its intended environment—one in which real, uncorrected reports are used and one that directly feeds an automated decision-support system—one can judge how it will actually perform and whether its coded output is really usable.

One disadvantage of the approach is that if the system does not perform as expected, one must determine whether the natural language processor or the automated decision-support system is responsible. Without a criterion standard for the intermediate coded output, assigning responsibility is subjective. Our study does not replace rigorous measurements of the effect of automated systems on patient outcomes. Instead, it is a necessary step to ascertain that the potential for effect is real and that resources should be put into clinical trials of systems that exploit natural language processors.

Other natural language processors have achieved similar levels of performance. A special-purpose processor designed to detect neoplasms in chest radiograph reports [7] achieved a sensitivity of 98% and a specificity of 88% with respect to one physician's interpretation. A preliminary study of another special-purpose processor [8] resulted in a sensitivity of 87% (specificity was not measured). A general-purpose processor for discharge summaries [6] used to assess asthma indicators achieved a sensitivity of 84% (specificity was not measured). The processor in our study achieved a lower sensitivity (81%) than the others but achieved a higher specificity (98%). The physicians themselves showed sensitivities and specificities that more closely matched the processor in our study than the special-purpose neoplasm processor. Depending on the context, sensitivity or specificity may be more important. Automated alerting systems generally require high specificity so that false-positive alerts that undermine confidence in an automated system are avoided.

The significance of natural language processing lies in the vast quantity of clinical data that remains locked in narrative reports and in the automated tools that could exploit these data if they were coded. Automated alerts and interpretations [1, 2, 13-15, 20] require coded clinical data to do an intelligent analysis of the patient's condition. For example, given chest radiograph data, an automated system can warn a health care provider when a patient with a potential neoplasm has not had follow-up. Health care providers can directly query coded clinical data. In a patient with a long, complex history, a covering physician can ask whether the patient has ever had a pleural effusion; the answer that is based on natural language processing is more likely to be accurate than one based on sifting through volumes of paper records. Departmental quality assurance systems can exploit coded clinical data. For example, radiologic findings can be automatically correlated with subsequent pathology findings to estimate the rates of false-positive and false-negative radiology findings. Natural language processing may meet clinical researchers' need for clinical data, a need that has not been met by administrative data sources [4]. For example, queries to coded data can supply a list of potential study candidates and give estimates of disease prevalence.

The effort required to exploit natural language processing comprises several parts. Building the natural language processor computer program took 1 year (assume one full-time equivalent effort for all estimates); this program is reusable across domains. Assembling the grammar for chest radiographs took 6 months, and coding the vocabulary for the radiographs took 18 months. Adding mammography took only 1 month because the grammar changed little and many words were already present. The estimated time to add abdominal radiographs is 3 months because most of the findings in abdominal radiographs are occasionally found in chest radiographs (some "chest" radiographs cover most of the abdomen), and these have already been coded. Writing the six Medical Logic Modules for this study required less than 1 week. The ability to transfer the grammar and vocabulary to other institutions must still be tested. Extending the processor to handle reports similar to those of radiology (such as pathology) will require similar effort; its performance on reports with much less structure (such as discharge summaries) may be worse.


Conclusion
space
up arrowTop
up arrowMethods
up arrowResults
up arrowDiscussion
dotConclusion
down arrowAuthor & Article Info
down arrowReferences

Natural language processing has the potential to unlock valuable clinical data from countless narrative reports and supports the use of automated decision-support systems and clinical research. Evaluations are showing positive results. Our study has shown that—at least for six clinical conditions in chest radiographs—the performance of a general-purpose natural language processor was the same as that of physicians and was significantly superior to that of lay persons and alternative automated methods.


Appendix
space

The method used to calculate the performance of the natural language processor is described first. The method is then extended to apply to the other subjects.

Let n be the number of independent reports to be assessed. Each is rated by the natural language processor (denoted as subject 0) and by some set of J physicians (denoted as subjects j = 1, 2,...J). Let X ij be a C-length vector of scores (x ij1 , ... x ijc ...xijC) assigned to report i by subject j (where C is the number of conditions). Item x ijc of X ij corresponds to the score for condition c, and its value may be 0 (absent) or 1 (present). Because the experiment did not call on every subject to rate every report, many of the X( ij ) were missing at the time of the analysis. We use subscripted n to denote how many reports each subject or combination of subjects has rated. For example, n j denotes the number of reports scored by subject j. Similarly, n jk denotes the number of reports rated by both subject j and subject k, and n jklm denotes the number of reports rated by all of the subjects j, k, l, and m. There might be duplicates in the subscripts, in which case, for example, n jkjm is interpreted as n jkm . In this study, n was 200, J was 12, and C was 6.

Let d ijk be the distance between the rating scores, X ij and X ik , that subjects j and k assign to report i. In this analysis, the number of conditions for which the two subjects differed (for report i) was used as the distance metric:

d ijk = {Sigma}c |x ijc – x ikc |

Other metrics may also be used without affecting the variance calculations that follow. By convention, we set d ijk = 0 if either of subjects j or k did not rate report i. Let \#396; jk be the average distance from subject j to subject k:

\#396; jk = {Sigma} i \#396; ijk /n jk

Var(\#396; jk ) = Cov(\#396; jk , \#396; jk )

Cov(\#396; jk \#396; lm ) = [{Sigma} i \#396; ijk \#396; ilm n jklm \#396; jk \#396; lm ]/n jk n lm

For matrix calculations, we can define the column vector d = (d01,..., d J-1,J )t of J(J+1)/2 mean distance scores, and the J(J+1)/2 by J(J+1)/2 convariance matrix V, whose elements are defined above. The estimated variance of any linear combination b t \#396;, where b = (b01,...b J-1,J )t is a vector of coefficients, is Var(b t \#396;) = b t Vb.

The previous theory enabled us to compute estimates and standard errors (and, assuming approximate normality, confidence intervals) for measures of intersubject difference. The following were used:

Average distance of the natural language processor from physicians:

{delta} 0 = {Sigma} k >0 \#396; 0k /J

Average distance of physician j from all other physicians:

{delta} j = {Sigma} 0 <k,k!=j \#396; jk / (J–1); j = 1, 2,...J

Overall average interphysician distance:

{Delta} = 2 {Sigma} 0<k<j \#396; jk /J(J–1)

Comparison of the processor's mean distance with that of physician j:

{theta} j = {delta} 0 {delta} j ; j = 1, 2,...J

Comparison of the processor's mean distance with average of all physicians:

{Sigma} j {theta} j /J

Because these measures are all linear combinations of \#396;, their standard errors can be calculated from b t Vb. Because they represent the averages of many points, they are approximately normally distributed, and confidence intervals can be calculated.

Lay persons and the automated methods were analyzed in a similar manner; their results were substituted for those of the processor. Each physician was analyzed by removing his or her results from the group of physicians, letting J = 11, and treating that physician like the processor in the above equations.


Author and Article Information
space
up arrowTop
up arrowMethods
up arrowResults
up arrowDiscussion
up arrowConclusion
dotAuthor & Article Info
down arrowReferences

From Columbia-Presbyterian Medical Center, New York, New York and Queens College, Flushing, New York.
Requests for Reprints: George Hripcsak, MD, Department of Medical Informatics, Columbia-Presbyterian Medical Center, 161 Fort Washington Avenue, AP-1310, New York, NY 10032.
Grant Support: National Library of Medicine grants LM04419, LM05397, and LM05627; grant #6-61483 from the Research Foundation of City University of New York.


References
space
up arrowTop
up arrowMethods
up arrowResults
up arrowDiscussion
up arrowConclusion
up arrowAuthor & Article Info
dotReferences

1. Pestotnik SL, Evans RS, Burke JP, Gardner RM, Classen DC. Therapeutic antibiotic monitoring: surveillance using a computerized expert system. Am J Med. 1990; 88:43-8.

2. Rind DM, Safran C, Phillips RS, Wang Q, Calkins DR, Delbanco TL, et al. Effect of computer-based alerts on the treatment and outcomes of hospitalized patients. Arch Intern Med. 1994; 154:1511-7.

3. Tierney WM, Miller ME, Overhage JM, McDonald CJ. Physician inpatient order writing on microcomputer workstations. Effects on resource utilization. JAMA. 1993; 269:379-83.

4. Jollis JG, Ancukiewicz M, DeLong ER, Pryor DB, Muhlbaier LH, Mark DB. Discordance of databases designed for claims payment versus clinical information systems. Implications for outcomes research. Ann Intern Med. 1993; 119:844-50.

5. McDonald CJ, Tierney WM, Overhage JM, Martin DK, Wilson GA. The Regenstrief Medical Record System: 20 years of experience in hospitals, clinics, and neighborhood health centers. MD Comput. 1992; 9:206-17.

6. Sager N, Lyman M, Tick LJ, Nhan NT, Bucknall CE. Natural language processing of asthma discharge summaries for the monitoring of patient care. In: Safran C, ed. Proceedings of the Seventeenth Annual Symposium on Computer Applications in Medical Care; 1993 Oct 30-Nov 3; Washington, D.C. New York: McGraw-Hill; 1994:265-8.

7. Zingmond D, Lenert LA. Monitoring free-text data using medical language processing. Comput Biomed Res. 1993; 26:467-81.

8. Haug PJ, Ranum DL, Frederick PR. Computerized extraction of coded findings from free-text radiologic reports. Work in progress. Radiology. 1990; 174:543-8.

9. Chinchor N, Hirschman L, Lewis DD. Evaluating message understanding systems: an analysis of the third message understanding conference (MUC-3). Computational Linguistics. 1993; 19:409-47.

10. Vries JK, Marshalek B, D'Abarno JC, Yount RJ, Dunner LL. An automated indexing system utilizing semantic net expansion. Comput Biomed Res. 1992; 25:153-67.

11. Gabrieli ER. Computer-assisted assessment of patient care in the hospital. J Med Syst. 1988; 12:135-46.

12. Friedman C, Alderson PO, Austin JH, Cimino JJ, Johnson SB. A general natural-language text processor for clinical radiology. Journal of the American Medical Informatics Association. 1994; 1:161-74.

13. Hripcsak G, Clayton PD, Cimino JJ, Johnson SB, Friedman C. Medical decision support at Columbia-Presbyterian Medical Center. In: Timmers T, Blum BI, eds. Software Engineering in Medical Informatics. Amsterdam: North-Holland; 1991:471-9.

14. Hripcsak G, Ludemann P, Pryor TA, Wigertz OB, Clayton PD. Rationale for the Arden Syntax. Comput Biomed Res. 1994; 27:291-324.

15. McDonald CJ. Action-Oriented Decisions in Ambulatory Medicine. Chicago: Year Book; 1981.

16. Metz CE. Basic principles of ROC analysis. Semin Nucl Med. 1978; 8:283-98.

17. Dunn G. Design and Analysis of Reliability Studies. New York: Oxford Univ Pr; 1989.

18. Sprent P. Applied Nonparametric Statistical Methods. 2d ed. London: Chapman and Hall; 1993.

19. Yerushalmy J. The statistical assessment of the variability in observer perception and description of roentgenographic pulmonary shadows. Radiol Clin North Am.1969; 7:381-92.

20. Johnston ME, Langton KB, Haynes RB, Mathieu A. Effects of computer-based clinical decision support systems on clinician performance and patient outcome A critical appraisal of research. Ann Intern Med. 1994; 120:135-42.

Related articles in Annals:

Editorials
Toward Electronic Medical Records That Improve Care
William M. Tierney, J. Marc Overhage, AND Clement J. McDonald
Annals 1995 122: 725-726. [Full Text]  



This article has been cited by other articles:


Home page
Am. J. Roentgenol.Home page
P. A. Dang, M. K. Kalra, M. A. Blake, T. J. Schultz, E. F. Halpern, and K. J. Dreyer
Extraction of Recommendation Features in Radiology with Natural Language Processing: Exploratory Study
Am. J. Roentgenol., August 1, 2008; 191(2): 313 - 320.
[Abstract] [Full Text] [PDF]


Home page
J. Am. Med. Inform. Assoc.Home page
L. Zhou, S. Parsons, and G. Hripcsak
The Evaluation of a Temporal Reasoning System in Processing Clinical Discharge Summaries
J. Am. Med. Inform. Assoc., January 1, 2008; 15(1): 99 - 106.
[Abstract] [Full Text] [PDF]


Home page
Proc Am Thorac SocHome page
Y. A. Lussier and Y. Liu
Computational Approaches to Phenotyping: High-Throughput Phenomics
Proceedings of the ATS, January 1, 2007; 4(1): 18 - 25.
[Abstract] [Full Text] [PDF]


Home page
RadiologyHome page
J. H. Thrall
Reinventing Radiology in the Digital Age: Part II. New Directions and New Stakeholder Value
Radiology, October 1, 2005; 237(1): 15 - 18.
[Full Text] [PDF]


Home page
J. Am. Med. Inform. Assoc.Home page
B. Hazlehurst, H. R. Frost, D. F. Sittig, and V. J. Stevens
MediClass: A System for Detecting and Classifying Encounter-based Clinical Events in Any Electronic Medical Record
J. Am. Med. Inform. Assoc., September 1, 2005; 12(5): 517 - 529.
[Abstract] [Full Text] [PDF]


Home page
J. Am. Med. Inform. Assoc.Home page
G. B. Melton and G. Hripcsak
Automated Detection of Adverse Events Using Natural Language Processing of Discharge Summaries
J. Am. Med. Inform. Assoc., July 1, 2005; 12(4): 448 - 457.
[Abstract] [Full Text] [PDF]


Home page
J. Am. Med. Inform. Assoc.Home page
P. R.O. Payne and J. B. Starren
Quantifying Visual Similarity in Clinical Iconic Graphics
J. Am. Med. Inform. Assoc., May 1, 2005; 12(3): 338 - 345.
[Abstract] [Full Text] [PDF]


Home page
Am. J. Roentgenol.Home page
B. J. Thomas, H. Ouellette, E. F. Halpern, and D. I. Rosenthal
Automated Computer-Assisted Categorization of Radiology Reports
Am. J. Roentgenol., February 1, 2005; 184(2): 687 - 690.
[Abstract] [Full Text] [PDF]


Home page
J. Am. Med. Inform. Assoc.Home page
T. S. Field, J. H. Gurwitz, L. R. Harrold, J. M. Rothschild, K. Debellis, A. C. Seger, L. S. Fish, L. Garber, M. Kelleher, and D. W. Bates
Strategies for Detecting Adverse Drug Events among Older Persons in the Ambulatory Setting
J. Am. Med. Inform. Assoc., November 1, 2004; 11(6): 492 - 498.
[Abstract] [Full Text] [PDF]


Home page
J. Am. Med. Inform. Assoc.Home page
C. Friedman, L. Shagina, Y. Lussier, and G. Hripcsak
Automated Encoding of Clinical Documents Based on Natural Language Processing
J. Am. Med. Inform. Assoc., September 1, 2004; 11(5): 392 - 402.
[Abstract] [Full Text] [PDF]


Home page
J. Am. Med. Inform. Assoc.Home page
D. Aronsky, E. Kasworm, J. A. Jacobson, P. J. Haug, and N. C. Dean
Electronic Screening of Dictated Reports to Identify Patients with Do-Not-Resuscitate Status
J. Am. Med. Inform. Assoc., September 1, 2004; 11(5): 403 - 409.
[Abstract] [Full Text] [PDF]


Home page
J. Am. Med. Inform. Assoc.Home page
W. W. Chapman, G. F. Cooper, P. Hanbury, B. E. Chapman, L. H. Harrison, and M. M. Wagner
Creating a Text Classifier to Detect Radiology Reports Describing Mediastinal Findings Associated with Inhalational Anthrax and Other Disorders
J. Am. Med. Inform. Assoc., September 1, 2003; 10(5): 494 - 503.
[Abstract] [Full Text] [PDF]


Home page
J. Am. Med. Inform. Assoc.Home page
A. B. Wilcox and G. Hripcsak
The Role of Domain Knowledge in Automating Medical Text Report Classification
J. Am. Med. Inform. Assoc., July 1, 2003; 10(4): 330 - 338.
[Abstract] [Full Text] [PDF]


Home page
J. Am. Med. Inform. Assoc.Home page
H. J. Murff, A. J. Forster, J. F. Peterson, J. M. Fiskio, H. L. Heiman, and D. W. Bates
Electronically Screening Discharge Summaries for Adverse Medical Events
J. Am. Med. Inform. Assoc., July 1, 2003; 10(4): 339 - 350.
[Abstract] [Full Text] [PDF]


Home page
J. Am. Med. Inform. Assoc.Home page
D. W. Bates, R. S. Evans, H. Murff, P. D. Stetson, L. Pizziferri, and G. Hripcsak
Detecting Adverse Events Using Information Technology
J. Am. Med. Inform. Assoc., March 1, 2003; 10(2): 115 - 128.
[Abstract] [Full Text] [PDF]


Home page
RadiologyHome page
G. Hripcsak, J. H. M. Austin, P. O. Alderson, and C. Friedman
Use of Natural Language Processing to Translate Clinical Information from a Database of 889,921 Chest Radiographic Reports
Radiology, July 1, 2002; 224(1): 157 - 163.
[Abstract] [Full Text]


Home page
J. Am. Med. Inform. Assoc.Home page
H. Yu, G. Hripcsak, and C. Friedman
Mapping Abbreviations to Full Forms in Biomedical Articles
J. Am. Med. Inform. Assoc., May 1, 2002; 9(3): 262 - 272.
[Abstract] [Full Text] [PDF]


Home page
J. Am. Med. Inform. Assoc.Home page
G. Hripcsak and A. Wilcox
Reference Standards, Judges, and Comparison Subjects: Roles for Experts in Evaluating System Performance
J. Am. Med. Inform. Assoc., January 1, 2002; 9(1): 1 - 15.
[Abstract] [Full Text] [PDF]


Home page
RadioGraphicsHome page
R. K. Taira, S. G. Soderland, and R. M. Jakobovits
Automatic Structuring of Radiology Free-Text Reports
RadioGraphics, January 1, 2001; 21(1): 237 - 245.
[Abstract] [Full Text] [PDF]


Home page
J. Am. Med. Inform. Assoc.Home page
M. Fiszman, W. W. Chapman, D. Aronsky, R. S. Evans, and P. J. Haug
Automatic Detection of Acute Bacterial Pneumonia from Chest X-ray Reports
J. Am. Med. Inform. Assoc., November 1, 2000; 7(6): 593 - 604.
[Abstract] [Full Text]


Home page
J. Am. Med. Inform. Assoc.Home page
W. W. Stead, R. A. Miller, M. A. Musen, and W. R. Hersh
Integration and Beyond: Linking Information from Disparate Sources andinto Workflow
J. Am. Med. Inform. Assoc., March 1, 2000; 7(2): 135 - 145.
[Abstract] [Full Text] [PDF]


Home page