Home |
Current Issue |
Past Issues |
In the Clinic |
ACP Journal Club |
CME |
Collections |
Audio/Video |
Mobile |
Subscribe |
Tools |
Help |
ACP Online
|
1 May 1995 | Volume 122 Issue 9 | Pages 681-688
Objective: To evaluate the automated detection of clinical conditions described in narrative reports.
Design: Automated methods and human experts detected the presence or absence of six clinical conditions in 200 admission chest radiograph reports.
Study Subjects: A computerized, general-purpose natural language processor; 6 internists; 6 radiologists; 6 lay persons; and 3 other computer methods.
Main Outcome Measures: Intersubject disagreement was quantified by "distance" (the average number of clinical conditions per report on which two subjects disagreed) and by sensitivity and specificity with respect to the physicians.
Results: Using a majority vote, physicians detected 101 conditions in the 200 reports (0.51 per report); the most common condition was acute bacterial pneumonia (prevalence, 0.14), and the least common was chronic obstructive pulmonary disease (prevalence, 0.03). Pairs of physicians disagreed on the presence of at least 1 condition for an average of 20% of reports. The average intersubject distance among physicians was 0.24 (95% CI, 0.19 to 0.29) out of a maximum possible distance of 6. No physician had a significantly greater distance than the average. The average distance of the natural language processor from the physicians was 0.26 (CI, 0.21 to 0.32; not significantly greater than the average among physicians). Lay persons and alternative computer methods had significantly greater distance from the physicians (all >0.5). The natural language processor had a sensitivity of 81% (CI, 73% to 87%) and a specificity of 98% (CI, 97% to 99%); physicians had an average sensitivity of 85% and an average specificity of 98%.
Conclusions: Physicians disagreed on the interpretation of narrative reports, but this was not caused by outlier physicians or a consistent difference in the way internists and radiologists read reports. The natural language processor was not distinguishable from the physicians and was superior to all other comparison subjects. Although the domain of this study was restricted (six clinical conditions in chest radiographs), natural language processing seems to have the potential to extract clinical information from narrative reports in a manner that will support automated decision-support and clinical research.
Much clinical data is locked up in departmental word-processor files, clinical databases, and research databases in the form of narrative reports such as discharge summaries, radiology reports, pathology reports, admission histories, and reports of physical examinations. Untold volumes of data are deleted every day after word-processor files are printed for the paper chart and for the mailing of reports. Exploiting this information is not trivial, however. Sentences that are easy for a person to understand are difficult for a computer to sort out. Problems include the many ways in which the same concept can be expressed (for example, "heart failure," "congestive heart failure," "CHF," and so forth); ambiguities in interpreting grammatical constructs ("possible worsening infiltrate" may refer to a definite infiltrate that may be worsening or to an uncertain infiltrate that, if present, is worsening); and negation ("lung fields are unremarkable" implies a lack of infiltrate). To be accurate, automated systems require "coded" data: The concepts must come from a well-defined, finite vocabulary, and the relations among the concepts must be expressed in an unambiguous, formal structure.
How do we unlock the contents of narrative reports? Human coders can be trained to read and manually structure reports [5]. Few institutions have been willing to invest in the personnel necessary for manual coding (other than for billing purposes), and the human coders can introduce an additional delay in obtaining coded data. The producers of reports (for example, radiologists for radiology reports) can be trained to directly create coded reports. Unfortunately, because manual coding systems do not match the speed and simplicity of dictating narrative reports, this approach has not attained widespread use. It also does not address the large number of reports already available in institutions.
Natural language processing offers an automated solution [6-11]. The processor converts narrative reports that are available in electronic formeither through word processors or electronic scanningto coded descriptions that are appropriate for automated systems. The promise of efficient, accurate extraction of coded clinical data from narrative reports is certainly enticing. The question is whether natural language processors are up to the taskjust how efficient and accurate are they, and how easy is it to use their coded output?
The natural language processor works as follows. The narrative report is fed into a preprocessor, which uses its vocabulary to recognize words and phrases in the report (for example, "lungs," "CHF"), map them to standard terms ("lung," "congestive heart failure"), and classify them into semantic categories ("bodylocation," "finding"). The parser then matches sequences of semantic categories in the report to structures defined in the grammar. For example, if the original report read, "infiltrate in lung," then the phrase might match this structure: "finding," "in," "bodylocation." Far more complex semantic structures are also supported through the grammar.
This structure is then mapped to the processor's result: a set of findings, each of which is associated with its own descriptive modifiers, such as certainty, status, location, quantity, degree, and change. For example, the following is an excerpt from a narrative report: "Probable mild pulmonary vascular congestion with new left pleural effusion, question mild congestive changes".
From this report, the natural language processor generated the following three coded findings:
Pulmonary vascular congestion
certainty: high
degree: low
Pleural effusion
region: left
status: new
Congestive changes
certainty: moderate
degree: low
The processor attempts to encode all clinical information available in reports, including the clinical indication, description, and impression. These findings are stored in a clinical database, where they can be exploited for automated decision-support and clinical research.
At Columbia-Presbyterian Medical Center, New York, New York, the processor has been trained to handle chest radiograph and mammogram reports. In normal operation, the radiologist dictates a report, which is then transcribed by a clerk with a word processor. The word-processor files are printed for the paper chart, stored in the clinical database in their narrative form for on-line review by clinicians, and transmitted to the natural language processor for coding.
The coded data produced by the processor are exploited for automated decision-support by the use of a computer program called a clinical event monitor [13]. The event monitor generates alerts, reminders, and interpretations that are based on the Arden Syntax for Medical Logic Modules [14]. The event monitor follows all clinical events (for example, admissions and laboratory results) in the medical center that can be tracked by computer. Whenever a clinically important situation is detected, the event monitor sends a message to the health care provider. For example, the storage of a low serum potassium level prompts the monitor to check whether the patient is receiving digoxin; if so, the monitor warns the health care provider that the hypokalemia may potentiate cardiac arrhythmias.
Our study was designed and conducted by an evaluation team that was separate from the development team responsible for the natural language processor. At the time of the evaluation, members of the evaluation team had no knowledge of the operation of the processor or of its strengths and weaknesses. They knew that the processor accepted chest radiograph reports and produced some coded result.
Two hundred admission chest radiograph reports were randomly selected from among those of all adult patients discharged from the inpatient service of Columbia-Presbyterian Medical Center during a particular week. An admission chest radiograph was defined as the first chest radiograph obtained during the hospital stay, even if it was not obtained on the first day. Chest radiographs were chosen because they display a broad range of disease, vocabulary, and grammatical variation. To better assess true performance, no corrections were made to reports, despite misspellings and even the inclusion of other types of reports in the same electronic files as the chest radiograph reports.
Study subjects (humans and automated methods) detected the presence or absence of six clinical conditions (Table 1). To ensure that the conditions were reasonable candidates for automated decision-support, they were selected from an independent published list of automated protocols that exploited chest radiographs [15]. An internist on the evaluation team selected the six conditions, thus ensuring that the conditions were common enough to be reasonably expected to appear several times in a set of 200 reports and that overlap would be minimized. ACADEMIA AND CLINIC
Unlocking Clinical Data from Narrative Reports: A Study of Natural Language Processing
The use of automated systems and electronic databases to enhance the quality, reduce the cost, and improve the management of health care has become common. Recent examples include using these systems to prevent adverse drug events [1, 2] and to encourage efficient treatment [3]. To function properly, automated systems require accurate, complete data. Although laboratory results are routinely available in electronic form, the most important clinical informationsymptoms, signs, and assessmentsremains largely inaccessible to automated systems. Investigators have attempted to use data from nonclinical sources to fill in the gaps, but such data have been found to be unreliable [4].
Methods
![]()
Top
Methods
Results
Discussion
Conclusion
Author & Article Info
References
We evaluated a general-purpose processor [12] that is intended to cover various clinical reports. To be used in a particular domain (for example, radiology) and subdomain (chest radiograph), the processor must have initial programming under the supervision of an appropriate expert (radiologist). This programming process involves enumerating the vocabulary of the domain (for example, "patchy infiltrate") and formulating the grammar rules that are specific to the domain.
|
The 200 reports were processed by the natural language processor, and the resulting coded data were fed into the clinical event monitor. For each clinical condition, the monitor had a rule expressed as a Medical Logic Module [14] to detect the condition on the basis of the processor's coded output. The Medical Logic Modules concluded true (present) or false (absent). For example, the Medical Logic Module that detected pneumothorax was the simplest and used the following logic:
if finding is in ("pneumothorax"; "hydropneumothorax")
and certainty-modifier is not in
("no"; "rule out"; "cannot evaluate")
and status-modifier is not in ("resolved")
then
conclude true;
endif;
The Medical Logic Module looks for reports with appropriate findings but eliminates reports that are actually stating that the finding is absent, unknown, or resolved. The Medical Logic Modules were written by a member of the evaluation team who was given access to the six condition definitions Table 1, a sample of the natural language processor's output based on an independent set of chest radiographs, and a complete list of all vocabulary terms that the processor could generate in its output. No changes were made to the natural language processor, its grammar, or its vocabulary for the entire duration of the study (including the design phase). Once written, Medical Logic Modules were also held constant.
Human participants were recruited as follows. Six board-certified radiologists and six board-certified internists were selected as experts. All 12 physicians actively practice medicine in their respective fields at Columbia-Presbyterian Medical Center. Six professional lay persons without experience in the practice of medicine were selected as additional controls. Each human participant analyzed 100 reports; the time required to analyze all 200 reports (about 4 hours) would have been a disincentive to participate in the study and might have led participants to hand in unfinished forms. Reports were assigned so that every participant analyzed at least 40 reports in common with every other participant; each report was read by six physicians and three lay persons.
Participants were given instructions that included a two-sentence description of each condition (Table 1). They were asked to select the conditions that applied to the patient on the basis of reading the chest radiograph report. For each report, a participant could select zero, one, or more conditions.
In addition to the natural language processor and the human participants, three additional computer algorithms were chosen for comparison. The first subject was a simple keyword search that tested for the presence of search phrases (such as "pneumonia") that implied one of the conditions within reports. Absence of a condition was implied by failure to find a pertinent phrase. Seventy-five search phrases were used for the six diagnoses. The second subject was a more complex keyword search. A total of 240 search phrases were used; they included explicit searches for the absence of a condition ("no evidence of infiltrate") and additional search phrases and abbreviations (obtained by testing the simple search on an independent training set). To ensure that differences in outcome would not be caused by differences in clinical acumen, both keyword searches were written by the same person who wrote the Medical Logic Modules. The final control subject simply generated "absent" to every condition for every report; it was included to verify the distance metric in the setting of low prevalence.
The main outcome measure in our study was the "distance" between pairs of subjects, where distance quantified the intersubject disagreement. The distance between two subjects for a given radiology report was defined as the number of conditions (0 to 6) on which the subjects disagreed. The average distance between two subjects was the simple average of distances over all reports that were analyzed in common. For each physician, we calculated the average distance from the other 11 physicians, and for each nonphysician (lay persons and computer methods), we calculated the average distance from all 12 physicians. The null hypothesis was that each subject was no more distant from the physicians than the physicians were from each other. The sample variances for each of the results were derived from the full covariance matrix for all the intersubject distances (Appendix).
We calculated sensitivity and specificity for each subject using the majority physician opinion as the reference standard. The criterion was based on how many physicians answered that a condition was present:
4 out of 6 = condition present
3 out of 6 = condition randomly assigned to present or absent with 0.5 probability
2 out of 6 = condition absent
If the subject was a physician, then using his or her own answers to help determine the reference standard would bias the study in favor of the physicians. Therefore, if the subject was a physician, his or her data were removed from the reference standard, and the criterion was adjusted as follows:
3 out of 5 = condition present
2 out of 5 = condition absent
We defined sensitivity as the number of positive answers in common between the subject and the reference standard divided by the number of positive answers in the reference standard. We defined specificity as the number of negative answers in common between the subject and the reference standard divided by the number of negative answers in the reference standard. We plotted the sensitivities and specificities using receiver-operator characteristic curve axes [16]. To assess whether internists and radiologists interpret radiology reports differently, the average distance of a physician to another physician of the same specialty was compared with the average distance to a physician of the opposite specialty. The null hypothesis was that the internists and radiologists agreed and that the average differences of the distances were zero. To corroborate the estimates of distance and variance, we did two additional analyses. We used chance-corrected inter-rater agreement [17] as an alternative distance metric and used bootstrapping [18] to estimate variance directly from the data. Bootstrapping was also used to estimate the variance of sensitivity and specificity.
Results
|
|---|
|
|
|---|
The distribution of conditions in the reports is shown in Table 2. The distribution is expressed as the number of reports (out of 200) for which a given number of physicians said a condition was present. The last row of Table 2 shows the total over all six conditions (out of 1200 possible affirmative answers). The last two columns show the number and percentage of reports for which a majority of physicians (four or more out of six) said the condition was present; the last column can be interpreted as the prevalence of the conditions. On the basis of majority vote, physicians detected 101 conditions in the 200 reports, for an average prevalence per condition of 8.4%. The prevalences ranged from 3.0% (for chronic obstructive pulmonary disease) to 13.5% (for acute bacterial pneumonia).
|
The main outcome measure, the average distance of each subject from the physicians, is shown in Figure 1. The average distance of physicians from each other was 0.24 (95% CI, 0.19 to 0.29). Physicians differed on the interpretation of reports for at least one of the six conditions about 20% of the time. The average distance of the natural language processor from the physicians was 0.26 (CI, 0.21 to 0.32). The performance of the subjects and the average performance of the physicians are shown in Table 3. Positive numbers imply worse performance (more unlike the average physician). No physicians differed significantly from other physicians. The fifth internist had the greatest deviation among the physicians, with an uncorrected P value of 0.0092, but when this is corrected by a Bonferroni multiplier for multiple hypotheses (either 12 for physicians or 22 for all subjects), the deviation becomes nonsignificant. The natural language processor did not differ significantly from the physicians. All lay persons and all comparison automated systems differed from the physicians with highly significant corrected P values.
|
|
The sensitivity and specificity for each subject are listed in Table 4 and plotted in Figure 2. Overall, the physicians had an average sensitivity of 85% and an average specificity of 98%. Of all the nonphysicians, only the natural language processor had both a sensitivity and a specificity similar to that of the physicians (that is, both CIs included the average physician value). Although in Figure 2 the complex keyword search appears to be somewhat near the physicians, it is in a position of significantly lower performance on the receiver-operator characteristic curve.
|
|
The average distance of a physician from another physician of the same specialty (0.245) was slightly greater than the average distance of a physician from physicians of the opposite specialty (0.239). The difference (0.006) was not significant (P > 0.2).
The use of chance-corrected inter-rater agreement as an alternate distance metric led to the same conclusions as the linear distance (only the scale changed). Bootstrap estimates of variance were within several percentage points of covariance matrix-based estimates for distance results.
Discussion
|
|---|
|
|
|---|
Although the 12 physicians had similar performance within the error of the experiment Table 3, and although they appear well-clustered when analyzed in terms of sensitivity and specificity Figure 2, the percentage of time that they disagree with each other20%is rather high. Therefore, although there is a significant level of disagreement among the physicians, no single physician stands out as significantly different from the others. This seems to be the normal level of disagreement that can be expected for the clinical interpretation of radiograph reports. The source of disagreement may be at one of several levels. Whether a finding is actually present on the chest radiograph may be unclear, leading the radiologist who originally generated the report to convey the ambiguity in the report itself. On the other hand, the finding may be clear, but the interpretation of its significance with respect to the conditions may differ among physicians. The physicians may also disagree on what the condition definitions mean. The physicians probably respond in different ways to phrases such as "reasonably likely"; these different responses result in different sensitivities and specificities. Inadvertent errors may also occur when physicians read reports. It is interesting that in a related area of researchthe interpretation of the chest radiographs themselvesthe same magnitude of inconsistency was found: a 30% disagreement when two radiologists read the same radiograph and a 22% disagreement when the same radiologist reread the same radiograph [19].
The natural language processor did as well as the physicians, both in terms of its distance and its sensitivity and specificity. We did not directly assess whether the cases on which the processor disagreed with physicians were clinically more significant than the cases on which physicians disagreed with each other. For example, disagreeing on what to call a borderline case of heart failure is less important than disagreeing on a clear case of a pulmonary nodule that requires follow-up. Fortunately, the distance metric indirectly accounts for clinical importance. In a case about which physicians fully agree (six out of six), picking the opposite conclusion leads to a large increase in distance. In a case for which there is split opinion (three out of six), all subjects are equally penalized by their answer, regardless of which conclusion they choose. Therefore, it is critical to agree with the physicians on those cases about which the physicians themselves agree. Alternate distance metrics, in which a simple majority physician vote determines a criterion standard, do not have this property.
Figure 2 shows that the lay persons are clustered in an area of lower sensitivity than that of the physicians are but, for the most part, in an area of specificity similar to that of the physicians. The lower sensitivity implies that lay persons did not recognize conditions that were recognized by physicians. The condition definitions were technical, aimed at defining the condition rather than teaching the medical background. Lay persons could recognize words in reports that matched words in the condition definitions, but they lacked the training to recognize alternate indications or even understand the vocabulary. More interesting is the fact that the specificities of most of the lay persons were close to those of the physicians. Because the lay persons could understand the English grammar (even though they did not understand the vocabulary), they were not fooled into thinking a condition was present when the report was actually stating that the condition was absent.
Sensitivity matched the level of the physicians for the complex keyword search and was a little lower for the simple keyword search; the difference between them was due to the training of the complex search. Both keyword searches had worse specificities than that of physicians. Because the simple search did not recognize negation, reports that said "new infiltrates not seen" were counted as having possible acute bacterial pneumonia. The complex search achieved better specificity because two thirds of its search phrases looked for the many ways of saying that an indication was absent. Nevertheless, even with 240 search phrases, complex Boolean logic, and training, it did not achieve the specificity of the physicians or the natural language processor. Our study does not show that other automated methods will not succeed but rather that a straightforward approach such as a keyword search is not sufficient.
Our study also indicates that internists and radiologists interpret reports similarly. If internists and radiologists consistently differed, the distance of a physician from a physician of the opposite type would be expected to exceed the distance to a physician of the same type. In fact, these average distances were almost identical. Figure 2 shows the same result. The internists and radiologists are intermixed without separate clusters.
The design of our study is appropriate for natural language processing in several ways. The use of several physicians provides a solid standard against which natural language processing can be judged, and it allows calculation of the variance among the physicians so that the significance of a difference between the processor and the physicians can be interpreted. The statistical methods are designed to accommodate studies in which physicians do not read each report; this will be critical for larger studies. By placing the processor in its intended environmentone in which real, uncorrected reports are used and one that directly feeds an automated decision-support systemone can judge how it will actually perform and whether its coded output is really usable.
One disadvantage of the approach is that if the system does not perform as expected, one must determine whether the natural language processor or the automated decision-support system is responsible. Without a criterion standard for the intermediate coded output, assigning responsibility is subjective. Our study does not replace rigorous measurements of the effect of automated systems on patient outcomes. Instead, it is a necessary step to ascertain that the potential for effect is real and that resources should be put into clinical trials of systems that exploit natural language processors.
Other natural language processors have achieved similar levels of performance. A special-purpose processor designed to detect neoplasms in chest radiograph reports [7] achieved a sensitivity of 98% and a specificity of 88% with respect to one physician's interpretation. A preliminary study of another special-purpose processor [8] resulted in a sensitivity of 87% (specificity was not measured). A general-purpose processor for discharge summaries [6] used to assess asthma indicators achieved a sensitivity of 84% (specificity was not measured). The processor in our study achieved a lower sensitivity (81%) than the others but achieved a higher specificity (98%). The physicians themselves showed sensitivities and specificities that more closely matched the processor in our study than the special-purpose neoplasm processor. Depending on the context, sensitivity or specificity may be more important. Automated alerting systems generally require high specificity so that false-positive alerts that undermine confidence in an automated system are avoided.
The significance of natural language processing lies in the vast quantity of clinical data that remains locked in narrative reports and in the automated tools that could exploit these data if they were coded. Automated alerts and interpretations [1, 2, 13-15, 20] require coded clinical data to do an intelligent analysis of the patient's condition. For example, given chest radiograph data, an automated system can warn a health care provider when a patient with a potential neoplasm has not had follow-up. Health care providers can directly query coded clinical data. In a patient with a long, complex history, a covering physician can ask whether the patient has ever had a pleural effusion; the answer that is based on natural language processing is more likely to be accurate than one based on sifting through volumes of paper records. Departmental quality assurance systems can exploit coded clinical data. For example, radiologic findings can be automatically correlated with subsequent pathology findings to estimate the rates of false-positive and false-negative radiology findings. Natural language processing may meet clinical researchers' need for clinical data, a need that has not been met by administrative data sources [4]. For example, queries to coded data can supply a list of potential study candidates and give estimates of disease prevalence.
The effort required to exploit natural language processing comprises several parts. Building the natural language processor computer program took 1 year (assume one full-time equivalent effort for all estimates); this program is reusable across domains. Assembling the grammar for chest radiographs took 6 months, and coding the vocabulary for the radiographs took 18 months. Adding mammography took only 1 month because the grammar changed little and many words were already present. The estimated time to add abdominal radiographs is 3 months because most of the findings in abdominal radiographs are occasionally found in chest radiographs (some "chest" radiographs cover most of the abdomen), and these have already been coded. Writing the six Medical Logic Modules for this study required less than 1 week. The ability to transfer the grammar and vocabulary to other institutions must still be tested. Extending the processor to handle reports similar to those of radiology (such as pathology) will require similar effort; its performance on reports with much less structure (such as discharge summaries) may be worse.
Conclusion
|
|---|
|
|
|---|
Appendix
|
|---|
Let n be the number of independent reports to be assessed. Each is rated by the natural language processor (denoted as subject 0) and by some set of J physicians (denoted as subjects j = 1, 2,...J). Let X ij be a C-length vector of scores (x ij1 , ... x ijc ...xijC) assigned to report i by subject j (where C is the number of conditions). Item x ijc of X ij corresponds to the score for condition c, and its value may be 0 (absent) or 1 (present). Because the experiment did not call on every subject to rate every report, many of the X( ij ) were missing at the time of the analysis. We use subscripted n to denote how many reports each subject or combination of subjects has rated. For example, n j denotes the number of reports scored by subject j. Similarly, n jk denotes the number of reports rated by both subject j and subject k, and n jklm denotes the number of reports rated by all of the subjects j, k, l, and m. There might be duplicates in the subscripts, in which case, for example, n jkjm is interpreted as n jkm . In this study, n was 200, J was 12, and C was 6.
Let d ijk be the distance between the rating scores, X ij and X ik , that subjects j and k assign to report i. In this analysis, the number of conditions for which the two subjects differed (for report i) was used as the distance metric:
d ijk =
c |x ijc x ikc |
Other metrics may also be used without affecting the variance calculations that follow. By convention, we set d ijk = 0 if either of subjects j or k did not rate report i. Let \#396; jk be the average distance from subject j to subject k:
\#396; jk =
i \#396; ijk /n jk
Var(\#396; jk ) = Cov(\#396; jk , \#396; jk )
Cov(\#396; jk \#396; lm ) = [
i \#396; ijk \#396; ilm n jklm \#396; jk \#396; lm ]/n jk n lm
For matrix calculations, we can define the column vector d = (d01,..., d J-1,J )t of J(J+1)/2 mean distance scores, and the J(J+1)/2 by J(J+1)/2 convariance matrix V, whose elements are defined above. The estimated variance of any linear combination b t \#396;, where b = (b01,...b J-1,J )t is a vector of coefficients, is Var(b t \#396;) = b t Vb.
The previous theory enabled us to compute estimates and standard errors (and, assuming approximate normality, confidence intervals) for measures of intersubject difference. The following were used:
Average distance of the natural language processor from physicians:
0 =
k >0 \#396; 0k /J
Average distance of physician j from all other physicians:
j =
0 <k,k
j \#396; jk / (J1); j = 1, 2,...J
Overall average interphysician distance:
= 2
0<k<j \#396; jk /J(J1)
Comparison of the processor's mean distance with that of physician j:
j =
0
j ; j = 1, 2,...J
Comparison of the processor's mean distance with average of all physicians:
j
j /J
Because these measures are all linear combinations of \#396;, their standard errors can be calculated from b t Vb. Because they represent the averages of many points, they are approximately normally distributed, and confidence intervals can be calculated.
Lay persons and the automated methods were analyzed in a similar manner; their results were substituted for those of the processor. Each physician was analyzed by removing his or her results from the group of physicians, letting J = 11, and treating that physician like the processor in the above equations.
Author and Article Information
|
|---|
|
|
|---|
References
|
|---|
|
|
|---|
1. Pestotnik SL, Evans RS, Burke JP, Gardner RM, Classen DC. Therapeutic antibiotic monitoring: surveillance using a computerized expert system. Am J Med. 1990; 88:43-8.
2. Rind DM, Safran C, Phillips RS, Wang Q, Calkins DR, Delbanco TL, et al. Effect of computer-based alerts on the treatment and outcomes of hospitalized patients. Arch Intern Med. 1994; 154:1511-7.
3. Tierney WM, Miller ME, Overhage JM, McDonald CJ. Physician inpatient order writing on microcomputer workstations. Effects on resource utilization. JAMA. 1993; 269:379-83.
4. Jollis JG, Ancukiewicz M, DeLong ER, Pryor DB, Muhlbaier LH, Mark DB. Discordance of databases designed for claims payment versus clinical information systems. Implications for outcomes research. Ann Intern Med. 1993; 119:844-50.
5. McDonald CJ, Tierney WM, Overhage JM, Martin DK, Wilson GA. The Regenstrief Medical Record System: 20 years of experience in hospitals, clinics, and neighborhood health centers. MD Comput. 1992; 9:206-17.
6. Sager N, Lyman M, Tick LJ, Nhan NT, Bucknall CE. Natural language processing of asthma discharge summaries for the monitoring of patient care. In: Safran C, ed. Proceedings of the Seventeenth Annual Symposium on Computer Applications in Medical Care; 1993 Oct 30-Nov 3; Washington, D.C. New York: McGraw-Hill; 1994:265-8.
7. Zingmond D, Lenert LA. Monitoring free-text data using medical language processing. Comput Biomed Res. 1993; 26:467-81.
8. Haug PJ, Ranum DL, Frederick PR. Computerized extraction of coded findings from free-text radiologic reports. Work in progress. Radiology. 1990; 174:543-8.
9. Chinchor N, Hirschman L, Lewis DD. Evaluating message understanding systems: an analysis of the third message understanding conference (MUC-3). Computational Linguistics. 1993; 19:409-47.
10. Vries JK, Marshalek B, D'Abarno JC, Yount RJ, Dunner LL. An automated indexing system utilizing semantic net expansion. Comput Biomed Res. 1992; 25:153-67.
11. Gabrieli ER. Computer-assisted assessment of patient care in the hospital. J Med Syst. 1988; 12:135-46.
12. Friedman C, Alderson PO, Austin JH, Cimino JJ, Johnson SB. A general natural-language text processor for clinical radiology. Journal of the American Medical Informatics Association. 1994; 1:161-74.
13. Hripcsak G, Clayton PD, Cimino JJ, Johnson SB, Friedman C. Medical decision support at Columbia-Presbyterian Medical Center. In: Timmers T, Blum BI, eds. Software Engineering in Medical Informatics. Amsterdam: North-Holland; 1991:471-9.
14. Hripcsak G, Ludemann P, Pryor TA, Wigertz OB, Clayton PD. Rationale for the Arden Syntax. Comput Biomed Res. 1994; 27:291-324.
15. McDonald CJ. Action-Oriented Decisions in Ambulatory Medicine. Chicago: Year Book; 1981.
16. Metz CE. Basic principles of ROC analysis. Semin Nucl Med. 1978; 8:283-98.
17. Dunn G. Design and Analysis of Reliability Studies. New York: Oxford Univ Pr; 1989.
18. Sprent P. Applied Nonparametric Statistical Methods. 2d ed. London: Chapman and Hall; 1993.
19. Yerushalmy J. The statistical assessment of the variability in observer perception and description of roentgenographic pulmonary shadows. Radiol Clin North Am.1969; 7:381-92.
20. Johnston ME, Langton KB, Haynes RB, Mathieu A. Effects of computer-based clinical decision support systems on clinician performance and patient outcome A critical appraisal of research. Ann Intern Med. 1994; 120:135-42.
Related articles in Annals:
This article has been cited by other articles:
![]() |
P. A. Dang, M. K. Kalra, M. A. Blake, T. J. Schultz, E. F. Halpern, and K. J. Dreyer Extraction of Recommendation Features in Radiology with Natural Language Processing: Exploratory Study Am. J. Roentgenol., August 1, 2008; 191(2): 313 - 320. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Zhou, S. Parsons, and G. Hripcsak The Evaluation of a Temporal Reasoning System in Processing Clinical Discharge Summaries J. Am. Med. Inform. Assoc., January 1, 2008; 15(1): 99 - 106. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. A. Lussier and Y. Liu Computational Approaches to Phenotyping: High-Throughput Phenomics Proceedings of the ATS, January 1, 2007; 4(1): 18 - 25. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. H. Thrall Reinventing Radiology in the Digital Age: Part II. New Directions and New Stakeholder Value Radiology, October 1, 2005; 237(1): 15 - 18. [Full Text] [PDF] |
||||
![]() |
B. Hazlehurst, H. R. Frost, D. F. Sittig, and V. J. Stevens MediClass: A System for Detecting and Classifying Encounter-based Clinical Events in Any Electronic Medical Record J. Am. Med. Inform. Assoc., September 1, 2005; 12(5): 517 - 529. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. B. Melton and G. Hripcsak Automated Detection of Adverse Events Using Natural Language Processing of Discharge Summaries J. Am. Med. Inform. Assoc., July 1, 2005; 12(4): 448 - 457. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. R.O. Payne and J. B. Starren Quantifying Visual Similarity in Clinical Iconic Graphics J. Am. Med. Inform. Assoc., May 1, 2005; 12(3): 338 - 345. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. J. Thomas, H. Ouellette, E. F. Halpern, and D. I. Rosenthal Automated Computer-Assisted Categorization of Radiology Reports Am. J. Roentgenol., February 1, 2005; 184(2): 687 - 690. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. S. Field, J. H. Gurwitz, L. R. Harrold, J. M. Rothschild, K. Debellis, A. C. Seger, L. S. Fish, L. Garber, M. Kelleher, and D. W. Bates Strategies for Detecting Adverse Drug Events among Older Persons in the Ambulatory Setting J. Am. Med. Inform. Assoc., November 1, 2004; 11(6): 492 - 498. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Friedman, L. Shagina, Y. Lussier, and G. Hripcsak Automated Encoding of Clinical Documents Based on Natural Language Processing J. Am. Med. Inform. Assoc., September 1, 2004; 11(5): 392 - 402. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Aronsky, E. Kasworm, J. A. Jacobson, P. J. Haug, and N. C. Dean Electronic Screening of Dictated Reports to Identify Patients with Do-Not-Resuscitate Status J. Am. Med. Inform. Assoc., September 1, 2004; 11(5): 403 - 409. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. W. Chapman, G. F. Cooper, P. Hanbury, B. E. Chapman, L. H. Harrison, and M. M. Wagner Creating a Text Classifier to Detect Radiology Reports Describing Mediastinal Findings Associated with Inhalational Anthrax and Other Disorders J. Am. Med. Inform. Assoc., September 1, 2003; 10(5): 494 - 503. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. B. Wilcox and G. Hripcsak The Role of Domain Knowledge in Automating Medical Text Report Classification J. Am. Med. Inform. Assoc., July 1, 2003; 10(4): 330 - 338. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. J. Murff, A. J. Forster, J. F. Peterson, J. M. Fiskio, H. L. Heiman, and D. W. Bates Electronically Screening Discharge Summaries for Adverse Medical Events J. Am. Med. Inform. Assoc., July 1, 2003; 10(4): 339 - 350. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. W. Bates, R. S. Evans, H. Murff, P. D. Stetson, L. Pizziferri, and G. Hripcsak Detecting Adverse Events Using Information Technology J. Am. Med. Inform. Assoc., March 1, 2003; 10(2): 115 - 128. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Hripcsak, J. H. M. Austin, P. O. Alderson, and C. Friedman Use of Natural Language Processing to Translate Clinical Information from a Database of 889,921 Chest Radiographic Reports Radiology, July 1, 2002; 224(1): 157 - 163. [Abstract] [Full Text] |
||||
![]() |
H. Yu, G. Hripcsak, and C. Friedman Mapping Abbreviations to Full Forms in Biomedical Articles J. Am. Med. Inform. Assoc., May 1, 2002; 9(3): 262 - 272. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Hripcsak and A. Wilcox Reference Standards, Judges, and Comparison Subjects: Roles for Experts in Evaluating System Performance J. Am. Med. Inform. Assoc., January 1, 2002; 9(1): 1 - 15. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. K. Taira, S. G. Soderland, and R. M. Jakobovits Automatic Structuring of Radiology Free-Text Reports RadioGraphics, January 1, 2001; 21(1): 237 - 245. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Fiszman, W. W. Chapman, D. Aronsky, R. S. Evans, and P. J. Haug Automatic Detection of Acute Bacterial Pneumonia from Chest X-ray Reports J. Am. Med. Inform. Assoc., November 1, 2000; 7(6): 593 - 604. [Abstract] [Full Text] |
||||
![]() |
W. W. Stead, R. A. Miller, M. A. Musen, and W. R. Hersh Integration and Beyond: Linking Information from Disparate Sources andinto Workflow J. Am. Med. Inform. Assoc., March 1, 2000; 7(2): 135 - 145. [Abstract] [Full Text] [PDF] |
||||
![]() |