Annals
Established in 1927 by the American College of Physicians
:
Advanced search
 
box Article
 arrow  Table of Contents                
space
 arrow  Abstract of this article Free
space
 arrow  Figures/Tables List
space
 arrow  Articles citing this article
space
box Services
 arrow  Send comment/rapid response letter
space
 arrow  Notify a friend about this article
space
 arrow  Alert me when this article is cited
space
 arrow  Add to Personal Archive
space
 arrow  Download to Citation Manager
space
 arrow  ACP Search                        
space
 arrow  Get Permissions
space
box Google Scholar
 arrow  Search for Related Content
space
box PubMed
Articles in PubMed by Author:
  arrow  Irwig, L.
space
  arrow  Mosteller, F.
space
 arrow  Related Articles in PubMed
space
 arrow  PubMed Citation
space
 arrow  PubMed
space

REVIEW

Guidelines for Meta-analyses Evaluating Diagnostic Tests

right arrow Les Irwig; Anna N. A. Tosteson; Constantine Gatsonis; Joseph Lau; Graham Colditz; Thomas C. Chalmers; and Frederick Mosteller

15 April 1994 | Volume 120 Issue 8 | Pages 667-676

Objectives: To introduce guidelines for the conduct, reporting, and critical appraisal of meta-analyses evaluating diagnostic tests and to apply these guidelines to recently published meta-analyses of diagnostic tests.

Data Sources: Based on current concepts of how to assess diagnostic tests and conduct meta-analyses. They are applied to all meta-analyses evaluating diagnostic tests published in English-language journals from January 1990 through December 1991, identified through MEDLINE searching and by experts in the field.

Study Selection: Meta-analyses were included if at least two of three independent readers regarded their main purpose as the evaluation of diagnostic tests against a concurrent reference standard.

Data Extraction: By three independent readers on the extent to which meta-analyses fulfilled each guideline, with consensus defined as agreement by at least two readers.

Data Synthesis: The guidelines are concerned with determining the objective of the meta-analysis, identifying the relevant literature and extracting the data, estimating diagnostic accuracy, and identifying the extent to which variability is explained by study design characteristics and characteristics of the patients and diagnostic test. In general, the guidelines were only partially fulfilled.

Conclusion: Meta-analysis is potentially important in the assessment of diagnostic tests. Those reading meta-analyses evaluating diagnostic tests should critically appraise them; those doing meta-analyses should apply recently developed methods. The conduct and reporting of primary studies on which meta-analyses are based require improvement.


Clinicians must decide whether to use a diagnostic test in a patient and how to interpret the result[1]. Policy-makers must assess the overall value of a test, compare it to alternatives, and decide whether the test should be made available. Both clinical and policy decisions should be based on a thorough evaluation of the test [2]. A crucial step in evaluation is the assessment of diagnostic accuracy—the ability of the test to determine correctly the presence or absence of the disease of interest. This is done by comparing test results with those obtained using a reference ("gold," criterion, or comparison) standard. Estimates of the diagnostic accuracy of a test may differ among studies. Each study may have included too few patients to give precise estimates or too selected a population to allow general applicability. Therefore, meta-analysis, the critical review and statistical combination of results of previous research [3-6], is potentially useful for assessing diagnostic accuracy. Using meta-analysis, we can 1) provide an overall summary of diagnostic accuracy; 2) determine whether estimates of diagnostic accuracy depend on the study design characteristics [study validity] of the primary studies; 3) determine whether diagnostic accuracy differs in subgroups defined by the characteristics of the patients and test; and 4) identify areas for further research. New hypotheses may be generated or the attempt to meta-analyze data may highlight deficits that need to be addressed in future primary studies before a useful meta-analysis can be done.

To help researchers do and readers assess meta-analyses of diagnostic accuracy, we suggest guidelines on how they should be conducted, reported, and critically appraised. Our guidelines are based on current concepts of how to assess diagnostic tests and conduct meta-analyses (Table 1). Because other guidelines exist for meta-analysis in general [3-8], we emphasize those issues that are particular to assessing diagnostic accuracy. For each step, the guidelines are used to review 11 journal articles published from January 1990 through December 1991 whose primary purpose was to assess the accuracy of a diagnostic test against a concurrent reference standard using meta-analysis [9-19]. The 11 articles are all the meta-analyses published in this 2-year period that we could identify through various search procedures. The guidelines were applied to each article independently by three of the authors, and the majority view was accepted. (See Appendix 1 for details of search and review procedures.)


View this table:
[in this window]
[in a new window]
 
Table 1. Steps in Conducting a Meta-analysis of a Diagnostic Test and Summary of Guidelines

 


Determine the Objective and Scope of the Meta-analysis
space

Because the type of meta-analysis we are assessing compares a diagnostic test against a reference standard for the disease of interest, a clear statement about what diagnostic test is being evaluated and how the reference standard is defined should be provided. Diagnostic tests are frequently used against the background of clinical information that is available before testing. A test may appear useful if only the association between the test and the reference standard is examined. However, it may not be useful if its information is carried by data that are already available to the clinician or are easy to obtain without ordering the test [20]. Test evaluation should examine the incremental or marginal value of the test. For example, consider an evaluation of exercise thallium scans against the reference standard of coronary angiography in patients with symptoms suggestive of coronary artery disease. An electrocardiogram would have been obtained routinely at the time of exercise. Therefore, the relevant question is whether the thallium scan has any incremental value over the information already available from the electrocardiogram. Similarly, a test may have poorer diagnostic accuracy in a tertiary care facility than in a primary care setting because the patient population has already been selected on the results of other tests [21]. Thus, it is imperative that meta-analysts clearly state the clinical circumstances for which they wish to evaluate diagnostic accuracy.

Not all primary studies identified will have used the meta-analyst's ideal reference standard or cover the exact tests or clinical context desired by the meta-analyst. This can be dealt with by using only the subset of papers that do so or by examining the extent to which deviation from the ideal changes the findings. In general, we suggest the latter approach and describe it further when we discuss how to assess the effect of variation in study validity and in the characteristics of patients and test. Often, the comparative value of several tests is being assessed. As when only a single test is being evaluated, this comparison should be viewed against the background of previous information, which should be equivalent for the tests being compared.

Review. The reference standard was stated in 10 and the test of interest in all of the 11 meta-analyses. Two meta-analyses gave a clear statement about the clinical background against which the tests' incremental value was being evaluated. Some authors, for example, Pinson and colleagues [18], stated that this was a problem in the primary studies, pointing to the need for improving their quality. Seven meta-analyses compared tests, whereas 4 evaluated an individual test.


Retrieve the Relevant Literature
space

Literature searching should be done using various methods, including searching computerized bibliographic systems, consulting content experts, searching appropriate journals, and identifying further studies and search strategies using the reference lists of studies already obtained [3, 8, 20]. The final MEDLINE search strategy should be reported in sufficient detail to allow others to repeat it if they wish to update the meta-analysis after publication.

Except for those clearly outside the scope of the meta-analysis, the relevance of all other papers should be judged by applying inclusion and exclusion criteria. This result could be accomplished by using the same methods outlined in the next section for the extraction of data. Reporting the reason for exclusion of potentially eligible papers helps readers understand how the criteria were applied.

Publication bias may arise because of the selection for publication of papers with more extreme results because they are "more interesting" or "statistically significant" [22, 23]. Methods for dealing with publication bias have been developed [24, 25], but their applicability to diagnostic test assessment has not been explored.

Review. Literature retrieval methods were described in 7 of the 11 meta-analyses, all of which used MEDLINE searching. Three articles gave their search terms, and 2 of these explained how these were linked [10, 14]. Criteria for including or excluding papers were given in 8 articles. Four articles gave information about excluded studies. Publication bias was discussed in 3 of the 11 articles, although no estimates could be made of its likely effect.


Extract and Display the Data
space

Assessing study relevance and extracting data require the use of judgment. Various procedures have been suggested to reduce the possibility of obtaining biased estimates of diagnostic accuracy, including independent review by two readers and resolution of differences by a third reader or by discussion between the original two readers [3, 8]. Sometimes, readers are blinded to details of authorship and study results [8]. Although theoretically sound, no evidence exists that blinding results in a decrease in bias. Until such evidence is obtained, the decision on whether to use blinding depends on the resources available. Whatever the decision, authors should state how the judgments were made.

Publishing a full list of diagnostic accuracy and study characteristics (for example, design features, patient characteristics) for each primary study allows other researchers to decide if they agree with the judgments made and enables re-analysis applying different analytic techniques, using a subset of studies, or adding studies published after the meta-analysis was done [26].

Review. Two meta-analyses stated that each of the primary studies was assessed by two or more readers [13, 19]. One of the articles [13] also mentioned that disagreements were settled by consensus or a third party. Gianrossi and colleagues [11] offer an example of extensive display of data about diagnostic accuracy and study characteristics.


Estimate Diagnostic Accuracy
space

Before discussing the estimation of diagnostic accuracy in meta-analyses, some key concepts of test assessment need to be addressed [20, 27, 28]. Most measures of diagnostic accuracy are based on the comparison of the test with a reference standard that determines the presence or absence of the disease of interest. An ideal diagnostic test discriminates between diseased and nondiseased individuals without error. Test error can be characterized by various measures. The two most commonly reported measures are sensitivity, the probability that a test result is positive in patients with the disease of interest, and specificity, the probability that a test result is negative in patients without the disease of interest.

These measures rely on a single threshold (cut-point or positivity criterion) for classifying a test result as positive. Changing the threshold to increase sensitivity decreases specificity and vice versa. This trade-off between sensitivity and specificity makes it imperative that they be considered jointly. When studies use different criteria to define positive and negative test results, as in our example, they differ in their explicit threshold. Even when studies use the same explicit threshold, their implicit thresholds may differ, especially if interpretation of the test requires judgment. For example, radiologists may agree to use the same words to describe imaging test results but still differ in what they regard as the boundary between "abnormal" and "probably abnormal"

An alternative to reporting a single pair of sensitivity and specificity estimates is to report a range of pairs, which is obtained as the threshold criterion is varied. Such a range of pairs is often reported as a receiver operating characteristic (ROC) curve, which is a graph of the sensitivity (true-positive rate) on the vertical axis against the false-positive rate (1 –specificity) on the horizontal axis [29]. One overall measure of the test's accuracy is the area under the ROC curve, where a value of 0.5 is obtained if the test does no better than chance and a value of 1 is obtained if the test is perfect [29, 30].

Another measure of test performance is the likelihood ratio, defined as the ratio of the probability of a particular test result in people with disease to the probability of the same test result in people without disease. A likelihood ratio at each possible value of a multi-category or continuous test to get a result-specific likelihood ratio can be estimated, thus avoiding the need to decide on a single threshold for dichotomizing a test as positive or negative. More importantly, it avoids the loss of information that the dichotomy causes, whereby a test result just above the threshold is not differentiated from a test result well above the threshold [20]. Likelihood ratios can be used to compute the post-test odds of disease for a patient with known or estimated pretest odds by using a version of Bayes' theorem:

Post-test odds of disease =

(pre-test odds of disease) x (likelihood ratio)

Methods for obtaining a summary estimate of diagnostic accuracy in a meta-analysis are now described, first for when test results are available only as a dichotomy and then for when test results are available in more than two categories.


Test Results Are Available Only as a Dichotomy
space

If primary studies provide only sufficient information to estimate sensitivity and specificity, the mean sensitivity and the mean specificity can be estimated, possibly weighted in some way for the sample size of each study. However, this technique is inappropriate because it is likely that different studies use different explicit or implicit thresholds, so that a primary study with a high sensitivity may have a low specificity and vice versa. The problem of estimating mean values separately for sensitivity and specificity is shown by considering the following hypothetical data: Three studies of equivalent size have sensitivity and specificity rates of 100% and 0%, 99% and 99%, and 0% and 100%, respectively. The mean sensitivity is 66% and the mean specificity, 66%. Yet, if the true-positive rate (sensitivity) is plotted against the false-positive rate (1 –specificity), that is, using the axes that are also used for an ROC curve, it is evident that diagnostic accuracy is high with the area under the ROC curve close to 1. In general, estimating mean sensitivity and specificity separately underestimates test accuracy. Therefore, a reasonable first approach in a meta-analysis is to plot a scattergram of the true-positive rate against the false-positive rate. The scattergram allows a visual impression of the variability in both measures and how they are related. Recently, techniques to fit a model to such data have been described by Kardaun and Kardaun [15] and by Moses and colleagues [31]. This method is referred to here as a summary receiver operating characteristic (SROC) curve. An example of a scatterplot and a superimposed SROC curve is shown in Figure 1 and its derivation is detailed in Appendix 2.



View larger version (33K):
[in this window]
[in a new window]
 
Figure 1. Plot of true-positive rate on false-positive rate for thallium scintigrams to detect angiographic coronary artery disease. Studies using computerized or semi-computerized reading techniques are shown as open circles and those using visual techniques as solid squares.

 

Test Results Are Available in More than Two Categories
space

If test results are measured as a continuum (for example, biochemical values such as creatine kinase) or as responses on an ordinal categorical scale (for example, degrees of suspicion of abnormality in the interpretation of radiologic images), then several other analytic techniques can be used. If no threshold or scaling differences between primary studies exist and test comparison is not an objective, then result-specific likelihood ratios can be obtained from logistic modeling procedures [32, 33].

If scale or threshold differences between primary studies are likely, as is probable for tests involving judgment, then likelihood ratios cannot be obtained on pooled data. One approach is to dichotomize the test result for each primary study and to use SROC methods as described above. Better use is made of the data if an ROC curve is constructed for each primary study using several thresholds, and an overall ROC curve is derived using ordinal regression techniques that have been applied previously to the diagnostic setting as a means of controlling for covariates [34]. Summary measures such as the area under the ROC curve can be obtained for the entire curve [35] or a clinically relevant range [36-38]. Difficulties with this approach may arise when studies have different numbers of cut-off values.

Review. All 11 meta-analyses were based on estimates of sensitivity and specificity from the primary studies. Seven articles gave mean estimates separately for sensitivity and specificity and did not examine their interdependence. Two studies estimated SROC curves [14, 15], one of which also estimated an odds ratio as a measure of diagnostic accuracy [15]. Two papers did not give a summary measure of diagnostic accuracy. Hoffman and colleagues [13] plotted sensitivity against the false-positive rate and decided not to pool results because of heterogeneity among the estimates of diagnostic accuracy from the primary studies. One meta-analysis gave a statistical test of association without any summary measure of diagnostic accuracy [19], whereas another based part of its analysis on a prevalence-dependent measure, the proportion of individuals overall correctly classified by the test [10].

Ordinal regression techniques have not yet been used to pool ROC data from primary studies. Data from primary studies that are amenable to the ordinal regression approach are uncommon. One meta-analysis did, however, note that some primary studies had test data at more than one threshold [15].

An example of the use of continuous data is the meta-analysis by Guyatt and colleagues [39], who obtained individual data points from 55 publications on the value of serum ferritin levels in the diagnosis of iron-deficiency anemia. Logistic regression was used on the pooled data to obtain continuous graphs of the likelihood ratio by serum ferritin result.


Assess the Effect of Variation in Study Validity on Estimates of Diagnostic Accuracy
space

Ideally, the estimate of diagnostic accuracy should be based on studies of the highest scientific validity, that is, studies that are most likely to be free of bias [20]. Before we consider guidelines on how to deal with study validity in meta-analyses of diagnostic tests, we describe the most important potential sources of bias in assessing diagnostic accuracy [40].


Appropriate Reference Standard
space

The reference standard must be clearly defined and should be the best available method of assessing the presence or absence of the disease of interest.


Independence of Observations
space

Test results may be biased if the test requires judgment and this judgment is made by someone who has knowledge of the result of the reference standard. In general, diagnostic accuracy would be expected to be overestimated. Therefore, those involved in assessing test results should be blind to the result of the reference standard. Likewise, assessors of the reference standard should be blind to the test result.


Verification Bias
space

Diagnostic accuracy should be assessed in consecutive patients who present with the clinical problem of interest. Verification bias may occur when the reference standard has been assessed on patients sampled differentially in the categories of test results [41, 42]. For example, the hypothetical data based on consecutive patients in Table 2 shows a sensitivity of 80% and a specificity of 90%. If establishing the reference standard required invasive procedures, patients who test negative may be less likely to have the reference test. Suppose that only a random sample of 10% of the test-negatives have the reference test. In that case (ignoring sampling variability), the estimated sensitivity would be 98% and the specificity, 47% (Table 3).


View this table:
[in this window]
[in a new window]
 
Table 3. Verification Bias: Cross-classification Based on Table 2 in Which Only a 10% Random Sample of Test-Negatives Has Been Verified by the Reference Standard*

 


View this table:
[in this window]
[in a new window]
 
Table 2. Hypothetical Comparison of a Test and a Reference Standard on Consecutive Patients

 

The investigators can adjust for the bias if the sampling was done randomly and with known proportions within strata defined by test results. For example, Table 2 can be reformulated from Table 3 if it is known that Table 3 includes only a 10% random sample of test-negatives. Methods also exist for the estimation of confidence intervals for the corrected sensitivity and specificity [41]. Usually the situation is more complex, with other clinical information, such as age or other symptoms, influencing the selection of patients who are assessed by the reference standard. Again, the bias can be adjusted for as long as sampling has been random within the categories defined by test result and known clinical information. However, the choice of patients for verification by the reference standard is commonly not random. In this case, estimates of diagnostic accuracy are biased in unpredictable ways and no adjustment is possible.


Issues in Comparative Meta-analyses
space

A comparison of the relative accuracy of several diagnostic tests should ideally be based on applying all the tests to each of the patients or randomly assigning tests to patients in each primary study. Obtaining diagnostic accuracy information for different tests from different primary studies is a weak design; differences in diagnostic accuracy may reflect differences in study characteristics that are unlikely to be adequately controlled by attempting to adjust for them in the meta-analysis.

We next address how to incorporate variation in study validity into the meta-analysis. One approach is to exclude studies that do not meet standards for scientific validity [20, 43]. This method avoids bias at the expense of decreasing precision, that is, widening the confidence intervals. For meta-analyses of clinical trials, empiric evidence supports the theory that inclusion of nonrandomized trials can bias the results [44, 45], although little empiric evidence supports the importance of other deviations from the theoretically ideal design [46]. For meta-analyses evaluating diagnostic tests, little empiric evidence has been accumulated to determine the practical importance of the potential sources of bias discussed above. Moreover, primary studies often do not report sufficient data for judging the potential for bias. We suggest that the initial analysis should assess the effect of reported study design flaws on estimates of diagnostic accuracy. This goal can be accomplished by doing the meta-analysis separately for studies with and without a particular design flaw or by including the presence or absence of the flaw in regression models as outlined in Appendix 3. Primary studies with a particular design flaw may give a different SROC curve to primary studies without that flaw. In that case, reliance should be placed only on those studies that do not have the flaw. Alternatively, the design flaw may not give a different SROC curve, in which case all studies can be used in the meta-analysis. Because different design flaws are likely to cause different biases, we suggest assessing the effect of each separately rather than summarizing them into an overall "validity score"

Review. Six of the 11 meta-analyses discussed variability in the choice of reference standard between the primary studies. Three articles examined how this variability affected diagnostic accuracy. Seven of the 11 meta-analyses mentioned other study design characteristics. Pinson and colleagues [18] provide a good example of their assessment. One meta-analysis excluded studies because they were prone to verification bias [15]. Five meta-analyses showed data on the variability of study design characteristics between studies, and 5 did analyses to determine how they predicted diagnostic accuracy. All but 1 of these analyses explored the relation between study design characteristics separately for sensitivity and specificity. Therefore, they are unable to assess whether the study design characteristic altered diagnostic accuracy rather than just reflecting differences in the threshold for test positivity between the primary studies.

Of the seven comparative diagnostic test evaluations, two were based on the tests both being applied to the same patients within each primary study [9, 14]. The remaining five meta-analyses used the much weaker design of obtaining estimates of diagnostic accuracy for the different tests at least in part from different studies.

Assess the Effects of Variation in the Characteristics of the Patients and Test on Estimates of Diagnostic Accuracy (Generalizability)

Valid estimates of diagnostic accuracy may not be generalizable (applicable) to the setting in which the reader works. Readers of a meta-analysis will want to know if they can apply the meta-analyzed estimate of diagnostic accuracy to the clinical or policy decision they confront. Although evidence exists for at least one condition that the combination of multiple tests is generalizable between settings [47], there is still reason for concern about the applicability of diagnostic accuracy assessments from a meta-analysis to other medical settings [21, 48]. Readers of meta-analyses may decide that the summary estimate of diagnostic accuracy is applicable to their decision making for any of the following reasons: 1) Characteristics of the patients and test are similar in the meta-analysis and in their target population; 2) characteristics are not associated with diagnostic accuracy; or 3) a particular characteristic (for example, sex) affects diagnostic accuracy and estimates are provided separately for groups defined by this characteristic. The reader can then apply them separately to each group.

Relevant characteristics depend on the topic and will be limited by reporting in the primary studies. The major patient characteristics are concerned with the clinical spectrum under consideration. For example, a test may be very accurate at differentiating patients with advanced cancer from persons in perfect health but much less accurate at differentiating patients with early cancers from those whose symptoms are caused by a range of other diseases [49, 50]. This example illustrates two factors that can influence the estimate of test accuracy: the extent of cancer in the "diseased" group and the occurrence of other medical conditions in the "nondiseased" group. The implication of this phenomenon is that measures of diagnostic accuracy are generalizable only to settings that have a similar spectrum of patients, defined by the type and extent of disease in patients with the disease of interest and the type and extent of differential diagnoses in the controls. This spectrum is likely to vary in different practice settings [20, 47]. Other commonly included patient characteristics are age, sex, presenting complaints, comorbid conditions, and the findings of other diagnostic tests that have been done.

The technical details of tests may also vary from one setting to another and limit generalizability. Variation among studies may be due to different diagnostic accuracy of different test methods used in the primary studies. On the other hand, different test methods may simply vary in their threshold. For example, the diagnostic accuracy of computerized techniques of reading thallium scintigrams for the diagnosis of coronary artery disease can be shown to be no better than visual reading. However, the threshold for computerized reading is at a level that results in a higher sensitivity at a lower specificity (see Figure 1 and Appendix 3).

Review. Nine of the 11 meta-analyses mentioned variability in at least one patient characteristic among the primary studies. Five articles gave information about the distribution of characteristics, and 7 included them in analysis as predictors of variability of diagnostic accuracy. The most commonly considered patient characteristic was the type and extent of disease in patients with the disease of interest. Seven meta-analyses discussed variability in the test, of which 4 examined how this variability affected diagnostic accuracy. Four meta-analyses examined how publication year affected diagnostic accuracy. As discussed in the review of how meta-analyses dealt with study validity, studies generally examined the effect of characteristics on sensitivity and specificity separately and therefore did not assess whether characteristics affected diagnostic accuracy rather than just causing a shift in threshold.


Conclusion
space
up arrowTop
dotConclusion
down arrowAuthor & Article Info
down arrowReferences

Evaluating and summarizing diagnostic test accuracy from literature articles is a complex task with many methodologic pitfalls and biases. We have outlined steps that improve the validity of meta-analyses of diagnostic tests. Attention to these guidelines can greatly improve our ability to synthesize information in the growing literature on diagnostic test evaluation. Meta-analysis can potentially provide a better understanding by examining the variability in estimates of diagnostic accuracy from primary studies, exploring whether this variability is explained by differences in study validity, and determining whether diagnostic accuracy based on the most valid studies varies for different clinical subgroups or categories of patients. By improving the conduct and reporting of primary studies, meta-analysis will also become a more valuable technique.


Appendix 1. Procedure for Identifying and Evaluating Meta-analyses
space

Literature Retrieval Methods

Meta-analyses published between January 1990 and December 1991 were identified through searching MEDLINE, by consulting experts in the field, and examining bibliographies of papers already retrieved. MEDLINE searching was done independently by two groups of authors. We then examined the index terms used in MEDLINE for all relevant papers obtained by any means and devised a final search strategy, which was as follows:

(explode diagnosis OR any of the following subheadings: diagnosis, radionuclide imaging, ultrasonography OR explode "sensitivity and specificity") AND (Meta-analysis OR the text words "meta" and any word starting with "analy").

Additional searches linking terms for diagnosis to "overview" and words starting with "pool" as text words had a negligible yield and were not pursued.


Review Procedure
space

The abstracts of all articles were reviewed by one of the authors to decide if the article was possibly a meta-analysis evaluating a diagnostic test against a concurrent reference standard. All papers considered possibly eligible were reviewed independently by two of the authors who assessed whether the paper was concerned with the association between a test and a concurrent reference standard and whether there was an attempt to combine data from primary studies in a quantitative way. A study was considered eligible if both reviewers answered yes to both questions. Disagreements were resolved through independent review by a third reader and by accepting the majority opinion. Note that our first criterion requires that the reference standard be concurrent. Therefore meta-analyses that require follow-up to establish outcome, for example, reference 51, were excluded.

All eligible papers were then reviewed independently by three of the authors to assess whether meta-analysis evaluating a diagnostic test was the main purpose of the paper or whether it was secondary (for example, to the reporting of new study results) and whether the meta-analysis addressed each of the issues outlined in our guidelines.

For each item assessed, the majority view was taken as the final response. Responses about whether our guidelines were addressed are given only for those 11 papers in which meta-analysis was the main purpose. Those papers in which meta-analysis was a secondary purpose (for example, done in the discussion section of the report of new research findings, such as [52, 53]) provided negligible information about how or whether the issues in our guidelines were addressed.


Appendix 2. Methods for Estimating Summary Receiver Operating Characteristic Curves
space

The analytic method for SROC curves is based on the principle that the ROC curve is conveniently represented as a straight line when logit TPR is plotted against logit FPR [31], where TPR = true-positive rate or sensitivity, FPR = false-positive rate or (1 –specificity), logit TPR = log (PR/[1 –TPR]), logit FPR = log (FPR/[1 –FPR]).

For statistical reasons, it is advisable to model logit TPR – logit FPR as a linear function of logit TPR + logit FPR. Thus, to estimate an SROC curve, we use the following model:

D = a + bS

where D = logit TPR – logit FPR, S = logit TPR + logit FPR, a = intercept term, and b = regression coefficient for S.

This model can be fit using conventional least-squares methods unweighted or weighted by the variance of (logit TPR – logit FPR) within available statistical packages. Robust techniques can also be used [31]. Regression lines should be drawn only over the range of the data. The final model can be converted back to the conventional ROC axes of TPR against FPR.

The SROC formulation of D and S have convenient interpretations. D is easily shown to be the log odds ratio, which is a common measure of association in epidemiologic studies. Here, the odds ratio represents the odds of a positive test result among diseased persons relative to the odds of a positive test result among nondiseased persons. S is a measure of the threshold for classifying a test as positive, which has a value of 0 when sensitivity equals specificity. It becomes positive when a threshold is used that increases sensitivity (and decreases specificity) and becomes negative when a threshold is used that decreases sensitivity (and increases specificity). The intercept of the model (a) is therefore an odds ratio and the regression coefficient (b) examines the extent to which the odds ratio is dependent on the threshold used. If the regression coefficient is near zero and not statistically significant, test accuracy for each primary study can be summarized as the odds ratio and these odds ratios can be combined using various techniques [54, 55].

The SROC method deals with the problem of different thresholds among studies and is useful for comparing the overall diagnostic accuracy of different tests or the extent to which accuracy depends on study characteristics. However, it does not directly provide an exclusive estimate of sensitivity and specificity. To do so requires fixing a value for either sensitivity or specificity and reading the corresponding value for the other off the SROC curve. The fixed value could be the median or mean of those found on meta-analysis or based on local experience.

To illustrate the SROC approach, we use data from a meta-analysis of over 50 primary studies on the exercise thallium scintigram as a test for angiographic coronary artery disease [56]. Although this paper was published before the years of our formal review, it is used because it gives extensive tabulation of the data from each primary study. This meta-analysis examined sensitivity and specificity separately. The range of sensitivity among the primary studies was from 0.63 to 0.98 (mean, 0.840) and the range of specificity from 0.43 to 1.00 (mean, 0.844). The regression equation for the plot of D on S was obtained after adding 0.5 to the numerator and 1 to the denominator of both the TPR and FPR for each study so that any zero cells did not result in undefined transformations [31]. The intercept of the unweighted model is 3.631 (95% CI, 3.354 to 3.907) and the regression coefficient for S is –0.294(CI, 0.503 to –0.085).The regression coefficient differs significantly from zero, showing that the odds ratio for the association between test and reference standard is dependent on the threshold used. The plot of TPR on FPR is shown (Figure 1). At a mean specificity of 0.844, the sensitivity is estimated at 0.868, which is only slightly higher than the value that was reported when the sensitivity and specificity were examined separately. The difference is small because the correlation between logit TPR and logit FPR is modest in this example (Pearson r = 0.19, P = 0.16).

Appendix 3. Methods for Assessing the Effect on Diagnostic Accuracy of Variation in Study Validity and Characteristics of the Patients and Test

Estimates of diagnostic accuracy may vary by study design characteristics (study validity) or characteristics of the patients or test (generalizability). Differences in diagnostic accuracy by characteristics are more likely to be real rather than caused by the play of chance if the analyses fulfill criteria such as being based on a previous hypothesis and showing large statistically significant differences [57]. Most variability in diagnostic accuracy will probably not be explained by reported characteristics. Formal methods (random-effects models) exist for taking account of heterogeneity between primary studies and estimating summary measures with appropriate confidence intervals in meta-analyses of randomized trials [54, 55] but have not as yet been published for most measures of diagnostic accuracy.

Experience with modeling the effect of characteristics on diagnostic accuracy is limited at present. Because most primary studies only provide test data around a single threshold and methods for exploring the effect of other variables are not well developed, we restrict our comments to examining whether a single variable predicts diagnostic accuracy using SROC. Assessing whether characteristics affect test accuracy requires a method that identifies whether accuracy is better in certain groups or if there is only a shift in threshold along the same ROC. To assess accuracy, the two groups being compared can be shown graphically using different symbols. The magnitude of the difference between groups and its statistical significance can be obtained by including the group variable in the SROC model. Alternatively, one can model the SROC curve based on all the data and compare the residuals around this model for the two groups using an unpaired t-test [14]. We now explore the first method using the example from Appendix 2.

We may wish to know if diagnostic accuracy is improved by having thallium scintigrams read using computerized techniques rather than by visual examination. The data are derived from Tables 2 and 3 of the meta-analysis by Detrano and colleagues [56]. The analysis compares studies that used visual techniques to read the thallium scintigrams with those that used computer or semi-quantitative techniques. The sensitivity appears to improve with computerized reading (Table 4). However, specificity deteriorates. This result could be explained by a threshold difference between the two reading techniques rather than a difference in accuracy of the two techniques. In Figure 1, estimates of diagnostic accuracy for visual reading are more common at lower true-and false-positive rates and computerized reading at higher true- and false-positive rates. This finding could be caused by a shift in threshold and could be confirmed by comparing the means of S for the two techniques. Means are –0.45(CI, –0.79 to –0.12)for visual readings and 0.80 (CI, 0.08 to 1.53) for computerized readings, suggesting that there is a significant (P = 0.005) difference in threshold. Reading technique does not have a statistically significant coefficient if included in the SROC model (Table 5). Setting specificity at 0.880 in the model gives a sensitivity of 0.846 for visual reading and 0.858 for computerized reading. In addition to not being statistically significant, the difference is clinically unimportant. In summary, the difference between reading by computer or visual methods does not change accuracy; it only shifts the threshold to increase the sensitivity by the amount one would expect from the reduction in specificity.


View this table:
[in this window]
[in a new window]
 
Table 4. Mean Sensitivity and Specificity of Thallium Scintigrams by Reading Technique*

 

View this table:
[in this window]
[in a new window]
 
Table 5. Summary Receiver Characteristic Curve Models with and without a Term for Reading Technique

 


Author and Article Information
space
up arrowTop
up arrowConclusion
dotAuthor & Article Info
down arrowReferences

From the University of Sydney, Sydney, Australia; Dartmouth Medical School, Hanover, New Hampshire; Harvard School of Public Health, Harvard Medical School, and the New England Medical Center, Boston, Massachusetts.
Requests for Reprints: Les Irwig, MBBCH, PhD, Department of Public Health, Building A27, University of Sydney, New South Wales, Australia 2006.
Acknowledgments: The authors thank Colin Begg, Gordon Guyatt, and David Sackett for review of the manuscript; Catherine Chock for assistance with data analysis; and Bruce Kupelnick and Clarence Zachery for assistance with literature searching and retrieval.
Grant Support: In part by grant HS05936 from the Agency for Health Care Policy and Research.


References
space
up arrowTop
up arrowConclusion
up arrowAuthor & Article Info
dotReferences

1. Panzer RJ, Black ER, Griner PF, eds. Diagnostic Strategies for Common Medical Problems. Philadelphia: American College of Physicians; 1991.

2. Guyatt GH, Tugwell PX, Feeny DH, Haynes RB, Drummond M. A framework for clinical evaluation of diagnostic technologies. Can Med Assoc J. 1986; 134:587-94.

3. L'Abbe KA, Detsky AS, O'Rourke K. Meta-analysis in clinical research. Ann Intern Med. 1987; 107:224-33.

4. Jenicek M. Meta-analysis in medicine. Where we are and where we want to go. J Clin Epidemiol. 1989; 42:35-44.

5. Fleiss JL, Gross AJ. Meta-analysis in epidemiology, with special reference to studies of the association between exposure to environmental tobacco smoke and lung cancer: a critique. J Clin Epidemiol. 1991; 44:127-39.

6. Greenland S. Quantitative methods in the review of epidemiologic literature. Epidemiol Rev. 1987; 9:1-30.

7. Oxman AD, Guyatt GH. Validation of an index of the quality of review articles. J Clin Epidemiol. 1991; 44:1271-8.

8. Sacks HS, Berrier J, Reitman D, Ancona-Berk VA, Chalmers TC. Meta-analyses of randomized controlled trials. N Engl J Med. 1987; 316:450-5.

9. Berman DS, Kiat H, Van Train KF, Friedman J, Garcia EV, Maddahi J. Comparison of SPECT using technetium-99m agents and thallium-201 and PET for the assessment of myocardial perfusion and viability. Am J Cardiol. 1990; 66:72E-79E.

10. Dales RE, Stark RM, Raman S. Computed tomography to stage lung cancer. Approaching a controversy using meta-analysis. Am Rev Respir Dis. 1990; 141:1096-101.

11. Gianrossi R, Detrano R, Columbo A, Froehlicher V. Cardiac fluoroscopy for the diagnosis of coronary artery disease: a meta analytic review. Am Heart J. 1990; 120:1179-88.

12. Goris ML, Basso LV, Keeling C. Parathyroid imaging. J Nucl Med. 1991; 32:887-9.

13. Hoffman RM, Kent DL, Deyo RA. Diagnostic accuracy and clinical utility of thermography for lumbar radiculopathy. A meta-analysis. Spine. 1991; 16:623-8.

14. Hurlbut TA 3d, Littenberg B. The diagnostic accuracy of rapid dipstick tests to predict urinary tract infection. Am J Clin Pathol. 1991; 96:582-8.[Medline]

15. Kardaun JW, Kardaun OJ. Comparative diagnostic performance of three radiological procedures for the detection of lumbar disk herniation. Methods Inf Med. 1990; 29:12-22.

16. Mezger J, Lamerz R, Permanetter W. Diagnostic significance of carcinoembryonic antigen in the differential diagnosis of malignant mesothelioma. J Thorac Cardiovasc Surg. 1990; 100:860-6.

17. Phillips KA. The use of meta-analysis in technology assessment: a meta-analysis of the enzyme immunosorbent assay human immunodeficiency virus antibody test. J Clin Epidemiol. 1991; 44:925-31.

18. Pinson AG, Becker DM, Philbrick JT, Parekh JS. Technetium-99m-RBC venography in the diagnosis of deep venous thrombosis of the lower extremity: a systematic review of the literature. J Nucl Med. 1991; 32:2324-8.

19. Reed JF 3d. Meta-analysis of the reliability of noninvasive carotid studies. Biomed Instrum Technol. 1991; 25:465-71.

20. Sackett DL, Haynes RB, Guyatt GH, Tugwell P. Clinical epidemiology. A Basic Science for Clinical Medicine. 2d ed. Boston: Little Brown; 1991.

21. Knottnerus JA, Leffers P. The influence of referral patterns on the characteristics of diagnostic tests. J Clin Epidemiol. 1992; 45:1143-54.

22. Easterbrook PJ, Berlin JA, Gopalan R, Matthews DR. Publication bias in clinical research. Lancet. 1991; 337:867-72.

23. Clermont RJ, Chalmers TC. The transaminase tests in liver disease. Medicine (Baltimore). 1967; 46:197-207.

24. Chalmers TC, Frank CS, Reitman D. Minimizing the three stages of publication bias. JAMA. 1990; 263:1392-5.

25. Begg CB, Berlin JA. Publication bias and dissemination of clinical research. J Natl Cancer Inst. 1989; 81:107-15.

26. Lau J, Antman EM, Jimenez-Silva J, Kupelnick B, Mosteller F, Chalmers TC. Cumulative meta-analysis of therapeutic trials for myocardial infarction. N Engl J Med. 1992; 327:248-54.

27. Sox HC Jr, Blatt MA, Higgins MC, Marton KI. Medical Decision Making. Boston: Butterworths; 1988.

28. Fletcher RH, Fletcher SW, Wagner EH. Clinical Epidemiology: The Essentials. 2d ed. Baltimore: Williams & Wilkins; 1988.

29. Hanley JA. Receiver operating characteristic (ROC) methodology: the state of the art. Crit Rev Diagn Imaging. 1989; 29:307-35.

30. Centor RM, Schwartz JS. An evaluation of methods for estimating the area under the receiver operating characteristic (ROC) curve. Med Decis Making. 1985; 5:149-56.

31. Moses LE, Shapiro D, Littenberg B. Combining independent studies of a diagnostic test into a summary ROC curve: data-analytic approaches and some additional considerations. Stat Med. 1993; 12: 1293-316.

32. Albert A. On the use and computation of likelihood ratios in clinical chemistry. Clin Chem. 1982; 28:1113-9.

33. Irwig L. Modelling result-specific likelihood ratios (Letter). J. Clin Epidemiol. 1992; 45:1335-8.

34. Tosteson AN, Begg CB. A general regression methodology for ROC curve estimation. Med Decis Making. 1988; 8:204-15.

35. Hunink MG, Richardson DK, Doubilet PM, Begg CB. Testing for pulmonary maturity: ROC analysis involving covariates, verification bias, and combination testing. Med Decis Making. 1990; 10:201-11.

36. McClish DK. Analyzing a portion of the ROC curve. Med Decis Making. 1989; 9:190-5.

37. McClish DK. Determining a range of false-positive rates for which ROC curves differ. Med Decis Making. 1990; 10:283-7.

38. Wieand S, Gail MH, James BR, James KL. A family of nonparametric statistics for comparing diagnostic markers with paired or unpaired data. Biometrika. 1989; 76:585-92.

39. Guyatt GH, Oxman AD, Ali M, Willan A, McIlroy W, Patterson C. Laboratory diagnosis of iron-deficiency anemia: an overview. J Gen Intern Med. 1992; 7:145-53.

40. Begg CB. Biases in the assessment of diagnostic tests. Stat Med. 1987; 6:411-23.

41. Begg CB, Greenes RA. Assessment of diagnostic tests when disease verification is subject to selection bias. Biometrics. 1983; 39:207-15.

42. Gray R, Begg CB, Greenes RA. Construction of receiver operating characteristic curves when disease verification is subject to selection bias. Med Decis Making. 1984; 4:151-64.

43. Mulrow CD, Linn WD, Gaul MK, Pugh JA. Assessing quality of a diagnostic test evaluation. J Gen Intern Med. 1989; 4:288-95.

44. Colditz GA, Miller JN, Mosteller F. How study design affects outcomes in comparisons of therapy. I: Medical. Stat Med. 1989; 8:441-54.

45. Miller JN, Colditz GA, Mosteller F. How study design affects outcomes in comparisons of therapy. II: Surgical. Stat Med. 1989; 8: 455-66.

46. Emerson JD, Burdick E, Hoaglin DC, Mosteller F, Chalmers TC. An empirical study of the possible relation of treatment differences to quality scores in controlled randomized clinical trials. Control Clin Trials. 1990; 11:339-52.

47. Bernelot Moens HJ, Hirshberg AJ, Claessens AA. Data-source effects on the sensitivities and specificities of clinical features in the diagnosis of rheumatoid arthritis: the relevance of multiple sources of knowledge for a decision-support system. Med Decis Making. 1992; 12:250-8.

48. Lachs MS, Nachamkin I, Edelstein PH, Goldman J, Feinstein AR, Schwartz JS. Spectrum bias in the evaluation of diagnostic tests: lessons from the rapid dipstick test for urinary tract infection. Ann Intern Med. 1992; 117:135-40.

49. Ransohoff DF, Feinstein AR. Problems of spectrum and bias in evaluating the efficacy of diagnostic tests. N Engl J Med. 1978; 299: 926-30.

50. Hlatky MA, Pryor DB, Harrell FE Jr, Califf RM, Mark DB, Rosati RA. Factors affecting sensitivity and specificity of exercise electrocardiography. Am J Med. 1984; 77:64-71.

51. Ng PC, Dear PR. The predictive value of a normal ultrasound scan in the preterm baby—a meta-analysis. Acta Paediatr Scand. 1990; 79:286-91.

52. Banales JL, Pineda PR, Fitzgerald JM, Rubio H, Selman M, Salazar-Lezama M. Adenosine deaminase in the diagnosis of tuberculous pleural effusions. A report of 218 patients and review of the literature. Chest. 1991; 99:355-7.

53. Rosen Y, Rosenblatt P, Saltzman E. Intraoperative pathologic diagnosis of thyroid neoplasms. Report on experience with 504 specimens. Cancer. 1990; 66:2001-6.

54. Laird NM, Mosteller F. Some statistical methods for combining experimental results. Int J Technol Assess Health Care. 1990; 6:5-30.

55. Berlin JA, Laird NM, Sacks HS, Chalmers TC. A comparison of statistical methods for combining event rates from clinical trials. Stat Med. 1989; 8:141-51.

56. Detrano R, Janosi A, Lyons KP, Marcondes G, Abbassi N, Froelicher VF. Factors affecting sensitivity and specificity of a diagnostic test: the exercise thallium scintigram. Am J Med. 1988; 84:699-710.

57. Oxman AD, Guyatt GH. A consumer's guide to subgroup analyses. Ann Intern Med. 1992; 116:78-84.


This article has been cited by other articles:


Home page
RadiologyHome page
M. P. Astin, M. G. Brazzelli, C. M. Fraser, C. E. Counsell, G. Needham, and J. M. Grimshaw
Developing a Sensitive Search Strategy in MEDLINE to Retrieve Studies on Assessment of the Diagnostic Performance of Imaging Techniques
Radiology, May 1, 2008; 247(2): 365 - 373.
[Abstract] [Full Text] [PDF]


Home page
CMAJHome page
T. J. Selman, C. Mann MD, J. Zamora PhD, T.-L. Appleyard MBBS, and K. Khan MSc
Diagnostic accuracy of tests for lymph node status in primary cervical cancer: a systematic review and meta-analysis
Can. Med. Assoc. J., March 25, 2008; 178(7): 855 - 862.
[Abstract] [Full Text] [PDF]


Home page
Hum Reprod UpdateHome page
T.E.M. Verhagen, D.J. Hendriks, L.F.J.M.M. Bancsi, B.W.J. Mol, and F.J.M. Broekmans
The accuracy of multivariate models predicting ovarian reserve and pregnancy after in vitro fertilization: a meta-analysis
Hum. Reprod. Update, March 1, 2008; 14(2): 95 - 100.
[Abstract] [Full Text] [PDF]


Home page
RadiologyHome page
M. H. Heijenbrok-Kal, M. C. J. M. Kock, and M. G. M. Hunink
Lower Extremity Arterial Disease: Multidetector CT Angiography Meta-Analysis
Radiology, November 1, 2007; 245(2): 433 - 439.
[Abstract] [Full Text] [PDF]


Home page
RadiologyHome page
P. K. Vanhoenacker, M. H. Heijenbrok-Kal, R. Van Heste, I. Decramer, L. R. Van Hoe, W. Wijns, and M. G. M. Hunink
Diagnostic Performance of Multidetector CT Angiography for Assessment of Coronary Artery Disease: Meta-analysis
Radiology, August 1, 2007; 244(2): 419 - 428.
[Abstract] [Full Text] [PDF]


Home page
J. Am. Med. Inform. Assoc.Home page
S. M. Handler, R. L. Altman, S. Perera, J. T. Hanlon, S. A. Studenski, J. E. Bost, M. I. Saul, and D. B. Fridsma
A Systematic Review of the Performance Characteristics of Clinical Event Monitor Signals Used to Detect Adverse Drug Events in the Hospital Setting
J. Am. Med. Inform. Assoc., July 1, 2007; 14(4): 451 - 458.
[Abstract] [Full Text] [PDF]


Home page
ANN INTERN MEDHome page
K. Nishimura, D. Sugiyama, Y. Kogata, G. Tsuji, T. Nakazawa, S. Kawano, K. Saigo, A. Morinobu, M. Koshiba, K. M. Kuntz, et al.
Meta-analysis: Diagnostic Accuracy of Anti-Cyclic Citrullinated Peptide Antibody and Rheumatoid Factor for Rheumatoid Arthritis
Ann Intern Med, June 5, 2007; 146(11): 797 - 808.
[Abstract] [Full Text] [PDF]


Home page
Arch. Dis. Child. Fetal Neonatal Ed.Home page
S. Thangaratinam, J. Daniels, A. K Ewer, J. Zamora, and K. S Khan
Accuracy of pulse oximetry in screening for congenital heart disease in asymptomatic newborns: a systematic review
Arch. Dis. Child. Fetal Neonatal Ed., May 1, 2007; 92(3): F176 - F180.
[Abstract] [Full Text] [PDF]


Home page
ChestHome page
J. Jiang, H.-Z. Shi, Q.-L. Liang, S.-M. Qin, and X.-J. Qin
Diagnostic Value of Interferon-{gamma} in Tuberculous Pleurisy: A Metaanalysis
Chest, April 1, 2007; 131(4): 1133 - 1141.
[Abstract] [Full Text] [PDF]


Home page
RadiologyHome page
S. Halligan and D. G. Altman
Evidence-based Practice in Radiology: Steps 3 and 4--Appraise and Apply Systematic Reviews and Meta-Analyses
Radiology, April 1, 2007; 243(1): 13 - 27.
[Abstract] [Full Text] [PDF]


Home page
Ann. Thorac. Surg.Home page
C. M. Jones, T. Athanasiou, N. Dunne, J. Kirby, O. Aziz, A. Haq, C. Rao, V. Constantinides, S. Purkayastha, and A. Darzi
Multi-Detector Computed Tomography in Coronary Artery Bypass Graft Assessment: A Meta-Analysis
Ann. Thorac. Surg., January 1, 2007; 83(1): 341 - 348.
[Abstract] [Full Text] [PDF]


Home page
Hum Reprod UpdateHome page
F.J. Broekmans, J. Kwee, D.J. Hendriks, B.W. Mol, and C.B. Lambalk
A systematic review of tests predicting ovarian reserve and IVF outcome
Hum. Reprod. Update, November 1, 2006; 12(6): 685 - 718.
[Abstract] [Full Text] [PDF]


Home page
Hum Exp ToxicolHome page
S Hoffmann and T Hartung
Toward an evidence-based toxicology
Human and Experimental Toxicology, September 1, 2006; 25(9): 497 - 513.
[Abstract] [PDF]


Home page
BMJHome page
S. Mallett, J. J Deeks, S. Halligan, S. Hopewell, V. Cornelius, and D. G Altman
Systematic reviews of diagnostic tests in cancer: review of methods and reporting
BMJ, August 26, 2006; 333(7565): 413.
[Abstract] [Full Text] [PDF]


Home page
Am. J. Roentgenol.Home page
C. Gatsonis and P. Paliwal
Meta-analysis of diagnostic and screening test accuracy evaluations: methodologic primer.
Am. J. Roentgenol., August 1, 2006; 187(2): 271 - 281.
[Abstract] [Full Text] [PDF]