Guidelines for Meta-analyses Evaluating Diagnostic Tests

  1. Les Irwig, MBBCh, PhD;
  2. Anna N. A. Tosteson, ScD;
  3. Constantine Gatsonis, PhD;
  4. Joseph Lau, MD;
  5. Graham Colditz, MD, DrPH;
  6. Thomas C. Chalmers, MD; and
  7. Frederick Mosteller, PhD
  1. From the University of Sydney, Sydney, Australia; Dartmouth Medical School, Hanover, New Hampshire; Harvard School of Public Health, Harvard Medical School, and the New England Medical Center, Boston, Massachusetts. Requests for Reprints: Les Irwig, MBBCH, PhD, Department of Public Health, Building A27, University of Sydney, New South Wales, Australia 2006. Acknowledgments: The authors thank Colin Begg, Gordon Guyatt, and David Sackett for review of the manuscript; Catherine Chock for assistance with data analysis; and Bruce Kupelnick and Clarence Zachery for assistance with literature searching and retrieval. Grant Support: In part by grant HS05936 from the Agency for Health Care Policy and Research.

    Abstract

    Objectives: To introduce guidelines for the conduct, reporting, and critical appraisal of meta-analyses evaluating diagnostic tests and to apply these guidelines to recently published meta-analyses of diagnostic tests.

    Data Sources: Based on current concepts of how to assess diagnostic tests and conduct meta-analyses. They are applied to all meta-analyses evaluating diagnostic tests published in English-language journals from January 1990 through December 1991, identified through MEDLINE searching and by experts in the field.

    Study Selection: Meta-analyses were included if at least two of three independent readers regarded their main purpose as the evaluation of diagnostic tests against a concurrent reference standard.

    Data Extraction: By three independent readers on the extent to which meta-analyses fulfilled each guideline, with consensus defined as agreement by at least two readers.

    Data Synthesis: The guidelines are concerned with determining the objective of the meta-analysis, identifying the relevant literature and extracting the data, estimating diagnostic accuracy, and identifying the extent to which variability is explained by study design characteristics and characteristics of the patients and diagnostic test. In general, the guidelines were only partially fulfilled.

    Conclusion: Meta-analysis is potentially important in the assessment of diagnostic tests. Those reading meta-analyses evaluating diagnostic tests should critically appraise them; those doing meta-analyses should apply recently developed methods. The conduct and reporting of primary studies on which meta-analyses are based require improvement.

    Clinicians must decide whether to use a diagnostic test in a patient and how to interpret the result [1]. Policy-makers must assess the overall value of a test, compare it to alternatives, and decide whether the test should be made available. Both clinical and policy decisions should be based on a thorough evaluation of the test [2]. A crucial step in evaluation is the assessment of diagnostic accuracy—the ability of the test to determine correctly the presence or absence of the disease of interest. This is done by comparing test results with those obtained using a reference (“gold,” criterion, or comparison) standard. Estimates of the diagnostic accuracy of a test may differ among studies. Each study may have included too few patients to give precise estimates or too selected a population to allow general applicability. Therefore, meta-analysis, the critical review and statistical combination of results of previous research [3-6], is potentially useful for assessing diagnostic accuracy. Using meta-analysis, we can 1) provide an overall summary of diagnostic accuracy; 2) determine whether estimates of diagnostic accuracy depend on the study design characteristics [study validity] of the primary studies; 3) determine whether diagnostic accuracy differs in subgroups defined by the characteristics of the patients and test; and 4) identify areas for further research. New hypotheses may be generated or the attempt to meta-analyze data may highlight deficits that need to be addressed in future primary studies before a useful meta-analysis can be done.

    To help researchers do and readers assess meta-analyses of diagnostic accuracy, we suggest guidelines on how they should be conducted, reported, and critically appraised. Our guidelines are based on current concepts of how to assess diagnostic tests and conduct meta-analyses (Table 1). Because other guidelines exist for meta-analysis in general [3-8], we emphasize those issues that are particular to assessing diagnostic accuracy. For each step, the guidelines are used to review 11 journal articles published from January 1990 through December 1991 whose primary purpose was to assess the accuracy of a diagnostic test against a concurrent reference standard using meta-analysis [9-19]. The 11 articles are all the meta-analyses published in this 2-year period that we could identify through various search procedures. The guidelines were applied to each article independently by three of the authors, and the majority view was accepted. (See Appendix 1 for details of search and review procedures.)

    Table 1. Steps in Conducting a Meta-analysis of a Diagnostic Test and Summary of Guidelines

    Determine the Objective and Scope of the Meta-analysis

    Because the type of meta-analysis we are assessing compares a diagnostic test against a reference standard for the disease of interest, a clear statement about what diagnostic test is being evaluated and how the reference standard is defined should be provided. Diagnostic tests are frequently used against the background of clinical information that is available before testing. A test may appear useful if only the association between the test and the reference standard is examined. However, it may not be useful if its information is carried by data that are already available to the clinician or are easy to obtain without ordering the test [20]. Test evaluation should examine the incremental or marginal value of the test. For example, consider an evaluation of exercise thallium scans against the reference standard of coronary angiography in patients with symptoms suggestive of coronary artery disease. An electrocardiogram would have been obtained routinely at the time of exercise. Therefore, the relevant question is whether the thallium scan has any incremental value over the information already available from the electrocardiogram. Similarly, a test may have poorer diagnostic accuracy in a tertiary care facility than in a primary care setting because the patient population has already been selected on the results of other tests [21]. Thus, it is imperative that meta-analysts clearly state the clinical circumstances for which they wish to evaluate diagnostic accuracy.

    Not all primary studies identified will have used the meta-analyst's ideal reference standard or cover the exact tests or clinical context desired by the meta-analyst. This can be dealt with by using only the subset of papers that do so or by examining the extent to which deviation from the ideal changes the findings. In general, we suggest the latter approach and describe it further when we discuss how to assess the effect of variation in study validity and in the characteristics of patients and test. Often, the comparative value of several tests is being assessed. As when only a single test is being evaluated, this comparison should be viewed against the background of previous information, which should be equivalent for the tests being compared.

    Review. The reference standard was stated in 10 and the test of interest in all of the 11 meta-analyses. Two meta-analyses gave a clear statement about the clinical background against which the tests' incremental value was being evaluated. Some authors, for example, Pinson and colleagues [18], stated that this was a problem in the primary studies, pointing to the need for improving their quality. Seven meta-analyses compared tests, whereas 4 evaluated an individual test.

    Retrieve the Relevant Literature

    Literature searching should be done using various methods, including searching computerized bibliographic systems, consulting content experts, searching appropriate journals, and identifying further studies and search strategies using the reference lists of studies already obtained [3, 8, 20]. The final MEDLINE search strategy should be reported in sufficient detail to allow others to repeat it if they wish to update the meta-analysis after publication.

    Except for those clearly outside the scope of the meta-analysis, the relevance of all other papers should be judged by applying inclusion and exclusion criteria. This result could be accomplished by using the same methods outlined in the next section for the extraction of data. Reporting the reason for exclusion of potentially eligible papers helps readers understand how the criteria were applied.

    Publication bias may arise because of the selection for publication of papers with more extreme results because they are “more interesting” or “statistically significant” [22, 23]. Methods for dealing with publication bias have been developed [24, 25], but their applicability to diagnostic test assessment has not been explored.

    Review. Literature retrieval methods were described in 7 of the 11 meta-analyses, all of which used MEDLINE searching. Three articles gave their search terms, and 2 of these explained how these were linked [10, 14]. Criteria for including or excluding papers were given in 8 articles. Four articles gave information about excluded studies. Publication bias was discussed in 3 of the 11 articles, although no estimates could be made of its likely effect.

    Extract and Display the Data

    Assessing study relevance and extracting data require the use of judgment. Various procedures have been suggested to reduce the possibility of obtaining biased estimates of diagnostic accuracy, including independent review by two readers and resolution of differences by a third reader or by discussion between the original two readers [3, 8]. Sometimes, readers are blinded to details of authorship and study results [8]. Although theoretically sound, no evidence exists that blinding results in a decrease in bias. Until such evidence is obtained, the decision on whether to use blinding depends on the resources available. Whatever the decision, authors should state how the judgments were made.

    Publishing a full list of diagnostic accuracy and study characteristics (for example, design features, patient characteristics) for each primary study allows other researchers to decide if they agree with the judgments made and enables re-analysis applying different analytic techniques, using a subset of studies, or adding studies published after the meta-analysis was done [26].

    Review. Two meta-analyses stated that each of the primary studies was assessed by two or more readers [13, 19]. One of the articles [13] also mentioned that disagreements were settled by consensus or a third party. Gianrossi and colleagues [11] offer an example of extensive display of data about diagnostic accuracy and study characteristics.

    Estimate Diagnostic Accuracy

    Before discussing the estimation of diagnostic accuracy in meta-analyses, some key concepts of test assessment need to be addressed [20, 27, 28]. Most measures of diagnostic accuracy are based on the comparison of the test with a reference standard that determines the presence or absence of the disease of interest. An ideal diagnostic test discriminates between diseased and nondiseased individuals without error. Test error can be characterized by various measures. The two most commonly reported measures are sensitivity, the probability that a test result is positive in patients with the disease of interest, and specificity, the probability that a test result is negative in patients without the disease of interest.

    These measures rely on a single threshold (cut-point or positivity criterion) for classifying a test result as positive. Changing the threshold to increase sensitivity decreases specificity and vice versa. This trade-off between sensitivity and specificity makes it imperative that they be considered jointly. When studies use different criteria to define positive and negative test results, as in our example, they differ in their explicit threshold. Even when studies use the same explicit threshold, their implicit thresholds may differ, especially if interpretation of the test requires judgment. For example, radiologists may agree to use the same words to describe imaging test results but still differ in what they regard as the boundary between “abnormal” and “probably abnormal”

    An alternative to reporting a single pair of sensitivity and specificity estimates is to report a range of pairs, which is obtained as the threshold criterion is varied. Such a range of pairs is often reported as a receiver operating characteristic (ROC) curve, which is a graph of the sensitivity (true-positive rate) on the vertical axis against the false-positive rate (1 −specificity) on the horizontal axis [29]. One overall measure of the test's accuracy is the area under the ROC curve, where a value of 0.5 is obtained if the test does no better than chance and a value of 1 is obtained if the test is perfect [29, 30].

    Another measure of test performance is the likelihood ratio, defined as the ratio of the probability of a particular test result in people with disease to the probability of the same test result in people without disease. A likelihood ratio at each possible value of a multi-category or continuous test to get a result-specific likelihood ratio can be estimated, thus avoiding the need to decide on a single threshold for dichotomizing a test as positive or negative. More importantly, it avoids the loss of information that the dichotomy causes, whereby a test result just above the threshold is not differentiated from a test result well above the threshold [20]. Likelihood ratios can be used to compute the post-test odds of disease for a patient with known or estimated pretest odds by using a version of Bayes' theorem:

    Post-test odds of disease =

    (pre-test odds of disease) x (likelihood ratio)

    Methods for obtaining a summary estimate of diagnostic accuracy in a meta-analysis are now described, first for when test results are available only as a dichotomy and then for when test results are available in more than two categories.

    Test Results Are Available Only as a Dichotomy

    If primary studies provide only sufficient information to estimate sensitivity and specificity, the mean sensitivity and the mean specificity can be estimated, possibly weighted in some way for the sample size of each study. However, this technique is inappropriate because it is likely that different studies use different explicit or implicit thresholds, so that a primary study with a high sensitivity may have a low specificity and vice versa. The problem of estimating mean values separately for sensitivity and specificity is shown by considering the following hypothetical data: Three studies of equivalent size have sensitivity and specificity rates of 100% and 0%, 99% and 99%, and 0% and 100%, respectively. The mean sensitivity is 66% and the mean specificity, 66%. Yet, if the true-positive rate (sensitivity) is plotted against the false-positive rate (1 −specificity), that is, using the axes that are also used for an ROC curve, it is evident that diagnostic accuracy is high with the area under the ROC curve close to 1. In general, estimating mean sensitivity and specificity separately underestimates test accuracy. Therefore, a reasonable first approach in a meta-analysis is to plot a scattergram of the true-positive rate against the false-positive rate. The scattergram allows a visual impression of the variability in both measures and how they are related. Recently, techniques to fit a model to such data have been described by Kardaun and Kardaun [15] and by Moses and colleagues [31]. This method is referred to here as a summary receiver operating characteristic (SROC) curve. An example of a scatterplot and a superimposed SROC curve is shown in Figure 1 and its derivation is detailed in Appendix 2.

    Figure 1. Studies using computerized or semi-computerized reading techniques are shown as open circles and those using visual techniques as solid squares.
    View larger version:
    Figure 1. Studies using computerized or semi-computerized reading techniques are shown as open circles and those using visual techniques as solid squares. Plot of true-positive rate on false-positive rate for thallium scintigrams to detect angiographic coronary artery disease.

    Test Results Are Available in More than Two Categories

    If test results are measured as a continuum (for example, biochemical values such as creatine kinase) or as responses on an ordinal categorical scale (for example, degrees of suspicion of abnormality in the interpretation of radiologic images), then several other analytic techniques can be used. If no threshold or scaling differences between primary studies exist and test comparison is not an objective, then result-specific likelihood ratios can be obtained from logistic modeling procedures [32, 33].

    If scale or threshold differences between primary studies are likely, as is probable for tests involving judgment, then likelihood ratios cannot be obtained on pooled data. One approach is to dichotomize the test result for each primary study and to use SROC methods as described above. Better use is made of the data if an ROC curve is constructed for each primary study using several thresholds, and an overall ROC curve is derived using ordinal regression techniques that have been applied previously to the diagnostic setting as a means of controlling for covariates [34]. Summary measures such as the area under the ROC curve can be obtained for the entire curve [35] or a clinically relevant range [36-38]. Difficulties with this approach may arise when studies have different numbers of cut-off values.

    Review. All 11 meta-analyses were based on estimates of sensitivity and specificity from the primary studies. Seven articles gave mean estimates separately for sensitivity and specificity and did not examine their interdependence. Two studies estimated SROC curves [14, 15], one of which also estimated an odds ratio as a measure of diagnostic accuracy [15]. Two papers did not give a summary measure of diagnostic accuracy. Hoffman and colleagues [13] plotted sensitivity against the false-positive rate and decided not to pool results because of heterogeneity among the estimates of diagnostic accuracy from the primary studies. One meta-analysis gave a statistical test of association without any summary measure of diagnostic accuracy [19], whereas another based part of its analysis on a prevalence-dependent measure, the proportion of individuals overall correctly classified by the test [10].

    Ordinal regression techniques have not yet been used to pool ROC data from primary studies. Data from primary studies that are amenable to the ordinal regression approach are uncommon. One meta-analysis did, however, note that some primary studies had test data at more than one threshold [15].

    An example of the use of continuous data is the meta-analysis by Guyatt and colleagues [39], who obtained individual data points from 55 publications on the value of serum ferritin levels in the diagnosis of iron-deficiency anemia. Logistic regression was used on the pooled data to obtain continuous graphs of the likelihood ratio by serum ferritin result.

    Assess the Effect of Variation in Study Validity on Estimates of Diagnostic Accuracy

    Ideally, the estimate of diagnostic accuracy should be based on studies of the highest scientific validity, that is, studies that are most likely to be free of bias [20]. Before we consider guidelines on how to deal with study validity in meta-analyses of diagnostic tests, we describe the most important potential sources of bias in assessing diagnostic accuracy [40].

    Appropriate Reference Standard

    The reference standard must be clearly defined and should be the best available method of assessing the presence or absence of the disease of interest.

    Independence of Observations

    Test results may be biased if the test requires judgment and this judgment is made by someone who has knowledge of the result of the reference standard. In general, diagnostic accuracy would be expected to be overestimated. Therefore, those involved in assessing test results should be blind to the result of the reference standard. Likewise, assessors of the reference standard should be blind to the test result.

    Verification Bias

    Diagnostic accuracy should be assessed in consecutive patients who present with the clinical problem of interest. Verification bias may occur when the reference standard has been assessed on patients sampled differentially in the categories of test results [41, 42]. For example, the hypothetical data based on consecutive patients in Table 2 shows a sensitivity of 80% and a specificity of 90%. If establishing the reference standard required invasive procedures, patients who test negative may be less likely to have the reference test. Suppose that only a random sample of 10% of the test-negatives have the reference test. In that case (ignoring sampling variability), the estimated sensitivity would be 98% and the specificity, 47% (Table 3).

    Table 2. Hypothetical Comparison of a Test and a Reference Standard on Consecutive Patients
    Table 3. Verification Bias: Cross-classification Based on Table 2 in Which Only a 10% Random Sample of Test-Negatives Has Been Verified by the Reference Standard*

    The investigators can adjust for the bias if the sampling was done randomly and with known proportions within strata defined by test results. For example, Table 2 can be reformulated from Table 3 if it is known that Table 3 includes only a 10% random sample of test-negatives. Methods also exist for the estimation of confidence intervals for the corrected sensitivity and specificity [41]. Usually the situation is more complex, with other clinical information, such as age or other symptoms, influencing the selection of patients who are assessed by the reference standard. Again, the bias can be adjusted for as long as sampling has been random within the categories defined by test result and known clinical information. However, the choice of patients for verification by the reference standard is commonly not random. In this case, estimates of diagnostic accuracy are biased in unpredictable ways and no adjustment is possible.

    Issues in Comparative Meta-analyses

    A comparison of the relative accuracy of several diagnostic tests should ideally be based on applying all the tests to each of the patients or randomly assigning tests to patients in each primary study. Obtaining diagnostic accuracy information for different tests from different primary studies is a weak design; differences in diagnostic accuracy may reflect differences in study characteristics that are unlikely to be adequately controlled by attempting to adjust for them in the meta-analysis.

    We next address how to incorporate variation in study validity into the meta-analysis. One approach is to exclude studies that do not meet standards for scientific validity [20, 43]. This method avoids bias at the expense of decreasing precision, that is, widening the confidence intervals. For meta-analyses of clinical trials, empiric evidence supports the theory that inclusion of nonrandomized trials can bias the results [44, 45], although little empiric evidence supports the importance of other deviations from the theoretically ideal design [46]. For meta-analyses evaluating diagnostic tests, little empiric evidence has been accumulated to determine the practical importance of the potential sources of bias discussed above. Moreover, primary studies often do not report sufficient data for judging the potential for bias. We suggest that the initial analysis should assess the effect of reported study design flaws on estimates of diagnostic accuracy. This goal can be accomplished by doing the meta-analysis separately for studies with and without a particular design flaw or by including the presence or absence of the flaw in regression models as outlined in Appendix 3. Primary studies with a particular design flaw may give a different SROC curve to primary studies without that flaw. In that case, reliance should be placed only on those studies that do not have the flaw. Alternatively, the design flaw may not give a different SROC curve, in which case all studies can be used in the meta-analysis. Because different design flaws are likely to cause different biases, we suggest assessing the effect of each separately rather than summarizing them into an overall “validity score”

    Review. Six of the 11 meta-analyses discussed variability in the choice of reference standard between the primary studies. Three articles examined how this variability affected diagnostic accuracy. Seven of the 11 meta-analyses mentioned other study design characteristics. Pinson and colleagues [18] provide a good example of their assessment. One meta-analysis excluded studies because they were prone to verification bias [15]. Five meta-analyses showed data on the variability of study design characteristics between studies, and 5 did analyses to determine how they predicted diagnostic accuracy. All but 1 of these analyses explored the relation between study design characteristics separately for sensitivity and specificity. Therefore, they are unable to assess whether the study design characteristic altered diagnostic accuracy rather than just reflecting differences in the threshold for test positivity between the primary studies.

    Of the seven comparative diagnostic test evaluations, two were based on the tests both being applied to the same patients within each primary study [9, 14]. The remaining five meta-analyses used the much weaker design of obtaining estimates of diagnostic accuracy for the different tests at least in part from different studies.

    Assess the Effects of Variation in the Characteristics of the Patients and Test on Estimates of Diagnostic Accuracy (Generalizability)

    Valid estimates of diagnostic accuracy may not be generalizable (applicable) to the setting in which the reader works. Readers of a meta-analysis will want to know if they can apply the meta-analyzed estimate of diagnostic accuracy to the clinical or policy decision they confront. Although evidence exists for at least one condition that the combination of multiple tests is generalizable between settings [47], there is still reason for concern about the applicability of diagnostic accuracy assessments from a meta-analysis to other medical settings [21, 48]. Readers of meta-analyses may decide that the summary estimate of diagnostic accuracy is applicable to their decision making for any of the following reasons: 1) Characteristics of the patients and test are similar in the meta-analysis and in their target population; 2) characteristics are not associated with diagnostic accuracy; or 3) a particular characteristic (for example, sex) affects diagnostic accuracy and estimates are provided separately for groups defined by this characteristic. The reader can then apply them separately to each group.

    Relevant characteristics depend on the topic and will be limited by reporting in the primary studies. The major patient characteristics are concerned with the clinical spectrum under consideration. For example, a test may be very accurate at differentiating patients with advanced cancer from persons in perfect health but much less accurate at differentiating patients with early cancers from those whose symptoms are caused by a range of other diseases [49, 50]. This example illustrates two factors that can influence the estimate of test accuracy: the extent of cancer in the “diseased” group and the occurrence of other medical conditions in the “nondiseased” group. The implication of this phenomenon is that measures of diagnostic accuracy are generalizable only to settings that have a similar spectrum of patients, defined by the type and extent of disease in patients with the disease of interest and the type and extent of differential diagnoses in the controls. This spectrum is likely to vary in different practice settings [20, 47]. Other commonly included patient characteristics are age, sex, presenting complaints, comorbid conditions, and the findings of other diagnostic tests that have been done.

    The technical details of tests may also vary from one setting to another and limit generalizability. Variation among studies may be due to different diagnostic accuracy of different test methods used in the primary studies. On the other hand, different test methods may simply vary in their threshold. For example, the diagnostic accuracy of computerized techniques of reading thallium scintigrams for the diagnosis of coronary artery disease can be shown to be no better than visual reading. However, the threshold for computerized reading is at a level that results in a higher sensitivity at a lower specificity (see Figure 1 and Appendix 3).

    Review. Nine of the 11 meta-analyses mentioned variability in at least one patient characteristic among the primary studies. Five articles gave information about the distribution of characteristics, and 7 included them in analysis as predictors of variability of diagnostic accuracy. The most commonly considered patient characteristic was the type and extent of disease in patients with the disease of interest. Seven meta-analyses discussed variability in the test, of which 4 examined how this variability affected diagnostic accuracy. Four meta-analyses examined how publication year affected diagnostic accuracy. As discussed in the review of how meta-analyses dealt with study validity, studies generally examined the effect of characteristics on sensitivity and specificity separately and therefore did not assess whether characteristics affected diagnostic accuracy rather than just causing a shift in threshold.

    Conclusion

    Evaluating and summarizing diagnostic test accuracy from literature articles is a complex task with many methodologic pitfalls and biases. We have outlined steps that improve the validity of meta-analyses of diagnostic tests. Attention to these guidelines can greatly improve our ability to synthesize information in the growing literature on diagnostic test evaluation. Meta-analysis can potentially provide a better understanding by examining the variability in estimates of diagnostic accuracy from primary studies, exploring whether this variability is explained by differences in study validity, and determining whether diagnostic accuracy based on the most valid studies varies for different clinical subgroups or categories of patients. By improving the conduct and reporting of primary studies, meta-analysis will also become a more valuable technique.

    Appendix 1. Procedure for Identifying and Evaluating Meta-analyses

    Literature Retrieval Methods

    Meta-analyses published between January 1990 and December 1991 were identified through searching MEDLINE, by consulting experts in the field, and examining bibliographies of papers already retrieved. MEDLINE searching was done independently by two groups of authors. We then examined the index terms used in MEDLINE for all relevant papers obtained by any means and devised a final search strategy, which was as follows:

    (explode diagnosis OR any of the following subheadings: diagnosis, radionuclide imaging, ultrasonography OR explode “sensitivity and specificity”) AND (Meta-analysis OR the text words “meta” and any word starting with “analy”).

    Additional searches linking terms for diagnosis to “overview” and words starting with “pool” as text words had a negligible yield and were not pursued.

    Review Procedure

    The abstracts of all articles were reviewed by one of the authors to decide if the article was possibly a meta-analysis evaluating a diagnostic test against a concurrent reference standard. All papers considered possibly eligible were reviewed independently by two of the authors who assessed whether the paper was concerned with the association between a test and a concurrent reference standard and whether there was an attempt to combine data from primary studies in a quantitative way. A study was considered eligible if both reviewers answered yes to both questions. Disagreements were resolved through independent review by a third reader and by accepting the majority opinion. Note that our first criterion requires that the reference standard be concurrent. Therefore meta-analyses that require follow-up to establish outcome, for example, reference 51, were excluded.

    All eligible papers were then reviewed independently by three of the authors to assess whether meta-analysis evaluating a diagnostic test was the main purpose of the paper or whether it was secondary (for example, to the reporting of new study results) and whether the meta-analysis addressed each of the issues outlined in our guidelines.

    For each item assessed, the majority view was taken as the final response. Responses about whether our guidelines were addressed are given only for those 11 papers in which meta-analysis was the main purpose. Those papers in which meta-analysis was a secondary purpose (for example, done in the discussion section of the report of new research findings, such as [52, 53]) provided negligible information about how or whether the issues in our guidelines were addressed.

    Appendix 2. Methods for Estimating Summary Receiver Operating Characteristic Curves

    The analytic method for SROC curves is based on the principle that the ROC curve is conveniently represented as a straight line when logit TPR is plotted against logit FPR [31], where TPR = true-positive rate or sensitivity, FPR = false-positive rate or (1 −specificity), logit TPR = log (PR/[1 −TPR]), logit FPR = log (FPR/[1 −FPR]).

    For statistical reasons, it is advisable to model logit TPR − logit FPR as a linear function of logit TPR + logit FPR. Thus, to estimate an SROC curve, we use the following model:

    D = a + bS

    where D = logit TPR − logit FPR, S = logit TPR + logit FPR, a = intercept term, and b = regression coefficient for S.

    This model can be fit using conventional least-squares methods unweighted or weighted by the variance of (logit TPR − logit FPR) within available statistical packages. Robust techniques can also be used [31]. Regression lines should be drawn only over the range of the data. The final model can be converted back to the conventional ROC axes of TPR against FPR.

    The SROC formulation of D and S have convenient interpretations. D is easily shown to be the log odds ratio, which is a common measure of association in epidemiologic studies. Here, the odds ratio represents the odds of a positive test result among diseased persons relative to the odds of a positive test result among nondiseased persons. S is a measure of the threshold for classifying a test as positive, which has a value of 0 when sensitivity equals specificity. It becomes positive when a threshold is used that increases sensitivity (and decreases specificity) and becomes negative when a threshold is used that decreases sensitivity (and increases specificity). The intercept of the model (a) is therefore an odds ratio and the regression coefficient (b) examines the extent to which the odds ratio is dependent on the threshold used. If the regression coefficient is near zero and not statistically significant, test accuracy for each primary study can be summarized as the odds ratio and these odds ratios can be combined using various techniques [54, 55].

    The SROC method deals with the problem of different thresholds among studies and is useful for comparing the overall diagnostic accuracy of different tests or the extent to which accuracy depends on study characteristics. However, it does not directly provide an exclusive estimate of sensitivity and specificity. To do so requires fixing a value for either sensitivity or specificity and reading the corresponding value for the other off the SROC curve. The fixed value could be the median or mean of those found on meta-analysis or based on local experience.

    To illustrate the SROC approach, we use data from a meta-analysis of over 50 primary studies on the exercise thallium scintigram as a test for angiographic coronary artery disease [56]. Although this paper was published before the years of our formal review, it is used because it gives extensive tabulation of the data from each primary study. This meta-analysis examined sensitivity and specificity separately. The range of sensitivity among the primary studies was from 0.63 to 0.98 (mean, 0.840) and the range of specificity from 0.43 to 1.00 (mean, 0.844). The regression equation for the plot of D on S was obtained after adding 0.5 to the numerator and 1 to the denominator of both the TPR and FPR for each study so that any zero cells did not result in undefined transformations [31]. The intercept of the unweighted model is 3.631 (95% CI, 3.354 to 3.907) and the regression coefficient for S is −0.294(CI, 0.503 to −0.085).The regression coefficient differs significantly from zero, showing that the odds ratio for the association between test and reference standard is dependent on the threshold used. The plot of TPR on FPR is shown (Figure 1). At a mean specificity of 0.844, the sensitivity is estimated at 0.868, which is only slightly higher than the value that was reported when the sensitivity and specificity were examined separately. The difference is small because the correlation between logit TPR and logit FPR is modest in this example (Pearson r = 0.19, P = 0.16).

    Appendix 3. Methods for Assessing the Effect on Diagnostic Accuracy of Variation in Study Validity and Characteristics of the Patients and Test

    Estimates of diagnostic accuracy may vary by study design characteristics (study validity) or characteristics of the patients or test (generalizability). Differences in diagnostic accuracy by characteristics are more likely to be real rather than caused by the play of chance if the analyses fulfill criteria such as being based on a previous hypothesis and showing large statistically significant differences [57]. Most variability in diagnostic accuracy will probably not be explained by reported characteristics. Formal methods (random-effects models) exist for taking account of heterogeneity between primary studies and estimating summary measures with appropriate confidence intervals in meta-analyses of randomized trials [54, 55] but have not as yet been published for most measures of diagnostic accuracy.

    Experience with modeling the effect of characteristics on diagnostic accuracy is limited at present. Because most primary studies only provide test data around a single threshold and methods for exploring the effect of other variables are not well developed, we restrict our comments to examining whether a single variable predicts diagnostic accuracy using SROC. Assessing whether characteristics affect test accuracy requires a method that identifies whether accuracy is better in certain groups or if there is only a shift in threshold along the same ROC. To assess accuracy, the two groups being compared can be shown graphically using different symbols. The magnitude of the difference between groups and its statistical significance can be obtained by including the group variable in the SROC model. Alternatively, one can model the SROC curve based on all the data and compare the residuals around this model for the two groups using an unpaired t-test [14]. We now explore the first method using the example from Appendix 2.

    We may wish to know if diagnostic accuracy is improved by having thallium scintigrams read using computerized techniques rather than by visual examination. The data are derived from Tables 2 and 3 of the meta-analysis by Detrano and colleagues [56]. The analysis compares studies that used visual techniques to read the thallium scintigrams with those that used computer or semi-quantitative techniques. The sensitivity appears to improve with computerized reading (Table 4). However, specificity deteriorates. This result could be explained by a threshold difference between the two reading techniques rather than a difference in accuracy of the two techniques. In Figure 1, estimates of diagnostic accuracy for visual reading are more common at lower true-and false-positive rates and computerized reading at higher true- and false-positive rates. This finding could be caused by a shift in threshold and could be confirmed by comparing the means of S for the two techniques. Means are −0.45(CI, −0.79 to −0.12)for visual readings and 0.80 (CI, 0.08 to 1.53) for computerized readings, suggesting that there is a significant (P = 0.005) difference in threshold. Reading technique does not have a statistically significant coefficient if included in the SROC model (Table 5). Setting specificity at 0.880 in the model gives a sensitivity of 0.846 for visual reading and 0.858 for computerized reading. In addition to not being statistically significant, the difference is clinically unimportant. In summary, the difference between reading by computer or visual methods does not change accuracy; it only shifts the threshold to increase the sensitivity by the amount one would expect from the reduction in specificity.

    Table 4. Mean Sensitivity and Specificity of Thallium Scintigrams by Reading Technique*
    Table 5. Summary Receiver Characteristic Curve Models with and without a Term for Reading Technique

    References

    1. 1.
    2. 2.
    3. 3.
    4. 4.
    5. 5.
    6. 6.
    7. 7.
    8. 8.
    9. 9.
    10. 10.
    11. 11.
    12. 12.
    13. 13.
    14. 14.
    15. 15.
    16. 16.
    17. 17.
    18. 18.
    19. 19.
    20. 20.
    21. 21.
    22. 22.
    23. 23.
    24. 24.
    25. 25.
    26. 26.
    27. 27.
    28. 28.
    29. 29.
    30. 30.
    31. 31.
    32. 32.
    33. 33.
    34. 34.
    35. 35.
    36. 36.
    37. 37.
    38. 38.
    39. 39.
    40. 40.
    41. 41.
    42. 42.
    43. 43.
    44. 44.
    45. 45.
    46. 46.
    47. 47.
    48. 48.
    49. 49.
    50. 50.
    51. 51.
    52. 52.
    53. 53.
    54. 54.
    55. 55.
    56. 56.
    57. 57.
    « Previous | Next Article »Table of Contents