Integrating Heterogeneous Pieces of Evidence in Systematic Reviews

  1. Cynthia Mulrow, MD, MSc;
  2. Peter Langhorne, PhD, MRCP; and
  3. Jeremy Grimshaw, MBChB, MRCGP
  1. From the University of Texas Health Science Center at San Antonio and Audie L. Murphy Veterans Affairs Hospital, San Antonio, Texas; Royal Infirmary, Glasgow, Scotland; and the University of Aberdeen, Aberdeen, Scotland. Acknowledgments: The authors thank Drs. Robert Fletcher and Brian Haynes for their critical reading of the manuscript. They also thank the clinical reviewer, Norman J. Wilder. Requests for Reprints: Cynthia D. Mulrow, MD, MSc, Audie L. Murphy Memorial Veterans Hospital, 7400 Merton Minter Boulevard (11C6), San Antonio, TX 78284. Current Author Addresses: Dr. Mulrow: Audie L. Murphy Memorial Veterans Hospital, 7400 Merton Minter Boulevard (11C6), San Antonio, TX 78284.

    Abstract

    Researchers preparing systematic reviews often encounter various types of evidence, which can generally be categorized as direct or indirect.The former directly relates an exposure, diagnostic strategy, or therapeutic intervention to the occurrence of a principal health outcome. Evidence is indirect if two or more bodies of evidence are required to relate the exposure, diagnostic strategy, or intervention to the principal health outcome.

    Heterogeneity of data sources complicates integration of both direct and indirect evidence.Participants in different studies may have a wide spectrum of baseline risk and sociodemographic and cultural characteristics. A variety of formulations and intensities of exposures, diagnostic strategies, and interventions, as well as diversity in the selection and definition of control groups, may be encountered. Outcome measures may be different, and similar outcomes may be measured or reported differently. Heterogeneity of study designs and of methodologic features and quality within a given design may be found. The effective integration of direct and indirect evidence requires development of explicit models that serve as analytic frameworks for linking the important pieces of evidence. A model can be viewed as a series of subquestions, with each important subquestion warranting a systematic review. Several subjective and quantitative methods can then be used to integrate the evidence. Tabular displays of major findings and strength of evidence for each subquestion can help reviewers, patients, and providers to integrate the differing research findings and draw reasonable conclusions. Various quantitative techniques, such as decision analysis and the confidence profile method, are also available. No single integration approach is clearly superior, none obviates uncertainty, and all underscore the role of careful judgment in integrating evidence.

    Previous articles in this series described systematic reviews and how to find them [1, 2], discussed their role in practice and educational settings [3-6], and outlined important aspects of their conduct [7-9]. This article addresses a particularly challenging problem in conducting systematic reviews-integration of different types of evidence within a single review. We present strategies for integrating evidence from various primary studies that were conducted with different objectives, protocols, and designs. These strategies may be useful in a variety of situations in which heterogeneous evidence is used (for example, clinical decisions, decision analysis, economic analysis, practice guidelines, and health policy formulations). The strategies are intended primarily for reviewers who address broad questions (for example, reviewers interested in producing evidence-based guidelines). The strategies may also help reviewers with focused questions when the data available on a topic are particularly heterogeneous.

    We consider the following specific questions: 1) How can reviewers classify and structure heterogeneous research evidence? 2) What factors complicate integration of heterogeneous research evidence? 3) What strategies help integrate heterogeneous research evidence?

    Classifying and Structuring Research Evidence

    Just as any scientific inquiry moves from concrete observations to abstract concepts, reviewers must move from samples of data (individual pieces of evidence) to more general conclusions [10]. This process involves drawing together multiple pieces of evidence into a unified whole by categorizing and ordering data [11]. An important first step is to classify evidence as direct or indirect.

    Direct evidence directly relates an exposure, diagnostic strategy, or therapeutic intervention to the occurrence of a principal health outcome [12]. Principal health outcomes are those relevant to the patient, such as symptoms, loss of function, and death [13]. Whether a study provides direct evidence depends on methodologic design and the outcomes studied. For example, some randomized, controlled trials that compared diuretic and β-blocker regimens with no therapy or placebo in hypertensive adults have directly shown that these therapies decrease cardiovascular morbidity and mortality [14]. Other trials comparing antihypertensive agents have shown decreases in blood pressure (a surrogate outcome) but have not directly demonstrated effects on cardiovascular morbidity and mortality (the principal outcomes).

    Evidence is indirect if two or more bodies of evidence are required to relate the exposure or intervention of interest to the principal health outcome [12]. Thus, one body of evidence may relate exercise (the intervention) to lower-extremity strength (an intermediate outcome), and another may relate lower-extremity strength to the probability of falls (the health outcome of principal interest); neither one alone directly relates exercise to falls. Other examples are intervention strategies with several substitutes, particularly when the various substitutes have been evaluated in different types of studies. For example, pravastatin, a 3-hydroxy-3-methylglutaryl-coenzyme A reductase inhibitor, has been shown to decrease low-density lipoprotein cholesterol levels and cardiovascular morbidity and mortality in men with moderate hypercholesterolemia and no history of myocardial infarction (primary prevention) [15]. Several of these inhibitors, including pravastatin, simvastatin, and fluvastatin, have been shown to decrease cardiovascular morbidity or mortality in persons with hypercholesterolemia and history of myocardial infarction (secondary prevention) ([16, 17], Oral presentation of the Lipoprotein and Coronary Atherosclerosis Study at American Heart Association Meeting, Anaheim, California, November 1996). Taken together, these data provide indirect rather than direct evidence that all 3-hydroxy-3-methylglutaryl-coenzyme A reductase inhibitors are effective in the primary prevention of cardiovascular disease.

    Much evidence about health care is indirect. The synthesis of indirect evidence or of pieces of direct evidence requires the creation of models that relate exposures or diagnostic or intervention strategies to principal health outcomes. Conceptually, this involves 1) identifying links that connect the exposures, diagnostic strategies, or interventions to principal health outcomes; 2) analyzing the evidence that pertains to each link; and 3) combining the links [12]. In essence, this approach breaks a complex problem into a series of smaller problems and formally theorizes the relations among those problems. Such models are often called evidence models. They provide reviewers with an analytic framework that clarifies the cause or natural history of a health problem, the sequence of intermediate effects that an exposure or diagnostic or intervention strategy must pass through to reach certain primary outcomes, and the range of potential adverse effects that need consideration [18, 19].

    Factors Complicating Integration of Evidence

    Regardless of whether reviewers are synthesizing direct or indirect evidence, many factors can modify etiologic and prognostic associations, diagnostic accuracy, and therapeutic effectiveness. Study participants are often drawn from various settings and have a wide spectrum of baseline risk, disease severity, and sociodemographic and cultural characteristics. Exposures, diagnostic strategies, interventions, and comparison groups have varying formulations and intensities. Different outcome measures are used in different studies, and similar outcomes are measured or reported differently. Various study designs are used (Table 1), and heterogeneity of methodologic features occurs within a given design (Table 2).

    Table 1. Potential Sources of Research Evidence
    Table 2. Examples of Important Methodologic Features Associated with Different Types of Studies

    Heterogeneity of research evidence may concurrently exist at one or more levels. A review of several randomized, controlled trials that tested whether a particular drug class resulted in improved survival in similar groups of patients may be complicated only by judgments on the degree of homogeneity of the different drugs within the class. In practice, heterogeneity of only one factor in a given study or group of studies (for example, different drugs within a class) is relatively rare; several sources of complexity are usually present. For example, a recent systematic review of the effectiveness of stroke units included evidence from both randomized and nonrandomized controlled clinical trials; these trials evaluated different models of stroke units, used different patient inclusion criteria, and had various outcome measures [20]. Systematic reviews that examined studies of methods for implementing clinical guidelines have included multifaceted management interventions directed toward different clinical conditions; systematic reviews of the efficacy of continuing medical education examined studies of a variety of educational activities among different groups of health care professionals working in different health care settings [21, 22]. A comprehensive review evaluating the association between cigarette smoking and lung cancer might integrate evidence from laboratory studies of genetic mutations with evidence from case–control and prospective studies of cancer in animals and humans.

    Heterogeneity is a double-edged sword. On the positive side, it may allow reviewers to examine consistency of findings across studies of various types and their applicability in a variety of patients and settings (that is, it may increase generalizability). It may also allow a more comprehensive picture of feasibility, acceptability, benefits, and harms associated with particular formulations of a diagnostic or therapeutic strategy. On the negative side, it may introduce ambiguity into the synthesis of evidence. Researchers conducting systematic reviews may be required to make judgments about the relevance of the heterogeneity, the legitimacy and relative uncertainty of particular pieces of evidence, the importance of missing evidence, the soundness of the model for linking the evidence, and the appropriateness of conducting a quantitative summary.

    Strategies for Integrating Heterogeneous Evidence

    Linking Multiple Pieces of Evidence

    Reviewers addressing broad questions that involve linkages among multiple bodies of both indirect and direct evidence need to use explicitly defined models. An example of a model that was used to guide a systematic review of screening for hearing impairment in elderly persons is given in Figure 1[23]. The model was based on preset criteria for evaluating screening programs [24]. Frameworks for constructing models of causality, prognosis, effectiveness of diagnostic and intervention strategies, and specific relationships between surrogate and clinically meaningful outcomes are also available [12, 13, 18, 25, 26]. An example of a complex framework for assessing benefits and harms of a particular therapy is given in Figure 2.

    Figure 1. The following are focused questions associated with some of the linkages shown. Linkage 2: What is the accuracy of screening tests (whispered voice, tuning fork, finger rub, questionnaires, audioscope) for identifying elderly patients with hearing impairments? Linkage 3: What adverse effects from mislabeling result from measuring hearing impairment in elderly patients with previously undetected hearing impairment? Linkage 4: Does treating hearing-impaired elderly patients by using hearing aids improve the acuity of hearing?.
    View larger version:
    Figure 1. The following are focused questions associated with some of the linkages shown. Linkage 2: What is the accuracy of screening tests (whispered voice, tuning fork, finger rub, questionnaires, audioscope) for identifying elderly patients with hearing impairments? Linkage 3: What adverse effects from mislabeling result from measuring hearing impairment in elderly patients with previously undetected hearing impairment? Linkage 4: Does treating hearing-impaired elderly patients by using hearing aids improve the acuity of hearing?. Model examining rationale for screening for hearing impairment.
    Figure 2. The following are focused questions associated with some of the linkages shown. Linkage 7: Does the effect of pharmacologic agents on abdominal or visceral fat lead to improved lipoprotein levels? Linkage 8: Does the effect of pharmacologic agents on abdominal or visceral fat affect control of blood sugar? Linkage 9: Does the effect of pharmacologic agents on abdominal or visceral fat affect control of blood pressure? Linkage 10: Does pharmacologic treatment of obesity affect control of blood pressure independently of its effect on weight or abdominal or visceral fat?.
    View larger version:
    Figure 2. The following are focused questions associated with some of the linkages shown. Linkage 7: Does the effect of pharmacologic agents on abdominal or visceral fat lead to improved lipoprotein levels? Linkage 8: Does the effect of pharmacologic agents on abdominal or visceral fat affect control of blood sugar? Linkage 9: Does the effect of pharmacologic agents on abdominal or visceral fat affect control of blood pressure? Linkage 10: Does pharmacologic treatment of obesity affect control of blood pressure independently of its effect on weight or abdominal or visceral fat?. An evidence model for the pharmacologic treatment of obesity.

    Each link in a model represents a subquestion for which a systematic review could be conducted. In some instances, direct evidence that obviates the need to address certain intermediate linkages may be available. Reviewers select important linkages and perform a series of pertinent systematic reviews, each with a well-formulated question, specified inclusion criteria, explicit searching and selection techniques, and method of critical appraisal. Evidence tables for each subquestion can be developed (Table 3, Table 4). These can be accompanied by narrative summaries that identify the direction, magnitude, significance, and uncertainty of effects and highlight major issues affecting the applicability and validity of data. For some subquestions, meta-analyses may be possible. Likewise, for some subquestions on prognosis, diagnosis, or therapy, the strength or level of available evidence may be ranked by using criteria that emphasize methodologic rigor and avoidance of bias [27-32].

    Table 3. Example of Items To Include in Evidence Table
    Table 4. Table 3. Continued

    The techniques for integrating and interpreting multiple types and units of evidence are evolving. Current methods include subjective as well as quantitative approaches [33]. One subjective approach is to create a tabular display or balance sheet that lists the major findings (such as the direction, magnitude, and uncertainty of effects) and strength of evidence for each subquestion. The goal is to condense important information into a display that can be grasped both visually and mentally [34]. Reviewers then use the tabular displays as structures for integrating a mixture of research findings and for drawing conclusions. Patients and their providers can also use the displays to integrate evidence and make their own personalized decisions. Several potential pitfalls need to be considered, however, when global interpretations and judgments are made on the basis of balance sheets. These pitfalls include overrelying on single outcomes; using statistical significance as a proxy for the clinical impact (effect size) of an outcome, ignoring the actual magnitude of effects and the degree of uncertainty associated with those effects, failure to differentiate surrogate from clinically meaningful outcomes, and retreating to such generalities as “cancer is bad, so any intervention that combats it is worthwhile” [18].

    Another subjective yet explicit approach is to base integration and conclusions on a limited number of important variables. The U.S. Preventive Services Task Force, for example, subjectively integrated research on preventive care strategies on the basis of three criteria: burden of suffering from the target condition; characteristics of the prevention strategy, such as feasibility; and demonstrated effectiveness of the strategy determined by considering the rigor of available evidence. By using this three-pronged approach, the Task Force concluded that there was good evidence (grade A) to recommend screening for cervical cancer with Papanicolaou testing even though no data from randomized, controlled trials directly show the clinical benefits of screening with this technique [35].

    More singular emphasis on the methodologic strength or level of evidence can be used to draw conclusions [29]. An important pitfall to avoid in this approach is confusing lack of high-level evidence with evidence against a particular strategy. Absence of proof is not proof of absence. Moreover, a single item of high-level evidence may be available for a particular diagnostic strategy or therapeutic intervention; although no high-level evidence exists for alternative strategies, many pieces of indirect evidence may at the same time suggest the superiority of the alternative strategies.

    A variety of quantitative models is available for linking intermediate events and several pieces of evidence together in sequence. Formal decision analyses are quantitative models that use explicit paths to connect decisions to intermediate and final outcomes. The paths represent a series of actions and events, beginning with an initial choice node and ending in outcomes that can be weighted to reflect patient preferences or utilities [36-38]. Probabilities of all possible outcomes, which are ideally estimated by using individual systematic reviews, are combined to determine the optimal course of action. Advanced stochastic modeling techniques, such as Markov chains, state-transition models, and difference equations, can be used to analyze particularly complex multidirectional relations [39]. A relatively new technique, the confidence profile method, allows analysis of evidence involving mixed comparisons (for example, drawing conclusions about A compared with B on the basis of evidence about A compared with C and C compared with B) [40]. Adjustments for prior probabilities, biases, and relative uncertainties that are pertinent to particular pieces of evidence can be incorporated into these models. All of these quantitative techniques have limitations because they rely on many assumptions; require special computational tools, software, and statistical expertise; and usually are not transparent to users of evidence.

    Addressing Heterogeneity in Single Bodies of Evidence

    Even reviewers who address focused questions are challenged by heterogeneity. One way of dealing with this problem is to use narrow inclusion criteria. For example, reviewers may review only studies that report a particular outcome. This approach ensures a more uniform data set that may limit ambiguity. The drawbacks of such narrow focus include the risk for losing valuable information and the possibility of introducing bias in favor of studies that report outcomes in a particular way [41]. Some reviewers restrict reviews to particular study designs or, even more severely, to particular study designs with certain methodologic characteristics (for example, double-blind, randomized, controlled trials rather than randomized, controlled trials). Although these approaches may limit bias, we still do not fully understand all of the factors that influence the validity of most study designs; thus, we may receive a false sense of assurance from the use of such restrictive techniques. Furthermore, the quality of studies is a continuum, and evidence from well-conducted studies with “weaker” designs may be more robust than evidence from poorly conducted studies with more “rigorous” designs.

    Quantitative methods for coping with heterogeneity include sensitivity analyses that explore the effect of grouping data in a variety of ways [9]. A good example is a recent meta-analysis of oral contraceptives and breast cancer that included a series of sensitivity analyses [42]. These analyses revealed a small increase in the risk for breast cancer with the use of oral contraceptives that was independent of methodologic factors (design of the primary studies), study sample factors (age, ethnicity, and educational and reproductive background), context of the primary study (national setting), and drug factors (type and duration of drug therapy used). Subgroup analyses must be used with caution, however, because they are subject to many recognized limitations, including spurious associations that may be suggested by such “data dredging” [43]. Finally, special types of meta-analysis that use individual patient data obtained from primary investigators may allow adjustment for heterogeneity and confounding by multiple factors [9, 44, 45]. This approach is labor and resource intensive and, although potentially powerful, may not be possible in many circumstances.

    Studies addressing similar questions often report different outcomes. For instance, controlled trials assessing the effect of interventions to reduce alcohol consumption may include biochemical markers, professional reports, or self-reports of abstinence as outcomes. In such circumstances, it may be possible to conduct separate meta-analyses for each key end point. As an alternative, standardized effect sizes or scale-free weighted mean differences can be used (the ratio of the difference between means in the treatment and control groups to the SD in the control group) [46]. Standardized effect sizes help estimate whether an intervention has a consistent effect in a group of related outcomes. Limitations of the use of standardized effect sizes include the following: 1) All outcomes are given equal weight regardless of clinical significance, 2) misleading results may occur if unrelated outcomes are combined or if major differences in effects of the intervention on the different outcomes exist, and 3) bias may result if investigators have selectively reported their most positive results. Another approach is to derive a standardized definition for outcomes. For example, reviewers evaluating comprehensive geriatric assessment obtained unpublished information from the original investigators on outcomes, such as functional status, that could be standardized across studies [47]. A review of stroke unit trials used a standardized description of stroke services and a preestablished definition of disability that could be determined from several different disability scales [20]. Similarly, standardized data were used in a review of adverse gastrointestinal effects associated with nonsteroidal anti-inflammatory drugs [48].

    Conclusions

    Reviewers interested in integrating many pieces of evidence face a kaleidoscope of research data that are fragmented by heterogeneity among study populations, exposures, diagnostic or intervention strategies, comparison groups, outcomes, study design, and quality. Although such heterogeneity may stimulate confidence by allowing assessment of general consistency and applicability, it may also increase uncertainty. Reviewers must therefore make judgments about the relevance of the heterogeneity, the legitimacy and relative uncertainty of particular pieces of evidence, the importance of missing pieces, the soundness of models for linking various pieces, and the appropriateness of conducting a quantitative summary. Integration of multiple pieces of disparate evidence is therefore a challenging and complex task that demands skill, humility, and skepticism. Initial steps involve recognition of distinction between direct and indirect evidence and the development of explicit models that break a complex problem into a series of smaller subproblems and hypothesize linkages among those subproblems. Focused systematic reviews for the important subproblems should be performed. Then, any of a variety of subjective or quantitative methods can be used to help integrate data into a unified whole. No single integrative approach is clearly superior, and some require specialized techniques. None obviates uncertainty, all involve assumptions, and all ultimately underscore the role of careful judgment in integrating evidence.

    Key Points To Remember

    Direct evidence relates an exposure, diagnostic strategy, or therapeutic intervention directly to the occurrence of a principal health outcome. Evidence is indirect if two or more bodies of evidence are required to relate the exposure, diagnostic strategy, or intervention to the principal health outcome.

    Explicit models provide analytic frameworks for viewing many pieces of evidence; they break a complex problem into a series of smaller subproblems and formally theorize linkages between those subproblems

    Multiple factors, including heterogeneity of study populations, exposures or diagnostic or intervention strategies, comparison groups, outcomes, and study design and quality, contribute to the complexity of integrating direct and indirect evidence

    Current methods for integrating heterogeneous evidence include a variety of evolving subjective and quantitative approaches

    Appendix

    The concepts and examples outlined in this article are drawn from a wide variety of sources. We used the Cochrane Methodology Database for the core of information. We also used informal approaches, such as discussion with colleagues and workshops on complexity within systematic reviews that have been conducted at several international research symposia.

    Dr. Langhome: Section of Geriatric Medicine, 3rd Floor Centre Block, Royal Infirmary, Glasgow G4 0SF, United Kingdom.

    Dr. Grimshaw: Health Services Research Unit, Department of Public Health, Drew Kay Wing, Polwarth Building, Fosterhill, Aberdeen AB9 2ZD, United Kingdom.

    References

    1. 1.
    2. 2.
    3. 3.
    4. 4.
    5. 5.
    6. 6.
    7. 7.
    8. 8.
    9. 9.
    10. 10.
    11. 11.
    12. 12.
    13. 13.
    14. 14.
    15. 15.
    16. 16.
    17. 17.
    18. 18.
    19. 19.
    20. 20.
    21. 21.
    22. 22.
    23. 23.
    24. 24.
    25. 25.
    26. 26.
    27. 27.
    28. 28.
    29. 29.
    30. 30.
    31. 31.
    32. 32.
    33. 33.
    34. 34.
    35. 35.
    36. 36.
    37. 37.
    38. 38.
    39. 39.
    40. 40.
    41. 41.
    42. 42.
    43. 43.
    44. 44.
    45. 45.
    46. 46.
    47. 47.
    48. 48.
    « Previous | Next Article »Table of Contents