We thank Dr. Stern for his thoughts on improving methods for the evaluation of risk prediction models.
We agree wholeheartedly that the distribution of risks predicted by the risk prediction model is key for evaluating model performance. This has been called the predictiveness curve in the statistical literature and we have advocated strongly for its use (Pepe et al 2008; Huang, Pepe and Feng 2007). In fact the margins of a risk stratification table display exactly this, they show the population distribution of risk according to the two models, albeit using discrete categories. Since the main goal of our paper is to emphasize that one should focus on the margins of the risk stratification table rather than on the interior cells, our paper in fact concurs with Dr Stern’s point of view.
The AUC or c-statistic can indeed be viewed as a measure of the dispersion of the risk distribution. However, it seems to be a measure that lacks clinical relevance (Pepe, Janes and Gu, 2007; Pepe and Janes, 2008). In addition, dissatisfaction with the ROC curve stems in part from the fact that risk thresholds are not displayed by it. We advocate instead displaying risk distributions for events and non-events separately as a way to directly view true positive rates (for events) and false positive rates (for non-events) associated with specific risk thresholds (Pepe et al, 2008). Although mathematically equivalent to reporting the ROC curve and the overall event rate (Huang and Pepe, in press), the risk distributions are much easier to interpret. Again, the margins of the risk stratification table show these distributions in categories.
We demonstrate that the amount of reclassification shown in a risk stratification table is simply a consequence of the extent of correlation between the risks calculated from the two models. Knowing the correlation in risks between two models is of little use; rather, the calibration, capacity for risk stratification, and classification accuracy should be used as metrics for model comparison, all of which can be viewed from the margins of the risk stratification table. When risk categories are not defined in advance, we agree with Dr. Stern that plots can be used to display this information (Pepe, Feng and Gu, 2008).
References
1. Pepe MS, Janes H, Gu JW Letter to the editor regarding “Special Report, Use and Misuse of the Receiver Operating Characteristic Curve in Risk Prediction” Circulation 116: e132, 2007.
2. Huang Y, Pepe MS, and Feng Z. Evaluating the predictiveness of a continuous marker. Biometrics.63: 1181-1188, 2007.
3. Pepe MS, Feng Z, Gu JW. Invited commentary on ‘Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond.’ Statistics in Medicine. 27:173–181, 2008.
4. Pepe MS, Feng Z, Huang Y, Longton G, Prentice R, Thompson IM, Zheng Y. Integrating the predictiveness of a marker with its performance as a classifier. American Journal of Epidemiology 167(3):362-368, 2008.
5. Pepe MS, Janes HE Gauging the performance of SNPs, biomarkers and clinical factors for predicting risk of breast cancer Journal of the National Cancer Institute 100(14): 978-9, 2008.
6. Huang Y, Pepe MS A parametric ROC model based approach for evaluating the predictiveness of continuous markers in case-control studies Biometrics (in press).
None declared
The article by by Janes, Pepe, and Gu advocates using risk stratification tables, not using the ROC curve AUC or c-statistic, and valuing models that assign a wide range of risks to individuals assigned a narrow range of risks by another model. A recent review of this topic reached different conclusions (1).
Simply providing a risk distribution curve (frequency vs risk in the population) is a more informative way to present the results of a risk stratification model, (1) while a plot of predicted versus observed risk for each decile of risk is a more informative way to present an assessment of a model’s calibration.
By choosing only one way to view the ROC curve, the authors are lead to reject the ROC curve AUC or c-statistic as a clinically important measure of risk stratification. The authors correctly point out that better models place more participants at the extremes of the risk distribution curve. But it is known that the risk distribution curve determines the ROC curve (2) and the ROC curve AUC is a measure of the dispersion of the risk distribution curve (1,3). From this perspective, the ROC curve AUC is a valid measure of risk stratification.
The authors make an important contribution by demonstrating that redistribution in risk stratification tables results from lack of correlation between the risks calculated from two different models. But this calls into question the current interpretation of redistribution, which is that identification of individuals at low, medium, and high risk by one method from a subgroup of individuals identified at medium risk by a different method proves superior risk stratification. When the methods are equivalent (as shown by the margins of the risk stratification table or measures of calibration and discrimination), this redistribution merely reflects the fact that different methods provide different risk estimates for the same individual. When the models differ by a single risk factor, high correlation between estimates may be observed (4). However when models differ by many risk factors, the lack of correlation between estimates can be dramatic (5).
References
1. Stern RH. Evaluating New Cardiovascular Risk Factors for Risk Stratification. J Clin Hypertension. 2008;10:485-488.
2. Diamond GA. What price perfection? Calibration and Discrimination of Clinical Prediction Models. J Clin Epidemiol. 1992;45:85-89.
3. Cook NR. Use and Misuse of the Receiver Operating Characteristic Curve in Risk Prediction. Circulation. 2007;115:928-935.
4. McGeechan KM, Macaskill P, Irwig L, Liew G, Wong TY. Assessing New Biomarkers and Predictive Models for Use in Clinical Practice A Clinician’s Guide. Arch Int Med. 2008; 168:2304-2310.
5. Lemeshow S, Klar J, Teres D. Outcome prediction for individual intensive care patients: useful, misused, or abused? Intensive Care Med. 1995;21:770-776.
None declared