- © 2002 by American Society of Clinical Oncology
Validation Study of the Accuracy of a Postoperative Nomogram for Recurrence After Radical Prostatectomy for Localized Prostate Cancer
- Markus Graefen,
- Pierre I. Karakiewicz,
- Ilias Cagiannos,
- Eric Klein,
- Patrick A. Kupelian,
- David I. Quinn,
- Susan M. Henshall,
- John J. Grygiel,
- Robert L. Sutherland,
- Phillip D. Stricker,
- Jean de Kernion,
- Thomas Cangiano,
- Fritz H. Schröder,
- Mark F. Wildhagen,
- Peter T. Scardino and
- Michael W. Kattan
- From the Departments of Urology and Biostatistics, Memorial Sloan-Kettering Cancer Center, New York, NY; Cleveland Clinic, Cleveland, OH; Garvan Institute of Medical Research and St Vincent’s Hospital, Sydney, Australia; University of California Los Angeles, Los Angeles, CA; and Department of Urology, Erasmus University and Academic Hospital, Rotterdam, the Netherlands.
- Address reprint requests to Michael W. Kattan, PhD, Departments of Urology and Biostatistics, Memorial Sloan-Kettering Cancer Center, 1275 York Ave, C1068, New York, NY 10021; email: kattanm{at}mskcc.org
Abstract
PURPOSE: A postoperative nomogram for prostate cancer was developed at Baylor College of Medicine. This nomogram uses readily available clinical and pathologic variables to predict 7-year freedom from recurrence after radical prostatectomy. We evaluated the predictive accuracy of the nomogram when applied to patients of four international institutions.
PATIENTS AND METHODS: Clinical and pathologic data of 2,908 patients were supplied for validation, and 2,465 complete records were used. Nomogram-predicted probabilities of 7-year freedom from recurrence were compared with actual follow-up in two ways. First, the area under the receiver operating characteristic curve (AUC) was calculated for all patients and stratified by the time period of surgery. Second, calibration of the nomogram was achieved by comparing the predicted freedom from recurrence with that of an ideal nomogram. For patients in whom the pathologic report does not distinguish between focal and established extracapsular extension (an input variable of the nomogram), two separate calculations were performed assuming one or the other.
RESULTS: The overall AUC was 0.80 when applied to the validation data set, with individual institution AUCs ranging from 0.77 to 0.82. The predictive accuracy of the nomogram was apparently higher in patients who were operated on between 1997 and 2000 (AUC, 0.83) compared with those treated between 1987 and 1996 (AUC, 0.78). Nomogram predictions of 7-year freedom from recurrence were within 10% of an ideal nomogram.
CONCLUSION: The postoperative Baylor nomogram was accurate when applied at international treatment institutions. Our results suggest that accurate predictions may be expected when using this nomogram across different patient populations.
IN LONG-TERM FOLLOW-UP series, disease recurrence after radical prostatectomy (RP) for localized prostate cancer is reported for 15% to 40% of patients.1-3 Early identification of men likely ultimately to experience disease progression is useful in considering adjuvant therapy. Accurate identification of the risk of disease recurrence would also be particularly useful in clinical trials to ensure comparability of treatment and control groups or to identify appropriate candidates for investigational treatment. Furthermore, accurate delineation of patients at risk of recurrence provides an evidence base for establishment of surveillance intervals and patient counseling. Several predictive tools to identify men at high risk for treatment failure on the basis of postoperative parameters have been published recently.3-11 Although some of these nomograms have been validated either internally3,4,6,10 or on a single external patient cohort,5,11 none of those predictive tools has been validated on a multi-institutional patient cohort to date.
There are numerous pitfalls to nomogram development, and chief among them is external validation failure. A nomogram may not predict well when applied to new patients if the predictor variables are irreproducible, unavailable, or confounded by other factors. If the sample size or follow-up used for nomogram development is inadequate, the estimates may similarly be suboptimal. In addition, the statistical model behind the nomogram may be a poor fit. Therefore, validation in one or, ideally, in several cohorts represents an essential step before a nomogram may safely be implemented in routine clinical practice.12
Herein, we assess the predictive accuracy of a postoperative nomogram4 that uses generally available clinical and pathologic features. The nomogram originally used patients from the same institution for validation, but despite good performance in that cohort, concerns relating to generalizability remain. These relate to homogeneity and similarity of the population at the same institution as well as to similarity in staging, diagnosis, and treatment processes. Further validation of the nomogram in different populations in which diagnostic, staging, and treatment processes may vary is clearly required. We submitted the original Baylor nomogram to further validity testing with data sets from institutions across the United States and overseas.
PATIENTS AND METHODS
Validation data representing men treated with RP were obtained from four institutions: Cleveland Clinic, Ohio (n = 1,174); University of California at Los Angeles (n = 607); Garvan Institute of Medical Research/St Vincent’s Hospital, Sydney, Australia (n = 818); and Erasmus University and Academic Hospital, Rotterdam, the Netherlands (n = 309). Patients treated with neoadjuvant hormonal therapy (n = 319) were excluded because the nomogram is not applicable in these men. Patients with missing pretreatment prostate-specific antigen (PSA) values (n = 32), missing Gleason sum in the specimen (n = 3), missing level of established extracapsular extension (ECE; n = 83), or missing margin status (n = 7) were excluded from analysis, leaving 2,465 patients (95.2% of patients without neoadjuvant therapy initially provided and applicable for validation) in the multi-institutional validation data set. In 250 men from the Cleveland Clinic, no lymph node (LN) dissection was performed, because of favorable preoperative findings. These men were considered as LN negative. Clinical stage in all centers was assigned by using the 1992 American Joint Committee on Cancer tumor-node-metastasis classification. PSA failures were defined by each center individually and ranged from 0.2 to 0.4 ng/mL and higher. No centralized review of pathology was performed.
Data Required for the Nomogram
For prediction, this nomogram4 requires pretreatment PSA level, Gleason sum in the prostatectomy specimen, levels of prostatic capsular invasion (PCI), surgical margin status, seminal vesical invasion, and LN status (Fig 1). Level of PCI, with respect to the stroma of the prostate, prostatic capsule, and periprostatic soft tissue, was classified as previously published.13 Patients were observed until the first evidence of treatment failure, which could be demonstrated by increasing PSA, clinically detectable disease, or need for a second treatment. Patients who were treated before RP (neoadjuvant hormonal or radiation therapy) were excluded when the nomogram was developed. Men who received adjuvant hormonal therapy or radiotherapy (but before documented recurrence) were regarded as treatment failures at the time of the second therapy.
Validation Data
A total of 2,465 patients met the derivation criteria. Table 1 lists the clinical and pathologic characteristics of patients included in this validation data set and those of the original patients from Baylor College of Medicine used to develop the nomogram. Table 2 lists follow-up status and characteristics for the cohorts. In the databases from the University of California at Los Angeles, Cleveland Clinic, and Rotterdam, the level of PCI does not distinguish between focal and ECE. Statistical analyses were performed assuming the worst-case (all ECEs considered established ECE) and best-case (all ECEs considered focal ECEs) ECE scenarios to explore whether the nomogram might be applicable to patient series in which focal and established ECE are not distinguished.
Statistical Methods
We performed receiver operating characteristic curve analysis comparing the nomogram-predicted 7-year probability of freedom from recurrence with the actual follow-up. Because the data are censored, the traditional area under the receiving operator characteristic curve (AUC) is problematic,14 and the version of Harrell et al15 was calculated. Nonetheless, its interpretation is similar. The AUC is the probability that, given two randomly drawn patients, the patient whose disease recurs first had a higher probability of recurrence. Note that the calculation assumes that the patient with the shorter follow-up had the first recurrence. If both patients’ disease recurs at the same time or if the patient with nonrecurrent disease has a shorter follow-up, the probability does not apply to that pair of patients. AUC was assessed overall and for each center individually. To examine a possible change in predictive accuracy over time, AUCs were calculated for the patients stratified by the time period of surgery.
Subsequently, by using the Kaplan-Meier method, we determined the actuarial probability of recurrence at 7 years after RP. Calibration of the nomogram was assessed by comparing nomogram predictions of recurrence with actuarial recurrence for the 2,465 patients. All statistical tests performed were two sided.
RESULTS
The 7-year freedom from recurrence for the 2,465 patients used for the validation of the nomogram was 70% (95% confidence interval, 66% to 73%). At 7 years after RP, 121 men were at risk for treatment failure. The AUC for the entire data set was 0.80. The individual institutions’ AUCs were 0.77, 0.78, 0.79, and 0.82. The AUC did not differ on the basis of whether the worst or best scenario for extracapsular extension was used. Predictive accuracy of the nomogram seemed higher in patients who were operated on between 1997 and 2000 (n = 1,333; AUC, 0.83), compared with men treated between 1987 and 1996 (n = 1,132; AUC, 0.78).
Figure 2 illustrates how the predictions of the nomogram compare with actual outcomes for the 2,465 men. The x-axis is the prediction calculated with use of the nomogram, and the y-axis is the observed 7-year freedom from recurrence for the patients in the validation cohort as estimated by the Kaplan-Meier method. The dashed line represents the performance of an ideal nomogram, where predicted outcome would correspond perfectly with actual outcome. The performance of the Baylor nomogram is plotted as the solid line. The dotted lines represent a 10% margin of error, which was speculated in the original nomogram publication.4 The solid line is close to the dashed line of the ideal nomogram and is always within the 10% margin of error. This correspondence between actual and ideal nomogram predictions suggests good calibration of the nomogram in the validation cohort.
DISCUSSION
In the initial derivation study of the Baylor nomogram, Kattan et al4 validated their predictive tool by using a separate sample of 332 men operated on by five separate surgeons from the same institution. With this cohort, the nomogram performed with high predictive accuracy (AUC, 0.89). Considering the homogeneity and similarity of this internal validation cohort as well as similarities in staging, diagnosis, and treatment processes, it is not surprising that predictive accuracy decreased when applied to the multi-institutional validation data set described here. Nonetheless, the observed AUC of 0.80 still represents a high level of predictive accuracy. Although the AUC seems to be improving with time (ie, the more recently treated patients were predicted more accurately), this may be a statistical artifact. Additional follow-up and simulation are needed.
The individual treatment centers differed with respect to patient selection, extracapsular extension measurement, and follow-up assessment (Table 1). Furthermore, centralized review of pathology was not performed. For the purposes of nomogram validation, such heterogeneity is desirable to gain insight into how the nomogram will perform across varied settings. The nomogram was consistently accurate at all four centers, with a range in AUC from 0.77 to 0.82. Further assessment in a community hospital setting might be valuable.
The predicted treatment failure probability from the nomogram was within 10% of the observed treatment failure probability throughout the spectrum of predictions (Fig 2). The nomogram seemed especially accurate when predicting patients with a high likelihood of recurrence. Accuracy in these men is important because they are potential candidates for adjuvant therapy. However, the decision of what level of risk is required for the administration of therapy is controversial and without good agreement. We therefore suggest discussion between patient and physician regarding benefits and consequences, centered around risk estimations.
In our validation study, we focused on a postoperative nomogram that uses readily available variables and predicts outcome at 7 years after RP. Other nomograms predict over a shorter time frame7,8 and thus are limited in their ability to judge ultimate treatment failure. Investigators have demonstrated that postoperative PSA may remain undetectable for up to 5 or 10 years before biochemical recurrence.2,16 Nomograms that include nonestablished markers were not evaluated simply because some of these features, such as p53,9 bcl-2,9 volume,7 or percentage8 of high-grade cancer, are not routinely measured and were therefore not available in our validation cohort. Even if those markers might enhance predictive ability above established parameters, applicability was another important concern in our validation study.
Other nomograms are available for patients with long-term follow-up in the postoperative setting. One of the earliest models was by Partin et al,5 for predicting 5-year freedom from recurrence in clinical stage T2b/T2c patients. This risk stratification scheme was derived by using 216 men with clinical stage T2b and T2c prostate cancer who were treated by a single urologist. In a validation cohort of 214 patients treated by one of three different urologists at two institutions, they were able to illustrate that the model was able to stratify those patients as well, on the basis of their Kaplan-Meier recurrence-free survival rates. However, no statistical testing of strata differences was performed in that study. This predictive tool would have been applicable to a subset of our patients (21.6%), except that it is not applicable for patients treated with adjuvant therapy.
Blute et al6 recently published a nomogram based on 2,000 men by using Gleason sum, margin status, seminal vesicle involvement, and a categorized pretreatment PSA level to assign a risk score for predicting disease-free survival 5 years after surgery. For patients who already had received adjuvant therapy, points from that risk score were subtracted. This nomogram was validated with the data of 518 randomly drawn men out the initial 2,518 patients in their data set, with an AUC of 0.76.
We were unable to compare the accuracy of the Baylor nomogram with this predictive tool because they are used for different purposes. As previously reported,4,6 statistical considerations for patients who received adjuvant therapy are complex. Any second treatment has the potential to mask a recurrence, which would falsely decrease the recurrence rate. However, simply omitting these patients might eliminate a major fraction of all the recurrences, also biasing the recurrence rates. Furthermore, the true effect of a variable can be masked when patients with adjuvant therapy who have a higher incidence of pathologic features with adverse association to prognosis are excluded.6 The motivation for the Baylor nomogram was to identify patients who might benefit from early adjuvant therapy. To be useful, this nomogram should identify these patients before they experience recurrence. Therefore, the Baylor nomogram was designed to use information that is available immediately postoperatively (ie, pathologic features of the specimen). Because the purpose of the tool was to identify patients who might benefit from adjuvant therapy, we considered this group to comprise patients whose surgery failed and those who were deemed to be at such high risk of treatment failure that they received adjuvant therapy before evidence of recurrence. Therefore, for our purposes, it did not make sense to include adjuvant therapy as a nomogram predictor variable because patients who potentially benefit from second therapy are the patients we wanted to identify.
For other purposes, the nomogram of Blute et al6 models the probability of recurrence after adjusting for the administration of adjuvant therapy. With this tool, adjuvant therapy is a predictor of outcome but not a treatment failure end point. The tool published by Partin et al5 models treatment failure after excluding the adjuvant-therapy patients and would therefore be useful for that purpose. Because the nomograms use different subsets of patients and require the end point to be coded differently, they are difficult to compare on matching data sets.
In conclusion, it seems that the Baylor postoperative nomogram provides reasonably accurate predictions regardless of minor variations in PSA recurrence definitions and pathologic assessment. Software versions for the Palm and Windows platforms are available free of charge from http://www.nomograms.org.
Acknowledgments
Supported by grant no. RPG-00-202-01-CCE from the American Cancer Society, the Deutsche Krebshilfe, grant no. GR 1866/1-1 from the Deutsche Forschungsgemeinschaft, the National Health and Medical Research Council (NMHRC) of Australia, New South Wales Cancer Council, R.T. Hall Trust, Freedman Foundation, St Vincent’s Clinic Foundation, St Vincent’s Hospital Foundation, and the Leon Lowenstein Foundation, Inc. P.I.K. was supported in part by the American Foundation for Urologic Diseases, National Cancer Institute of Canada, and Medical Research Council of Canada. D.I.Q. is the recipient of an NHMRC Neil Hamilton Fairley Postdoctoral Fellowship and the Vincent Fairfax Family Foundation Fellowship from the Royal Australasian College of Physicians.
- Received May 21, 2001.
- Accepted October 5, 2001.