Use of Modeling to Evaluate the Cost-Effectiveness of Cancer Screening Programs

  1. G. Scott Gazelle
  1. From the Harvard Medical School and Massachusetts General Hospital, Boston, MA
  1. Address reprint requests to G. Scott Gazelle, MD, MPH, PhD, Institute for Technology Assessment, 101 Merrimac St 10th floor, Boston, MA 02114; e-mail: scott{at}mgh-ita.org

Abstract

Cost-effectiveness analysis (CEA) is an analytic tool that provides a framework for comparing the health benefits and resource expenditures associated with competing medical and public health interventions, thereby allowing decision makers to identify interventions that yield the greatest amount of health, given their resource constraints. Models are important components of most, if not all, CEAs, and they play a key role in evaluating the cost-effectiveness of cancer screening programs, in particular. In this article, we describe the basic types of models used to evaluate cancer screening programs and provide examples of the use of models in CEAs and to guide cancer screening policy. Finally, we offer some suggestions for important concepts to consider when interpreting model results.

INTRODUCTION

In 2006, approximately 565,000 Americans will die from cancer, making cancer second only to heart disease as the leading cause of death in the United States.1 Mass screening of asymptomatic individuals can diagnose some cancers at earlier stages when treatments are more effective. However, cancer deaths are prevented in only a small fraction of the total population screened; the vast majority of individuals undergoing cancer screening receive no benefit and may in fact be exposed to additional health risks as a result of screening. These risks arise from complications from the screening procedure that result in hospitalization or even death (eg, perforation during a screening colonoscopy), from false-positive test results that trigger unnecessary invasive follow-up procedures (eg, biopsy), and from the treatment of cancers that—in the absence of screening—would neither have been detected in the patient's lifetime nor caused death (ie, overdiagnosis). Mass screening may therefore require society to trade off large benefits to a few against small risks to many.

Screening large numbers of people is a costly undertaking. For example, Burnside et al2 estimate that in the year 2000, the total costs of mammography screening for breast cancer and the associated work-up of positive findings were approximately $4.5 billion.2 Given that total US expenditures on health care have doubled over the period from 1993 to 20043 and are expected to continue to rise through 2014,4 payers and policymakers are increasingly interested in the efficient allocation of medical care resources, including those used for cancer screening programs.

As discussed elsewhere in this issue, cost-effectiveness analysis (CEA) is an analytic tool that provides a framework for comparing the health benefits and resource expenditures associated with competing medical and public health interventions, thereby allowing decision makers to identify interventions that yield the greatest amount of health, given their resource constraints. Models are important components of most, if not all, CEAs, and they play a key role in evaluating the cost-effectiveness of cancer screening programs, in particular.

In this article, we describe the basic types of models used to evaluate cancer screening programs and provide examples of the use of models in CEAs and to guide cancer screening policy. Finally, we offer some suggestions for important concepts to consider when interpreting model results.

ROLE OF MODELS IN THE EVALUATION OF CANCER SCREENING PROGRAMS

While randomized trials provide the strongest evidence of the efficacy of screening, a number of factors prevent them from providing all of the information necessary to inform decisions about cancer screening, particularly those regarding cost-effectiveness. The size and expense of screening trials place practical limits on the number of screening strategies evaluated, the duration of follow-up, and the populations included. Given these limitations, results from models are often used in conjunction with data from trials and other clinical and epidemiologic studies to inform decisions regarding cancer screening.

A trial may demonstrate that a particular screening methodology yields a more favorable stage distribution of diagnosed cancers compared with an alternative strategy (often a no screening strategy), but longer follow-up is required to generate mortality results critical for decision makers. For example, the Prostate, Lung, Colorectal and Ovarian Cancer Screening Trial began enrollment in 1992; follow-up of participants will continue through 2017.5 Decisions about whether and how to screen for cancer must be made and cannot always be delayed until mortality data are available. Some degree of modeling is virtually always required to translate intermediate trial outcomes (eg, changes in stage distribution of detected cancers) to the long-term end points (ie, number of life-years or quality-adjusted years of life saved) needed for CEA.

Similarly, randomized screening trials often cannot determine whether the screening interval evaluated and the follow-up procedure employed are the optimal interval and procedure, or whether similar findings would be observed among different populations or in routine medical care settings. Data to support comprehensive CEAs are virtually impossible to derive from a single trial; due to selected populations and trial sites, trial findings and economic analyses performed alongside them may not be generalizable to other populations and settings. Models synthesize the available evidence and predict the clinical and/or economic outcomes of interest (eg, long-term cancer incidence and mortality rates, life expectancy, and lifetime costs) under alternative strategies for screening for the disease.

Furthermore, with the current pace of improvement in imaging technologies and treatments, as well as in our understanding of the genetics of cancer, trial results may not be relevant by the time they are available (ie, the moving target problem). In addition, a screening test may undergo rapid dissemination throughout the population before its efficacy has been demonstrated in clinical trials, thereby raising both ethical and practical concerns about recruiting individuals to a trial involving a no screening arm.

In addition to CEA, models have other uses in the realm of cancer screening. As demonstrated by the randomized trials of mammography screening for breast cancer,6-9 trials may yield conflicting estimates of the efficacy of the intervention of interest. Models may be used to reconcile these results.10,11 Models may also be used to explain observed trends in cancer mortality,12 estimate the mean lead time and the magnitude of overdiagnosis due to screening,13-15 and provide insights into important facets of the disease processes.16,17 For example, detailed simulation modeling suggests that some breast tumors have limited malignant potential.16 Finally, models may be used to guide screening policy18; the American Cancer Society (ACS) based its guidelines on the age at which to stop screening for cervical cancer and for the intervals between repeat liquid-based cytology in part from results of modeling studies.19 Results from models also informed the ACS guideline for mammography screening intervals.20

Although incredibly powerful, models have limitations. Because most cancer screening models attempt to simulate unobservable phases of the disease process, assumptions are a necessary component of models. Models also necessitate assumptions about the extent of heterogeneity in the population at risk and the characteristics and performance of the screening tests in routine clinical settings. If models are to be useful in guiding cancer screening policy, all assumptions must be stated explicitly and the implications of alternative assumptions should be discussed or, better yet, evaluated with the model.

TYPES OF MODELS USED IN CEAs OF CANCER SCREENING PROGRAMS

Cancer screening models may be classified on a number of dimensions. For example, they may be grouped by whether they simulate events among a single cohort of individuals over their lifetimes or among the entire population over a specified period of time, whether events are deterministic or stochastic in nature, or whether events occur in continuous versus discrete time. While these differences are important, a more meaningful distinction for the purposes of assessing the cost-effectiveness of cancer screening programs is whether a model is an empirically based shallow model or a more biologically based deep model.21,22

Shallow models reproduce observable outcomes (eg, cancer incidence and/or mortality) but do not specify the processes that generate those outcomes. For example, a shallow model might track a tumor from the time at which it is first detectable by screening, rather than simulating a tumor from onset and modeling its growth over time. Similarly, a shallow model may assign the probability of death using stage-specific survival estimates from tumor registries (eg, Surveillance, Epidemiology and End Results), rather than by explicitly modeling the effects of treatment and simulating recurrence and metastasis.

A stage-shift model is an example of a shallow model that has been used to assess the cost-effectiveness of a cancer screening program. In a stage-shift model, the effect of screening is modeled by shifting the cancer diagnosis to either a less advanced stage (ie, external stage shift) or earlier within the same stage (ie, internal stage shift), which may lead to life expectancy gains, differences in treatment-related costs, and ultimately an incremental cost-effectiveness ratio (ICER). However, because screening trial results may be subject to lead-time bias, length-time bias, and overdiagnosis bias,23-25 screen-detected cases of a given stage do not necessarily have the same prognosis as nonscreen-detected cases of the same stage. For example, it would be incorrect to attribute any gain in life expectancy to the detection of an overdiagnosed case. To account for these biases, stage-shift models should incorporate estimates of the magnitude of each screening bias as model inputs, and assess the impact of the estimates on model predictions via sensitivity analyses. Unfortunately, there are few empiric data on which to base such estimates.

In contrast to shallow models, deep or biologically based models incorporate hypotheses about the nature of the underlying disease processes. This typically means that deep models simulate the unobservable natural history of cancer, including tumor onset, growth, and metastasis, as well as the mechanism by which a screening test detects preclinical disease. Survival is modeled as a function of true disease characteristics: occult metastases reduce survival, whereas a person with a truly indolent cancer would have the same survival as a disease-free person. With deep models, factors such as lead time and overdiagnosis are model outputs, rather than model inputs.

Because deep models attempt to characterize an unobservable process, there are limited data from which to estimate the natural history parameters. Plausible values of these parameters must be inferred by constraining the model to yield predictions that are consistent with observable end points, such as cancer incidence and mortality in the absence of screening and, for cancers with precancerous lesions, the prevalence and characteristics of such lesions. This process is called model calibration. A calibrated deep model must also be validated by evaluating how well the model predicts data that were not used in the calibration. Validation often entails simulating the population of a screening trial or observational study that was not used in the estimation or calibration of the model and comparing the model predictions against the study results.

Comparing the two model types, it is clear that a shallow model, while simpler to implement, has shortcomings that make evaluation of many interesting scenarios difficult or impossible. For example, to use a stage-shift model the population must be like that in a past screening trial; the model could not estimate the cost-effectiveness of screening a population with different risk factors or of a different age group because the disease progression would likely be completely different. More challenging is the moving target problem. There is no way to accommodate new screening methods or changes in test sensitivity and specificity in a stage-shift model without a new trial. As screening strategies evolve to incorporate new genetic (and soon, genomic and proteomic) profiles, shallow models that do not account for individual variation in risk will be less useful.

Deep models are more comprehensive and can more easily be updated to incorporate new information as it becomes available, including new screening methods and improved performance of existing tests. Furthermore, because of their comprehensive nature, deep models can be used to address a broad range of policy issues in addition to those pertaining to the cost-effectiveness of screening. However, deep models take longer to develop and to calibrate, may be less transparent or understandable to clinicians and policy-makers, and are subject to identifiability problems (ie, when multiple parameter sets provide a reasonable fit with the observed data).

EXAMPLES OF MODELS OF CANCER SCREENING PROGRAMS

Numerous models have been developed to address issues pertaining to cancer screening policy; a thorough review is beyond the scope of this article. A comprehensive database of published CEAs of cancer screening programs can be found in the Cost-Effectiveness Analysis Registry.26 Below we provide examples of the application of models to address lung, colorectal, breast, and prostate cancer screening.

Lung cancer.

Because lung cancer is the leading cause of cancer death in the United States1 and fewer than 16% of patients are diagnosed at localized stages when survival is higher,27 an effective way to screen for the disease is urgently needed. Three large controlled US trials of chest x-ray and sputum cytology conducted in the 1970s failed to demonstrate a reduction in lung cancer mortality in the screened group.28-31 More recent screening efforts have focused on helical computed tomography (CT). Results from a number of nonrandomized (ie, single arm) studies suggest that helical CT detects more early-stage cancers than would be expected in the absence of screening.32-35 However, a concern raised by these studies is the high percentage of ultimately benign lesions detected by CT, most of which require costly (and potentially invasive) clinical work-ups. Randomized trials of helical CT versus chest x-ray are under way; results will be available by 2009.36,37

Extrapolating from the baseline results of the Early Lung Cancer Action Project (ELCAP) single-arm trial,32,38 two groups39,40 have developed shallow models to evaluate the cost-effectiveness of a one-time helical CT screen of heavy smokers over age 60. Marshall et al39 assumed that one-time screening with helical CT diagnoses the same number of cancers as observed among an unscreened population, but that the stage distribution of those cancers is shifted. In base-case analyses, the authors assumed that screen-detected cancers have the same stage-, age-, and sex-specific survival as those diagnosed by symptoms, and in sensitivity analyses, they evaluated the impact of a 1-year lead-time bias. They conclude that compared with no screening, one-time helical CT screening among this high-risk population has an ICER of approximately $5,900 per year of life saved in the absence of a lead-time bias and of approximately $15,000 with a lead-time bias of 1 year. The model did not include costs related to complications from biopsies, nor did it include the increased risks of death and associated costs from the numerous diseases common in older smokers, such as cardiovascular disease and other cancers.41-44 Including these competing risks would decrease the life-years saved by effective lung cancer screening, and thereby increase the ICER for screening (ie, make it less cost-effective). The model did not consider the potential for overdiagnosis or length bias or evaluate the impact of a lead-time bias of more than 1 year, all of which would make screening appear less favorable. Wisnivesky et al40 assumed stage-specific lead times ranging from 1.5 years for stage I cancers to 4.5 years for cancers diagnosed at stage IV. They conclude that one-time helical CT among heavy smokers older than age 60 has an ICER of $2,500 per year of life saved, compared with no screening.

Mahadevia et al45 improved on the analyses above by increasing competing mortality rates for smokers (but the article provides insufficient detail to determine the exact adjustment used) and by doubling lung cancer incidence as a crude adjustment for overdiagnosis bias in the base case. As expected, they report higher (less favorable) cost-effectiveness ratios (base case estimate of $116,300 per quality-adjusted life-year [QALY] for current smokers) than Marshall et al39 and Wisnivesky et al.40 However, Mahadevia et al45 used the same 1-year lead-time adjustment for screen-detected cancers as in Marshall et al,39 and included no discussion of calibration or validation of the model.

In contrast to the shallow stage-shift models described herein, we are developing a deep model of lung cancer to evaluate proposed and hypothetical screening programs46 (grant No. R01 CA 97337, principal investigator, G.S.G.). The Lung Cancer Policy model (LCPM) is a comprehensive microsimulation model of lung cancer, populated with individuals assigned smoking histories representative of a specified age-sex-race cohort of the US population. Competing mortality risks are a function of age, sex, race, and smoking status.47 The underlying natural history model simulates lung cancer development, growth, and metastasis, rather than transitions between disease stages (eg, local to regional), avoiding the screening biases inherent in stage-shift models. Benign lesions, which cause high positivity rates on screening examinations, are also simulated. Clinical algorithms for staging and follow-up are modeled explicitly. Survival in the LCPM depends on the underlying disease state and the treatment received.

The natural history component of the LCPM has been calibrated to observed characteristics of incident cancers (rates, cell type, stage, and size) and survival curves,46 and we are currently validating the model against screening trial data. When the validation is complete, we will use the LCPM to extrapolate the lifetime costs and health benefits from soon-to-be-available interim trial results, make comparisons not included in ongoing trials (eg, annual v biennial screening), and provide insight into questions that may linger despite the trials. For example, ongoing trials employ different clinical algorithms for follow-up of indeterminate lesions and for staging and treating lung cancers. The LCPM will be able to assess the influence of clinical algorithms on the effectiveness of screening and help interpret any inconsistent trial results. It will also be used to assess the effectiveness and cost-effectiveness of screening lighter smokers or those with different compliance rates. None of these issues could be addressed with a shallow model.

Colorectal cancer.

The ACS recommends that individuals at average risk of colorectal cancer begin screening for the disease at age 50 with one of the following screening strategies: annual fecal occult blood testing (FOBT) or fecal immunochemical testing; sigmoidoscopy every 5 years; annual FOBT or fecal immunochemical testing and sigmoidoscopy every 5 years; double-contrast barium enema every 5 years; or colonoscopy every 10 years.48 However, to date, only the efficacy of annual and biennial FOBT has been demonstrated in randomized clinical trials.49-54 Given the lack of comparative clinical trials, there is limited direct evidence on how a decision maker should rank FOBT relative to other available strategies, such as colonoscopy or sigmoidoscopy, in terms of mortality reductions and cost-effectiveness ratios. Furthermore, it is unclear whether other strategies that vary the age at first screen and the frequency of repeat screening may yield greater mortality reductions or more favorable cost-effectiveness ratios. Similarly, existing trials offer little evidence on how differential adherence rates affect the relative mortality reductions from these strategies.

Several models have been developed to specifically address these issues.55-61 For example, Frazier et al55 developed a deep model of the natural history of colorectal cancer and used it to evaluate the lifetime risks of developing and of dying from colorectal cancer, the lifetime costs, life expectancy, and ICERs for 22 alternative strategies for colorectal cancer screening. The strategies varied in terms of the type of screening test and its frequency, as well as in the protocol for follow-up of positive findings (for sigmoidoscopy strategies). Such a comprehensive assessment is unlikely to be feasible in a clinical trial setting. The authors found that compared with no screening, all of the screening strategies reduced colorectal cancer incidence and mortality. Assuming a 60% adherence rate for screening and an 80% adherence rate with follow-up and surveillance tests, seven of 22 screening strategies were efficient (ie, yielded greater gains in life expectancy at a lower cost per unit than an alternative strategy or combination of strategies). These efficient strategies included one of the strategies included in current guidelines,48,62,63 namely annual FOBT and sigmoidoscopy screening every 5 years (with referral of all individuals with an adenoma detected by sigmoidoscopy for follow-up colonoscopy). This strategy had an ICER of $51,200 per year of life saved with unrehydrated FOBT and $92,900 with rehydrated FOBT. Two other recommended strategies—annual FOBT alone and colonoscopy screening every 10 years—were not among the efficient options. A strategy of one-time sigmoidoscopy screening at age 55 years (with referral of individuals with an adenoma 10+ mm or an adenoma with villous histology for follow-up colonoscopy) had the lowest ICER at $1,200 per year of life saved. This ratio increased to $11,000 per year of life saved if all individuals with adenomas detected by sigmoidoscopy are referred for follow-up colonoscopy.

Breast cancer.

Mammography screening for breast cancer has been widely disseminated in the US population. Data from the 1987 National Health Interview Survey indicate that approximately 30% of women age 40 years or older report having had a mammogram in the past 2 years.64 In the 2000 National Health Interview Survey, this estimate increased to 70%.65 A recent analysis using seven independently developed breast cancer models suggests that mammography dissemination has played an important role in explaining the declining breast cancer mortality rates in the past decade12 and numerous studies have demonstrated that screening for breast cancer with mammography is cost-effective, compared with no screening.66-71

Stout et al72 evaluated the cost-effectiveness of mammography screening from a different perspective. Using a deep model that simulates the natural history of breast cancer, they estimated the total lifetime costs and number of QALYs among women in the United States age 40 years or older during the period 1990 to 2000 under 64 alternative scenarios for mammography screening that varied in terms of the age at first screen, age at last screen, and the interval between screens, as well as with two additional scenarios—no screening and with screening as actually happened over this period. The authors found that annual screening of women age 40 to 80 years—a strategy that resembles some current guidelines—is one of 11 efficient screening scenarios, but is the most costly of the efficient options. Mammography screening as practiced during this time period was not efficient; that is, other screening scenarios could have achieved more QALYs for the same or lower costs. However, the efficient scenarios tended to involve screening a narrower range of ages and/or screening less frequently than recommended by current guidelines. The findings are sensitive to assumptions about the level of participation in screening programs and to the potential short-term detrimental quality of life effects associated with mammography.

Prostate cancer.

The US Preventive Services Task Force concluded that at present there is insufficient evidence to recommend for or against routine prostate-specific antigen (PSA) screening of asymptomatic men.73 Of particular concern is whether the benefits of early detection outweigh the risks and consequences associated with false positives and overdiagnosis. To shed light on this issue, Draisma and de Koning15 used a deep model of the natural history of prostate cancer to estimate the magnitude of overdiagnosis due to PSA screening. They simulated prostate cancer incidence and mortality in the absence of screening, with one-time screens at age 55, 65, and 75 years, with annual screening of men age 55 to 67 years, or with screening men in this age range every 4 years. Their estimates of overdiagnosis range from 27% for one-time screening at age 55 years to 56% for one-time screening at age 75 years. Annual and quadrennial PSA testing—the screening intervals under study in ongoing randomized trials of PSA screening5,74—have estimated overdiagnosis rates of 50% and 48% respectively, suggesting that approximately one of every two cases of prostate cancer detected by PSA screening in these trials may not otherwise have been diagnosed in the patient's lifetime.

SUMMARY

Modeling is an essential component of screening program evaluation, particularly when CEA is involved. Simple models can be useful, but comprehensive evaluations of cancer screening programs typically require deep natural history models with sufficient detail and flexibility to simulate realistic clinical algorithms and populations with different risk profiles. Regardless of its type, when evaluating the results from a cancer screening model, readers should critically appraise the underlying assumptions and evaluate the extent to which the model has been calibrated and validated against empirical data. Furthermore, one should pay careful attention to the population under consideration. A screening test that yields a favorable incremental cost-effectiveness ratio among one population subgroup may not be cost-effective among other populations. This is of particular concern with stage-shift models because the results may not be generalizable to populations and settings not included in the trial. Finally, and most importantly, one should remember that no model can (nor should claim to) predict the truth. A model can, however, provide insights into unobservable processes and shed light on the potential implications of alternative strategies. In that regard, models are useful adjuncts to clinical and epidemiologic evidence in guiding cancer screening policy.

AUTHORS' DISCLOSURES OF POTENTIAL CONFLICTS OF INTEREST

The authors indicated no potential conflicts of interest.

AUTHOR CONTRIBUTIONS

Conception and design: Amy B. Knudsen, Pamela M. McMahon, G. Scott Gazelle

Collection and assembly of data: Amy B. Knudsen, Pamela M. McMahon

Manuscript writing: Amy B. Knudsen, Pamela M. McMahon

Final approval of manuscript: Amy B. Knudsen, Pamela M. McMahon, G. Scott Gazelle

Footnotes

  • Authors' disclosures of potential conflicts of interest and author contributions are found at the end of this article.

  • Received June 16, 2006.
  • Accepted September 14, 2006.

REFERENCES

| Table of Contents
  • Advertisement
  • Advertisement
  • Advertisement