Standardizing Patient-Reported Outcomes Assessment in Cancer Clinical Trials: A Patient-Reported Outcomes Measurement Information System Initiative

  1. Kevin Weinfurt
  1. From the Center on Outcomes, Research and Education, Evanston Northwestern Healthcare, Evanston; Department of Psychiatry and Behavioral Sciences, Institute for Healthcare Studies, and Department of Pediatrics, Feinberg School of Medicine, Northwestern University; Hematology/Oncology Division, Stronger Hospital of Cook County, Chicago, IL; Outcomes Research Branch, Division of Cancer Control and Population Sciences, National Cancer Institute, Bethesda, MD; Center for Clinical and Genetic Economics, Duke Clinical Research Institute; Departments of Psychiatry and Behavioral Sciences and Psychology and Neuroscience, Duke University, Durham, NC; and Department of Psychiatry and Behavioral Science, Stony Brook University, Stony Brook, NY
  1. Address reprint requests to David Cella, PhD, Center on Outcomes, Research and Education (CORE), Evanston Northwestern Healthcare, 1001 University Pl, Suite 100, Evanston, Illinois 60201; e-mail: d-cella{at}northwestern.edu

Abstract

Patient-reported outcomes (PROs), such as symptom scales or more broad-based health-related quality-of-life measures, play an important role in oncology clinical trials. They frequently are used to help evaluate cancer treatments, as well as for supportive and palliative oncology care. To be most beneficial, these PROs must be relevant to patients and clinicians, valid, and easily understood and interpreted. The Patient-Reported Outcomes Measurement Information System (PROMIS) Network, part of the National Institutes of Health Roadmap Initiative, aims to improve appreciably how PROs are selected and assessed in clinical research, including clinical trials. PROMIS is establishing a publicly available resource of standardized, accurate, and efficient PRO measures of major self-reported health domains (eg, pain, fatigue, emotional distress, physical function, social function) that are relevant across chronic illnesses including cancer. PROMIS is also developing measures of self-reported health domains specifically targeted to cancer, such as sleep/wake function, sexual function, cognitive function, and the psychosocial impacts of the illness experience (ie, stress response and coping; shifts in self-concept, social interactions, and spirituality). We outline the qualitative and quantitative methods by which PROMIS measures are being developed and adapted for use in clinical oncology research. At the core of this activity is the formation and application of item banks using item response theory modeling. We also present our work in the fatigue domain, including a short-form measure, as a sample of PROMIS methodology and work to date. Plans for future validation and application of PROMIS measures are discussed.

INTRODUCTION

The last few decades have witnessed increased attention to the role of patient-reported outcomes (PROs), such as symptom scales or health-related quality-of-life (HRQOL) measures, in clinical oncology trials.1 PROs can provide understanding and detail regarding the impact of new treatments in both cancer treatment and supportive care settings. In some cases, the analysis of PROs in clinical trials has led to labeling claims for new therapeutic agents.2-4 PROs are also important for monitoring adverse events that might arise as a result of therapy, whether in the context of traditional clinical trials or in registry databases.

Challenge of Measuring PROs

Despite the importance of PROs for characterizing the value of new treatment strategies, researchers and policy makers have cited several methodologic issues that need to be addressed to improve the measurement and interpretation of PROs in clinical trials. In a recent study, Flynn et al5 report the results of qualitative interviews with 42 lead authors of clinical trials published in top-tier journals, including 11 oncology trialists. Results from those interviews, as well as comments by others in recent years, highlight a number of concerns. First, clinical trials often use different measures to assess the same concepts, limiting the ability of decision makers to compare results across studies. Second, some PRO measures are perceived to be unresponsive to changes that investigators believe are present. Lack of instrument responsiveness can be attributed partly to the floor or ceiling effects in some measures, such that the experiences of individuals reporting very low or high levels of a symptom (eg, pain) are not assessed adequately.

A third issue raised by many trialists is the burden of PRO measurement on research participants and personnel. Brief measures that do not sacrifice precision for brevity would help make PRO assessments more widely adopted in clinical research (and practice). This is particularly important given the increasing prevalence of electronic diary and interactive voice response systems for collecting more frequent data from trial participants who are being queried away from the clinic, in their home environments.

A fourth barrier cited by clinical trialists is that some attractive PRO measures have not been validated specifically in the clinical population under study. A fifth and related issue is that appropriate measures must possess adequate evidence for validity. The US Food and Drug Administration has called special attention to this concern in their draft guidance document on the use of PROs for pharmaceutical labeling claims.6 The remainder of this article describes current efforts by a collaborative network of investigators to address these concerns and arrive at a better alternative for measuring PROs in applications pertaining to clinical research.

PROMIS Network

Recognizing the importance of PROs for clinical research, the National Institutes of Health (NIH) funded the Patient-Reported Outcomes Measurement Information System (PROMIS) Network as part of the NIH Roadmap Initiative to reengineer the clinical research enterprise.7 The PROMIS Network is a cooperative group that includes six primary research sites and a statistical coordinating center, all of which work closely with scientists from the NIH.8 The PROMIS Network's overall goal is to develop a publicly available set of standardized instruments for measuring major self-reported health domains that are affected by many chronic illnesses, and to do so incorporating state-of the-art cognitive, qualitative, quantitative, and health survey methodologies. To date, the network has developed first-generation measures of self-reported pain, fatigue, emotional distress, physical function, and social function, with the expectation that additional domains will be developed in the future (Fig 1). 8

Fig 1.

Patient-Reported Outcomes Measurement Information System (PROMIS) domain framework. *Cancer-specific versions of PROMIS chronic illness banks are being developed in these areas. Reproduced with permission from the PROMIS Health Organization and the PROMIS Cooperative Group.

The National Cancer Institute (NCI) provided supplementary funding to the PROMIS Network to ensure that the network's measures were valid for cancer patients and survivors across the continuum of care, and that its measurement tools addressed the needs of cancer researchers. First, the NCI supplement made possible the collection of data for item calibration and norming from more than 2,000 patients with cancer (reflecting multiple tumor sites and different stages of treatment). In addition, domain expert and patient input was obtained to enhance the cancer relevance of PROMIS measures of pain, fatigue, emotional distress, and physical function; the same will later be done for social function. Together, these quantitative and qualitative approaches provide greater confidence that the PROMIS measures have precise and valid interpretations for patients and survivors along the continuum of cancer care.

Second, the NCI supplement is supporting the development of PRO measures assessing additional domains that are especially relevant for cancer patients and survivors. Researchers at NCI (Bethesda, MD), Northwestern University (Evanston, IL), and Duke University (Durham, NC) are focusing on four important self-reported health domains for which there are no well-accepted measures: cognitive function, the negative and positive psychosocial impacts of illness (ie, stress response and coping; shifts in self-concept, social interactions, and spirituality), sleep/wake function, and sexual function. Measures of these domains are being developed in tandem with the other PROMIS domains using the same rigorous development process, which we describe in this article.

Finally, the NCI supplement is providing support to identify and address barriers to the adoption of PROMIS measures in oncology clinical trials. The supplement seeks to augment the utility of PROMIS measures in oncology by: identifying minimally important differences (MIDs) in scores on PROMIS measures used in cancer populations; gathering clinician feedback on formats for reports of patients' scores on PROMIS measures; working collaboratively with NCI-funded cooperative groups to select optimal PRO measures for use in clinical trials that include HRQOL components. An MID on a PRO measure represents the smallest score difference (either improvement or deterioration) that patients perceive as important and which would lead clinicians to consider a change in care.10 By representing the smallest clinically significant score changes, MIDs increase the utility of PRO scores for clinicians and clinical researchers (ie, facilitating interpretation of patients' responses to treatment and other changes over time). Likewise, incorporating clinician input in designing graphical reports of patients' PRO scores helps to ensure the interpretability of assessment results, which researchers have emphasized is fundamental in symptom monitoring and management trials.11-13 Together, these efforts are expected to improve substantially the ability of oncology researchers to assess PRO end points that are important to patients and clinicians with greater efficiency and precision.

We next describe the PROMIS process for developing item banks and short-form measures to assess PROs, using fatigue as an example. As we later explain in greater detail, an item bank is a grouping of questions, the measurement properties of which are carefully calibrated such that they can provide an operational definition of a concept (eg, anxiety) and accurately assess the entire continuum (eg, severity or frequency) of that concept. First-generation PROMIS Fatigue (and other domain) measures will be available later in 2007. We present a sample fatigue short form in this article to illustrate PROMIS measure development processes and to preview the nature of resultant measures for use in oncology clinical trials.

PROMIS METHODOLOGY

Defining Domains and Generating Items

Table 1 outlines the measure development process for the PROMIS domains of self-reported health. After determining what health domains PROMIS would address, the next step was to conduct extensive literature reviews within each domain (pain, fatigue, emotional distress, physical function, and social function) to derive a working definition and a conceptual framework that summarized the core subdomains (Fig 1).14 When possible, PROMIS investigators developed a domain hierarchy or end point model to describe the structure or relationship of the subdomains (eg, anxiety and depression) within the larger domains of health (eg, emotional distress). To identify gaps in the conceptual framework and to capture how patients talk about their experiences of the domain topics, multiple focus groups were conducted for each domain.15 Patients with a range of chronic health conditions (eg, psychiatric outpatients, patients with arthritis) were recruited from clinics and registries across the country to participate in the focus groups. Conceptual frameworks were revised based on focus group results.

Table 1.

Overview of PROMIS Item Bank Development Process

Drawing from literature reviews and expert recommendations, approximately 7,000 questionnaire items were extracted from existing PRO measures to form item libraries for PROMIS domains.15 Each library catalogued items based on their instrument of origin, instructions, time frame (recall period), wording (stem), and possible answers (response options). Next, members of PROMIS domain groups put items through a binning and winnowing process in which items of similar content were categorized together based on the conceptual framework (binning) and then those items from each bin believed to best exemplify the domain definition were selected as representatives of the subdomains (winnowing).15 Items not selected during the winnowing process were given a standardized reason for rejection (most often content redundancy with a selected item or inconsistency with the domain definition). The review and rewriting of selected items focused on standardizing the style of item stems (ie, clear and concise wording) and rephrasing stems to match PROMIS preferred recall periods and response options.15 PROMIS Network experts used the Lexile Framework for Reading16,17 when drafting items targeted at or below the sixth-grade reading level.

Patients with a variety of chronic health conditions were presented with the rewritten items during cognitive interviews assessing how respondents understood the questions, formulated their responses, and matched them to the response options, as well as other decision processes (eg, social desirability) that can modify responses.18 At least five respondents were interviewed to evaluate each potential PROMIS item, with at least one nonwhite and one white respondent and at least two respondents with less than 12 years of education or ninth-grade reading ability (as measured by the Wide Range Assessment Test-3 Reading Subtest).19 Items that were revised substantially after the first round of interviews underwent additional cognitive interviewing by another three to five respondents with similar characteristics.15

The general PROMIS item libraries underwent additional refinement to ensure that they are applicable to cancer populations. This involved the use of expert consensus and qualitative studies to identify items or concepts that might be of low relevance for cancer populations as well as those that should be added to reflect cancer-specific concerns. First, we performed content analysis of data from diverse samples of patients with cancer who participated in focus groups (N = 21) or cognitive interviews (N = 40).20 These results then informed domain experts' qualitative item review, which also incorporated their clinical and measurement expertise. One example of consequent item modifications included removal of certain items (eg, “I got tired more easily than usual”) assessing somatic symptoms of depression that the expert panel judged to be more indicative of treatment adverse effects than of emotional distress in patients with cancer; this decision is consistent with literature examining relationships between somatic and nonsomatic symptoms of depression in oncology populations.21,22 Another example was the addition of items assessing neuropathic pain, which was supported by the qualitative data and expert consensus. Future analyses will compare how these cancer-modified PROMIS item pools (sets of items not yet calibrated in an established item bank) perform in cancer populations as compared with the general PROMIS item banks.

Creating and Using Item Banks

After final revisions to PROMIS items were made, they were submitted for an initial wave of field testing with a sample of more than 3,000 adults from the general US population, with approximately 500 people randomly assigned to respond to each domain (pain, fatigue, emotional distress, physical function, and social function). Also included in this field testing were items from legacy instruments—existing measures most often used in clinical research—that will serve as benchmarks in future validation of PROMIS measures (see Reeve et al23 for complete description of analysis plan). A second wave of testing was conducted by administering the PROMIS item pools to approximately 500 patients recruited from cancer clinics and tumor registries. In addition, a third wave of testing with patients across the continuum of cancer care (N = 1,500) currently is underway to test the modified item pools created after gathering patient and expert input into the cancer relevance of the general PROMIS items. Data collected from all waves are used to perform thorough examinations of the items' statistical properties. Modern psychometric techniques will be used to evaluate items for their ability to measure different levels of measured constructs (ie, frequency or severity of pain) and to demonstrate that the items can be interpreted similarly for people from different study populations where PROMIS measures may be used (eg, different races, sexes, or chronic health conditions).

Item response theory (IRT) modeling will be conducted to specify how each item should be used to measure the concept of interest (eg, fatigue).23 Specifically, IRT characterizes each item according to the location of its response options—on the concept's continuum (ie, severity or frequency of a symptom)—and its ability to discriminate people from one another along the continuum of the concept being measured. For example, in the assessment of fatigue, the question “How often were you too tired to take a bath or shower?” tends to measure more extreme levels of fatigue than the question, “How often did you have enough energy to exercise strenuously?” Discrimination refers to how well an item measures different levels of a concept experienced by a person. For example, the latter question (“How often did you have enough energy to exercise strenuously?”) has more precision for assessing people with low fatigue levels and is less informative (discriminating) for assessing people with extreme fatigue.

In an item bank, an item's location, discrimination, and other statistical properties are combined with additional information (eg, wording and response categories) to provide a library from which one can draw questions for a given clinical research project. Item banks are thus a foundation for a new era of PRO assessment. The library of detailed information behind an item bank aids researchers in selecting the optimal set of questions to match the needs of their studies. For example, measuring physical functioning in highly symptomatic patients with advanced cancer may require items tapping into less strenuous activities, whereas measuring physical functioning in less symptomatic patients may require items tapping into moderate or vigorous activities. Because each of the many items within an item bank has been mapped onto a common metric, those two sets of physical functioning measures can be compared to each other or combined for a meta-analysis.

Item banks also allow researchers to assess components of HRQOL and other PROs in cancer populations with short-form measures or computerized adaptive testing (CAT) technology. For the former, researchers can select optimal sets of items from the item banks to create accurate short-form measures tailored to clinical trial populations. CAT assessment uses an automated system (ie, computers, Internet, telephone using interactive voice response) that tailors which bank items are administered on a respondent-by-respondent basis. CATs provide efficient and precise PRO measurement as a result of administering the most informative set of questions for individual respondents by selecting each question based on the respondent's pattern of responses to all previously administered questions. Well-defined item banks allow CATs to minimize or eliminate floor and ceiling effects. Furthermore, PRO data collected from any short form or CAT using the same PROMIS item banks can be compared or combined, even when the respondents receive different sets of questions.

SAMPLE PROMIS FATIGUE MEASURE FOR ONCOLOGY

In this section, we present the creation of a PROMIS-derived fatigue short form that was developed for use in oncology populations. The PROMIS network adopted the definition of fatigue as “an overwhelming, debilitating, and sustained sense of exhaustion that decreases one's ability to carry out daily activities, including the ability to work effectively and to function at one's usual level in family or social roles.”24-26 Subsequent to domain group experts classifying fatigue into two content-driven subdomains (experience and impact), a 58-item Fatigue Impact and a 54-item Fatigue Experience item pool were developed for use across chronic illnesses.

Items' psychometric properties were evaluated using the general US population data mentioned previously, of which a subsample of approximately 450 individuals completed the fatigue items. Of those individuals (mean age, 53.2 years; standard deviation, 19.0 years), 236 (52.4%) were female; 384 (85.3%) were white, 41 (9.1%) were black; and 52 (11.6%) were of Hispanic/Latino origin. The items' psychometric properties were evaluated using classical test theory indices (Spearman's ρ, Cronbach's α, and item-scale correlation), monotonicity, and scalability (using Mokken scale procedures). Sufficient unidimensionality of the bank, a requirement for sound application of these models, was evaluated using confirmatory factor analytic techniques (eg, bifactor analysis; see Lai et al27 for an example). In addition, item parameters were estimated using the graded response model as implemented in PARSCALE software.23 Preliminary analyses suggested that 56 of 58 Impact items and 52 of 54 Experience items demonstrated satisfactory psychometric properties and were therefore retained in the PROMIS fatigue item pools for use across chronic illnesses.

Enhancing the Cancer Relevance of PROMIS Fatigue Measures

Within the measure development phase in which items were refined for use in oncology populations, we implemented a three-step process to augment the cancer relevance of PROMIS banks, including fatigue (Table 1). The first step consisted of systematic, content-based expert selection of PROMIS Fatigue Experience and Fatigue Impact items deemed highly relevant to cancer populations. Using an expert consensus process, PROMIS Fatigue Experience and Fatigue Impact items were elected for exclusion from the cancer-specific item pool or earmarked for exclusion pending their psychometric performance in field testing (described below). Reasons items were nominated for exclusion included their tapping aspects of fatigue that in the context of cancer would often indicate a different concept (eg, “How often were you emotionally exhausted?”), compound or obscured concepts (eg, “How often did you think about your fatigue?”), or attributional judgments (eg, “To what degree did your fatigue make you more forgetful?”).

The second step involved the previously mentioned expert review of the qualitative summaries from the focus group and cognitive interviews with cancer populations. For the fatigue domain, unlike for the others (pain, emotional distress, and physical function), all the key patient-identified concepts (amounts and types of symptoms/distress/dysfunction; negative and positive impacts) were judged by expert consensus to be sufficiently addressed by the general, noncancer-specific PROMIS Fatigue item banks, and they supported the conceptual subdomains of Fatigue Experience and Fatigue Impact. As a result, no non-PROMIS items were added to the cancer-specific fatigue item pools.

The third step comprised item selection based on psychometric performance in the first wave of PROMIS field testing (general US population sample) as well as on content balancing. Estimates of location and discrimination were generated separately for items in the Fatigue Impact and Fatigue Experience banks using IRT modeling. Through a process of expert review, items were chosen based on their ability to cover both the fatigue continuum (ie, severity or frequency, from low to high) and the range of ways fatigue is expressed clinically (ie, content areas within the PROMIS conceptual framework, such as energy, impact on mental processing or occupational functioning). This process led to 20 Fatigue Impact and 16 Fatigue Experience items being excluded, producing two 36-item cancer-specific banks that measured the entire spectrum of the fatigue continuum.

PROMIS Fatigue Short Form for Cancer Trials

Domain experts elected the 20 best Fatigue items (10 Fatigue Impact and 10 Fatigue Experience) from the cancer-specific banks based on statistical and conceptual considerations. These select items were presented for review to multidisciplinary panels of clinical experts working in oncology (including physicians, nurses, pharmacists, and psychologists) using independent voting regarding item inclusion followed by domain group discussion and consensus. Overall, this clinician feedback supported the 20 items' content coverage, clinical relevance, and utility in providing general assessment and guiding interventions. There was only one recommended item exclusion due to content redundancy between two items: “How often were you exhausted?” and “How often did you experience extreme exhaustion?” We therefore selected the latter of the two items for the short-form pool based on its more favorable statistical performance. From there, we created a preliminary fatigue short-form measure covering both Fatigue Impact and Fatigue Experience subdomains, which can be scored separately in addition to contributing to a global fatigue score. The seven items in this preliminary PROMIS Cancer Fatigue Short Form (Table 2) were selected so that there was consistency in the response scale, broad coverage across the fatigue continuum (ie, high to low), and good precision of measurement (discrimination function). Analysis of the second wave of PROMIS field testing (including a cancer sample) and the third wave of testing (exclusively patients with cancer) are currently underway. These data will be used to evaluate further the psychometric properties of this particular short form and the other PROMIS fatigue items, resulting in finalized fatigue item banks and short-form options for use in cancer populations and available for clinical trials.

Table 2.

Preliminary PROMIS Cancer Fatigue Short Form

FUTURE DIRECTIONS

The measures created by the PROMIS Network seek to address and remedy the methodologic concerns previously identified by researchers and policy makers as barriers to greater use of PROs in clinical trials. First, as an NIH Roadmap Initiative, the PROMIS Network's goal is to enhance coordination of research being conducted across institutions by establishing a publicly accessible national resource for standardized, well-validated PRO measures that can be used and compared across different studies. Second, the PROMIS measure development process emphasizes content and construct validity across all stages of development, paying particular attention to incorporating patient input from formative qualitative research. Third, using IRT modeling to create PROMIS item banks allows for the development of precise, low-burden short-form measures that can be tailored to particular trial populations, and CATs that can diminish or eradicate floor or ceiling effects and subsequently allow for better evaluation of clinical change. Fourth, the NCI supplement aims to establish further the validity of PROMIS measures for use in oncology clinical trials by gathering both qualitative and quantitative data from diverse cancer populations toward the creation of cancer-specific PROMIS item banks and short forms. Lastly, the identification of minimally important differences or changes in scores on PROMIS measures used in cancer populations will extend their value in oncology clinical trials. We hope these advancements will improve the application of PROs in clinical research and further develop the state of the science by allowing comparisons across trials as the use of item banks and PRO technology matures.

Efforts supported by the NCI also are facilitating the inclusion of PROMIS measures into clinical trials. Representatives from all NCI-funded adult cooperative groups have been informed of the opportunities for collaboration with PROMIS, and most have committed to inclusion of PROMIS short-form measures in clinical trials that include a HRQOL component. The cooperative groups will be provided with customized short forms tailored to their specific trials and research interests. In addition, we are working collaboratively with several cooperative groups to contribute to additional item bank development and validation. Our outreach activities will help to ensure that the final PROMIS measures are responsive to the needs of the research community.

AUTHORS' DISCLOSURES OF POTENTIAL CONFLICTS OF INTEREST

Although all authors completed the disclosure declaration, the following author(s) indicated a financial or other interest that is relevant to the subject matter under consideration in this article. Certain relationships marked with a “U” are those for which no compensation was received; those relationships marked with a “C” were compensated. For a detailed description of the disclosure categories, or for more information about ASCO's conflict of interest policy, please refer to the Author Disclosure Declaration and the Disclosures of Potential Conflicts of Interest section in Information for Contributors.

Employment or Leadership Position: Arthur A. Stone, Gallup Organization (C) Consultant or Advisory Role: Arthur A. Stone, Invivodata Inc (C) Stock Ownership: None Honoraria: None Research Funding: None Expert Testimony: None Other Remuneration: None

AUTHOR CONTRIBUTIONS

Conception and design: Sofia F. Garcia, David Cella, Steven B. Clauser, Kathryn E. Flynn, Jin-Shei Lai, Bryce B. Reeve, Arthur A. Stone, Kevin Weinfurt

Financial support: Steven B. Clauser, Bryce B. Reeve, Ashley Wilder Smith

Administrative support: Steven B. Clauser, Bryce B. Reeve, Ashley Wilder Smith

Provision of study materials or patients: David Cella, Thomas Lad, Kevin Weinfurt

Collection and assembly of data: Sofia F. Garcia, David Cella, Kathryn E. Flynn, Kevin Weinfurt

Data analysis and interpretation: Sofia F. Garcia, David Cella, Kathryn E. Flynn, Jin-Shei Lai, Arthur A. Stone, Kevin Weinfurt

Manuscript writing: Sofia F. Garcia, David Cella, Steven B. Clauser, Kathryn E. Flynn, Thomas Lad, Jin-Shei Lai, Bryce B. Reeve, Ashley Wilder Smith, Arthur A. Stone, Kevin Weinfurt

Final approval of manuscript: Sofia F. Garcia, David Cella, Steven B. Clauser, Kathryn E. Flynn, Thomas Lad, Jin-Shei Lai, Bryce B. Reeve, Ashley Wilder Smith, Arthur A. Stone, Kevin Weinfurt

Acknowledgments

We thank the PROMIS Cooperative Group, reviewers at the National Cancer Institute for their comments on this article, and all of the patients who provided qualitative and quantitative data.

Footnotes

  • Supported by National Institutes of Health Grant No. U01AR052177 to the Center on Outcomes, Research and Education, Evanston Northwestern Healthcare, and Grant No. U01AR052186 to Duke University Medical Center. In addition to financial support, the sponsor reviewed and approved the study design and monitored its progress and results.

  • Authors' disclosures of potential conflicts of interest and author contributions are found at the end of this article.

  • Received April 24, 2007.
  • Accepted July 27, 2007.

REFERENCES

| Table of Contents
  • Advertisement
  • Advertisement
  • Advertisement