- © 2009 by American Society of Clinical Oncology
Three Immunomarker Support Vector Machines–Based Prognostic Classifiers for Stage IB Non–Small-Cell Lung Cancer
- Zhi-Hua Zhu,
- Bing-Yu Sun,
- Yun Ma,
- Jian-Yong Shao,
- Hao Long,
- Xu Zhang,
- Jian-Hua Fu,
- Lan-Jun Zhang,
- Xiao-Dong Su,
- Qiu-Liang Wu,
- Peng Ling,
- Ming Chen,
- Ze-Ming Xie,
- Yi Hu and
- Tie-Hua Rong
- From the Departments of Thoracic Oncology, Pathology, and Radiation Oncology, State Key Laboratory of Oncology in South China, Cancer Center, Lung Cancer Research Center; and Reproductive Research Center, the Second Affiliated Hospital, Sun Yat-sen University, Guangzhou; and Institute of Intelligence Machine, Chinese Academy of Sciences, Hefei, People's Republic of China.
- Corresponding author: Tie-Hua Rong, MD, Department of Thoracic Oncology, Cancer Center of Sun Yat-Sen University, 651 Dongfeng Rd E, Guangzhou 510060, People's Republic of China; e-mail: rongth{at}live.cn.
-
Z.-H.Z., B.-Y.S., and Y.M. contributed equally to this article.
Abstract
Purpose Approximately 30% of patients with stage IB non–small-cell lung cancer (NSCLC) die within 5 years after surgery. Current staging methods are inadequate for predicting the prognosis of this particular subgroup. This study identifies prognostic markers for NSCLC.
Patients and Methods We used computer-generated random numbers to study 148 paraffin-embedded specimens for immunohistochemical analysis. We studied gene expression in paraffin-embedded specimens of lung cancer tissue from 73 randomly selected patients with stage IB NSCLC who had undergone radical surgical resection and evaluated the association between the level of expression and survival. We used support vector machines (SVM)–based methods to develop three immunomarker-SVM–based prognostic classifiers for stage IB NSCLC. For validation, we used randomly assigned specimens from 75 other patients.
Results We devised three immunomarker-SVM–based prognostic classifiers, including SVM1, SVM2, and SVM3, to refine prognosis of stage IB NSCLC successfully. The SVM1 model integrates age, cancer cell type, and five markers, including CD34MVD, EMA, p21ras, p21WAF1, and tissue inhibitors of metalloproteinases (TIMP) –2. The SVM2 model integrates age, cancer cell type, and 19 markers, including BCL2, caspase-9, CD34MVD, low-molecular-weight cytokeratin, high-molecular-weight cytokeratin, cyclo-oxygenase-2, EMA, HER2, matrix metalloproteinases (MMP) –2, MMP-9, p16, p21ras, p21WAF1, p27kip1, p53, TIMP-1, TIMP-2, vascular endothelial growth factor (VEGF), and β-catenin. The SVM3 model consists of SVM1 and SVM2. The three models were independent predictors of overall survival. We validated the classifiers with data from an independent cohort of 75 patients with stage IB NSCLC.
Conclusion The three immunomarker-SVM–based prognostic characteristics are closely associated with overall survival among patients with stage IB NSCLC.
INTRODUCTION
The 5-year survival rate for stage IB non–small-cell lung cancer (NSCLC) is only 70%, despite surgery.1 Several mRNA-based models that correlate with prognosis of NSCLC show that the reliability of specific gene expression as a marker in clinical practice has been validated.2–13 The large genomic meta-analysis also demonstrates that survival is reproducibly identified in multiple independent cohorts and demonstrates that these classes are reproducibly identified using immunohistochemistry (IHC) in addition to mRNA.2 Molecular prognostic markers could potentially be represented by changes in gene copy number, mRNA expression, or protein expression levels. IHC is the most practical method for assessing protein expression changes by histopathology. IHC not only provides a semiquantitative assessment of protein abundance but also defines the cellular localization of their expression. It may also detect functionally important post-translational protein modifications, such as phosphorylation. These considerations have led to the extensive use of IHC in studies on prognostic markers for tumors.14 Because no special processing of tissue samples is needed and labor-intensive and expensive molecular diagnostic techniques are avoided, IHC is perhaps the most readily adaptable technique to clinical practice.15 Interestingly, all immunomarkers in NSCLC that have been studied most exhaustively have all reported inconsistent results, suggesting that their prognostic values are suboptimal.14
zWhen we use a single clinicopathologic feature or immunomarker to predict prognosis, the results may be not reliable. With specific clinicopathologic features or immunomarkers, the value of prognostic predictors can be greatly enhanced. This hypothesis has been tested at mRNA level,2–13 whereas there is a paucity of reliable IHC markers for prognosis.
Several supervised methods, such as decision trees, have been applied to the analysis of cDNA microarrays for refining prognosis in breast cancer16,17 and NSCLC.2–7,9–13,16,18 Recently it is demonstrated that a small subset of highly discriminating genes can be extracted to build extremely reliable cancer classifiers by applying state-of-the-art classification algorithms (support vector machines [SVM]19), and SVM is also effective for discovering informative features or attributes (such as critically important genes).20
In this article, we used SVM-based methods to develop three immunomarker-SVM–based prognostic classifiers for the prediction of prognosis of NSCLC.
PATIENTS AND METHODS
Patient Selection
We used computer-generated random numbers to assign specimens from 148 consecutive patients with stage IB NSCLC for IHC analysis. Patients were selected based on the following eligibility criteria: histologic proof of NSCLC was required. Disease stage was T2N0M0 using the 1997 American Joint Committee on Cancer/International Union Against Cancer staging system. Patients were at least 18 years of age, with no evidence of metastatic disease as determined by history, physical examination, and blood chemistry analysis or routine computed tomography. All patients received no adjuvant therapy. Patients were excluded based on the following criteria: history of previously treated cancer other than basal or squamous cell carcinoma of the skin or with preoperative chemotherapy and/or radiotherapy.21
Tissue Microarrays Construction and Immunohistochemistry
Tissue microarrays were constructed as previously described (complete details are showed in the Appendix, online only).22 The prognostic immunomarkers identified in previous studies14 represent the best candidates for the new prognostic classifiers; 33 molecular markers involved in different aspects of NSCLC development and metastasis, including proliferation (epidermal growth factor receptor [EGFR], HER2), cell cycle (cyclin D1, proliferating cell nuclear antigen [PCNA], transforming growth factor β [TGF-β], p21WAF1, p27kip1, and p16), apoptosis (BAX, BCL2, caspase-9, fas, and survivin), angiogenesis, and lymphangiogenesis (vascular endothelial growth factor [VEGF], CD34MVD), cell adhesion molecule (CD44v6, E-cadherin, β-catenin), matrix metalloproteinases (MMP-2, MMP-9), tissue inhibitors of metalloproteinases (TIMP-1, TIMP-2, cyclo-oxygenase-2, and EMA), proto-oncogenes (c-Myc, p21ras), tumor suppressor genes (p53, p63, NM23-H1, and PTEN), and cancer cell antigens (carcinoembryonic antigen [CEA], low-molecular-weight and high-molecular-weight cytokeratin), were chosen for investigation in this study (Fig 1). The antibody dilutions and antigen retrieval are shown in Appendix Table A1 (online only).
IHC Scoring
A positive control sample was evaluated with each batch of slides. Each slide was assigned a score: the average of the score of tumor cells staining multiplied by the score of staining intensity. Tumor cell staining was assigned a score using a semiquantitative five-category grading system: 0, none of tumor cells staining; 1, 1% to 10% of tumor cells staining; 2, 11% to 25% of tumor cells staining; 3, 26% to 50% of tumor cells staining; 4, 51% to 75% of tumor cells staining; 5, more than 76% of tumor cells staining. Staining intensity was assigned a score using a semiquantitative four-category grading system: 0, no staining; 1, weak staining; 2, moderate staining; 3, strong staining. Tumor neovascularization was assessed by counting CD34+ capillaries and small venules. The intratumor microvessel density (MVD) counting procedure as described by Weidner23 was used. Two trained pathologists used the IHC scoring system by simultaneously evaluating a panel of 300 NCSLC tissue samples that were immunostained for EGFR and were not part of the study presented here. They then independently scored all cases blindly to clinical follow-up data. Their results were in complete agreement in 86% of the cases. Taken together, these results indicate that the scoring method was easy to learn and highly reproducible. The third pathologist intervened in case difference of opinion arose between two pathologists. If the third pathologist agreed with one of them, then that value was selected. If the conclusion by the third pathologist was completely different, then the three of them would work collaboratively to find a common answer.
Statistics
The goal of this study was to identify prognostic classifiers that predict overall survival. This is defined as the time between surgery and death or the last follow-up date. Distributions were estimated with the Kaplan-Meier method. The relationship between survival and each variable was determined with the log-rank test. Multivariate analysis of prognostic factors was performed using Cox's regression model. A significant difference was declared if the P value from a two-tailed test was less than .01. All of the statistical analysis was performed using the SPSS 13.0 for Windows software system (SPSS Inc, Chicago, IL).
Selection of Cutoff Score for the Kaplan-Meier Method
We selected the cutoff scores based on receiver operating characteristic (ROC) curve analysis.24–26 At each immunomarker score, the sensitivity and specificity for each outcome under study was plotted, thus generating an ROC curve. The score closest to the point with both maximum sensitivity and specificity (ie, the point [0.0, 1.0] on the curve) was selected as the cutoff score leading to the greatest number of tumors correctly classified as having or not having the clinical outcome. To use ROC curve analysis, the clinicopathologic features and the SVM models were dichotomized: cancer cell type (adenocarcinoma [exclude bronchioalveolar] or others [bronchioalveolar + squamous + large-cell + adenosquamous + NSCLC not otherwise specified]), age (≥ 65 years or < 65 years), the SVM models (alive [a patient survives > 5 years] or dies [a patient dies before 5 years]) and survival (death owing to NSCLC or censored [not followed-up, alive, or dead from other causes]).
Prognosis Prediction Using SVM-Based Methods
The SVM was introduced by Vapnik19 for data classification and function approximation. We address the prognostic prediction of NSCLC at two-class classification levels (ie, whether a patient can survive more than 5 years). We trained two SVM models for prognosis using two different loss functions. The first SVM model, SVM1, is highly specific, whereas the second model, SVM2, has the highest sensitivity. To obtain an improved overall performance, SVM1 and SVM2 were combined to develop the third model, SVM3. In SVM3, first the prognosis of a patient is predicated using SVM1; if this patient is predicated as a high-risk patient, then SVM2 is used to further predicate the prognosis of this patient (Fig 2C). Thus the advantage of SVM1 and SVM2 can be integrated. The programs were coded using Matlab software (MathWorks Inc, Natick, MA), and Matlab scripts are available on request. Complete details are provided in the Appendix.
RESULTS
Patient Characteristics
Table 1 lists the demographic and clinical characteristics of the patients (and their tumors) used to develop and test the prognostic model. We studied lung cancer tumor tissue from 73 randomly selected patients at the Cancer Center of Sun Yat-Sen University (Guangzhou, People's Republic of China) between January 1990 and October 1998. We validated the immunomarker-SVM–based prognostic classifiers using an independent cohort of 75 randomly selected patients between November 1998 and March 2002 in the same hospital. We verified and updated the survival data in the patient records through May 2007 using the database. The study was approved by the Research Ethics Committee of the Cancer Center of Sun Yat-Sen University.
Thirty-Seven Features and Survival
Table 2 lists the cutoff score for 37 features and univariate analysis of 37 features in NSCLC. The ROC curves for each feature (Fig 3) clearly illustrate the point on the curve closest to 0.0, 1.0, which maximizes both sensitivity and specificity for the outcome. Analysis was performed with each training and validation cohort individually and with all patients as a group. When P values less than .01 were considered statistically significant, only p21WAF1 (hazard ratio, 0.1463; 95% CI, 0.0005374 to 0.06914; P < .0001) showed prognostic significance for the validation cohort; CEA (hazard ratio, 0.4058; 95% CI, 0.2339 to 0.7159; P = .0017), EMA (hazard ratio, 0.4158; 95% CI, 0.2641 to 0.8182; P = .0079), and p21WAF1 (hazard ratio, 0.4390; 95% CI, 0.1546 to 0.7454; P = .0071) showed prognostic significance for all patients. CEA (hazard ratio, 0.3196; 95% CI, 0.1621 to 0.7810; P = .01) also showed prognostic significance for the training cohort.
SVM1 and Survival
The SVM1 model integrates two clinicopathologic features (age, cancer cell type) and five immunomarkers (CD34MVD, EMA, p21ras, p21WAF1, and TIMP-2). When analysis was performed with the validation cohort individually, we identified 17 patients with high risk and 58 with low risk according to SVM1. The 5-year survival rates were 17.6% for high-risk patients and 80.8% for low-risk patients (Fig 2A; P < .01 by the log-rank test). The SVM1 model was also strongly associated with overall survival (sensitivity, 56%; specificity, 94%; positive predictive value, 92.4%; negative predictive value, 80%; and overall accuracy, 81.3%). According to Cox multivariate regression analysis, the SVM1 model was significantly associated with death among the 75 patients (Appendix Table A2, online only; hazard ratio for high risk v low risk, 17.605; 95% CI, 3.051 to 101.6; P < .01).
When all patients were analyzed as a group, we identified 41 patients with high risk and 107 patients with low risk according to SVM1. The 5-year survival rates were 7.3% for high-risk patients and 88.7% for low-risk patients (Fig 2A; P < .01 by the log-rank test). The SVM1 model was strongly associated with overall survival (sensitivity, 78%; specificity, 96.9%; positive predictive value, 95.1%; negative predictive value, 88.8%; and overall accuracy, 90.5%). According to Cox multivariate regression analysis, the SVM1 model was significantly associated with death among the 148 patients (Appendix Table A3, online only; hazard ratio for high risk v low risk, 87.618; 95% CI, 28.17 to 272.522; P < .01).
SVM2 and Survival
The SVM2 model integrates two clinicopathologic features (age, cancer cell type) and 19 immunomarkers, including BCL2, caspase-9, CD34MVD, low-molecular-weight cytokeratin, high-molecular-weight cytokeratin, cyclo-oxygenase-2, EMA, HER2, MMP-2, MMP-9, p16, p21ras, p21WAF1, p27kip1, p53, TIMP-1, TIMP-2, VEGF, and β-catenin.
When analysis was performed with the validation cohort individually, we identified 26 patients with high risk and 49 patients with low risk according to SVM2. The 5-year survival rates were 33.6% for high-risk patients and 83.7% for low-risk patients (Fig 2B; P < .01 by the log-rank test). The SVM2 model was also strongly associated with overall survival (sensitivity, 68%; specificity, 82%; positive predictive value, 65.4%; negative predictive value, 83.7%; and overall accuracy, 77.3%). According to Cox multivariate regression analysis, the SVM2 model was significantly associated with death among the 75 patients (Appendix Table A4, online only; hazard ratio for high risk v low risk, 5.336; 95% CI, 1.51 to 18.858; P < .01).
When all patients were analyzed as a group, we identified 50 patients with high risk and 98 patients with low risk according to SVM2. The 5-year survival rates were 17.1% for high-risk patients and 90.8% for low-risk patients (Fig 2B; P < .01 by the log-rank test). The SVM2 model was strongly associated with overall survival (sensitivity, 82%; specificity, 91.8%; positive predictive value, 82%; negative predictive value, 91.8%; and overall accuracy, 88.5%). According to Cox multivariate regression analysis, the SVM2 model was significantly associated with death among the 148 patients (Appendix Table A5, online only; hazard ratio for high risk v low risk, 28.705; 95% CI, 11.642 to 70.777; P < .01).
SVM3 and Survival
The SVM3 model consisted of two SVM models (SVM1 and SVM2). The preclassification was performed with the first SVM1 model based on seven features; the SVM2 model, based on 14 features, was used to support the final classification of results.
When analysis was performed with the validation cohort individually, we identified 29 patients with high risk and 46 patients with low risk according to SVM3. The 5-year survival rates were 37.1% for high-risk patients and 84.8% for low-risk patients (Fig 2C; P < .01 by the log-rank test). The SVM3 model was also strongly associated with overall survival (sensitivity, 72%; specificity, 78%; positive predictive value, 62.1%; negative predictive value, 84.8%; and overall accuracy, 76%). According to Cox multivariate regression analysis, the SVM3 model was significantly associated with death among the 75 patients (Appendix Table A6, online only; hazard ratio for high risk v low risk, 6.726; 95% CI, 1.982 to 22.828; P < .01).
When all patients were analyzed as a group, we identified 50 patients with high risk and 98 patients with low risk according to SVM3. The 5-year survival rates were 20% for high-risk patients and 91.6% for low-risk patients (Fig 2C; P < .01 by the log-rank test). The SVM3 model was strongly associated with overall survival (sensitivity, 84%; specificity, 88.8%; positive predictive value, 79.2%; negative predictive value, 91.6%; and overall accuracy, 87.2%). According to Cox multivariate regression analysis, the SVM3 model was significantly associated with death among the 148 patients (Appendix Table A7, online only; hazard ratio for high risk v low risk, 34.579; 95% CI, 13.557 to 88.196; P < .01).
DISCUSSION
Staging systems for lung cancer that are based on clinical and pathologic findings may have reached their limit of usefulness for predicting outcomes. However, molecular methods prove to be more useful.10,12,26 A number of previous mRNA expression-based prognosis techniques, combined with the use of microarrays or polymerase chain reaction, have been shown to estimate the prognosis for patients with lung cancer more accurately.2–7,9–13,16,18 However, the requirements for fresh or snap-frozen tissue, uncertainties about the reproducibility of complicated molecular biology methods, the lack of independent validation, and the expensive examination costs have limited their clinical application, especially in developing countries.5,9 IHC is perhaps the most readily adaptable method to clinical practice, as it is already widely used to guide treatment of patients (eg, estrogen receptor and HER2 in breast cancer). No special processing of tissue samples is needed, and the use of labor-intensive and expensive molecular diagnostic techniques is avoided.10
We measured the candidate prognostic gene expressions by IHC in 73 paraffin-embedded specimens from patients with stage IB NSCLC and then applied SVM-based methods to select and integrate the robust genes that have more powerful prognosis-predicting ability. Three immunomarker-SVM–based prognostic classifiers, including SVM1, SVM2, and SVM3, were devised and found to be independent predictors of overall survival. The results were also validated in an independent cohort of 75 patients with stage IB NSCLC.
Although most published studies have focused on the poor-prognosis patients who may benefit from adjuvant chemotherapy, it is equally important that prognosis classifiers should identify patients with stage IB disease and good prognosis who may not require further treatment after complete resection. The benefit of adjuvant chemotherapy in patients with stage IB NSCLC now is still unclear.27 Our prognosis classifiers provide a new strategy and approach for making the optimal clinical decision. With our prognostic tools, clinicians can gauge high sensitivity or high specificity to select different prognosis classifiers according to different therapeutic effectiveness and side effects of adjuvant therapy, such as molecularly targeted therapy or chemotherapy. For application purposes, if the adjuvant therapy has limited therapeutic effectiveness and serious side effects, clinicians should prefer high specificity and choose the SVM1 model to minimize risk of overtreatment. If the adjuvant therapy is highly effective with serious side effects, clinicians should use high sensitivity and choose the SVM3 model to maximize therapeutic benefit.
Comparison of previously published prognostic gene lists and our immunomarker-SVM–based prognostic classifiers showed that there was minimal overlap. It can be interpreted that although different gene sets are being used as predictors, they each track a common set of biologic characteristics that are present in different groups of patients with lung cancer, resulting in similar predictions of outcome.13 It is possible that multiple small NSCLC gene classifiers provide similar prognostic capabilities, especially when they include genes that belong to the commonly deregulated pathways in lung carcinogenesis. For example, STAT1, proposed in the five-gene signature, can induce the expression of p21WAF1 and caspase.5,19
Compared with other machine learning algorithms, such as decision trees, artificial neural network, and nearest-neighboring classifiers, SVM is well suited to managing classification problems, including high-dimensional data and limited number of training samples. Another important use of SVM is to select several efficient features from all available methods. We can use SVM to show strong characteristics to refine prognosis using variables for NSCLC from a large number of candidate features. Also, because the SVM is robust for data, we can use SVM to refine prognosis of stage IB NSCLC successfully. A single gene expression could not have enough predictability power. Most of these genes are not independent from each other; only with several predictable genes can we achieve satisfactory and reliable prognosis. With SVM, clinicopathologic features can be combined with more predominant genes to predict the outcome of patients. Moreover, it also excludes the interaction of genes and cutoff point of immunomarkers.
In conclusion, the present study suggests the use of IHC and SVMs-based approaches to predict the reliable prognosis of stage IB NSCLC patients.
AUTHORS' DISCLOSURES OF POTENTIAL CONFLICTS OF INTEREST
The author(s) indicated no potential conflicts of interest.
AUTHOR CONTRIBUTIONS
Conception and design: Zhi-Hua Zhu, Bing-Yu Sun, Yun Ma, Tie-Hua Rong
Financial support: Tie-Hua Rong
Administrative support: Tie-Hua Rong
Provision of study materials or patients: Zhi-Hua Zhu, Hao Long, Xu Zhang, Jian-Hua Fu, Lan-Jun Zhang, Xiao-Dong Su, Qiu-Liang Wu, Peng Ling, Ming Chen, Ze-Ming Xie, Yi Hu
Collection and assembly of data: Zhi-Hua Zhu, Yun Ma, Jian-Yong Shao, Hao Long, Xu Zhang, Ze-Ming Xie
Data analysis and interpretation: Zhi-Hua Zhu, Bing-Yu Sun, Yun Ma, Jian-Yong Shao
Manuscript writing: Zhi-Hua Zhu, Bing-Yu Sun, Yun Ma, Jian-Yong Shao
Final approval of manuscript: Zhi-Hua Zhu, Yun Ma, Jian-Yong Shao, Tie-Hua Rong
Acknowledgment
We thank our patients and their families for their willingness to take part in this study.
Appendix
This appendix has been provided by the authors to give readers additional information about their work.
Patient Selection
We used computer-generated random numbers to assign specimens from 148 consecutive patients with stage IB non–small-cell lung cancer (NSCLC) for immunohistochemical (IHC) analysis. Patients were selected based on the following eligibility criteria: histologic proof of NSCLC was required. Disease stage had to be T2N0M0 using the 1997 American Joint Committee on Cancer/International Union Against Cancer staging system. Patients had to be at least 18 years of age. In addition, before surgical operation, there should be no evidence of metastatic disease as determined by history, physical examination, and blood chemistry analysis. Routine computed tomographic examination of the brain, lung, liver, and adrenal glands to detect occult metastases was required. A radionuclide bone scan was performed only in patients who complained of bone pain or chest pain, had an elevated serum calcium level, or elevated serum alkaline phosphatase level. All patients should have undergone either a lobectomy or pneumonectomy with radical systematic mediastinal lymphadenectomy. All patients received no adjuvant therapy. Patients were excluded based on the following criteria: patients were ineligible if they had a history of previously treated cancer other than basal or squamous cell carcinoma of the skin. Patients with preoperative chemotherapy and/or radiotherapy were excluded.
Tissue Samples
This study was performed using 148 formalin-fixed, paraffin-embedded tumor samples from patients treated with lobectomy or pneumonectomy at the Cancer Center of Sun Yat-Sen University in Guangzhou, People's Republic of China (January 1990 and March 2002). The sample was collected in the operating room and was routinely fixed immediately after collection in 10% neutral buffered formalin for approximately 24 hours at room temperature. After fixation, samples were dehydrated, incubated in xylene, infiltrated with paraffin, and finally embedded in paraffin (Oxford Labware, St Louis, MO).
Tissue Microarrays Construction
Briefly, areas containing viable tumor were marked on the paraffin wax tissue blocks. Duplicate 1.0-mm tissue cores taken from different areas of the same tissue block for each case (three cores per case) were used to construct the tissue microarrays using an arraying machine from Beecher Instruments (Sun Prairie, WI). Array blocks were sectioned to produce 4-μm sections.
Support Vector Machines Model
Classification problems.
In this article we address the prognosis prediction problem of NSCLC as a classification problem where the input is a vector that we call a “pattern” of n components which are called “features.” The features consist of clinicopathologic variables and immunohistochemical markers, whereas each pattern corresponds to a patient. We limit ourselves to a two-class classification problem (ie, whether a patient can survive > 5 years). We identify the two classes with the symbols {+} and {−}. A training set of a number of patterns {x1, x2,…, xN} with known class labels {y1, y2,…, yN}, yi ϵ {−1, +1} is given. The training patterns are used to build a decision function (or discriminant function) D(x), that is a linear or nonlinear function of an input pattern x. New patterns are classified according to the sign of the decision function:
The decision function is a discriminant function of the training patterns. For linear case, the following notation can be used: where w is the weight vector and b is bias value.
However, in most problems, the data set is not “linearly separable” (ie, a linear discriminant function cannot separate it with enough accuracy). In this case, a nonlinear discriminant function should be used: where f (.) is a nonlinear function.
Space dimensionality reduction and feature selection.
An important problem in prognosis prediction of NSCLC is feature selection. In our cases, 37 features are used, where some features may be irrelevant to the prognosis of the patient. Accordingly, we should find ways to reduce the dimensionality M of the features to overcome the risk of “overfitting.” In our problem, data overfitting would arise if we fail to exclude irrelevant features because the number of training patterns is comparatively small. In an overfitting situation, one can easily find a decision function that separates the training data (even a linear decision function) but will perform poorly on testing data.19,20
Projecting on the first few principal directions of the data is a method commonly used to reduce feature space dimensionality (Shawe-Taylor J, Cristianini N: Kernel Methods for Pattern Analysis, Cambridge, MA, Cambridge University Press, 2004). With such a method, new features that are linear combinations of the original features are obtained. One disadvantage of projection methods is that none of the original input features can be discarded. So in this article, we investigate pruning techniques that eliminate some of the original input features and retain a minimum subset of features that yield best classification performance. Pruning techniques lend themselves to the applications in which we are interested. To build prognosis tests, it is of practical importance to be able to select a small subset of markers. The reasons include cost effectiveness and ease of verification of the relevance of selected features.
The problem of feature selection is well known in machine learning. Given a particular classification technique, it is conceivable to select the best subset of features satisfying a given model selection criterion by exhaustive enumeration of all subsets of features. Exhaustive enumeration is impractical for large numbers of features because of the combinatorial explosion of the number of subsets. In our article, the pruning algorithm (Guyon I, Elisseeff A. J Mach Learn Res 3:1157-1182, 2003) is used in combination with support vector machine (SVM) method to exclude the irrelevant features to form an efficient feature set.
Excluding Features Using SVM
The introduction of SVM.
An SVM is a binary classifier trained on a set of labeled patterns called training samples. Let {xi,yi} ϵ RM × {±1} be such a set of training samples, with inputs xi ϵ RM and outputs yi ϵ {±1}. The objective of training an SVM is to find a hyperplane that divides these samples into two sides so that all the points with the same label will be on the same side of the hyperplane, in other words, to find w and b of equation 4. After the training, we obtain the classifier decision function, given by: where sgn stands for a bipolar sign function. The hyperplane of the classifier should satisfy the following:
Among all the separating hyperplanes satisfying equation 7, the one with the maximal distance to the closest point is called the optimal separating hyperplane, which will result in an optimal generalization. On the other hand, in many practical situations, we may not have such an ideal hyperplane. To allow for possibilities of violating equation 7, some slack variables ei ≥ 0, can be introduced into equation 7, and we obtain:
According to the structural risk minimization inductive principle, the training of an SVM is to minimize the guaranteed risk bound as follows: subject to equation 8.
The above optimization problem 9 can be used in a linear recognition problem, but in general, a classification problem is nonlinear. To solve a nonlinear classification problem, we can map first the training data to another dot product space (called the feature space) F via a nonlinear map ϕ: RN → F and then perform the above computations 9 in F. Two commonly used kernel functions for SVMs are polynomial kernels and Gaussian radius basis function (RBF) kernels (Muller K-R, Mika S, Ratsch G, et al. IEEE Trans Neural Netw 12:181-201, 2001).
Recursive Feature Excluding Using SVM
In practice, it is not possible to achieve a satisfactory classification with a single feature. Better results are obtained when increasing the number of features. Classical feature selection methods select the features that individually classify best the training data. These methods include correlation methods and expression ratio methods. They eliminate features that are useless for discrimination (noise); however, complementary features that individually do not adequately separate the data could be missed. Therefore, in this article, the pruning method is used to exclude the useless features. We first evaluate how well an individual feature contributes to the prognosis predication (eg, > 5 years v < 5 years) and then all candidate features are ranked based on their contribution. To evaluate the contribution of a feature, we exclude this feature from the original feature set, and a reduced feature set is obtained. The contribution of each feature can be measured based on the performance of the SVM trained with this reduced feature set. After ranking features, we can exclude the features with the least contribution to the prognosis predication.
However, as we know, a good feature ranking criterion is not necessarily a good feature subset ranking criterion; we estimate the effect of removing one feature at a time and they become very sub-optimal when it comes to removing several features at a time, which is necessary to obtain a small feature subset. This problem can be overcome by using recursive feature elimination algorithm, that is, each time only the worst feature is removed.20
Data Preprocessing and the Training of SVM
To investigate the possibility of identifying different subgroups in stage IB NSCLC based on their clinicopathologic features and immunomarkers using SVM, we performed a set of experiments using the above described data set (148 patients with stage IB NSCLC with 37 features). The experiment was to predict where a patient can survive more than 5 years individually. The whole data set is divided into two parts: the first 73 patients are used for training and the remaining 75 patients are used for testing. Because the previous experiment results have proved that our predication problem is a nonlinear one, the RBF kernel was used as the kernel of the SVM classifier, and the kernel parameters were determined by the leave-one-out (LOO) cross-validation method.
We evaluate the importance of each mark based on the classification accuracy of the trained SVM classifier. Traditionally, the specificity and sensitivity are of the same importance for a statistical model. However, in some cases, sensitivity may be more important than specificity. So in this article, the classification accuracy of each trained SVM classifiers is calculated using two different loss functions: where yp and y are the predicted label and actual label, respectively, of a training sample. The difference between these two loss functions is that specificity and sensitivity are of the same importance in the first function, whereas in the second function, sensitivity is twice important than specificity. Therefore, the classification accuracy of each classifier is evaluated using two criteria. The whole procedure is listed as follows:
Algorithm I:
Input: {xi,yi} ϵ RM × {±1}, i = 1, 2,…, N
For t = 1 to M
{
Xt = {xl,m |m ≠ t} // exclude t − th feature;
Train t − th SVM using {Xt, Y};
Compute LOSS(t) using 10 or 11 // the loss of t − th SVM;
}
Output: p = argmaxt LOSS(t) // find the worst feature.
In this article, N = 73 is the number of the patients used for training, M is the number of the used features during the procedure, and yi is the label of i − th patient. The above algorithm is performed until the necessary number of features is left.
With the algorithm I, two SVM models, SVM1 and SVM2, are trained using two extracted feature subsets. Each of these two models has its advantage: the first model has a higher specificity and the second model has a higher sensitivity. Also, to improve the performance of the prognosis, we also constructed a hierarchical classification model consisting of SVM1 and SVM2. With this model, the sensitivity of the classifier can be further improved.
Details of the Experiments
In our experiments, the RBF kernel function k(x,x′) = exp(− |x − x′|2 /σ) is used because our classification problem is a nonlinear problem. LOO cross-validation was used to determine the optimal values of SVM model (the kernel parameter σ and regularization parameter C), and the testing error was obtained using the tuned parameters. We carried out a grid search in the region (−10 ≤ log2 σ ≤ 10 and − 10 ≤ log2C ≤ 10) with the step size log2 0.5. The algorithm 1 is performed 36 times, and each time, one feature is excluded. During the training, we evaluated the performance of SVM using LOO cross-validation error. Then we selected the feature subset with the best LOO cross-validation performance, predicted the labels of testing samples, and recorded the performance of the trained SVM models on testing samples.
Footnotes
-
Authors' disclosures of potential conflicts of interest and author contributions are found at the end of this article.
- Received March 8, 2008.
- Accepted September 19, 2008.