We propose AFTNet, a novel network-constraint survival analysis method based on the Weibull accelerated failure time (AFT) model solved by a penalized likelihood approach for variable selection and estimation. When using the log-linear representation, the inference problem becomes a structured sparse regression problem for which we explicitly incorporate the correlation patterns among predictors using a double penalty that promotes both sparsity and grouping effect. Moreover, we establish the theoretical consistency for the AFTNet estimator and present an efficient iterative computational algorithm based on the proximal gradient descent method. Finally, we evaluate AFTNet performance both on synthetic and real data examples.
Breast cancer is one of the most common invasive tumors causing high mortality among women. It is characterized by high heterogeneity regarding its biological and clinical characteristics. Several high-throughput assays have been used to collect genome-wide information for many patients in large collaborative studies. This knowledge has improved our understanding of its biology and led to new methods of diagnosing and treating the disease. In particular, system biology has become a valid approach to obtain better insights into breast cancer biological mechanisms. A crucial component of current research lies in identifying novel biomarkers that can be predictive for breast cancer patient prognosis on the basis of the molecular signature of the tumor sample. However, the high dimension and low sample size of data greatly increase the difficulty of cancer survival analysis demanding for the development of ad-hoc statistical methods. In this work, we propose novel screening-network methods that predict patient survival outcome by screening key survival-related genes and we assess the capability of the proposed approaches using METABRIC dataset. In particular, we first identify a subset of genes by using variable screening techniques on gene expression data. Then, we perform Cox regression analysis by incorporating network information associated with the selected subset of genes. The novelty of this work consists in the improved prediction of survival responses due to the different types of screenings (i.e., a biomedical-driven, data-driven and a combination of the two) before building the network-penalized model. Indeed, the combination of the two screening approaches allows us to use the available biological knowledge on breast cancer and complement it with additional information emerging from the data used for the analysis. Moreover, we also illustrate how to extend the proposed approaches to integrate an additional omic layer, such as copy number aberrations, and we show that such strategies can further improve our prediction capabilities. In conclusion, our approaches allow to discriminate patients in high-and low-risk groups using few potential biomarkers and therefore, can help clinicians to provide more precise prognoses and to facilitate the subsequent clinical management of patients at risk of disease.
Network penalized approaches
Cox-Regression
Data integration
Omics
International initiatives such as the Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC) are collecting multiple datasets at different genome-scales with the aim of identifying novel cancer biomarkers and predicting survival of patients. To analyze such data, several statistical methods have been applied, among them Cox regression models. Although these models provide a good statistical framework to analyze omic data, there is still a lack of studies that illustrate advantages and drawbacks in integrating biological information and selecting groups of biomarkers. In fact, classical Cox regression algorithms focus on the selection of a single biomarker, without taking into account the strong correlation between genes. Even though network-based Cox regression algorithms overcome such drawbacks, such network-based approaches are less widely used within the life science community. In this article, we aim to provide a clear methodological framework on the use of such approaches in order to turn cancer research results into clinical applications. Therefore, we first discuss the rationale and the practical usage of three recently proposed network-based Cox regression algorithms (i.e., Net-Cox, AdaLnet, and fastcox). Then, we show how to combine existing biological knowledge and available data with such algorithms to identify networks of cancer biomarkers and to estimate survival of patients. Finally, we describe in detail a new permutation-based approach to better validate the significance of the selection in terms of cancer gene signatures and pathway/networks identification. We illustrate the proposed methodology by means of both simulations and real case studies. Overall, the aim of our work is two-fold. Firstly, to show how network-based Cox regression models can be used to integrate biological knowledge (e.g., multi-omics data) for the analysis of survival data. Secondly, to provide a clear methodological and computational approach for investigating cancers regulatory networks.
International initiatives such as the Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC) are collecting multiple datasets at different genome-scales with the aim of identifying novel cancer biomarkers and predicting survival of patients. To analyze such data, several statistical methods have been applied, among them Cox regression models. Although these models provide a good statistical framework to analyze omic data, there is still a lack of studies that illustrate advantages and drawbacks in integrating biological information and selecting groups of biomarkers. In fact, classical Cox regression algorithms focus on the selection of a single biomarker, without taking into account the strong correlation between genes. Even though network-based Cox regression algorithms overcome such drawbacks, such network-based approaches are less widely used within the life science community. In this article, we aim to provide a clear methodological framework on the use of such approaches in order to turn cancer research results into clinical applications. Therefore, we first discuss the rationale and the practical usage of three recently proposed network-based Cox regression algorithms (i.e., Net-Cox, AdaLnet, and fastcox). Then, we show how to combine existing biological knowledge and available data with such algorithms to identify networks of cancer biomarkers and to estimate survival of patients. Finally, we describe in detail a new permutation based approach to better validate the significance of the selection in terms of cancer gene signatures and pathway/networks identification. We illustrate the proposed methodology by means of both simulations and real case studies. Overall, the aim of our work is two-fold. Firstly, to show how network-based Cox regression models can be used to integrate biological knowledge (e.g., multi-omics data) for the analysis of survival data. Secondly, to provide a clear methodological and computational approach for investigating cancers regulatory networks. Keywords: cancer, Cox model, high-dimensionality, gene expression, network, regularization, survival
cancer
Cox model
high-dimensionality
gene expression
network
regularization
survival
Motivation
Gene expression data from high-throughput assays, such as microarray, are often used to
predict cancer survival. However, available datasets consist of a small number of samples (n patients)
and a large number of gene expression data (p predictors). Therefore, the main challenge
is to cope with the high-dimensionality, i.e. p>>n, and a novel appealing approach is to use
screening procedures to reduce the size of the feature space to a moderate scale (Wu & Yin 2015,
Song et al. 2014, He et al. 2013). In addition, genes are often co-regulated and their expression
levels are expected to be highly correlated. Genes that are involved in the same biological process
are grouped in pathway structures. In order to incorporate the pathway information of genes,
network-based methods have been applied (Zhang et al. 2013, Sun et al. 2013). Motivated
by the most recent models based on variable screening techniques and integration of pathway
information into penalized Cox methods, we propose a new procedure to obtain more accurate
predictions. First, we identify the high-risk genes by using variable screening techniques and
then, we perform Cox regression analysis integrating network information associated with the
selected high-risk genes. By combining these two approaches, we present a new method to select
important core pathways and genes that are related to the survival outcome and we show the
benefits of our proposal both in simulation and real studies.
Methods
In our study, we combine variable screening techniques and network methods to identify
genes and pathways highly associated with the disease and to better predict patient risk. We
propose a new method for survival analysis based on the following steps. First, (i) we perform
variable screening, such as the sure independence screening (Fan et al. 2008) and its advancement
(Gorst-Rasmussen & Scheike 2013, Zhao & Li 2012, Fan et al. 2010) to select the active set of
variables strongly correlated with the survival response, and then (ii) we apply network-based
Cox regression models, such as Net-Cox and AdaLnet, which use a network based on the number
of selected signature genes to predict survival probability. In order to build our apriori network
information, we use the human gene functional linkage approach (Huttenhower et al. 2009).
Such network contains maps of functional activity and interaction networks in over 200 areas of
human cellular biology with information from 30.000 genome-scale experiments. The functional
linkage network summarizes information from a variety of biologically informative perspectives:
prediction of protein function and functional modules, cross-talk among biological processes, and
association of novel genes and pathways with known genetic disorders. In particular, our gene
network is built by using the HEFalMp tool to determine the edge's weight w between two nodes
(i.e. genes). The resulting network consists of a fixed number of unique genes (about 2000
genes), where w describes how strong is the relation between two genes and it takes values in
[0,1]. Hence, while the screening methods recruit the features with the best marginal utility to
reduce the dimensionality of the data, the network incorporates the pathway information used
as a prior knowledge network into the survival analysis.
Results
We combine variable screening procedures and network-penalized Cox models for high-dimensional
survival data aimed at determining pathway structures and biomarkers involved in cancer progression.
By using this approach, it is possible to obtain a deeper insight of the gene-regulatory
networks and investigate the gene signatures related to the cancer survival time in order to understand
how patient features (molecular and clinical information) can influence cancer treatment
and detection. In particular, we show the results obtained in simulation and real cancer studies,
along with screening rules. The simulated data are aimed to illustrate two different biological
scenarios. In the first setting, we examine the situation where all genes within the same module
belong to different groups or pathways. In the second one, the pathways are not independent
among them (as in genomic studies), but the activation of some groups is conditional from other
pathways. We use specificity, sensitivity and Matthews Correlation Coefficient to compare the
prediction performance. We also predict patient survival using molecular data of different cancer
types, such as ovarian and breast cancer. We investigate the set of the active signature genes and
the corresponding pathways involved in the cancer disease process. Then, using the biological
network, as prior information network, we perform network-based Cox model including Kaplan-
Meier curve and log-rank test. Overall this study shows that the new screening-network analysis
is useful for improving
Gene expression data from high-throughput assays, such as
microarray, are often used to predict cancer survival. Available datasets
consist of a small number of samples (n patients) and a large number of
genes (p predictors). Therefore, the main challenge is to cope with the
high-dimensionality. Moreover, genes are co-regulated and their expression
levels are expected to be highly correlated. In order to face these
two issues, network based approaches can be applied. In our analysis,
we compared the most recent network penalized Cox models for highdimensional
survival data aimed to determine pathway structures and
biomarkers involved into cancer progression.
Using these network-based models, we show how to obtain a deeper
understanding of the gene-regulatory networks and investigate the gene
signatures related to prognosis and survival in different types of tumors.
Comparisons are carried out on three real different cancer datasets.
Gene expression data from high-throughput assays, such as microarray, are often used to predict cancer survival. However, available datasets consist of a small number of samples (n patients) and a large number of gene expression data (p predictors). Therefore, the main challenge is to cope with the high-dimensionality. Moreover, genes are co-regulated and their expression levels are expected to be highly correlated. In order to face these two issues, network based approaches have been proposed. In our analysis, we compare four network penalized Cox models for high-dimensional survival data aimed to determine pathway structures and biomarkers involved in cancer progression. Using these network-based models, it is possible to obtain a deeper understanding of the gene-regulatory networks and investigate the gene signatures related to the cancer survival time. We evaluate cancer survival prediction to illustrate the benefits and drawbacks of the network techniques and to understand how patient features (i.e. age, gender and coexisting diseases-comorbidity) can influence cancer treatment, detection and outcome. In particular, we show results obtained in simulation and real cancer datasets using the Functional Linkage network, as network prior information.