Hybrid Causal Feature Selection for Cancer Biomarker Identification from RNA-seq Data.

Xu, Wenwei; Zhang, Hao; Xia, Yewei; Ren, Yixin; Guan, Jihong; Zhou, Shuigeng

ABSTRACT

The discovery of cancer biomarkers helps to advance medical diagnosis and plays an important role in biomedical applications. Most of the existing data-driven methods identify biomarkers by ranking-based strategies, which generally return a subset or superset of the actual biomarkers, while some other causal-wise feature selection methods are based on Markov Blanket (MB) learning, facing the challenges of high-dimensionality & low-sample. In this work, we propose a novel hybrid causal feature selection method (called CAFES) to support large-scale cancer biomarker discovery from real RNA-seq data. Concretely, CAFES first uses minimal-redundancy & maximal-relevance strategy for dimensionality reduction that returns a set of candidate features. CAFES then learns the causal skeleton w.r.t. those features by CI tests and further obtains an appropriate superset of the MB of the target variable. Finally, CAFES learns the causal structure of this superset by the DAG-GNN algorithm and then obtains the MB of the target variable, which can be treated as the cancer biomarkers. We conduct experiments to evaluate the proposed method on two real well-known RNA-seq datasets that covering both binary and multi-class cases. We compare our method CAFES with seven recent methods including Semi-HITON-MB, STMB, BAMB, FBED, LCS-FS, EEMB, and EAMB. The results show that CAFES can identify dozens of cancer biomarkers, and 1/6 â¼ 1/2 of the discovered biomarkers can be verified by existing works that they are really directly related to the corresponding disease. An advantage of CAFES is that its Recall is significantly higher than those of all the counterparts, indicating that the continuous optimization (DAG-GNN) with the returned causal skeleton after feature selection (that can be treated as a conditional independence-based constraint to the optimization problem) is effective in cancer biomarkers identification under high-dimensional and low-sample RNA-seq data. The source code of CAFES is available at https//github.com/Milkteaww/CFS.