Feature selection for cancer classification using microarray gene expression data

Zhong, Wenyan

Feature selection for cancer classification using microarray gene expression data

atmire.migration.oldid	2719
dc.contributor.advisor	Wu, Jingjing
dc.contributor.advisor	Lu, Xuewen
dc.contributor.author	Zhong, Wenyan
dc.date.accessioned	2014-09-29T21:46:18Z
dc.date.available	2014-11-17T08:00:50Z
dc.date.issued	2014-09-29
dc.date.submitted	2014	en
dc.description.abstract	The rapid development of DNA microarray technology enables researchers to measure the expression levels of thousands of genes simultaneously and allows biologists easily gain insight into the complex interaction in tumours on gene expression levels. Its application in cancer studies has been shown great success in both diagnosis and elucidating the pathological mechanism. However, DNA microarray data usually contains thousands of genes and most of them are proved to be uninformative and redundant. Meanwhile, small size of samples of microarray data undermines the diagnosis accuracy of statistical models. Thereby, selecting highly discriminative genes from raw gene expression data can improve the performance of cancer classification and cut down the cost of medical diagnosis. This M.Sc. thesis proposes and investigates a new method of selecting highly discriminative genes for cancer classification based on DNA microarray data. For two-group classification problem, the Bhattacharyya distance is proposed to measure the dissimilarity in gene expression levels between the two groups. For any particular gene, we calculate the Bhattacharrya distance between the two groups based on the expression levels of that particular gene. We use the calculated distances, one for each gene, as a criteria to rank all the genes. Finally, support vector machine is utilized to obtain the optimal subset of genes achieving the lowest misclassification rate. Compared with the other two methods, SWKC (supervised weighted kernel clustering) (Shim et al., 2009) and SVM-RFE (support vector machine with recursive feature elimination) (Guyon et al., 2002), the proposed method is shown to be more effective and sensitive to differentially expressed genes. In the simulation study, the proposed method has much higher recovery rate than the other two methods. Comparisons among these three gene selection methods are also made through two real DNA microarray datasets, the colon dataset and the leukemia dataset, that are publicly available. Based on three classification performance indexes, i.e. average number of genes selected, average number of classification errors in test set and misclassification rate, the proposed method gets slightly better classification results than SVM-RFE for the colon dataset while at a much less computation cost. It also achieves better classification results than the SWKC methods in both datasets. Finally, we discuss that in future work improvement in performance could be achieved by introducing kernel density estimators and replacing Bhattacharyya distance with Hellinger distance as a feature selection criteria. Since kernel density estimation is free of distribution assumptions, under which the classification results would be more robust than that obtained by the Bhattacharyya distance under normal assumption.	en_US
dc.identifier.citation	Zhong, W. (2014). Feature selection for cancer classification using microarray gene expression data (Master's thesis, University of Calgary, Calgary, Canada). Retrieved from https://prism.ucalgary.ca. doi:10.11575/PRISM/26170	en_US
dc.identifier.doi	http://dx.doi.org/10.11575/PRISM/26170
dc.identifier.uri	http://hdl.handle.net/11023/1846
dc.language.iso	eng
dc.publisher.faculty	Graduate Studies
dc.publisher.institution	University of Calgary	en
dc.publisher.place	Calgary	en
dc.rights	University of Calgary graduate students retain copyright ownership and moral rights for their thesis. You may use this material in any way that is permitted by the Copyright Act or through licensing that has been assigned to the document. For uses that are not allowable under copyright legislation or licensing, you are required to seek permission.
dc.subject	Statistics
dc.subject.classification	feature selection	en_US
dc.subject.classification	microarray	en_US
dc.subject.classification	Classification	en_US
dc.title	Feature selection for cancer classification using microarray gene expression data
dc.type	master thesis
thesis.degree.discipline	Mathematics and Statistics
thesis.degree.grantor	University of Calgary
thesis.degree.name	Master of Science (MSc)
ucalgary.item.requestcopy	true