Cancer biomarker extraction from gene expression microarray data

Alhajj, RedaAlshalalfa, Mohammed2017-12-182017-12-182008Alshalalfa, M. (2008). Cancer biomarker extraction from gene expression microarray data (Master's thesis, University of Calgary, Calgary, Canada). Retrieved from https://prism.ucalgary.ca. doi:10.11575/PRISM/2304http://hdl.handle.net/1880/103305Bibliography: p. 115-124some pages are in colourBioinformatics is a new field of science mainly integrating computer science, mathematics, statistics and biology where the aim is to discover knowledge hidden within biological data. One of the widely investigated biological data is gene expression microarray data. Profiling the global gene expression patterns in different tissues/ sample can be investigated in few days due to microarray technology, which can accommodate the whole genome, unlike traditional methods which may take months. However, analyzing micro array data is challenging as the number of features (genes) is very large relative to the number of attributes (samples). Fortunately, microarray has been successfully used to study gene expression data; this allowed researchers to investigate different diseases, including cancer. In other words, using microarray in cancer diagnosis showed to be very efficient and reliable, but the large number of genes makes the data noisy and difficult to deal with. Consequently, identifying relevant genes has received considerable attention. In this thesis, we combine biological knowledge with machine learning techniques to propose three methods for extracting the most informative genes for cancer classification. The first method is based on double clustering; we filter the data initially with a statistical test and then cluster the data iteratively to get the best number of clusters. The genes closest to the centroids of the resulting clusters showed to have high potential to be significant features for sample classification. These genes (one per centroid) are used as input for building a classification model. The second method is based on iterative t-test in a way that eliminates noise from the data. The third method is a hybrid approach which combines statistical tests with entropy based tests. This method uses the t-test and Singular Value Decomposition (SVD) based entropy. It showed to be effective as it considers the feature itself and its effect on the data entropy. This approach is the first to combine entropy and statistical significance for gene ranking. We have also developed SVD based gene extraction method for multi-class data; only introduced at high level in this thesis, details are left are future work. The test results reported demonstrate the applicability and effectiveness of the three proposed approaches. _x000D_ Index Terms: Classification, clustering, t-test, singular value decomposition, support vector machine, microarray data, gene expression data, over-expression, underexpress10n._x000D_xiii, 127 leaves : ill. ; 30 cm.engUniversity of Calgary graduate students retain copyright ownership and moral rights for their thesis. You may use this material in any way that is permitted by the Copyright Act or through licensing that has been assigned to the document. For uses that are not allowable under copyright legislation or licensing, you are required to seek permission.Cancer biomarker extraction from gene expression microarray datamaster thesis10.11575/PRISM/2304