Gene Clustering

The ARCHS4 dataset contains gene counts for humans and mice from multiple authoritative sources including Gene Expression OmniBus (GEO) and Sequence Read Archive (SRA). The file format allows us to programmatically access the matrix entries based on column and row indices. Python3 was used to process the dataset. Considering the large size of ARCHS4 and our personal interest, we are going to focus mainly on the human gene count. We are hoping to perform knowledge and discovery of data by performing various clustering methods on the dataset. We will examine the results, compare the performance of each clustering method, and then ask and tackle a classification problem on potential sample data based on the gene counts. Precisely, the first problem we are going to tackle is to find interesting partitions of the dataset by clustering. For instance, we may find a partition of samples that contain similar genes, or we can find partitions of genes that show similar expression patterns over a wide range of samples. Second, we hope to discover interesting patterns within and between clusters that could lead to potentially solving real-world questions. For example, if we find out that we have access to a large enough sample size of samples from a cell line of cancerous ovarian tissue vs. samples from healthy ovarian tissue that were separated into different clusters with high purity, then we test whether we can effectively classify these samples based on gene expressions via other machine learning techniques. These are the kinds of questions we hope to explore and answer in stage 3 of the project using supervised machine learning methods.

Read the full paper here

GitHub