SUPREME: A Cancer Subtype Prediction Methodology by Integrating High-Dimensional Biological Datasets

Revision as of 04:38, 17 March 2018 by Brylow (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Cancer, the second leading cause of death in the world, is a complex genetic disease. Every cancer patient is unique in terms of progression of disease and response to treatment. In recent years, vast amount of biological datasets from cancer tissues have been generated to better characterize the cancer biology. With these efforts, subtypes of some cancer types have been discovered and tools to predict the subtype of a new patient have been developed. Several of these studies relied on a single type of biological dataset such as gene expression, DNA methylation and other tools attempted to integrate various datasets.

In this study, we aim to develop a cancer subtype prediction methodology called SUPREME that integrates multiple types of biological data to discover novel cancer subtypes, predict subtypes of cancer patients and discover subtype-specific biomarkers. We will test SUPREME on publicly available cancer datasets such as breast cancer dataset from the Cancer Genome Atlas Project.

Students will work with a team of PhD students and the faculty mentor and contribute to various parts of this project.

Students are expected to be proficient in programming. Experience in molecular biology, basic Linux commands and high performance computing is preferred, but not required.

Student learning objectives: After this project, students will

  • Have a basic understanding of molecular biology and high-dimensional biological datasets.
  • Be familiar with R or Python programming language and some bioinformatics libraries in those languages.
  • Learn gather biological data from public repositories
  • Build a computational pipeline that pre-processes and integrates high-dimensional biological datasets
  • Be familiar with data visualization tools to analyze and visualize gene networks
  • Learn methods to evaluate predictive models by computing true positive rate, false positive rate, precision, recall, ROC curves, etc.