- Bucknell University Class of 2018
- Computer Engineering Major
Log 0: Orientation Day - 5/31
Today we met the REU program coordinators and our project mentors. We also got a tour of the labs we will be working in.
I spoke with my faculty mentor, Dr. Serdar Bozdag, about the different projects that he is currently working on. He is involved in the field of Bioinformatics, which I am very interested in but have very little experience with. During the first couple weeks of the program I will probably be reading a lot of background information on molecular biology. I will likely be working on developing a ranked list of transcription that most affect the expression of certain genes using a computational model.
Log 1: 5/31
Today I met with a PhD student, Duc, who is also working with Dr. Bozdag. He helped me
- Download R and RStudio onto my laptop
- Start a tutorial for learning R
R is the primary programming language used in the field of bioinformatics, so it is important for me to be familiar with it. It is a very high level language, so I don't think it will take long at all to learn.
Today the other REU students and I also got a tour of the library and learned how to search the library catalog and the many online databases that the library subscribes to. This information will be helpful when we go to search for published papers related to our current summer research. I checked out four books from the library, Gene Transcription, RNA Motifs and Regulatory Elements, R Programming for Bioinformatics, and Bioinformatics for Biomedical Science and Clinical Applications. I am hoping that these books will help me gain the background knowledge necessary to begin my research project.
Log 2: 6/01
- Continued working on the R tutorial for most of the day
- Listened to a lecture by Dr. Factor on good research practices
Log 3: 6/02
- Finished the R tutorial
- Began reading background information
Today I began reading through the books I checked out on Wednesday. Dr. Bozdag also sent me a pdf of a book chapter called "Molecular Biology for Computer Scientists" for me to read through. I am currently about a third of the way through taking notes on the pdf, and refreshing my memory on all the concepts I learned in my high school biology course. My mentor and I also developed a list of specific goals and milestones for the rest of the summer. This list can be found on my user page.
Log 4: 6/03 and 6/04
- Continued reading through background information
- Reformatted wiki page
- Goals and Milestones are now easily accessible from the 2017 projects list
Log 5: 6/05
- Continued reading though "Molecular Biology for Computer Scientists"
Log 6: 6/06
- Completed ethics training with other REU students
- Finished reading and taking notes on "Molecular Biology for Computer Scientists"
- Met with Dr. Bozdag and Duc to discuss more details of the project and goals for the next few weeks
Over the next week or so, I will be conducting a literature search. Duc has sent me some initial papers to read, as well as a tutorial for some of the main R libraries used for RNA sequencing in bioinformatics.
Log 7: 6/07
- Completed a tutorial going over some major R functions created to help computational biologists model gene expression data
There are several open source R libraries and packages compiled on bioconductor.org, including tutorials and sample data sets for understanding how to use the packages. Today I read through a tutorial for modeling RNA sequencing data using the limma, Glimma, and edgeR libraries. I don't completely understand the details of every function used in the tutorial, but I do now know that it is possible to filter anomalous data, create and normalize graphs of gene expression distributions, and create detailed boxplots, multi-dimensional scaling plots, mean-variance plots, venn diagrams, interactive multi-dimensional scaling plots, and even heatmaps to highlight statistically significant differences in the data between samples.
Log 8: 6/08
- Created a git repository for this research project
- Wrote a program that runs all of the code in the tutorial on bioconductor.org in separate functions
- Made sure I understood the code and that it is well-commented so I can reference it later
Log 9: 6/09
- Met with Duc to discuss papers I should start with for the literature survey
Duc explained that the key points I should be looking for with each paper are 1) What research question is being asked? 2) What data sets are being used? and 3) How are the results being evaluated? Duc also sent me a number of tutorials so that I can work on starting to download and model gene expression data.
Log 10: 6/12
- Read through and took notes on two research papers, including one review paper which compared several methods both for pre-processing data and evaluating predictions of microRNA - target gene interactions and microRNA - transcription factor - target gene interactions.
Log 11: 6/13
- Read and took notes on three more papers
I will be making a small presentation next Thursday on common techniques for predicting transcription factor - target gene interactions based on what I am learning this week.
Log 12: 6/14
- Continued reading and taking notes on papers, preparing for the presentation
Log 13: 6/15
- Completed all three modules of the Responsible Conduct of Research course on the Collaborative Institutional Training Initiative (CITI) website
Log 14: 6/20
- Continued reading and taking notes on papers, preparing for the presentation
Log 15: 6/21
- Completed a tutorial on the TCGAbiolinks R package on the bioconductor website
I now understand how to search for, download, format, and display the many different types of data available for several tumor types from the Genomic Data Commons in The Cancer Genome Atlas. Because the files are so large, I can currently only download small portions of the data on my laptop. Once I am able to access a remote server, I will be able to download and analyze the data from several samples.
Log 16: 6/22
- Gained access to server
- Downloaded KIRC (kidney cancer) data from GDC to server
I am using a workflow and code that Duc developed for a separate research project on ceRNA analysis. I read through the code to download data to make sure I understood everything before running it.
Log 17: 6/23
- Began preprocessing GDC data
I downloaded four types of data from TCGA, mRNA (or gene) expression, miRNA expression, copy number alteration, and DNA methylation data. The first step in preprocessing the data is to find the set of patients with all four types of data.
Log 18: 6/26
- Attended talk on how to give an effective research talk
- Finished processing data for differential expression analysis
The final processing steps include downloading clinical data, creating expression matrices, and converting barcode names. Each sample is represented by a barcode, which was reformatted to be more easily readable.
Log 19: 6/27
- Researched databases for putative miRNA-target gene interactions
- Downloaded raw interaction data from miRTarBase
- Began preparing research talk for Thursday
miRTarBase is the largest database of experimentally validated miRNA - target gene interactions, containing over 360,000 interactions from several species. After looking over code that Duc wrote to download and process similar data, I started writing an R script to upload the raw data to the server and then process and filter it. The first step in processing the data was to filter out any non-human miRNA or target genes.
Log 20: 6/28
- Created first draft of research talk and presented to mentors
- Revised research talk for Thursday's presentation
- Continued processing miRTarBase data
Tomorrow morning every student in the MSCS REU will be giving a short (8 minute) presentation on what they have accomplished so far.
Log 21: 6/29
- Presented research talk
- Finished processing miRTarBase data
- Downloaded and processed raw miRNA - mRNA interaction data from TargetScan
TargetScan is a database of predicted miRNA - target interactions, where predictions are based on sequencing data.
Log 22: 6/30
- Integrated miRTarBase and TargetScan datasets
- Began differential expression analysis on miRNA and mRNA
Differential expression analysis first filters out any genes or miRNA with very low expression (in counts per million) in a majority of samples, and then filters out genes or miRNA that are not significantly up or down-regulated in enough samples.
Log 23: 7/1 and 7/2
- Downloaded and processed raw TF - mRNA interaction data from https://github.com/slowkow/tftargets
This GitHub user has compiled interactions from several databases into one file, which can be downloaded from the above website. I am using data from the TRED, ENCODE, ITFP, and TRRUST databases. Processing these Transcription Factor interactions was a little more complicated than processing the microRNA interactions because this file simply had a list of each database, which were each a large list of all the target genes for each TF in the database. I then had to create one data frame from these lists, whereas for the microRNA databases each file I downloaded was already a data frame.
Log 24: 7/3
- Finished differential expression analysis
The output of the differential expression analysis is a list of statistics for the mRNA analysis, expression data for the differentially expressed mRNA, statistics for the miRNA analysis, and expression data for the differentially expressed miRNA. I then filtered the other datasets so that copy number alteration and DNA methylation data was only stored for the differentially expressed mRNA.
Log 25: 7/4
I took this day off for the holiday.
Log 26: 7/5
- Began filtering putative interactions using differential expression analysis results
Next week, I will create correlation matrices from the expression data for the differentially expressed mRNA and miRNA. Then I will integrate the correlation data with the putative interactions. In order to do so, I need to create data frames that only contain putative interactions with differentially expressed mRNA and miRNA. I started this process by filtering the miRNA - target gene (or miRNA - mRNA) putative interactions.
Log 27: 7/6
- Finished filtering putative interactions using differential expression analysis results
- Explained steps to download and process TCGA data
- Finished reading KIRC paper
- Met with mentors and rest of Bioinformatics lab to go over progress
Today I first met with Duc to discuss some of the issues I'd been having with filtering the putative interactions data frames. I had been using the "==" command, which only resulted in one regulator - target gene interaction for each differentially expressed regulator, when most of the regulators had multiple interactions. Duc told me about the %in% command, which makes subsetting data frames in R really easy. Using this command allowed me to get any interactions with a differentially expressed regulator, not just one. I then met with Matt, another undergraduate researcher working on identifying microRNA - Transcription Factor - target gene modules. He is just starting to download data from TCGA, so I walked him though part of that process. I also finished reading the KIRC paper my mentor had sent me. I was supposed to present on the paper in today's meeting, but even though I had read the paper multiple times I still didn't understand much of the more complex biology and genetics concepts. So we actually spent much of the meeting trying to understand what the authors of the study actually did. Basically, the study identified significant genes and potential regulators in KIRC patients using a combination of DNA methylation, copy number alteration, and sequencing data. The researchers also attempted to predict clinical outcomes based on expression levels of potential oncogenes and tumor suppressors in patient samples. Hopefully my research will identify some of the same genes and regulators as the study did.
Log 28: 7/7
- Compiled statistics on putative interactions data
- Updated wiki logs
I found the total number of interactions, the number of unique regulators, and the number of unique target genes for both the original putative interactions data sets and the filtered datasets (both miRNA - mRNA and TF - miRNA). I also found the number of target genes with both microRNA and Transcription Factor interactions. I then compiled all of this information into a data table. I realized that I still had a number of duplicate interactions in both the miRNA and TF putative interaction datasets, but the unique function was not removing them because they came from different database sources. So I removed the source column, ran the unique function, and then redid my orginial data table.
Log 29: 7/8 and 7/9
- Researched WGCNA package
Duc told me that the functions I would need to create the correlation matrix were in a package called WGCNA, or weighted gene correlation network analysis. So this weekend I researched that package and went through some of the online tutorials.
Log 30: 7/10
- Created mRNA - mRNA and miRNA - mRNA correlation matrices
- Added correlation and correlation p-values to miRNA - target gene and TF - target gene interaction data frames
- Explained steps for differential expression analysis
Today I learned that I only needed one function in the WGCNA package to create the kind of correlation matrices I wanted. The matrices I created used the Pearson correlation coefficient, which is essentially a measure of the line of best fit between the expression levels of two regulators. If the two regulators are the same mRNA for example, then the correlation is 1. The function I used, called corAndPValue, also creates a data table of p-values (or probability values) for each correlation. I then added columns for correlation and correlation p-value to the miRNA - target gene and TF - target gene data frames. While doing this, I realized I had to further filter my TF - target gene data because Transcription Factors are also mRNA, and so I didn't have correlation data for any Transcription Factors that were not in my list of differentially expressed mRNA. Once I re-filtered the TF - target gene data frame, I was able to add all the correlation and p-value data. Also, Matt is now working on running the differential expression analysis, and I answered some of his questions about the process.
Log 31: 7/11
- Created density plots of correlation between regulators and target genes
- Filtered regulator - target gene interactions by correlation p-value
- Filtered regulator - target gene interactions by correlation
I used the density function to create bell curves of the correlations for the miRNA - target gene and TF - target gene interactions. I then filtered the interactions by p-value (only keeping interactions with correlation p-value < 0.05), and created another graph. I then filtered the interactions by correlation, and compiled information on the number of interactions with correlations above or below a certain value.
Log 32: 7/12
- Created igraph objects of regulator - target genes
- Explained steps for filtering putative interactions based on differential expression analysis results
- Prepared presentation for tomorrow's meeting
This morning I researched the igraph package in R, and figured out how to create igraph objects for miRNA - target gene interactions, TF - target gene interactions, and all regulator - target gene interactions. I now have a directed graph showing microRNA - Transcription Factor - target gene modules, with correlations and p-values for each possible interaction. Tomorrow I will analyze these graphs to find in hubs (target genes with the most regulators), and out hubs (regulators for the most target genes). For now, I am preparing a presentation showing how I have come up with these modules, how I initially filtered the putative interactions using expression data, and then correlation data, and how the data set of putative interactions changed during each of these steps. I have also been teaching Matt about subsetting data frames in R and helping him filter the putative interactions for differentially expressed mRNA and miRNA.
Log 33: 7/13
- Presented on work during the last two weeks
- Found hub genes
- Identified the target gene with the most regulators
- Identified the regulator with the most target genes
- Updated wiki logs
Log 34: 7/14
- Researched layouts for visualization in the igraph package
- Updated wiki logs
Plotting an igraph object with a layout so that all vertices can actually be seen is very difficult, especially one with a large number of vertices and edges like the interactions that I am working with. Duc is going to show me how to use a java interface called cytoscape which makes this much easier.
Log 35: 7/17
- Wrote first draft of abstract for final paper
- Compiled statistics on results so far
- Began researching clustering algorithms in igraph package
I compiled summary statistics on both sets of hub genes, and found that over 50% of target genes only have one regulator, and over 25% of regulators only have one target gene. Also, 390, or over 70%, of transcription factors are also target genes. I then identified hub genes that are known cancer genes, and highly correlated interactions with known cancer genes.
Log 36: 7/18
- Ran infoMAP clustering algorithm
- Compiled statistics on clusters
- Began creating data structure of target genes
I was having trouble finding tutorials on the community finding algorithms in the igraph package, so Duc showed me a website that provided summaries on each of them. Only one algorithm, infoMAP, could be used on directed graphs. After running that algorithm I found 155 clusters. I used some of Duc's code from previous research to compile statistics on each of the clusters into a data table showing the number of miRNA, TF, targets, edges, and vertices for each cluster. I also began creating a data structure for the target genes. When completed, it will essentially be a list of all target genes, which are each another list of the miRNA regulators and the TF regulators for that particular gene.
Log 37: 7/19
- Continued working on first draft of final paper
Log 38: 7/20
- Completed first draft of final paper
Log 39: 7/21
- Began researching potential algorithms for enrichment analysis
Enrichment analysis compares the gene oncology, or the biological process associated with a particular gene, of all the genes within a set to the gene oncology of all the genes in a particular universe. In this case, the gene set is all the target genes in a particular cluster, and the gene universe is the target genes in all target genes. After running enrichment analysis, the best case scenario would be that biological processes associated with cancer, such as cell division and mitosis, are overrepresented or enriched in each of the clusters.
Log 40: 7/22 and 7/23
- Started working on research poster
Log 41: 7/24
- Completed enrichment analysis
- Finished creating target gene data structure
I used some of Duc's code from previous research to apply the enrichGO function to each of the clusters and then combine the results into one data frame. The enrichGO function is part of the clusterProfiler package in R, and uses the hypergeometric test to compute pvalues for each possible biological process. At first, I was not seeing any processes enriched in any of the clusters. This was because the cluster sizes that the infoMAP algorithm produced were either two large or two small to produce accurate enrichment analysis results. The largest cluster had 1,783 vertices, and I was comparing it to a universe of 2,011 target genes, which is much too small of a difference. The other clusters had 25 vertices or less, and were essentially too small to analyze. So I created an undirected igraph object, and ran a different clustering algorithm where the largest cluster was 306 vertices. I also used all differentially expressed genes as the universe so that the difference in size between the universe and the gene set would be reasonable. After rerunning the enrichment analysis functions, I finally had results.
Log 42: 7/25
- Continued working on research poster
- Created visualizations of clusters and extracted modules using Cytoscape
Duc showed me how to get Cytoscape working on my laptop, import graph objects from RStudio, and change attributes in the Cytoscape visualization based on attributes of the graph object, in order to easily display the differences between miRNAs, TFs, and target genes.
Log 43: 7/26
- Finished poster
Log 44: 7/27
- Began working on Final draft of paper
- Began working on final 15 minute presentation
Log 45: 7/28
- Continued working on final paper and presentation
Log 46: 7/31
- Completed final presentation
Unfortunately, because both Duc and Dr. Bozdag were out of town on Friday and today, I did not have the tools or time to complete survival analysis. Duc will likely continue the analysis after I complete and leave the REU program.