- Bucknell University Class of 2018
- Computer Engineering Major
Log 0: Orientation Day - 5/31
Today we met the REU program coordinators and our project mentors. We also got a tour of the labs we will be working in.
I spoke with my faculty mentor, Dr. Serdar Bozdag, about the different projects that he is currently working on. He is involved in the field of Bioinformatics, which I am very interested in but have very little experience with. During the first couple weeks of the program I will probably be reading a lot of background information on molecular biology. I will likely be working on developing a ranked list of transcription that most affect the expression of certain genes using a computational model.
Log 1: 5/31
Today I met with a PhD student, Duc, who is also working with Dr. Bozdag. He helped me
- Download R and RStudio onto my laptop
- Start a tutorial for learning R
R is the primary programming language used in the field of bioinformatics, so it is important for me to be familiar with it. It is a very high level language, so I don't think it will take long at all to learn.
Today the other REU students and I also got a tour of the library and learned how to search the library catalog and the many online databases that the library subscribes to. This information will be helpful when we go to search for published papers related to our current summer research. I checked out four books from the library, Gene Transcription, RNA Motifs and Regulatory Elements, R Programming for Bioinformatics, and Bioinformatics for Biomedical Science and Clinical Applications. I am hoping that these books will help me gain the background knowledge necessary to begin my research project.
Log 2: 6/01
- Continued working on the R tutorial for most of the day
- Listened to a lecture by Dr. Factor on good research practices
Log 3: 6/02
- Finished the R tutorial
- Began reading background information
Today I began reading through the books I checked out on Wednesday. Dr. Bozdag also sent me a pdf of a book chapter called "Molecular Biology for Computer Scientists" for me to read through. I am currently about a third of the way through taking notes on the pdf, and refreshing my memory on all the concepts I learned in my high school biology course. My mentor and I also developed a list of specific goals and milestones for the rest of the summer. This list can be found on my user page.
Log 4: 6/03 and 6/04
- Continued reading through background information
- Reformatted wiki page
- Goals and Milestones are now easily accessible from the 2017 projects list
Log 5: 6/05
- Continued reading though "Molecular Biology for Computer Scientists"
Log 6: 6/06
- Completed ethics training with other REU students
- Finished reading and taking notes on "Molecular Biology for Computer Scientists"
- Met with Dr. Bozdag and Duc to discuss more details of the project and goals for the next few weeks
Over the next week or so, I will be conducting a literature search. Duc has sent me some initial papers to read, as well as a tutorial for some of the main R libraries used for RNA sequencing in bioinformatics.
Log 7: 6/07
- Completed a tutorial going over some major R functions created to help computational biologists model gene expression data
There are several open source R libraries and packages compiled on bioconductor.org, including tutorials and sample data sets for understanding how to use the packages. Today I read through a tutorial for modeling RNA sequencing data using the limma, Glimma, and edgeR libraries. I don't completely understand the details of every function used in the tutorial, but I do now know that it is possible to filter anomalous data, create and normalize graphs of gene expression distributions, and create detailed boxplots, multi-dimensional scaling plots, mean-variance plots, venn diagrams, interactive multi-dimensional scaling plots, and even heatmaps to highlight statistically significant differences in the data between samples.
Log 8: 6/08
- Created a git repository for this research project
- Wrote a program that runs all of the code in the tutorial on bioconductor.org in separate functions
- Made sure I understood the code and that it is well-commented so I can reference it later
Log 9: 6/09
- Met with Duc to discuss papers I should start with for the literature survey
Duc explained that the key points I should be looking for with each paper are 1) What research question is being asked? 2) What data sets are being used? and 3) How are the results being evaluated? Duc also sent me a number of tutorials so that I can work on starting to download and model gene expression data.
Log 10: 6/12
- Read through and took notes on two research papers, including one review paper which compared several methods both for pre-processing data and evaluating predictions of microRNA - target gene interactions and microRNA - transcription factor - target gene interactions.
Log 11: 6/13
- Read and took notes on three more papers
I will be making a small presentation next Thursday on common techniques for predicting transcription factor - target gene interactions based on what I am learning this week.
Log 12: 6/14
- Continued reading and taking notes on papers, preparing for the presentation
Log 13: 6/15
- Completed all three modules of the Responsible Conduct of Research course on the Collaborative Institutional Training Initiative (CITI) website
Log 14: 6/20
- Continued reading and taking notes on papers, preparing for the presentation
Log 15: 6/21
- Completed a tutorial on the TCGAbiolinks R package on the bioconductor website
I now understand how to search for, download, format, and display the many different types of data available for several tumor types from the Genomic Data Commons in The Cancer Genome Atlas. Because the files are so large, I can currently only download small portions of the data on my laptop. Once I am able to access a remote server, I will be able to download and analyze the data from several samples.
Log 16: 6/22
- Gained access to server
- Downloaded KIRC (kidney cancer) data from GDC to server
I am using a workflow and code that Duc developed for a separate research project on ceRNA analysis. I read through the code to download data to make sure I understood everything before running it.
Log 17: 6/23
- Began preprocessing GDC data
I downloaded four types of data from TCGA, mRNA (or gene) expression, miRNA expression, copy number alteration, and DNA methylation data. The first step in preprocessing the data is to find the set of patients with all four types of data.
Log 18: 6/26
- Attended talk on how to give an effective research talk
- Finished processing data for differential expression analysis
The final processing steps include downloading clinical data, creating expression matrices, and converting barcode names. Each sample is represented by a barcode, which was reformatted to be more easily readable.
Log 19: 6/27
- Researched databases for putative miRNA-target gene interactions
- Downloaded raw interaction data from miRTarBase
- Began preparing research talk for Thursday
miRTarBase is the largest database of experimentally validated miRNA - target gene interactions, containing over 360,000 interactions from several species. After looking over code that Duc wrote to download and process similar data, I started writing an R script to upload the raw data to the server and then process and filter it. The first step in processing the data was to filter out any non-human miRNA or target genes.
Log 20: 6/28
- Created first draft of research talk and presented to mentors
- Revised research talk for Thursday's presentation
- Continued processing miRTarBase data
Tomorrow morning every student in the MSCS REU will be giving a short (8 minute) presentation on what they have accomplished so far.
Log 21: 6/29
- Presented research talk
- Finished processing miRTarBase data
- Downloaded and processed raw miRNA - mRNA interaction data from TargetScan
TargetScan is a database of predicted miRNA - target interactions, where predictions are based on sequencing data.
Log 22: 6/30
- Integrated miRTarBase and TargetScan datasets
- Began differential expression analysis on miRNA and mRNA
Differential expression analysis first filters out any genes or miRNA with very low expression (in counts per million) in a majority of samples, and then filters out genes or miRNA that are not significantly up or down-regulated in enough samples.
Log 23: 7/1 and 7/2
- Downloaded and processed raw TF - mRNA interaction data from https://github.com/slowkow/tftargets
This GitHub user has compiled interactions from several databases into one file, which can be downloaded from the above website. I am using data from the TRED, ENCODE, ITFP, and TRRUST databases. Processing these Transcription Factor interactions was a little more complicated than processing the microRNA interactions because this file simply had a list of each database, which were each a large list of all the target genes for each TF in the database. I then had to create one data frame from these lists, whereas for the microRNA databases each file I downloaded was already a data frame.
Log 24: 7/3
- Finished differential expression analysis
The output of the differential expression analysis is a list of statistics for the mRNA analysis, expression data for the differentially expressed mRNA, statistics for the miRNA analysis, and expression data for the differentially expressed miRNA. I then filtered the other datasets so that copy number alteration and DNA methylation data was only stored for the differentially expressed mRNA.
Log 25: 7/4
I took this day off for the holiday.
Log 26: 7/5
- Began filtering putative interactions using differential expression analysis results
Next week, I will create correlation matrices from the expression data for the differentially expressed mRNA and miRNA. Then I will integrate the correlation data with the putative interactions. In order to do so, I need to create data frames that only contain putative interactions with differentially expressed mRNA and miRNA. I started this process by filtering the miRNA - target gene (or miRNA - mRNA) putative interactions.
Log 27: 7/6
- Finished filtering putative interactions using differential expression analysis results
- Explained steps to download and process TCGA data
- Finished reading KIRC paper
- Met with mentors and rest of Bioinformatics lab to go over progress
Today I first met with Duc to discuss some of the issues I'd been having with filtering the putative interactions data frames. I had been using the "==" command, which only resulted in one regulator - target gene interaction for each differentially expressed regulator, when most of the regulators had multiple interactions. Duc told me about the %in% command, which makes subsetting data frames in R really easy. Using this command allowed me to get any interactions with a differentially expressed regulator, not just one. I then met with Matt, another undergraduate researcher working on identifying microRNA - Transcription Factor - target gene modules. He is just starting to download data from TCGA, so I walked him though part of that process. I also finished reading the KIRC paper my mentor had sent me. I was supposed to present on the paper in today's meeting, but even though I had read the paper multiple times I still didn't understand much of the more complex biology and genetics concepts. So we actually spent much of the meeting trying to understand what the authors of the study actually did. Basically, the study identified significant genes and potential regulators in KIRC patients using a combination of DNA methylation, copy number alteration, and sequencing data. The researchers also attempted to predict clinical outcomes based on expression levels of potential oncogenes and tumor suppressors in patient samples. Hopefully my research will identify some of the same genes and regulators as the study did.
Log 28: 7/7
- Compiled statistics on putative interactions data
- Updated wiki logs
I found the total number of interactions, the number of unique regulators, and the number of unique target genes for both the original putative interactions data sets and the filtered datasets (both miRNA - mRNA and TF - miRNA). I also found the number of target genes with both microRNA and Transcription Factor interactions. I then compiled all of this information into a data table. I realized that I still had a number of duplicate interactions in both the miRNA and TF putative interaction datasets, but the unique function was not removing them because they came from different database sources. So I removed the source column, ran the unique function, and then redid my orginial data table.
Log 29: 7/8 and 7/9
- Researched WGCNA package
Duc told me that the functions I would need to create the correlation matrix were in a package called WGCNA, or weighted gene correlation network analysis. So this weekend I researched that package and went through some of the online tutorials.
Log 30: 7/10
- Created mRNA - mRNA and miRNA - mRNA correlation matrices
- Added correlation and correlation p-values to miRNA - target gene and TF - target gene interaction data frames
- Explained steps for differential expression analysis
Log 31: 7/11
- Created density plots of correlation between regulators and target genes
- Filtered regulator - target gene interactions by correlation p-value
- Filtered regulator - target gene interactions by correlation
Log 32: 7/12
- Created igraph objects of regulator - target genes
- Explained steps for filtering putative interactions based on differential expression analysis results
- Prepared presentation for tomorrow's meeting