title: "Enhancer RNAs Predict Enhancer-Gene Regulatory Links and are Critical for Enhancer Function in Neuronal systems\nSupplemental Code"
author: "Robert A. Phillips III & Jeremy J. Day "
#date: "5/5/2020"
output:
word_document: default
html_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Mapping Transcriptionally Active Putative Enhancers (TAPEs) to Genes
The goal of this analysis is to identify high confidence TAPE-gene pairs. To do this, transcription start sites located within 1Mb upstream or downstream from the center of the TAPE are identified. Then, pearson's correlations are calculated using counts for the TAPEs and associated genes
##Load Libraries
First, all essential libraries are loaded for the analysis.
Now we are going to create a column indicating these as intergenic TAPEs. Then, we are going to create an ID column consisting of the TAPE chromosome, starting position, and ending position. Next, we will rename the first four columnns. The start and end will be referred to as five and three prime end as the SeqMonk software used during the pipeline refers to the 5' end of all genes as the start, regardless of whehter that gene is on the - strand. Thus, referring to the start/end as five prime and three prime will allow us to correctly calculate distance later in this workflow. Finally, the center of the TAPE is calculated.
To map the TAPEs to associated genes, we will use a for loop. As the for loop proceeds, any TAPE that maps to the gene will be input into a large list created below. Here, every element of the list corresponds to a unique gene.
```{r build genes_list}
# Here I make an empty list to input the annotated TAEs into
genes_list <- vector(mode = "list",
length = nrow(genes))
#Name every element of the list a unique name for that gene. This consists of chr,5' end, 3' end, Gene name, and strand that is comma separated
Now we annotate all genes that are 1Mbp upstream and downstream of the TAPE. This loops through the genes dataframe, and first asks if the strand of the gene is + or -. If the strand of the gene is positive all calculations are computed using the five prime end of the gene. If the strand of the gene is negative all calculations are computed using the three prime end of the gene. This search is gene-centric in that this loop identifies genes that fall within 1Mbp windows from the center of the TAPE.A progress bar will also print the progress of the loop.
Next, we identify any genes in which there were no associated TAPEs. These empty elements are then removed from the list.
```{r Identify empty elements}
#Empty vector for identification of empty list elements
x <-vector()
#Run loop
for(i in 1:length(genes_list)){
if(is.null(genes_list[[i]])){
x <- append(x = x,i)
}else{
next
}
}
#Remove genes with no TAPEs
genes_list <- genes_list[-x]
```
This for loop adds a column to every dataframe in the list that indicates the gene name, strand, chr, 5'end, 3' end. The list is then unlisted to create a large dataframe that can be exported. Finally, a sanity chekc is run to make sure that no rows are duplicated.
Here distance is calculated. If the gene is on the + strand, the TSS is the five prime end of the gene. If the gene is on the - strand, the TSS is the three prime end of the gene.
```{r Make TSS column}
# Calculate distance and orientation based on strand
#Calculate distance and orientation based on strand
#To calculate the distance, the TSS for each gene must be identified. The TSS changes for each gene's strandedness in that a + stand gene's TSS will be in the Gene_Five_Prime_End column and a - strand gene's TSS will be in the Gene_Three_Prime End column
To calculate correlations we need TAPE and gene counts. The TAPE counts are already within the TAPEs dataframe. Here, we load in mRNA count information. Next, we keep only genes in which there is one annotation. Finally, only useful columns are kept.
```{r}
#Read in the Gene Probe counts identified with Seqmonk
Some of the genes have count values of 0 for every sample across all cell types. This results in a correaltion value of 0. Thus, these TAPE-gene pairs are removed, leaving us with 388,605 potential TAPE-gene pairs.