Introduction to R for bioinformatics (part 1)

4 minute read

Published:

Variables, assignments, and basic arithmetic

Variable names usually begin with letters, and should be descriptive.
The assignment “<-“ operator stores value on the right hand side to the variable name on the left hand side.
You can assign values in one variable to another variable.

experimental_condition <- "KO" #assign character strings
num_animals <- 6 # assign numbers
replicates <- num_animals #now the variable "replicates" stores the number 6.  

R can be used as a calculator.
You can store calculation results in variables for later use

6 + 3 
6 - 3
6 / 3.1
6 * 4.3
6^3 #6 to the power of 3, equivalent to 6 * 6 * 6

result <- log2(8)
result
result/1.2
result/num_animals #result is 3, number_animals is 6

Vector, matrix and dataframe

Vectors

Vectors are used to store values of the same type (all numbers, all characters etc.). It starts with c and are enclosed by parentheses.
You can access specific elements in a vector by using its position in the vector.

conditions <- c("KO","KO","KO","WT","WT","WT")
batch_sizes <- c(6, 5, 6, 7, 9)
batch_sizes[4]

Matrix

Vectors are 1-D. Matrices are 2-D.
You can create a matrix by the matrix command.
You can also reshape vectors into matrix.

new_matrix <- matrix(0, nrow= 4, ncol=6)
new_matrix

#assign values to matrix:
new_matrix[1,3]<-1.4
new_matrix

#reshape vectors into matrix
input_vector<-seq(from=1, to=12, by=1)
new_matrix <- matrix(input_vector,nrow=3, ncol = 4, byrow = T) #roll the input_vector values row by row
new_matrix

new_matrix <- matrix(input_vector,nrow=2, ncol = 6, byrow = F) #roll the input_vector values column by column 
new_matrix

Data frames

Data frames are R’s equivalent to Excel spreadsheets: you can store numbers, names, dates etc. in the same table.

sample_info<-data.frame(id=1:6,condition=c("KO","KO","KO","WT","WT","WT"), weight = c(2.3,2.5,2.1, 4.6,3.8, 4.5))
sample_info

#show specific columns use the dollar sign:
sample_info$condition

#change values in specific columns with dollar sign and position index:
sample_info$weight
sample_info$weight[2]<-3.1

#note that position 2 is changed from 2.5 to 3.1
sample_info$weight 

Getting data into and out of R

We often do not manually create data frames inside R. Rather, data frame is how R stores the data when we read files into R.
The most common function is read.table.

  1. Header=T tells R that your file has colunm names.
  2. sep=”\t” tells R that tab separates different columns (other example: sep=”,”, for csv files. )
  3. stringsAsFactors = F tells R NOT to convert your gene names/symbols into factors.
  4. skip=2 tells R to skip the first two rows (open the file to see why)
    first<-read.table("370660.gene_tpm.gct",header=T,sep="\t",stringsAsFactors = F,skip = 2)
    head(first, n=10) #check first 10 rows
    tail(first, n=10) #check last 10 rows
    

You can check the dimensions of a data frame.
And compute some statistical summary

dim(first)
nrow(first)
ncol(first)
summary(first$X370660)

#select specific values:
first[10:20,c(1,3)] #10:20: the 10th to 20th row; c(1,3): the 1st and 3rd column

You can write the data frames you created/calculated in R to a file in your folder for later use. To save to text files, use write.table.
To save as Excel files, load openxlsx library and use write.xlsx

#write the first 100 rows, and 1st and 2nd column of the "first" data frame into a tab-separately text file. Please examine the resulting file. 
write.table(first[1:100,1:2],file="example_output_text.txt",row.names = FALSE, col.names = TRUE,quote=F,sep="\t")
#write the whole data frame to text file
write.table(first,file="example_output_text_whole.txt",row.names = FALSE, col.names = TRUE,quote=F,sep="\t")

library(openxlsx)
write.xlsx(sample_info,file="sample_info_test_file.xlsx")

You can also use read.xlsx in the openxlsx library to read Excel file into R.

sample_info_from_excel <- read.xlsx("sample_info_test_file.xlsx",sheet=1) #read the first sheet. If you have multiple sheets, you can specify which one to read, either by sheet number (sheet = 3) or sheet name ("Table S3")
sample_info_from_excel

For loops

for loops are commonly used to do something repeatedly.

a <- c(1, 3, 10, -1, 8, 3.2, 5.1)
for (i in 1:length(a)){
  if (a[i] >4){
    print(a[i])
    print("I am bigger than 4")
  }
}

Statistical tests

t.test are the most common proceedure.

#test whether the expression of the first 100 genes and 1000th to 1100th genes have equal mean
t.test(first[1:100,3],first[1000:1100,3]) 

#you can store t-test results and extract a specific field. 
t_result <- t.test(first[1:100,3],first[1000:1100,3]) 
names (t_result) #what are the individual results that t-test computes?
t_result$p.value #only extract p-values
t_result$conf.int #only get confidence interval