Welcome to the Festival of Genomics workshop on single-cell analyses

This is an R Markdown document that contains instructions and code for the examples used in today’s workshop. The first few steps will check that you have all the packages and data files necessary to carry out all of the analyses. If you are running these analyses on your own machine (not on the RStudio Server instance provided), you will need to have the latest version of R installed (3.3.1, download here) and RStudio is also recommended (download here).

Hour I: Getting Started

Check that the Brain Atlas data files are present

The following code chunk assumes that Brain Atlas files have been downloaded and placed in the “BrainAtlas” subdirectory of a folder in your home directory entitled “FestivalWorkshopSC”. If you have downloaded these files to another location, either create a new folder and move the files there, or substitute the file path to where they are currently located for “~/FestivalWorkshopSC/BrainAtlas”. If you are using the RStudio Server instance provided by the workshop, change the first line that follows to setwd("/home/FestivalWorkshopSC/BrainAtlas")

setwd("~/FestivalWorkshopSC/BrainAtlas")
file.exists("cell_metadata.csv")
## [1] TRUE
file.exists("genes_counts.csv")
## [1] TRUE
file.exists("genes_rpkm.csv")
## [1] TRUE
file.exists("ercc_counts.csv")
## [1] TRUE
file.exists("README.txt")
## [1] TRUE

If any of the preceding lines return FALSE, double check that you have set the correct working directory and that all download files have been placed in that folder. If they are missing, you can download the files here. Note that there will be more files than listed here (including alternate gene count quantifications, and metadata about data-driven cluster memberships as discussed in the paper “Adult mouse cortical cell taxonomy revealed by single cell transcriptomics”), but these are the ones we will make use of.

Read the Brain Atlas data files into R

The read.csv function in R is useful for reading in .csv (comma-separated value) files. First, we’ll read in the main data file genes_counts.csv to a data frame using this function and check its contents. The extra arguments to this function help to format our object so that we have row and column names, and we use character variables instead of converting to factors. For more details on these arguments, you can type help(read.csv). In the resulting counts object, we have genes in rows (24057) and cells in columns (1679). We will also import the corresponding RPKM normalized data from genes_rpkm.csv

counts <- read.csv("genes_counts.csv", stringsAsFactors = FALSE, header=TRUE, row.names = 1)
str(counts[,1:20]) # restrict to the first 20 columns (cells)
## 'data.frame':    24057 obs. of  20 variables:
##  $ A01101401: num  0 992 2.57 0 0 0 0 0 0 1 ...
##  $ A01101402: num  0 2287 177 0 0 ...
##  $ A01101403: num  0 492 0 0 0 ...
##  $ A01101404: num  0 1932 1 0 0 ...
##  $ A01101405: num  0 1425 2 0 0 ...
##  $ A01101406: num  0 130 3 0 0 ...
##  $ A01101407: num  0 2110 3041 0 17 ...
##  $ A01101408: num  0 955 101 0 0 ...
##  $ A02271433: num  0 326 0 0 0 349 0 0 0 0 ...
##  $ A02271434: num  0 933 1042 0 0 ...
##  $ A02271435: num  0 296 0 0 0 ...
##  $ A02271436: num  0 31 200 0 0 984 0 0 0 36 ...
##  $ A02271437: num  0 970 0 0 3 ...
##  $ A02271438: num  0 594 355 0 228 ...
##  $ A02271439: num  0 2290 3294 0 0 ...
##  $ A02271440: num  0 29 1 0 0 ...
##  $ A12101401: num  0 373 540 0 0 ...
##  $ A12101402: num  0 462 197 0 0 0 0 0 0 0 ...
##  $ A12101403: num  0 516 96 0 498 ...
##  $ A12101404: num  0 1019 0 0 0 ...
rpkm <- read.csv("genes_rpkm.csv", stringsAsFactors = FALSE, header=TRUE, row.names = 1)
str(rpkm[,1:20]) # restrict to the first 20 columns (cells) 
## 'data.frame':    24057 obs. of  20 variables:
##  $ A01101401: num  0 76.13 0.05 0 0 ...
##  $ A01101402: num  0 512.7 10.1 0 0 ...
##  $ A01101403: num  0 173 0 0 0 ...
##  $ A01101404: num  0 170.56 0.02 0 0 ...
##  $ A01101405: num  0 113.87 0.04 0 0 ...
##  $ A01101406: num  0 10.9 0.06 0 0 ...
##  $ A01101407: num  0 191.23 70.19 0 0.53 ...
##  $ A01101408: num  0 67.41 1.82 0 0 ...
##  $ A02271433: num  0 70.6 0 0 0 ...
##  $ A02271434: num  0 73.3 20.9 0 0 ...
##  $ A02271435: num  0 92.7 0 0 0 ...
##  $ A02271436: num  0 6.16 10.12 0 0 ...
##  $ A02271437: num  0 99.96 0 0 0.11 ...
##  $ A02271438: num  0 54.48 8.29 0 7.12 ...
##  $ A02271439: num  0 209.7 76.9 0 0 ...
##  $ A02271440: num  0 2.32 0.02 0 0 ...
##  $ A12101401: num  0 88 32.4 0 0 ...
##  $ A12101402: num  0 134.3 14.6 0 0 ...
##  $ A12101403: num  0 150.02 7.11 0 23.77 ...
##  $ A12101404: num  0 254 0 0 0 ...
rpkm <- log(rpkm + 1, 2) # put on the log scale

The cell_metadata.csv file contains 1679 rows (one for each cell) and columns containing information such as collection date, sequencing type, total reads, mapping percentage, dissection layer, and major/minor derived cell subtypes.

cells <- read.csv("cell_metadata.csv", stringsAsFactors = FALSE, header = TRUE)
str(cells)
## 'data.frame':    1679 obs. of  16 variables:
##  $ long_name         : chr  "A01101401" "A01101402" "A01101403" "A01101404" ...
##  $ cre               : chr  "Calb2" "Calb2" "Calb2" "Calb2" ...
##  $ collection_date   : chr  "11/18/2013" "11/18/2013" "11/18/2013" "11/18/2013" ...
##  $ sequencing_type   : chr  "hiseq" "hiseq" "hiseq" "hiseq" ...
##  $ total_reads       : int  23770190 9694719 5864322 22102121 24057147 24171169 22447919 20995719 9023705 23016406 ...
##  $ all_mapped_percent: num  93.5 92.9 90.5 93.2 93.1 ...
##  $ mRNA_percent      : num  54.4 45.7 48.3 51.4 51.1 ...
##  $ genome_percent    : num  30.1 35.5 34 33.8 32.1 ...
##  $ ercc_percent      : num  4.36 7.84 4.12 4.24 4.98 3.14 3.12 2.27 3.22 2.13 ...
##  $ tdt_permillion    : num  306 341 106 371 264 ...
##  $ major_class       : chr  "Inhibitory" "Inhibitory" "Excitatory" "Inhibitory" ...
##  $ sub_class         : chr  "Vip" "Vip" "L4" "Vip" ...
##  $ major_dissection  : chr  "V1" "V1" "V1" "V1" ...
##  $ layer_dissectoin  : chr  "All" "All" "All" "All" ...
##  $ color_code        : int  11 11 11 11 11 11 11 11 11 11 ...
##  $ short_name        : chr  "A200_V" "A201_V" "A202_V" "A203_V" ...

The ‘ercc_counts.csv’ file contains 1679 columns (one for each cell) and 93 rows containing the counts for the ERCC spike-in control RNA transcripts (and spike in tdTomato). The ERCC spike-ins are a standard set of RNA transcripts that are spiked in at known concentrations to each cell (more on this later). We’ll remove the tdTomato transcript since this does not serve the same purpose as the ERCC spike-ins, and add the ERCC spike-ins to the main counts matrix.

ercc <- read.csv("ercc_counts.csv", stringsAsFactors = FALSE, header = TRUE, row.names=1)
str(ercc[,1:20]) # restrict to the first 20 columns (cells)
## 'data.frame':    93 obs. of  20 variables:
##  $ A01101401: int  167620 16658 35838 18394 0 0 0 0 0 0 ...
##  $ A01101402: int  131187 10413 31563 20370 0 0 0 0 0 578 ...
##  $ A01101403: int  37523 2248 9723 1342 0 0 0 0 0 185 ...
##  $ A01101404: int  136565 4789 34892 8611 0 0 1354 0 1 0 ...
##  $ A01101405: int  222462 15067 53617 19096 0 0 0 0 0 0 ...
##  $ A01101406: int  125448 7055 31013 4439 0 0 0 0 0 0 ...
##  $ A01101407: int  109946 9494 25467 4122 0 0 1 0 0 0 ...
##  $ A01101408: int  79354 5755 17085 9585 0 0 0 0 0 296 ...
##  $ A02271433: int  29555 5180 26581 3004 0 0 0 0 0 304 ...
##  $ A02271434: int  58598 4188 29983 3939 0 0 0 0 0 233 ...
##  $ A02271435: int  18494 836 8201 359 0 0 0 0 0 0 ...
##  $ A02271436: int  10725 764 5968 619 0 0 0 0 0 0 ...
##  $ A02271437: int  48311 1914 18605 1635 0 0 0 0 0 634 ...
##  $ A02271438: int  31437 1279 12595 2102 0 0 0 0 0 332 ...
##  $ A02271439: int  58475 5016 25756 11074 0 0 0 0 0 0 ...
##  $ A02271440: int  127465 6838 67261 2358 0 0 0 0 0 0 ...
##  $ A12101401: int  48418 2363 8437 3996 0 0 0 0 0 119 ...
##  $ A12101402: int  77486 3819 17759 4358 0 521 0 0 0 0 ...
##  $ A12101403: int  72009 4899 13997 4956 0 0 0 0 0 0 ...
##  $ A12101404: int  62729 2947 16044 1551 0 0 0 0 0 0 ...
# remove the tdTomato row
whichTomato <- grep("tdTomato", rownames(ercc))
ercc <- ercc[-whichTomato,]

# combine the counts and erccs into the same data frame
all.equal(colnames(counts), colnames(ercc)) # check that the cells are in the same order across the two datasets
## [1] TRUE
counts <- rbind(counts, ercc)  

More detailed information about the files downloaded from the Allen Brain Atlas can be found in the README.txt file provided. Here is a peek at the contents of that file.

# This is a bash command, to be executed at the command line (not within R);
# Alternatively, simply open the README.txt in your favorite text editor to view its contents
head README.txt
## Description of files contained in this data download:
## 1.   genes_counts.csv: File containing read count values obtained from the RSEM algorithm for each gene (row) for each cell (column)
## 2.   genes_rpkm.csv: File containing the RPKM values obtained from the RSEM algorithm for each gene (row) for each cell (column)
## 3.   ercc_counts.csv: File containing read count values obtained from the RSEM algorithm for each external ERCC spike-in control (row) for each cell (column)
## 4.  cell_metadata.csv: File containing information about each cell profiled, including its nomenclature, Cre line of origin, dissection, date of collection and sequencing, and read mapping statistics
## 5.   cluster_metadata.csv: File containing information about each data-driven cluster, including its label, the corresponding label in Tasic et al. (Nat. Neuro, 2106), the primary cell class membership, and marker genes (including genes with widespread expression in the cluster, sparse expression in the cluster, and no expression in the cluster). 
## 6.   cell_classification.csv: File containing information about the cluster membership of each cell, including whether the cell is a "core" (unambiguously assigned to a single cluster) or "transition" (shares membership between two clusters) cell, as well as its membership score (from 0-10) for each cluster (labeled f01 to f49). 
## 
## To generate the count and RPKM data, 100bp single-end reads were aligned using RSEM to the mm10 mouse genome build with the RefSeq annotation downloaded on 11 June 2013.
## Raw fastq files are available at Gene Expression Omnibus, accession ID GSE71585

Check that the desired R packages have been installed

We will make use of many existing R packages (collections of functions to carry out a specific task). If you are using your own machine instead of the RStudio Server instance provided, you’ll need to install each of these manually. Instructions are below. The following is a list of the packages we’ll be using along where they can be found, and the code checks whether they are already installed.

require(scde)         #bioconductor
require(monocle)      #bioconductor
require(scran)        #bioconductor
require(scater)       #bioconductor
require(Biobase)      #bioconductor
require(EBSeq)        #bioconductor
require(scDD)         #github
require(ggplot2)      #cran
require(devtools)     #cran
require(RColorBrewer) #cran

If any of these commands return a message that includes “there is no package called…”, then the package is missing and needs to be installed (this is expected if you have just installed R for the first time, since most packages are optional add-ons). Note that packages may be stored in one of several package repositories. The most popular are Bioconductor, github, and CRAN. For Bioconductor packages, for example monocle, this can be done with the following code:

source("http://bioconductor.org/biocLite.R")
biocLite("monocle")

For CRAN packages, for example devtools, installation can be done with the following code:

install.packages(devtools)

For Github packages, for example scDD, installation can be done with the following code:

install.packages("devtools")
devtools::install_github("kdkorthauer/scDD")

Explore the data

Let’s take a quick peek at what our data looks like to get a handle on the information it contains

# Take the dimensions of the count and normalized objects
dim(rpkm)
## [1] 24057  1679
dim(counts)
## [1] 24149  1679
#both objects contain the same genes in the same order
head(rownames(rpkm))
## [1] "0610005C13Rik" "0610007C21Rik" "0610007L01Rik" "0610007N19Rik"
## [5] "0610007P08Rik" "0610007P14Rik"
head(rownames(counts))
## [1] "0610005C13Rik" "0610007C21Rik" "0610007L01Rik" "0610007N19Rik"
## [5] "0610007P08Rik" "0610007P14Rik"
#as well as the same samples in the same order
head(colnames(rpkm))
## [1] "A01101401" "A01101402" "A01101403" "A01101404" "A01101405" "A01101406"
head(colnames(counts))
## [1] "A01101401" "A01101402" "A01101403" "A01101404" "A01101405" "A01101406"

Although they contain the same basic information. The two datasets, counts and rpkm, are distributed very differently

hist(as.vector(as.matrix(counts)))