This is an R Markdown document that contains instructions and code for the examples used in today’s workshop. The first few steps will check that you have all the packages and data files necessary to carry out all of the analyses. If you are running these analyses on your own machine (not on the RStudio Server instance provided), you will need to have the latest version of R installed (3.3.1, download here) and RStudio is also recommended (download here).
The following code chunk assumes that Brain Atlas files have been downloaded and placed in the “BrainAtlas” subdirectory of a folder in your home directory entitled “FestivalWorkshopSC”. If you have downloaded these files to another location, either create a new folder and move the files there, or substitute the file path to where they are currently located for “~/FestivalWorkshopSC/BrainAtlas”. If you are using the RStudio Server instance provided by the workshop, change the first line that follows to setwd("/home/FestivalWorkshopSC/BrainAtlas")
setwd("~/FestivalWorkshopSC/BrainAtlas")
file.exists("cell_metadata.csv")
## [1] TRUE
file.exists("genes_counts.csv")
## [1] TRUE
file.exists("genes_rpkm.csv")
## [1] TRUE
file.exists("ercc_counts.csv")
## [1] TRUE
file.exists("README.txt")
## [1] TRUE
If any of the preceding lines return FALSE
, double check that you have set the correct working directory and that all download files have been placed in that folder. If they are missing, you can download the files here. Note that there will be more files than listed here (including alternate gene count quantifications, and metadata about data-driven cluster memberships as discussed in the paper “Adult mouse cortical cell taxonomy revealed by single cell transcriptomics”), but these are the ones we will make use of.
The read.csv
function in R is useful for reading in .csv (comma-separated value) files. First, we’ll read in the main data file genes_counts.csv
to a data frame using this function and check its contents. The extra arguments to this function help to format our object so that we have row and column names, and we use character variables instead of converting to factors. For more details on these arguments, you can type help(read.csv)
. In the resulting counts
object, we have genes in rows (24057) and cells in columns (1679). We will also import the corresponding RPKM normalized data from genes_rpkm.csv
counts <- read.csv("genes_counts.csv", stringsAsFactors = FALSE, header=TRUE, row.names = 1)
str(counts[,1:20]) # restrict to the first 20 columns (cells)
## 'data.frame': 24057 obs. of 20 variables:
## $ A01101401: num 0 992 2.57 0 0 0 0 0 0 1 ...
## $ A01101402: num 0 2287 177 0 0 ...
## $ A01101403: num 0 492 0 0 0 ...
## $ A01101404: num 0 1932 1 0 0 ...
## $ A01101405: num 0 1425 2 0 0 ...
## $ A01101406: num 0 130 3 0 0 ...
## $ A01101407: num 0 2110 3041 0 17 ...
## $ A01101408: num 0 955 101 0 0 ...
## $ A02271433: num 0 326 0 0 0 349 0 0 0 0 ...
## $ A02271434: num 0 933 1042 0 0 ...
## $ A02271435: num 0 296 0 0 0 ...
## $ A02271436: num 0 31 200 0 0 984 0 0 0 36 ...
## $ A02271437: num 0 970 0 0 3 ...
## $ A02271438: num 0 594 355 0 228 ...
## $ A02271439: num 0 2290 3294 0 0 ...
## $ A02271440: num 0 29 1 0 0 ...
## $ A12101401: num 0 373 540 0 0 ...
## $ A12101402: num 0 462 197 0 0 0 0 0 0 0 ...
## $ A12101403: num 0 516 96 0 498 ...
## $ A12101404: num 0 1019 0 0 0 ...
rpkm <- read.csv("genes_rpkm.csv", stringsAsFactors = FALSE, header=TRUE, row.names = 1)
str(rpkm[,1:20]) # restrict to the first 20 columns (cells)
## 'data.frame': 24057 obs. of 20 variables:
## $ A01101401: num 0 76.13 0.05 0 0 ...
## $ A01101402: num 0 512.7 10.1 0 0 ...
## $ A01101403: num 0 173 0 0 0 ...
## $ A01101404: num 0 170.56 0.02 0 0 ...
## $ A01101405: num 0 113.87 0.04 0 0 ...
## $ A01101406: num 0 10.9 0.06 0 0 ...
## $ A01101407: num 0 191.23 70.19 0 0.53 ...
## $ A01101408: num 0 67.41 1.82 0 0 ...
## $ A02271433: num 0 70.6 0 0 0 ...
## $ A02271434: num 0 73.3 20.9 0 0 ...
## $ A02271435: num 0 92.7 0 0 0 ...
## $ A02271436: num 0 6.16 10.12 0 0 ...
## $ A02271437: num 0 99.96 0 0 0.11 ...
## $ A02271438: num 0 54.48 8.29 0 7.12 ...
## $ A02271439: num 0 209.7 76.9 0 0 ...
## $ A02271440: num 0 2.32 0.02 0 0 ...
## $ A12101401: num 0 88 32.4 0 0 ...
## $ A12101402: num 0 134.3 14.6 0 0 ...
## $ A12101403: num 0 150.02 7.11 0 23.77 ...
## $ A12101404: num 0 254 0 0 0 ...
rpkm <- log(rpkm + 1, 2) # put on the log scale
The cell_metadata.csv
file contains 1679 rows (one for each cell) and columns containing information such as collection date, sequencing type, total reads, mapping percentage, dissection layer, and major/minor derived cell subtypes.
cells <- read.csv("cell_metadata.csv", stringsAsFactors = FALSE, header = TRUE)
str(cells)
## 'data.frame': 1679 obs. of 16 variables:
## $ long_name : chr "A01101401" "A01101402" "A01101403" "A01101404" ...
## $ cre : chr "Calb2" "Calb2" "Calb2" "Calb2" ...
## $ collection_date : chr "11/18/2013" "11/18/2013" "11/18/2013" "11/18/2013" ...
## $ sequencing_type : chr "hiseq" "hiseq" "hiseq" "hiseq" ...
## $ total_reads : int 23770190 9694719 5864322 22102121 24057147 24171169 22447919 20995719 9023705 23016406 ...
## $ all_mapped_percent: num 93.5 92.9 90.5 93.2 93.1 ...
## $ mRNA_percent : num 54.4 45.7 48.3 51.4 51.1 ...
## $ genome_percent : num 30.1 35.5 34 33.8 32.1 ...
## $ ercc_percent : num 4.36 7.84 4.12 4.24 4.98 3.14 3.12 2.27 3.22 2.13 ...
## $ tdt_permillion : num 306 341 106 371 264 ...
## $ major_class : chr "Inhibitory" "Inhibitory" "Excitatory" "Inhibitory" ...
## $ sub_class : chr "Vip" "Vip" "L4" "Vip" ...
## $ major_dissection : chr "V1" "V1" "V1" "V1" ...
## $ layer_dissectoin : chr "All" "All" "All" "All" ...
## $ color_code : int 11 11 11 11 11 11 11 11 11 11 ...
## $ short_name : chr "A200_V" "A201_V" "A202_V" "A203_V" ...
The ‘ercc_counts.csv’ file contains 1679 columns (one for each cell) and 93 rows containing the counts for the ERCC spike-in control RNA transcripts (and spike in tdTomato). The ERCC spike-ins are a standard set of RNA transcripts that are spiked in at known concentrations to each cell (more on this later). We’ll remove the tdTomato transcript since this does not serve the same purpose as the ERCC spike-ins, and add the ERCC spike-ins to the main counts matrix.
ercc <- read.csv("ercc_counts.csv", stringsAsFactors = FALSE, header = TRUE, row.names=1)
str(ercc[,1:20]) # restrict to the first 20 columns (cells)
## 'data.frame': 93 obs. of 20 variables:
## $ A01101401: int 167620 16658 35838 18394 0 0 0 0 0 0 ...
## $ A01101402: int 131187 10413 31563 20370 0 0 0 0 0 578 ...
## $ A01101403: int 37523 2248 9723 1342 0 0 0 0 0 185 ...
## $ A01101404: int 136565 4789 34892 8611 0 0 1354 0 1 0 ...
## $ A01101405: int 222462 15067 53617 19096 0 0 0 0 0 0 ...
## $ A01101406: int 125448 7055 31013 4439 0 0 0 0 0 0 ...
## $ A01101407: int 109946 9494 25467 4122 0 0 1 0 0 0 ...
## $ A01101408: int 79354 5755 17085 9585 0 0 0 0 0 296 ...
## $ A02271433: int 29555 5180 26581 3004 0 0 0 0 0 304 ...
## $ A02271434: int 58598 4188 29983 3939 0 0 0 0 0 233 ...
## $ A02271435: int 18494 836 8201 359 0 0 0 0 0 0 ...
## $ A02271436: int 10725 764 5968 619 0 0 0 0 0 0 ...
## $ A02271437: int 48311 1914 18605 1635 0 0 0 0 0 634 ...
## $ A02271438: int 31437 1279 12595 2102 0 0 0 0 0 332 ...
## $ A02271439: int 58475 5016 25756 11074 0 0 0 0 0 0 ...
## $ A02271440: int 127465 6838 67261 2358 0 0 0 0 0 0 ...
## $ A12101401: int 48418 2363 8437 3996 0 0 0 0 0 119 ...
## $ A12101402: int 77486 3819 17759 4358 0 521 0 0 0 0 ...
## $ A12101403: int 72009 4899 13997 4956 0 0 0 0 0 0 ...
## $ A12101404: int 62729 2947 16044 1551 0 0 0 0 0 0 ...
# remove the tdTomato row
whichTomato <- grep("tdTomato", rownames(ercc))
ercc <- ercc[-whichTomato,]
# combine the counts and erccs into the same data frame
all.equal(colnames(counts), colnames(ercc)) # check that the cells are in the same order across the two datasets
## [1] TRUE
counts <- rbind(counts, ercc)
More detailed information about the files downloaded from the Allen Brain Atlas can be found in the README.txt
file provided. Here is a peek at the contents of that file.
# This is a bash command, to be executed at the command line (not within R);
# Alternatively, simply open the README.txt in your favorite text editor to view its contents
head README.txt
## Description of files contained in this data download:
## 1. genes_counts.csv: File containing read count values obtained from the RSEM algorithm for each gene (row) for each cell (column)
## 2. genes_rpkm.csv: File containing the RPKM values obtained from the RSEM algorithm for each gene (row) for each cell (column)
## 3. ercc_counts.csv: File containing read count values obtained from the RSEM algorithm for each external ERCC spike-in control (row) for each cell (column)
## 4. cell_metadata.csv: File containing information about each cell profiled, including its nomenclature, Cre line of origin, dissection, date of collection and sequencing, and read mapping statistics
## 5. cluster_metadata.csv: File containing information about each data-driven cluster, including its label, the corresponding label in Tasic et al. (Nat. Neuro, 2106), the primary cell class membership, and marker genes (including genes with widespread expression in the cluster, sparse expression in the cluster, and no expression in the cluster).
## 6. cell_classification.csv: File containing information about the cluster membership of each cell, including whether the cell is a "core" (unambiguously assigned to a single cluster) or "transition" (shares membership between two clusters) cell, as well as its membership score (from 0-10) for each cluster (labeled f01 to f49).
##
## To generate the count and RPKM data, 100bp single-end reads were aligned using RSEM to the mm10 mouse genome build with the RefSeq annotation downloaded on 11 June 2013.
## Raw fastq files are available at Gene Expression Omnibus, accession ID GSE71585
We will make use of many existing R packages (collections of functions to carry out a specific task). If you are using your own machine instead of the RStudio Server instance provided, you’ll need to install each of these manually. Instructions are below. The following is a list of the packages we’ll be using along where they can be found, and the code checks whether they are already installed.
require(scde) #bioconductor
require(monocle) #bioconductor
require(scran) #bioconductor
require(scater) #bioconductor
require(Biobase) #bioconductor
require(EBSeq) #bioconductor
require(scDD) #github
require(ggplot2) #cran
require(devtools) #cran
require(RColorBrewer) #cran
If any of these commands return a message that includes “there is no package called…”, then the package is missing and needs to be installed (this is expected if you have just installed R for the first time, since most packages are optional add-ons). Note that packages may be stored in one of several package repositories. The most popular are Bioconductor, github, and CRAN. For Bioconductor packages, for example monocle
, this can be done with the following code:
source("http://bioconductor.org/biocLite.R")
biocLite("monocle")
For CRAN packages, for example devtools
, installation can be done with the following code:
install.packages(devtools)
For Github packages, for example scDD
, installation can be done with the following code:
install.packages("devtools")
devtools::install_github("kdkorthauer/scDD")
Let’s take a quick peek at what our data looks like to get a handle on the information it contains
# Take the dimensions of the count and normalized objects
dim(rpkm)
## [1] 24057 1679
dim(counts)
## [1] 24149 1679
#both objects contain the same genes in the same order
head(rownames(rpkm))
## [1] "0610005C13Rik" "0610007C21Rik" "0610007L01Rik" "0610007N19Rik"
## [5] "0610007P08Rik" "0610007P14Rik"
head(rownames(counts))
## [1] "0610005C13Rik" "0610007C21Rik" "0610007L01Rik" "0610007N19Rik"
## [5] "0610007P08Rik" "0610007P14Rik"
#as well as the same samples in the same order
head(colnames(rpkm))
## [1] "A01101401" "A01101402" "A01101403" "A01101404" "A01101405" "A01101406"
head(colnames(counts))
## [1] "A01101401" "A01101402" "A01101403" "A01101404" "A01101405" "A01101406"
Although they contain the same basic information. The two datasets, counts and rpkm, are distributed very differently
hist(as.vector(as.matrix(counts)))