mcSCRB-seq#
Check this GitHub page to see how SCRB-seq/mcSCRB-seq libraries are generated experimentally. Those are plate-based methods, where single cells are sorted into 96- or 384-well plates in a one-cell-per-well manner. Each well contains a barcoded RT primers to label mRNA from each well. After reverse transcription, cDNA from all wells are pooled together, and one single library is prepared per plate. A plate barcode is added at the library amplification stage using the Illumina Nextera Index primers. The cells can be identified using the combination of the well barcode and the plate barcode.
For Your Own Experiments#
Your sequencing read configuration is like this:
Order |
Read |
Cycle |
Description |
---|---|---|---|
1 |
Read 1 |
16 |
This yields |
2 |
Index 1 (i7) |
8 |
This yields |
3 |
Index 2 (i5) |
8 or None |
This yields |
4 |
Read 2 |
>50 |
This yields |
If you sequence your data via your core facility or a company, you will need to provide the plate index sequence, which is the N7xx
primer from the Illumina Nextera Index king, to them and they will demultiplex for you. You will get one two fastq
files per plate. The cell barcode is basically the first 6 bp of Read 1 (R1_001.fastq.gz
).
If you sequence by yourself, you need to run bcl2fastq
by yourself. You have two choices:
Option 1: Prepare a SampleSheet.csv
with index information for each plate, and get the fastq
for each plate by running bcl2fastq
. Here is an example of SampleSheet.csv
of a NextSeq run with five different plates:
[Header],,,,,,,,,,,
IEMFileVersion,5,,,,,,,,,,
Date,17/12/2019,,,,,,,,,,
Workflow,GenerateFASTQ,,,,,,,,,,
Application,NextSeq FASTQ Only,,,,,,,,,,
Instrument Type,NextSeq/MiniSeq,,,,,,,,,,
Assay,AmpliSeq Library PLUS for Illumina,,,,,,,,,,
Index Adapters,AmpliSeq CD Indexes (384),,,,,,,,,,
Chemistry,Amplicon,,,,,,,,,,
,,,,,,,,,,,
[Reads],,,,,,,,,,,
16,,,,,,,,,,,
75,,,,,,,,,,,
,,,,,,,,,,,
[Settings],,,,,,,,,,,
,,,,,,,,,,,
[Data],,,,,,,,,,,
Sample_ID,Sample_Name,Sample_Plate,Sample_Well,Index_Plate,Index_Plate_Well,I7_Index_ID,index,I5_Index_ID,index2,Sample_Project,Description
Plate01,,,,,,N701,TAAGGCGA,,,,
Plate02,,,,,,N702,CGTACTAG,,,,
Plate03,,,,,,N703,AGGCAGAA,,,,
Plate04,,,,,,N704,TCCTGAGC,,,,
Plate05,,,,,,N705,GGACTCCT,,,,
Then, for each plate, you will get two fastq
files, like this:
Plate01_S1_R1_001.fastq.gz
Plate01_S1_R2_001.fastq.gz
Plate02_S2_R1_001.fastq.gz
Plate02_S2_R2_001.fastq.gz
Plate03_S3_R1_001.fastq.gz
Plate03_S3_R2_001.fastq.gz
Plate04_S4_R1_001.fastq.gz
Plate04_S4_R2_001.fastq.gz
Plate05_S5_R1_001.fastq.gz
Plate05_S5_R2_001.fastq.gz
You can process each plate independently. See the later section on how to prepare a cell barcode whitelist
.
Option 2: I actually prefer this way: run bcl2fastq
without a SampleSheet.csv
like this:
bcl2fastq --create-fastq-for-index-reads \
--no-lane-splitting \
--ignore-missing-positions \
--ignore-missing-controls \
--ignore-missing-filter \
--ignore-missing-bcls \
-r 4 -w 4 -p 4
In this case, you will get three fastq
files for each experiment:
Undetermined_S0_I1_001.fastq.gz
Undetermined_S0_R1_001.fastq.gz
Undetermined_S0_R2_001.fastq.gz
Your plate barcodes are in the Undetermined_S0_I1_001.fastq.gz
and the well barcodes are the first 6 bp of Undetermined_S0_R1_001.fastq.gz
. The cell barcode is basically the combination of the plate barcode and the well barcode. To run starsolo
, we need to prepare cell barcode and UMI read files by stitching I1
and R1
, that is, put the sequence from I1
in front of R1
:
paste <(zcat Undetermined_S0_I1_001.fastq.gz) <(zcat Undetermined_S0_R1_001.fastq.gz) | \
awk -F '\t' '{ if(NR%4==1||NR%4==3) {print $1} else {print $1 $2} }' | \
gzip > Undetermined_S0_CB_UMI.fastq.gz
The resulting Undetermined_S0_CB_UMI.fastq.gz
file contains reads with 24 bp in length. The first 14 bp are the cell barcodes (plate barcode + well barcode), and the rest 10 bp are UMI. See the later section on how to prepare a cell barcode whitelist
.
Public Data#
For the purpose of demonstration, we will use the mcSCRB-seq data from the following paper:
Note
Mereu E, Lafzi A, Moutinho C, Ziegenhain C, McCarthy DJ, Álvarez-Varela A, Batlle E, Sagar, Grün D, Lau JK, Boutet SC, Sanada C, Ooi A, Jones RC, Kaihara K, Brampton C, Talaga Y, Sasagawa Y, Tanaka K, Hayashi T, Braeuning C, Fischer C, Sauer S, Trefzer T, Conrad C, Adiconis X, Nguyen LT, Regev A, Levin JZ, Parekh S, Janjic A, Wange LE, Bagnoli JW, Enard W, Gut M, Sandberg R, Nikaido I, Gut I, Stegle O, Heyn H (2020) Benchmarking single-cell RNA-sequencing protocols for cell atlas projects. Nat Biotechnol 38:747–755. https://doi.org/10.1038/s41587-020-0469-4
where the authors benchmarked quite a few different scRNA-seq methods using a standardised sample: a mixture of different human, mouse and dog cells. We are going to use the data from the mcSCRB-seq method. You can download the fastq
file from this ENA page. There are 8 plates in total, but I’m just downloading one plate for the demonstration.
mkdir -p mereu2020/mcscrb-seq
wget -P mereu2020/mcscrb-seq -c ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR962/007/SRR9621767/SRR9621767_1.fastq.gz
wget -P mereu2020/mcscrb-seq -c ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR962/007/SRR9621767/SRR9621767_2.fastq.gz
Prepare Whitelist#
The data we get from the above paper is already demultiplexed based on plate, which is the same as the output from Option 1 described above. In this case, the cell barcode is essentially the barcode in the RT primer in the well. The barcoded oligo-dT can be download from the protocols.io page of mcSCRB-seq.
# Download the oligo-dT file from protocols.io
# If this does not work, go to the protocols.io page and download manually
wget -O mereu2020/mcscrb-seq/mcSCRB-seq_oligodT.txt https://s3.amazonaws.com/pr-journal/uhyjf6e.txt
# the 3rd column is the well barcode
tail -n +2 mereu2020/mcscrb-seq/mcSCRB-seq_oligodT.txt | \
cut -f 3 > mereu2020/mcscrb-seq/whitelist.txt
You should get 384 lines (well barcodes) in mereu2020/mcscrb-seq/whitelist.txt
, and here I’m just showing the first five lines:
AAAACT
AAAATC
AAAGTT
AAATAC
AAATTG
If you perform mcSCRT-seq on your own and generated the fastq
files with Option 2 described above, the whitelist
should be a combination of the plate barcode and the well barcode. Just using the five plates (N701
, N702
, N703
, N704
and N705
) described above as an example, you should put the each of the N7xx
barcode in front of the 384 well barcode to get a total of 5 * 384 = 1920 cell barcodes. You can do this in the following way:
for pb in TAAGGCGA CGTACTAG AGGCAGAA TCCTGAGC GGACTCCT;
do tail -n +2 mereu2020/mcscrb-seq/mcSCRB-seq_oligodT.txt | \
awk -v PLATE_BARCODE=${pb} '{print PLATE_BARCODE $3}';
done > mereu2020/mcscrb-seq/whitelist2.txt
You should get 1920 lines (well barcodes) in mereu2020/mcscrb-seq/whitelist2.txt
, and here I’m just showing the first five lines:
TAAGGCGAAAAACT
TAAGGCGAAAAATC
TAAGGCGAAAAGTT
TAAGGCGAAAATAC
TAAGGCGAAAATTG
From FastQ To Count Matrix#
Now we could start the preprocessing by simply doing:
STAR --runThreadN 4 \
--genomeDir mix_hg38_mm10/star_index \
--readFilesCommand zcat \
--outFileNamePrefix mereu2020/star_outs/ \
--readFilesIn mereu2020/mcscrb-seq/SRR9621767_2.fastq.gz mereu2020/mcscrb-seq/SRR9621767_1.fastq.gz \
--soloType CB_UMI_Simple \
--soloCBstart 1 --soloCBlen 6 --soloUMIstart 7 --soloUMIlen 10 \
--soloCBwhitelist mereu2020/mcscrb-seq/whitelist.txt \
--soloStrand Forward \
--outSAMattributes CB UB \
--outSAMtype BAM SortedByCoordinate
The above scenario is from Option 1. If you are doing the experiments by yourself and using Option 2, you should do:
STAR --runThreadN 4 \
--genomeDir mix_hg38_mm10/star_index \
--readFilesCommand zcat \
--outFileNamePrefix /path/to/star_outs/ \
--readFilesIn /path/to/Undetermined_S0_R2_001.fastq.gz /path/to/Undetermined_S0_CB_UMI.fastq.gz \
--soloType CB_UMI_Simple \
--soloCBstart 1 --soloCBlen 14 --soloUMIstart 15 --soloUMIlen 10 \
--soloCBwhitelist mereu2020/mcscrb-seq/whitelist2.txt \
--soloStrand Forward \
--outSAMattributes CB UB \
--outSAMtype BAM SortedByCoordinate
Explanation#
If you understand the SCRB-seq/mcSCRB-seq experimental procedures described in this GitHub Page, the command above should be straightforward to understand.
--runThreadN 4
Use 4 cores for the preprocessing. Change accordingly if using more or less cores.
--genomeDir mix_hg38_mm10/star_index
Pointing to the directory of the star index. The public data from the above paper was produced using the HCA reference sample, which consists of human PBMCs (60%), and HEK293T (6%), mouse colon (30%), NIH3T3 (3%) and dog MDCK cells (1%). Therefore, we need to use the species mixing reference genome. We also need to add the dog genome, but the dog cells only take 1% of all cells, so I did not bother in this documentation.
--readFilesCommand zcat
Since the
fastq
files are in.gz
format, we need thezcat
command to extract them on the fly.
--outFileNamePrefix mereu2020/star_outs/
We want to keep everything organised. This directs all output files inside the
mereu2020/star_outs
directory.
--readFilesIn mereu2020/mcscrb-seq/SRR9621767_2.fastq.gz mereu2020/mcscrb-seq/SRR9621767_1.fastq.gz
or --readFilesIn /path/to/Undetermined_S0_R2_001.fastq.gz /path/to/Undetermined_S0_CB_UMI.fastq.gz
If you check the manual, we should put two files here. The first file is the reads that come from cDNA, and the second file should contain cell barcode and UMI. In SCRB-seq/mcSCRB-seq, cDNA reads come from Read 2, and the cell barcode and UMI come from either Read 1 or the
CB_UMI.fastq.gz
file we created as described above. Check the mcSCRB-seq GitHub Page if you are not sure.
--soloType CB_UMI_Simple
Most of the time, you should use this option, and specify the configuration of cell barcodes and UMI in the command line (see immediately below). Sometimes, it is actually easier to prepare the cell barcode and UMI file upfront so that we could use this parameter. This is what we did for Option 2 as described above.
--soloCBstart 1 --soloCBlen 6 --soloUMIstart 7 --soloUMIlen 10
or --soloCBstart 1 --soloCBlen 14 --soloUMIstart 15 --soloUMIlen 10
The name of the parameter is pretty much self-explanatory. If using
--soloType CB_UMI_Simple
, we can specify where the cell barcode and UMI start and how long they are in the reads from the first file passed to--readFilesIn
. Note the position is 1-based (the first base of the read is 1, NOT 0).
--soloCBwhitelist mereu2020/mcscrb-seq/whitelist.txt
The plain text file containing all possible valid cell barcodes, one per line. The content of this file varies depending on the strategy (Option 1 vs Option 2) and the primer used during the experiment. We just demonstrated the standard whitelist in the paper.
--soloStrand Forward
The choice of this parameter depends on where the cDNA reads come from, i.e. the reads from the first file passed to
--readFilesIn
. You need to check the experimental protocol. If the cDNA reads are from the same strand as the mRNA (the coding strand), this parameter will beForward
(this is the default). If they are from the opposite strand as the mRNA, which is often called the first strand, this parameter will beReverse
. In the case of SCRB-seq and mcSCRB-seq, the cDNA reads are from the Read 2 file. During the experiment, the mRNA molecules are captured by barcoded oligo-dT primer containing the Illumina Read 1 sequence. Therefore, Read 1 comes from the first strand, complementary to the coding strand. Read 2 comes from the coding strand. Therefore, useForward
for SCRB-seq/mcSCRB-seq data. ThisForward
parameter is the default, because many protocols generate data like this, but I still specified it here to make it clear.
--outSAMattributes CB UB
We want the cell barcode and UMI sequences in the
CB
andUB
attributes of the output, respectively. The information will be very helpful for downstream analysis.
--outSAMtype BAM SortedByCoordinate
We want sorted
BAM
for easy handling by other programs.
If everything goes well, your directory should look the same as the following:
scg_prep_test/mereu2020/
├── mcscrb-seq
│ ├── mcSCRB-seq_oligodT.txt
│ ├── SRR9621767_1.fastq.gz
│ ├── SRR9621767_2.fastq.gz
│ ├── whitelist2.txt
│ └── whitelist.txt
└── star_outs
├── Aligned.sortedByCoord.out.bam
├── Log.final.out
├── Log.out
├── Log.progress.out
├── SJ.out.tab
└── Solo.out
├── Barcodes.stats
└── Gene
├── Features.stats
├── filtered
│ ├── barcodes.tsv
│ ├── features.tsv
│ └── matrix.mtx
├── raw
│ ├── barcodes.tsv
│ ├── features.tsv
│ └── matrix.mtx
├── Summary.csv
└── UMIperCellSorted.txt
6 directories, 20 files