STRT-seq#
Check this GitHub page to see how STRT-seq libraries are generated experimentally. This is one of the few methods that sequence the 5’ of the transcript. This is useful for both gene expression quantification and transcription start site (TSS) identification. The method is plate-based where single cells are sorted into each well in a one-cell-per-well manner.
Important
Be aware that there are three versions of STRT-seq, which are quite different in terms of experimental procedures and library structures. Make sure you check the STRT-seq GitHub Page for some details.
In this documentation, we will go through the procedures to process all versions just for the sake of record keeping.
For Your Own Experiments#
The read configuration varies greatly depending on the version.
The Original Version#
Order |
Read |
Cycle |
Description |
---|---|---|---|
1 |
Read 1 |
>=50 |
This yields |
2 |
Index 1 (i7) |
6 or 8 |
This yields |
3 |
Index 2 (i5) |
Optional |
This yields |
4 |
Read 2 |
>=50 |
This yields |
Read 1 is the only required reads and the content is like this:
Length |
Sequence (5’ -> 3’) |
---|---|
>=50 |
6 bp Cell barcodes + GGG + 5’ of cDNA |
The C1 Version#
Order |
Read |
Cycle |
Description |
---|---|---|---|
1 |
Read 1 |
>=50 |
This yields |
2 |
Index 1 (i7) |
8 |
This yields |
3 |
Index 2 (i5) |
Optional |
This yields |
4 |
Read 2 |
>=50 |
This yields |
The Read 1 and Index 1 are the only required reads, and the content of Read 1 is like this:
Length |
Sequence (5’ -> 3’) |
---|---|
>=50 |
5 bp UMI + GGG + 5’ of cDNA |
The 2i Version#
This configuration is more complicated and the naming of the output files does not really follow our normal convention. DO NOT get confused.
Order |
Read |
Cycle |
Description |
---|---|---|---|
1 |
Read 1 |
>=50 |
This normally yields |
2 |
Index 1 |
8 |
This normally yields |
3 |
Index 2 (i7) |
5 |
This normally yields |
4 |
Read 2 |
Optional |
This normally yields |
The content of Read 1 is like this:
Length |
Sequence (5’ -> 3’) |
---|---|
>=50 |
6 bp UMI + GGG + 5’ of cDNA |
In all cases, the pair-end sequencing mode can be used, but the original publications only used single-end reads. If you use this method, you have to sequence the library on your because custom sequencing primers are used, but that can be modified. You need to get the fastq
files by running bcl2fastq
by yourself. In the original version, it is better to write a SampleSheet.csv
with i7
indices for each sample. In the C1 and 2i versions, it is better just run bcl2fastq
without a SampleSheet.csv
. You will see the reason later. Here is an example of SampleSheet.csv
of NextSeq runs with two samples using some standard index with the original version of STRT-seq:
[Header],,,,,,,,,,,
IEMFileVersion,5,,,,,,,,,,
Date,17/12/2019,,,,,,,,,,
Workflow,GenerateFASTQ,,,,,,,,,,
Application,NextSeq FASTQ Only,,,,,,,,,,
Instrument Type,NextSeq/MiniSeq,,,,,,,,,,
Assay,AmpliSeq Library PLUS for Illumina,,,,,,,,,,
Index Adapters,AmpliSeq CD Indexes (384),,,,,,,,,,
Chemistry,Amplicon,,,,,,,,,,
,,,,,,,,,,,
[Reads],,,,,,,,,,,
50,,,,,,,,,,,
50,,,,,,,,,,,
,,,,,,,,,,,
[Settings],,,,,,,,,,,
,,,,,,,,,,,
[Data],,,,,,,,,,,
Sample_ID,Sample_Name,Sample_Plate,Sample_Well,Index_Plate,Index_Plate_Well,I7_Index_ID,index,I5_Index_ID,index2,Sample_Project,Description
Sample01,,,,,,BC1,CAGATC,,,,
Sample02,,,,,,BC2,ACTTGA,,,,
You need to run bcl2fastq
differently based on the versions like this:
# for the original version with the above SampleSheet.csv
bcl2fastq --no-lane-splitting \
--ignore-missing-positions \
--ignore-missing-controls \
--ignore-missing-filter \
--ignore-missing-bcls \
-r 4 -w 4 -p 4
# for the C1 version without a SampleSheet.csv
bcl2fastq --use-bases-mask=Y50,I8,Y50 \
--create-fastq-for-index-reads \
--no-lane-splitting \
--ignore-missing-positions \
--ignore-missing-controls \
--ignore-missing-filter \
--ignore-missing-bcls \
-r 4 -w 4 -p 4
# for the 2i version without a SampleSheet.csv
bcl2fastq --use-bases-mask=Y50,I8,I5,Y50 \
--create-fastq-for-index-reads \
--no-lane-splitting \
--ignore-missing-positions \
--ignore-missing-controls \
--ignore-missing-filter \
--ignore-missing-bcls \
-r 4 -w 4 -p 4
You can check the bcl2fastq manual for more information, but the important bit that needs explanation is the --use-bases-mask
flag in the C1 and 2i versions. Using the 2i version as an example, we have four reads in this case, and that parameter specifies how we treat each read in the stated order:
Y50
at the first position indicates “use the cycle as a real read”, so you will get 50-nt sequences, output asR1_001.fastq.gz
, because this is the 1st real read.I8
at the second position indicates “use the cycle as a real read”, so you will get 8-nt sequences, output asI1_001.fastq.gz
, because this is the 1st index read, though it is the 2nd read overall.I5
at the third position indicates “use the cycle as an index read”, so you will get 5-nt sequences, output asI2_001.fastq.gz
, because this is the 2nd index read, though it is the 3rd read overall.Y50
at the fourth position indicates “use the cycle as a real read”, so you will get 50-nt sequences, output asR2_001.fastq.gz
, because this is the 2nd real read, though it is the 4th read overall.
After that, you will get two files per sample for the original version, three files per run for the C1 version and four files per run for the 2i version:
# Original version
Sample01_S1_R1_001.fastq.gz # 50 bp: 6 bp cell barcodes + GGG + 5' cDNA
Sample01_S1_R2_001.fastq.gz # 50 bp: cDNA reads
Sample02_S2_R1_001.fastq.gz # 50 bp: 6 bp cell barcodes + GGG + 5' cDNA
Sample02_S2_R2_001.fastq.gz # 50 bp: cDNA reads
# C1 version
Undetermined_S0_R1_001.fastq.gz # 50 bp: 5 bp UMI + GGG + 5' cDNA
Undetermined_S0_I1_001.fastq.gz # 8 bp: cell barcodes
Undetermined_S0_R2_001.fastq.gz # 50 bp: cDNA reads
# 2i version
Undetermined_S0_R1_001.fastq.gz # 50 bp: 6bp UMI + GGG + 5' cDNA
Undetermined_S0_I1_001.fastq.gz # 8 bp: Subarray barcodes
Undetermined_S0_I2_001.fastq.gz # 5 bp: Well barcodes
Undetermined_S0_R2_001.fastq.gz # 50 bp: cDNA reads
There are no UMIs in the original version. For those types of data, we should demultiplex the fastq
files based on the cell barcodes (the first 6 bp in Read 1), making one (for single-end) or two (for pair-end) files per cell. This can be achieved using any demultiplex programs, but we will use cutadapt as the demonstration later. For the C1 and 2i versions, the cell barcodes and UMI are distributed in different reads. We need to collect them into one fastq
file in order to use starsolo
. This can be done by simple stitching the reads:
# C1 version
paste <(zcat Undetermined_S0_I1_001.fastq.gz) \
<(zcat Undetermined_S0_R1_001.fastq.gz) | \
awk -F '\t' '{ if(NR%4==1||NR%4==3) {print $2} else {print $1 $2} }' | \
gzip > Undetermined_S0_CB_UMI.fastq.gz
# 2i version
paste <(zcat Undetermined_S0_I1_001.fastq.gz) \
<(zcat Undetermined_S0_I2_001.fastq.gz) \
<(zcat Undetermined_S0_R1_001.fastq.gz) | \
awk -F '\t' '{ if(NR%4==1||NR%4==3) {print $3} else {print $1 $2 $3} }' | \
gzip > Undetermined_S0_CB_UMI_R1.fastq.gz
After that, you are ready to go.
Public Data#
For the purpose of demonstration, we are using the data from the following publications:
Note
Original
Islam S, Kjällquist U, Moliner A, Zajac P, Fan J-B, Lönnerberg P, Linnarsson S (2011) Characterization of the single-cell transcriptional landscape by highly multiplex RNA-seq. Genome Res 21:1160–1167. https://doi.org/10.1101/gr.110882.110
C1
Islam S, Zeisel A, Joost S, Manno GL, Zajac P, Kasper M, Lönnerberg P, Linnarsson S (2014) Quantitative single-cell RNA-seq with unique molecular identifiers. Nat Methods 11:163–166. https://doi.org/10.1038/nmeth.2772
2i
Hochgerner H, Lönnerberg P, Hodge R, Mikes J, Heskol A, Hubschle H, Lin P, Picelli S, Manno GL, Ratz M, Dunne J, Husain S, Lein E, Srinivasan M, Zeisel A, Linnarsson S (2017) STRT-seq-2i: dual-index 5ʹ single cell and nucleus RNA-seq on an addressable microwell array. Sci Rep-uk 7:16327. https://doi.org/10.1038/s41598-017-16546-4
where the authors developed those methods for the first time.
The Original Version#
The raw data for the original version can be found from the PRJNA140307 ENA page. I have prepared the read information, and you can download here. The authors already demultiplexed for us. They were using single-end sequencing mode, so there is one file per cell. To mimic what we get directly from the machine, we could merge all of them into one file.
# get individual fastq files and merge into one file
mkdir -p strt-seq/data
wget -P strt-seq/data https://teichlab.github.io/scg_lib_structs/data/STRT-seq_family/filereport_read_run_PRJNA140307.tsv
wget -i <(cut -f 8 strt-seq/data/filereport_read_run_PRJNA140307.tsv | tail -n +2 | awk '{print "ftp://" $0}') \
-O /dev/stdout >> trt-seq/data/STRT-seq.fastq.gz
Now we need to demultiplex the fastq
file into individual files based on the first 6 bp. In this way, each cell has one file. Here, we use cutadapt
. The cell barcode information can be found in this Supplementary Information from the Genome Res. paper. We need the barcode in fasta
format:
>bc01
TTTAGG
>bc02
ATTCCA
>bc03
GCTCAA
>bc04
CATCCC
>bc05
TTGGAC
. . .
I have already prepared the fasta
file and you can download from here, and pass the fasta
to cutadapt
:
wget -P strt-seq/data https://teichlab.github.io/scg_lib_structs/data/STRT-seq_family/STRT_bc.fa
cutadapt -j 4 -g ^file:strt-seq/data/STRT_bc.fa \
--no-indels \
-o "strt-seq/data/demul-{name}.fastq.gz" \
strt-seq/data/STRT-seq.fastq.gz
It should finish without any problem, and we should have 97 more files under strt-seq/data
. They are named as demul-bc{01..96}.fastq.gz
and demul-unknown.fastq.gz
. The size of the “unknown” file should be very small. We are ready to go from here for the original version.
The C1 Version#
The raw data for the C1 version can be found from the PRJNA203208 ENA page. I have prepared the read information as a TSV file including the barcode as the last column, and you can download here. Again, the authors already demultiplexed for us. They were using single-end sequencing mode, so there is one file per cell.
mkdir -p strt-seq-c1/data
wget -P strt-seq-c1/data \
https://teichlab.github.io/scg_lib_structs/data/STRT-seq_family/filereport_read_run_PRJNA203208.tsv
# there are two types of libraries
# the one with the string "single" in the cell name is the regular one
# the one with the string "amplified" has extra 9 cycles of amplification
# we just use the regular one here
wget -P strt-seq-c1/data/ \
-i <(tail -n +2 strt-seq-c1/data/filereport_read_run_PRJNA203208.tsv | grep '_single' | cut -f 8 | awk '{print "ftp://" $0}')
Since the authors already demultiplexed the data to one file per cell, we need to add the cell barcode with fake quality scores in front of the reads and merge them into one fastq
file. This mimics the Undetermined_S0_CB_UMI_R1.fastq.gz
that we will get by ourselves. To this end, we do:
tail -n +2 strt-seq-c1/data/filereport_read_run_PRJNA203208.tsv | \
grep '_single' | cut -f 4,10 | \
while read -r line; do
srr=$(echo "${line}" | cut -f 1)
bc=$(echo "${line}" | cut -f 2)
zcat strt-seq-c1/data/${srr}.fastq.gz | \
awk -v BARCODE="${bc}" '{ if(NR%4==1||NR%4==3) {print $0} if(NR%4==2) {print BARCODE $0} if(NR%4==0) {print "IIIIIIII" $0} }' | \
gzip >> strt-seq-c1/data/CB_UMI_R1.fastq.gz
done
We are ready to go from here for the C1 version.
The 2i Version#
The raw data for the 2i version can be found from the PRJNA394919 ENA page. To preprocess the data, we need the oligo sequences. I have not got this information from the paper. I will update once I get them.
Prepare Whitelist#
The Original Version#
There are no UMIs in this version, and each cell has been demultiplexed into individual files. Therefore, we do not need a whitelist, but we do need to prepare a manifest, pointing the files to starsolo
:
for i in $(ls strt-seq/data/demul-bc*.gz); do
cell=$(echo ${i} | cut -f 3 -d '-')
echo -e "${i}\t-\t${cell%.fastq.gz}"
done > islam2011_manifest.tsv
The C1 Version#
In this version, cDNA from individual cells are tagmented by barcoded Tn5 separately. The Tn5 barcode serves as the cell barcode. You can find the full sequence from the Supplementary Table 2 from the Isalm2014 paper in Nature Methods. There are 96 different 8-bp Tn5 barcodes:
Name |
Sequence |
Reverse complement |
---|---|---|
C1-TN5-1 |
CGTCTAAT |
ATTAGACG |
C1-TN5-2 |
AGACTCGT |
ACGAGTCT |
C1-TN5-3 |
GCACGTCA |
TGACGTGC |
C1-TN5-4 |
TCAACGAC |
GTCGTTGA |
C1-TN5-5 |
ATTTAGCG |
CGCTAAAT |
C1-TN5-6 |
ATACAGAC |
GTCTGTAT |
C1-TN5-7 |
TGCGTAGG |
CCTACGCA |
C1-TN5-8 |
TGGAGCTC |
GAGCTCCA |
C1-TN5-9 |
TGAATACC |
GGTATTCA |
C1-TN5-10 |
TCTCACAC |
GTGTGAGA |
C1-TN5-11 |
TACTGGTA |
TACCAGTA |
C1-TN5-12 |
ACGATAGG |
CCTATCGT |
C1-TN5-13 |
GATGTCGA |
TCGACATC |
C1-TN5-14 |
TTACGGGT |
ACCCGTAA |
C1-TN5-15 |
CACAGCAT |
ATGCTGTG |
C1-TN5-16 |
CTTTGACA |
TGTCAAAG |
C1-TN5-17 |
CCTTCAAG |
CTTGAAGG |
C1-TN5-18 |
GAGTCCTG |
CAGGACTC |
C1-TN5-19 |
CACACTGA |
TCAGTGTG |
C1-TN5-20 |
GTTACAGG |
CCTGTAAC |
C1-TN5-21 |
GGACCTTT |
AAAGGTCC |
C1-TN5-22 |
TTCCGTTC |
GAACGGAA |
C1-TN5-23 |
ACTGTTTG |
CAAACAGT |
C1-TN5-24 |
AAGTGGCT |
AGCCACTT |
C1-TN5-25 |
CTGTACAA |
TTGTACAG |
C1-TN5-26 |
CGCAAAGT |
ACTTTGCG |
C1-TN5-27 |
GTGCATGA |
TCATGCAC |
C1-TN5-28 |
GTCATTAG |
CTAATGAC |
C1-TN5-29 |
AGCTCCTT |
AAGGAGCT |
C1-TN5-30 |
TCACCCGA |
TCGGGTGA |
C1-TN5-31 |
GTTGCCAC |
GTGGCAAC |
C1-TN5-32 |
TGTACCAA |
TTGGTACA |
C1-TN5-33 |
AACGAGGT |
ACCTCGTT |
C1-TN5-34 |
AGCCACCA |
TGGTGGCT |
C1-TN5-35 |
GGTAATCA |
TGATTACC |
C1-TN5-36 |
CCAGTCCA |
TGGACTGG |
C1-TN5-37 |
ACCTCAGC |
GCTGAGGT |
C1-TN5-38 |
GGTGGACT |
AGTCCACC |
C1-TN5-39 |
GACAAACC |
GGTTTGTC |
C1-TN5-40 |
TAACTCCG |
CGGAGTTA |
C1-TN5-41 |
ACACCGTG |
CACGGTGT |
C1-TN5-42 |
GTAGAACG |
CGTTCTAC |
C1-TN5-43 |
GGATTGAC |
GTCAATCC |
C1-TN5-44 |
ACGTATCC |
GGATACGT |
C1-TN5-45 |
TTCGGAAA |
TTTCCGAA |
C1-TN5-46 |
AGTTGTGT |
ACACAACT |
C1-TN5-47 |
AAGCACAT |
ATGTGCTT |
C1-TN5-48 |
CTGTCATT |
AATGACAG |
C1-TN5-49 |
GTCCTATA |
TATAGGAC |
C1-TN5-50 |
CTACGCTG |
CAGCGTAG |
C1-TN5-51 |
GGGATTGT |
ACAATCCC |
C1-TN5-52 |
TGATGTAG |
CTACATCA |
C1-TN5-53 |
TTCGCTGT |
ACAGCGAA |
C1-TN5-54 |
GAAGACTT |
AAGTCTTC |
C1-TN5-55 |
TCTGGGCA |
TGCCCAGA |
C1-TN5-56 |
CAACTAGA |
TCTAGTTG |
C1-TN5-57 |
CCATGGGA |
TCCCATGG |
C1-TN5-58 |
ATGCGACG |
CGTCGCAT |
C1-TN5-59 |
GAGGGTAG |
CTACCCTC |
C1-TN5-60 |
CGGGTGAA |
TTCACCCG |
C1-TN5-61 |
GCCATCTT |
AAGATGGC |
C1-TN5-62 |
GCATAATC |
GATTATGC |
C1-TN5-63 |
TCTATGGT |
ACCATAGA |
C1-TN5-64 |
AGGACTTA |
TAAGTCCT |
C1-TN5-65 |
CGTGATTC |
GAATCACG |
C1-TN5-66 |
ACTAGCGA |
TCGCTAGT |
C1-TN5-67 |
GTAACTCC |
GGAGTTAC |
C1-TN5-68 |
CGGAAGTG |
CACTTCCG |
C1-TN5-69 |
CCGAGTAC |
GTACTCGG |
C1-TN5-70 |
GACGCAAT |
ATTGCGTC |
C1-TN5-71 |
ACCTGGAG |
CTCCAGGT |
C1-TN5-72 |
CATGGGTT |
AACCCATG |
C1-TN5-73 |
ATTCCTAG |
CTAGGAAT |
C1-TN5-74 |
AATCATGC |
GCATGATT |
C1-TN5-75 |
GCTTCCCT |
AGGGAAGC |
C1-TN5-76 |
AGGTAAAG |
CTTTACCT |
C1-TN5-77 |
CCACAACT |
AGTTGTGG |
C1-TN5-78 |
ACAGGCAT |
ATGCCTGT |
C1-TN5-79 |
TTTGTGTC |
GACACAAA |
C1-TN5-80 |
TGAGCATA |
TATGCTCA |
C1-TN5-81 |
TTAGACGC |
GCGTCTAA |
C1-TN5-82 |
CGCTTGCT |
AGCAAGCG |
C1-TN5-83 |
AGTCTGCC |
GGCAGACT |
C1-TN5-84 |
CATAGTCG |
CGACTATG |
C1-TN5-85 |
TCTTGCTG |
CAGCAAGA |
C1-TN5-86 |
GGGACAAC |
GTTGTCCC |
C1-TN5-87 |
ATATTCCC |
GGGAATAT |
C1-TN5-88 |
TGTTAAGC |
GCTTAACA |
C1-TN5-89 |
TACGCCTC |
GAGGCGTA |
C1-TN5-90 |
CACTTATC |
GATAAGTG |
C1-TN5-91 |
ACCGCTAA |
TTAGCGGT |
C1-TN5-92 |
TAAGGTCC |
GGACCTTA |
C1-TN5-93 |
GAAAGGTG |
CACCTTTC |
C1-TN5-94 |
ACGTTGTA |
TACAACGT |
C1-TN5-95 |
GCAGAGAA |
TTCTCTGC |
C1-TN5-96 |
GCATTTGG |
CCAAATGC |
I have prepared the full tables in csv
format for you to download:
If we check carefully about the oligo orientation in the STRT-seq C1 GitHub page, we can see that the Tn5 barcodes are sequenced using the bottom strand as the template. Therefore, the barcode reads are actually reverse complement to the primer sequence. We should use the reverse complement as the whitelist:
wget -P strt-seq-c1/data \
https://teichlab.github.io/scg_lib_structs/data/STRT-seq_family/STRT-seq_C1_bc.csv
tail -n +2 strt-seq-c1/data/STRT-seq_C1_bc.csv | \
cut -f 3 -d, > strt-seq-c1/data/whitelist.txt
The 2i Version#
From the Table 1 of the Hochgerner2017 in Scientific Reports, there should be 32 different well barcodes (DI-P1A-idx[1–32]-P1B
) and 96 different subarray barcodes (STRT-Tn5-Idx[1–96]
). The cell barcodes are basically the combination of the subarray and well barcodes. Therefore, we should generate all combinations of the 96 subarray barcodes and 32 well barcodes for a total of 96 x 32 = 3072 barcodes as whitelist. However, the sequences are not available from the paper. I will update once I get them.
From FastQ To Count Matrix#
Since we have already generated the manifest for the original version and the whitelist for the C1 version, it is now very easy to just run starsolo
:
# for the original version
STAR --runThreadN 4 \
--genomeDir mm10/star_index \
--readFilesCommand zcat \
--outFileNamePrefix strt-seq/star_outs/ \
--readFilesManifest islam2011_manifest.tsv \
--soloType SmartSeq \
--clip5pNbases 3 \
--soloUMIdedup Exact NoDedup \
--soloStrand Forward \
--outSAMtype BAM SortedByCoordinate
# for the C1 version
STAR --runThreadN 4 \
--genomeDir mm10/star_index \
--readFilesCommand zcat \
--outFileNamePrefix strt-seq-c1/star_outs/ \
--readFilesIn strt-seq-c1/data/CB_UMI_R1.fastq.gz \
--soloType CB_UMI_Simple \
--soloCBstart 1 --soloCBlen 8 --soloUMIstart 9 --soloUMIlen 5 \
--soloBarcodeMate 1 \
--clip5pNbases 16 \
--soloCBwhitelist strt-seq-c1/data/whitelist.txt \
--soloStrand Forward \
--outSAMattributes CB UB \
--outSAMtype BAM SortedByCoordinate
Explanation#
If you understand the STRT-seq experimental procedures described in this GitHub Page, the command above should be straightforward to understand.
--runThreadN 4
Use 4 cores for the preprocessing. Change accordingly if using more or less cores.
--genomeDir mm10/star_index
Pointing to the directory of the star index. The public data from the above paper was from mouse embryonic stem cells (mESC).
--readFilesCommand zcat
Since the
fastq
files are in.gz
format, we need thezcat
command to extract them on the fly.
--outFileNamePrefix
We want to keep everything organised. This parameter directs all output files into the
star_outs
directory within each method.
--readFilesManifest
and --readFilesIn
For the original version, we need to provide the manifest here. For the C1 version, we provide the prepared read files containing cell barcodes, UMIs and the 5’ of cDNA.
--soloType
The original version has no UMIs, and each cell has its own file, It is in the same situation of SMART-seq, so we put
SmartSeq
here. For the C1 version, we have prepared the files with cell barcodes and UMIs, so we useCB_UMI_Simple
here.
--soloCBstart 1 --soloCBlen 8 --soloUMIstart 9 --soloUMIlen 5
This is for the C1 version. The name of the parameter is pretty much self-explanatory. If using
--soloType CB_UMI_Simple
, we can specify where the cell barcode and UMI start and how long they are in the reads from the first file passed to--readFilesIn
. Note the position is 1-based (the first base of the read is 1, NOT 0).
--soloBarcodeMate 1
This is for the C1 version. This option is designed for the 5’ sequencing methods, where one of the read contains not only cell barcodes + UMI, but useful cDNA as well. It tells the program that cell barcodes + UMI are in the first file in
--readFilesIn
. In this case, the public data is in single-end mode, so we only have one file.
--clip5pNbases
This option remove certain number of bases from the 5’ of the read. In the original version, the cell barcodes are removed during the
cutadapt
demultiplexing step, but there are still a GGG at the 5’. We need to ignore that. In the C1 version, the 5’ of the read is 8 bp cell barcodes, 5 bp UMI and GGG. Therefore, we need to remove 8 + 5 + 3 = 16 bp.
--soloUMIdedup Exact NoDedup
The original version does not have UMI in the reads. Exact
means perform the deduplication using the genomic coordinates, that is, fragments with the exact same starts and ends will be treated as duplicates. NoDedup
means do not perform deduplication. In ChIP-seq, deduplication is standard. In non-UMI RNA-seq, it seems deduplication is not always enforced (I might be wrong). I’m not sure if this makes a huge difference. Putting both options here will generated two versions of count matrices, one with and one without deduplication.
--soloCBwhitelist
The plain text file containing all possible valid cell barcodes, one per line. We have prepared this file in the previous section. This is for the C1 version.
--soloStrand Forward
The choice of this parameter depends on where the cDNA reads come from, i.e. the reads from the first file passed to
--readFilesIn
. You need to check the experimental protocol. If the cDNA reads are from the same strand as the mRNA (the coding strand), this parameter will beForward
(this is the default). If they are from the opposite strand as the mRNA, which is often called the first strand, this parameter will beReverse
. In all versions of STRT-seq, the cDNA reads from the Read 1 file are in the same direction of the mRNA, i.e. the coding strand. Therefore, useForward
for all STRT-seq data. ThisForward
parameter is the default, because many protocols generate data like this, but I still specified it here to make it clear. Check the STRT-seq GitHub Page if you are not sure.
--outSAMattributes CB UB
We want the cell barcode and UMI sequences in the
CB
andUB
attributes of the output, respectively. The information will be very helpful for downstream analysis.
--outSAMtype BAM SortedByCoordinate
We want sorted
BAM
for easy handling by other programs.
If everything goes well, your directory should look the same as the following:
# The Original Version
scg_prep_test/strt-seq
├── data
│ ├── demul-bc01.fastq.gz
│ ├── demul-bc02.fastq.gz
│ ├── demul-bc03.fastq.gz
│ ├── demul-bc04.fastq.gz
│ ├── demul-bc05.fastq.gz
│ ├── demul-bc06.fastq.gz
│ ├── demul-bc07.fastq.gz
│ ├── demul-bc08.fastq.gz
│ ├── demul-bc09.fastq.gz
│ ├── demul-bc10.fastq.gz
│ ├── demul-bc11.fastq.gz
│ ├── demul-bc12.fastq.gz
│ ├── demul-bc13.fastq.gz
│ ├── demul-bc14.fastq.gz
│ ├── demul-bc15.fastq.gz
│ ├── demul-bc16.fastq.gz
│ ├── demul-bc17.fastq.gz
│ ├── demul-bc18.fastq.gz
│ ├── demul-bc19.fastq.gz
│ ├── demul-bc20.fastq.gz
│ ├── demul-bc21.fastq.gz
│ ├── demul-bc22.fastq.gz
│ ├── demul-bc23.fastq.gz
│ ├── demul-bc24.fastq.gz
│ ├── demul-bc25.fastq.gz
│ ├── demul-bc26.fastq.gz
│ ├── demul-bc27.fastq.gz
│ ├── demul-bc28.fastq.gz
│ ├── demul-bc29.fastq.gz
│ ├── demul-bc30.fastq.gz
│ ├── demul-bc31.fastq.gz
│ ├── demul-bc32.fastq.gz
│ ├── demul-bc33.fastq.gz
│ ├── demul-bc34.fastq.gz
│ ├── demul-bc35.fastq.gz
│ ├── demul-bc36.fastq.gz
│ ├── demul-bc37.fastq.gz
│ ├── demul-bc38.fastq.gz
│ ├── demul-bc39.fastq.gz
│ ├── demul-bc40.fastq.gz
│ ├── demul-bc41.fastq.gz
│ ├── demul-bc42.fastq.gz
│ ├── demul-bc43.fastq.gz
│ ├── demul-bc44.fastq.gz
│ ├── demul-bc45.fastq.gz
│ ├── demul-bc46.fastq.gz
│ ├── demul-bc47.fastq.gz
│ ├── demul-bc48.fastq.gz
│ ├── demul-bc49.fastq.gz
│ ├── demul-bc50.fastq.gz
│ ├── demul-bc51.fastq.gz
│ ├── demul-bc52.fastq.gz
│ ├── demul-bc53.fastq.gz
│ ├── demul-bc54.fastq.gz
│ ├── demul-bc55.fastq.gz
│ ├── demul-bc56.fastq.gz
│ ├── demul-bc57.fastq.gz
│ ├── demul-bc58.fastq.gz
│ ├── demul-bc59.fastq.gz
│ ├── demul-bc60.fastq.gz
│ ├── demul-bc61.fastq.gz
│ ├── demul-bc62.fastq.gz
│ ├── demul-bc63.fastq.gz
│ ├── demul-bc64.fastq.gz
│ ├── demul-bc65.fastq.gz
│ ├── demul-bc66.fastq.gz
│ ├── demul-bc67.fastq.gz
│ ├── demul-bc68.fastq.gz
│ ├── demul-bc69.fastq.gz
│ ├── demul-bc70.fastq.gz
│ ├── demul-bc71.fastq.gz
│ ├── demul-bc72.fastq.gz
│ ├── demul-bc73.fastq.gz
│ ├── demul-bc74.fastq.gz
│ ├── demul-bc75.fastq.gz
│ ├── demul-bc76.fastq.gz
│ ├── demul-bc77.fastq.gz
│ ├── demul-bc78.fastq.gz
│ ├── demul-bc79.fastq.gz
│ ├── demul-bc80.fastq.gz
│ ├── demul-bc81.fastq.gz
│ ├── demul-bc82.fastq.gz
│ ├── demul-bc83.fastq.gz
│ ├── demul-bc84.fastq.gz
│ ├── demul-bc85.fastq.gz
│ ├── demul-bc86.fastq.gz
│ ├── demul-bc87.fastq.gz
│ ├── demul-bc88.fastq.gz
│ ├── demul-bc89.fastq.gz
│ ├── demul-bc90.fastq.gz
│ ├── demul-bc91.fastq.gz
│ ├── demul-bc92.fastq.gz
│ ├── demul-bc93.fastq.gz
│ ├── demul-bc94.fastq.gz
│ ├── demul-bc95.fastq.gz
│ ├── demul-bc96.fastq.gz
│ ├── demul-unknown.fastq.gz
│ ├── STRT_bc.fa
│ └── STRT-seq.fastq.gz
└── star_outs
├── Aligned.sortedByCoord.out.bam
├── Log.final.out
├── Log.out
├── Log.progress.out
├── SJ.out.tab
└── Solo.out
├── Barcodes.stats
└── Gene
├── Features.stats
├── filtered
│ ├── barcodes.tsv
│ ├── features.tsv
│ └── umiDedup-Exact.mtx
├── raw
│ ├── barcodes.tsv
│ ├── features.tsv
│ ├── umiDedup-Exact.mtx
│ └── umiDedup-NoDedup.mtx
├── Summary.csv
└── UMIperCellSorted.txt
6 directories, 115 files
# The C1 Version
scg_prep_test/strt-seq-c1/
├── data
│ ├── CB_UMI_R1.fastq.gz
│ ├── filereport_read_run_PRJNA203208.tsv
│ ├── SRR1043197.fastq.gz
│ ├── SRR1043198.fastq.gz
│ ├── SRR1043199.fastq.gz
│ ├── SRR1043200.fastq.gz
│ ├── SRR1043201.fastq.gz
│ ├── SRR1043202.fastq.gz
│ ├── SRR1043203.fastq.gz
│ ├── SRR1043204.fastq.gz
│ ├── SRR1043205.fastq.gz
│ ├── SRR1043206.fastq.gz
│ ├── SRR1043207.fastq.gz
│ ├── SRR1043208.fastq.gz
│ ├── SRR1043209.fastq.gz
│ ├── SRR1043210.fastq.gz
│ ├── SRR1043211.fastq.gz
│ ├── SRR1043212.fastq.gz
│ ├── SRR1043213.fastq.gz
│ ├── SRR1043214.fastq.gz
│ ├── SRR1043215.fastq.gz
│ ├── SRR1043216.fastq.gz
│ ├── SRR1043217.fastq.gz
│ ├── SRR1043218.fastq.gz
│ ├── SRR1043219.fastq.gz
│ ├── SRR1043220.fastq.gz
│ ├── SRR1043221.fastq.gz
│ ├── SRR1043222.fastq.gz
│ ├── SRR1043223.fastq.gz
│ ├── SRR1043224.fastq.gz
│ ├── SRR1043225.fastq.gz
│ ├── SRR1043226.fastq.gz
│ ├── SRR1043227.fastq.gz
│ ├── SRR1043228.fastq.gz
│ ├── SRR1043229.fastq.gz
│ ├── SRR1043230.fastq.gz
│ ├── SRR1043231.fastq.gz
│ ├── SRR1043232.fastq.gz
│ ├── SRR1043233.fastq.gz
│ ├── SRR1043234.fastq.gz
│ ├── SRR1043235.fastq.gz
│ ├── SRR1043236.fastq.gz
│ ├── SRR1043237.fastq.gz
│ ├── SRR1043238.fastq.gz
│ ├── SRR1043239.fastq.gz
│ ├── SRR1043240.fastq.gz
│ ├── SRR1043241.fastq.gz
│ ├── SRR1043242.fastq.gz
│ ├── SRR1043243.fastq.gz
│ ├── SRR1043244.fastq.gz
│ ├── SRR1043245.fastq.gz
│ ├── SRR1043246.fastq.gz
│ ├── SRR1043247.fastq.gz
│ ├── SRR1043248.fastq.gz
│ ├── SRR1043249.fastq.gz
│ ├── SRR1043250.fastq.gz
│ ├── SRR1043251.fastq.gz
│ ├── SRR1043252.fastq.gz
│ ├── SRR1043253.fastq.gz
│ ├── SRR1043254.fastq.gz
│ ├── SRR1043255.fastq.gz
│ ├── SRR1043256.fastq.gz
│ ├── SRR1043257.fastq.gz
│ ├── SRR1043258.fastq.gz
│ ├── SRR1043259.fastq.gz
│ ├── SRR1043260.fastq.gz
│ ├── SRR1043261.fastq.gz
│ ├── SRR1043262.fastq.gz
│ ├── SRR1043263.fastq.gz
│ ├── SRR1043264.fastq.gz
│ ├── SRR1043265.fastq.gz
│ ├── SRR1043266.fastq.gz
│ ├── SRR1043267.fastq.gz
│ ├── SRR1043268.fastq.gz
│ ├── SRR1043269.fastq.gz
│ ├── SRR1043270.fastq.gz
│ ├── SRR1043271.fastq.gz
│ ├── SRR1043272.fastq.gz
│ ├── SRR1043273.fastq.gz
│ ├── SRR1043274.fastq.gz
│ ├── SRR1043275.fastq.gz
│ ├── SRR1043276.fastq.gz
│ ├── SRR1043277.fastq.gz
│ ├── SRR1043278.fastq.gz
│ ├── SRR1043279.fastq.gz
│ ├── SRR1043280.fastq.gz
│ ├── SRR1043281.fastq.gz
│ ├── SRR1043282.fastq.gz
│ ├── SRR1043283.fastq.gz
│ ├── SRR1043284.fastq.gz
│ ├── SRR1043285.fastq.gz
│ ├── SRR1043286.fastq.gz
│ ├── SRR1043287.fastq.gz
│ ├── SRR1043288.fastq.gz
│ ├── SRR1043289.fastq.gz
│ ├── SRR1043290.fastq.gz
│ ├── SRR1043291.fastq.gz
│ ├── SRR1043292.fastq.gz
│ ├── SRR1043293.fastq.gz
│ ├── SRR1043294.fastq.gz
│ ├── SRR1043295.fastq.gz
│ ├── SRR1043296.fastq.gz
│ ├── SRR1043297.fastq.gz
│ ├── SRR1043298.fastq.gz
│ ├── SRR1043299.fastq.gz
│ ├── SRR1043300.fastq.gz
│ ├── SRR1043301.fastq.gz
│ ├── SRR1043302.fastq.gz
│ ├── SRR1043303.fastq.gz
│ ├── SRR1043304.fastq.gz
│ ├── SRR1043305.fastq.gz
│ ├── SRR1043306.fastq.gz
│ ├── SRR1043307.fastq.gz
│ ├── SRR1043308.fastq.gz
│ ├── SRR1043309.fastq.gz
│ ├── SRR1043310.fastq.gz
│ ├── SRR1043311.fastq.gz
│ ├── SRR1043312.fastq.gz
│ ├── SRR1043313.fastq.gz
│ ├── SRR1043314.fastq.gz
│ ├── SRR1043315.fastq.gz
│ ├── SRR1043316.fastq.gz
│ ├── SRR1043317.fastq.gz
│ ├── SRR1043318.fastq.gz
│ ├── SRR1043319.fastq.gz
│ ├── SRR1043320.fastq.gz
│ ├── SRR1043321.fastq.gz
│ ├── SRR1043322.fastq.gz
│ ├── SRR1043323.fastq.gz
│ ├── SRR1043324.fastq.gz
│ ├── SRR1043325.fastq.gz
│ ├── SRR1043326.fastq.gz
│ ├── SRR1043327.fastq.gz
│ ├── SRR1043328.fastq.gz
│ ├── SRR1043329.fastq.gz
│ ├── SRR1043330.fastq.gz
│ ├── SRR1043331.fastq.gz
│ ├── SRR1043332.fastq.gz
│ ├── SRR1043333.fastq.gz
│ ├── SRR1043334.fastq.gz
│ ├── SRR1043335.fastq.gz
│ ├── SRR1043336.fastq.gz
│ ├── SRR1043337.fastq.gz
│ ├── SRR1043338.fastq.gz
│ ├── SRR1043339.fastq.gz
│ ├── SRR1043340.fastq.gz
│ ├── SRR1043341.fastq.gz
│ ├── SRR1043342.fastq.gz
│ ├── SRR1043343.fastq.gz
│ ├── SRR1043344.fastq.gz
│ ├── SRR1043345.fastq.gz
│ ├── SRR1043346.fastq.gz
│ ├── SRR1043347.fastq.gz
│ ├── SRR1043348.fastq.gz
│ ├── SRR1043349.fastq.gz
│ ├── SRR1043350.fastq.gz
│ ├── SRR1043351.fastq.gz
│ ├── SRR1043352.fastq.gz
│ ├── SRR1043353.fastq.gz
│ ├── SRR1043354.fastq.gz
│ ├── SRR1043355.fastq.gz
│ ├── SRR1043356.fastq.gz
│ ├── SRR1043357.fastq.gz
│ ├── SRR1043358.fastq.gz
│ ├── SRR1043359.fastq.gz
│ ├── SRR1043360.fastq.gz
│ ├── SRR1043361.fastq.gz
│ ├── SRR1043362.fastq.gz
│ ├── SRR1043363.fastq.gz
│ ├── SRR1043364.fastq.gz
│ ├── SRR1043365.fastq.gz
│ ├── SRR1043366.fastq.gz
│ ├── SRR1043367.fastq.gz
│ ├── SRR1043368.fastq.gz
│ ├── SRR1043369.fastq.gz
│ ├── SRR1043370.fastq.gz
│ ├── SRR1043371.fastq.gz
│ ├── SRR1043372.fastq.gz
│ ├── SRR1043373.fastq.gz
│ ├── SRR1043374.fastq.gz
│ ├── SRR1043375.fastq.gz
│ ├── SRR1043376.fastq.gz
│ ├── SRR1043377.fastq.gz
│ ├── SRR1043378.fastq.gz
│ ├── SRR1043379.fastq.gz
│ ├── SRR1043380.fastq.gz
│ ├── SRR1043381.fastq.gz
│ ├── SRR1043382.fastq.gz
│ ├── SRR1043383.fastq.gz
│ ├── SRR1043384.fastq.gz
│ ├── SRR1043385.fastq.gz
│ ├── SRR1043386.fastq.gz
│ ├── SRR1043387.fastq.gz
│ ├── SRR1043388.fastq.gz
│ ├── SRR1043389.fastq.gz
│ ├── SRR1043390.fastq.gz
│ ├── SRR1043391.fastq.gz
│ ├── SRR1043392.fastq.gz
│ ├── SRR1043393.fastq.gz
│ ├── SRR1043394.fastq.gz
│ ├── SRR1043395.fastq.gz
│ ├── SRR1043396.fastq.gz
│ ├── SRR1043397.fastq.gz
│ ├── SRR1043398.fastq.gz
│ ├── SRR1043399.fastq.gz
│ ├── SRR1043400.fastq.gz
│ ├── SRR1043401.fastq.gz
│ ├── SRR1043402.fastq.gz
│ ├── SRR1043403.fastq.gz
│ ├── SRR1043404.fastq.gz
│ ├── SRR1043405.fastq.gz
│ ├── SRR1043406.fastq.gz
│ ├── SRR1043407.fastq.gz
│ ├── SRR1043408.fastq.gz
│ ├── SRR1043409.fastq.gz
│ ├── SRR1043410.fastq.gz
│ ├── SRR1043411.fastq.gz
│ ├── SRR1043412.fastq.gz
│ ├── SRR1043413.fastq.gz
│ ├── SRR1043414.fastq.gz
│ ├── SRR1043415.fastq.gz
│ ├── SRR1043416.fastq.gz
│ ├── SRR1043417.fastq.gz
│ ├── SRR1043418.fastq.gz
│ ├── SRR1043419.fastq.gz
│ ├── SRR1043420.fastq.gz
│ ├── SRR1043421.fastq.gz
│ ├── SRR1043422.fastq.gz
│ ├── SRR1043423.fastq.gz
│ ├── SRR1043424.fastq.gz
│ ├── SRR1043425.fastq.gz
│ ├── SRR1043426.fastq.gz
│ ├── SRR1043427.fastq.gz
│ ├── SRR1043428.fastq.gz
│ ├── SRR1043429.fastq.gz
│ ├── SRR1043430.fastq.gz
│ ├── SRR1043431.fastq.gz
│ ├── SRR1043432.fastq.gz
│ ├── SRR1043433.fastq.gz
│ ├── SRR1043434.fastq.gz
│ ├── SRR1043435.fastq.gz
│ ├── SRR1043436.fastq.gz
│ ├── SRR1043437.fastq.gz
│ ├── SRR1043438.fastq.gz
│ ├── SRR1043439.fastq.gz
│ ├── SRR1043440.fastq.gz
│ ├── SRR1043441.fastq.gz
│ ├── SRR1043442.fastq.gz
│ ├── SRR1043443.fastq.gz
│ ├── SRR1043444.fastq.gz
│ ├── SRR1043445.fastq.gz
│ ├── SRR1043446.fastq.gz
│ ├── SRR1043447.fastq.gz
│ ├── SRR1043448.fastq.gz
│ ├── SRR1043449.fastq.gz
│ ├── SRR1043450.fastq.gz
│ ├── SRR1043451.fastq.gz
│ ├── SRR1043452.fastq.gz
│ ├── SRR1043453.fastq.gz
│ ├── SRR1043454.fastq.gz
│ ├── SRR1043455.fastq.gz
│ ├── SRR1043456.fastq.gz
│ ├── SRR1043457.fastq.gz
│ ├── SRR1043458.fastq.gz
│ ├── SRR1043459.fastq.gz
│ ├── SRR1043460.fastq.gz
│ ├── SRR1043461.fastq.gz
│ ├── SRR1043462.fastq.gz
│ ├── SRR1043463.fastq.gz
│ ├── SRR1043464.fastq.gz
│ ├── SRR1043465.fastq.gz
│ ├── SRR1043466.fastq.gz
│ ├── SRR1043467.fastq.gz
│ ├── SRR1043468.fastq.gz
│ ├── SRR1043469.fastq.gz
│ ├── SRR1043470.fastq.gz
│ ├── SRR1043471.fastq.gz
│ ├── SRR1043472.fastq.gz
│ ├── SRR1043473.fastq.gz
│ ├── SRR1043474.fastq.gz
│ ├── SRR1043475.fastq.gz
│ ├── SRR1043476.fastq.gz
│ ├── SRR1043477.fastq.gz
│ ├── SRR1043478.fastq.gz
│ ├── SRR1043479.fastq.gz
│ ├── SRR1043480.fastq.gz
│ ├── SRR1043481.fastq.gz
│ ├── SRR1043482.fastq.gz
│ ├── SRR1043483.fastq.gz
│ ├── SRR1043484.fastq.gz
│ ├── STRT-seq_C1_bc.csv
│ └── whitelist.txt
└── star_outs
├── Aligned.sortedByCoord.out.bam
├── Log.final.out
├── Log.out
├── Log.progress.out
├── SJ.out.tab
└── Solo.out
├── Barcodes.stats
└── Gene
├── Features.stats
├── filtered
│ ├── barcodes.tsv
│ ├── features.tsv
│ └── matrix.mtx
├── raw
│ ├── barcodes.tsv
│ ├── features.tsv
│ └── matrix.mtx
├── Summary.csv
└── UMIperCellSorted.txt
6 directories, 307 files