SPLiT-seq#

Check this GitHub page to see how SPLiT-seq libraries are generated experimentally. This is a split-pool based combinatorial indexing strategy, where fixed cells are used as the reaction chamber. mRNA molecules are first marked by oligo-dT primer with distinct barcodes (Round1 barcodes) in 48 minibulk reactions in a plate. Then all cells are pooled and randomly distributed into wells of a new 96 well plate where another level of barcode (Round2 barcodes) is added. The same procedure is repeated to add a third level of barcodes (Round3 barcodes). After that, cells are pooled, counted and split into sublibraries. For each sublibrary, an i7 index is added. Single cells can be identified by the combination of the Round1 + Round2 + Round3 + i7 barcodes.

Important

Be aware that there are different versions of SPLiT-seq. They have slightly different adaptor sequences and hence require slightly different parameters for the preprocessing steps. Check the SPLiT-seq GitHub Page and this thread for more details. In this documentation, we are using the Science 2018 version.

For Your Own Experiments#

The read configuration is the same as a standard library:

Order	Read	Cycle	Description
1	Read 1	>50	This yields `R1_001.fastq.gz`, cDNA reads
2	Index 1 (i7)	6	This yields `I1_001.fastq.gz`, index for sublibrary
3	Index 2 (i5)	Optional	This yields `I2_001.fastq.gz`, not used but can be present
4	Read 2	94	This yields `R2_001.fastq.gz`, Cell barcode and UMI

The content of Read 2 is like this:

Length	Sequence (5’ -> 3’)
34	10 bp UMI + 8 bp Round3 barcode + GTGGCCGATGTTTCGCATCGGCGTACGACT + 8 bp Round2 barcode + ATCCACGTGCTTGAGAGGCCAGAGCATTCG + 8 bp Round1 barcode

You can think of the 8 bp Round1, Round2 and Round3 barcodes as the well barcode for the 1st, 2nd and 3rd plates, respectively. The 6 bp i7 barcode is the index for each sublibrary at the final stage. For a cell, it can go into a well in the 1st plate, then another well in the 2nd plate, then another well in the 3rd plate and finally into a sublibrary. Different cells have very low chance of going through the same combination of wells in the three plates and the final sublibrary. Therefore, if reads have the same combination of well barcodes and sublibrary index (Round1 + Round2 + Round3 + i7), we can safely think they are from the same cell.

If you sequence the library via your core facility or a company, you need to provide the i7 index sequence you used during the sublibrary PCR. Then you will get two fastq files (R1 and R2) per sublibrary. The total file number will depend on how many sublibraries in the final library preparation step.

If you sequence the library on your own, you need to get the fastq files by running bcl2fastq by yourself. In this case it is better to write a SampleSheet.csv with i7 indices for each sublibrary. This will yield the fastq files similar to those from your core facility or the company. Here is an example of the SampleSheet.csv from a NextSeq run with 4 sublibraries using indices from the Science 2018 paper:

[Header],,,,,,,,,,,
IEMFileVersion,5,,,,,,,,,,
Date,17/12/2019,,,,,,,,,,
Workflow,GenerateFASTQ,,,,,,,,,,
Application,NextSeq FASTQ Only,,,,,,,,,,
Instrument Type,NextSeq/MiniSeq,,,,,,,,,,
Assay,AmpliSeq Library PLUS for Illumina,,,,,,,,,,
Index Adapters,AmpliSeq CD Indexes (384),,,,,,,,,,
Chemistry,Amplicon,,,,,,,,,,
,,,,,,,,,,,
[Reads],,,,,,,,,,,
66,,,,,,,,,,,
94,,,,,,,,,,,
,,,,,,,,,,,
[Settings],,,,,,,,,,,
,,,,,,,,,,,
[Data],,,,,,,,,,,
Sample_ID,Sample_Name,Sample_Plate,Sample_Well,Index_Plate,Index_Plate_Well,I7_Index_ID,index,I5_Index_ID,index2,Sample_Project,Description
Sublib01,,,,,,BC_0076,CAGATC,,,,
Sublib02,,,,,,BC_0077,ACTTGA,,,,
Sublib03,,,,,,BC_0078,GATCAG,,,,
Sublib04,,,,,,BC_0079,TAGCTT,,,,

Simply run bcl2fastq like this:

bcl2fastq --no-lane-splitting \
          --ignore-missing-positions \
          --ignore-missing-controls \
          --ignore-missing-filter \
          --ignore-missing-bcls \
          -r 4 -w 4 -p 4

Then, you will have R1_001.fastq.gz and R2_001.fastq.gz per sublibrary like this:

Sublib01_S1_R1_001.fastq.gz # 66 bp: cDNA reads
Sublib01_S1_R2_001.fastq.gz # 94 bp: UMI, three rounds of barcode, linkers
Sublib02_S2_R1_001.fastq.gz # 66 bp: cDNA reads
Sublib02_S2_R2_001.fastq.gz # 94 bp: UMI, three rounds of barcode, linkers
Sublib03_S3_R1_001.fastq.gz # 66 bp: cDNA reads
Sublib03_S3_R2_001.fastq.gz # 94 bp: UMI, three rounds of barcode, linkers
Sublib04_S5_R1_001.fastq.gz # 66 bp: cDNA reads
Sublib05_S5_R2_001.fastq.gz # 94 bp: UMI, three rounds of barcode, linkers

That’s it. You are ready to go from here using starsolo. You can and should treat each sublibrary as separate experiments, and single cell can be identified by the combination of the Round1 + Round2 + Round3 barcodes in R2. Each sublibrary needs to be processed independently as if they are from different experiments. For example, if you detect barcode AACGTGAT + AACGTGAT + AACGTGAT in both Sublib01 and Sublib02, you should treat them as different cells. Therefore, we need to generate count matrix for each sublibrary separately, and combine them in the downstream analysis.

The advantage of doing this is that we actually divide each experiment into small chunks, and use the exact the same procedures for each chunk independently. In addition, the whitelist will simply be the combination of the Round1 + Round2 + Round3 barcodes for all the analysis.

Public Data#

For the purpose of demonstration, we will use the SPLiT-seq data from the following paper:

Note

Rosenberg AB, Roco CM, Muscat RA, Kuchina A, Sample P, Yao Z, Gray L, Peeler DJ, Mukherjee S, Chen W, Pun SH, Sellers DL, Tasic B, Seelig G (2018) Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding. Science 360:eaam8999. https://doi.org/10.1126/science.aam8999

where the authors developed SPLiT-seq for the first time and described the details of the method. The data is in GEO under the accession code GSE110823. You can get all the fastq files directly from this ENA page. There are a few sample, and we are going to use the 150000_CNS_nuclei sample. The sample accession is SAMN08567262. As you can see, there are a total of 14 run accessions. Each run accession represents the data from a sublibrary. This means the authors already demultiplexed the data based on i7 index for us. We could just download each run accession and process them independently. Single cells can be identified by the combination of the Round1 + Round2 + Round3 barcodes.

I’m not going to do all 14 sublibraries. Let’s just use the data SRR6750042 for the demonstration:

# get fastq files
mkdir -p mkdir -p split-seq/data
wget -P split-seq/data -c \
    ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR675/002/SRR6750042/SRR6750042_1.fastq.gz \
    ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR675/002/SRR6750042/SRR6750042_2.fastq.gz

Prepare Whitelist#

The full oligo sequences can be found in the Supplementary Table S12 from the SPLiT-seq paper. As you can see, there are a total of 96 different Round2 barcodes and 96 different Round3 barcodes. For the sublibrary index, they provided 8 different ones (BC_0076 - BC_0083), but you can cerntainly do more. For the Round1 barcodes, it is a bit more complicated. There are 96 of them (Round1_01 - Round1_96). The first 48 are oligo-dT primers and the last 48 are random hexamers. They mix them into 48 different wells:

Round1_01 and Round1_49 are mixed in the same well;
Round1_02 and Round1_50 are mixed in the same well;
Round1_03 and Round1_51 are mixed in the same well;
…
…
…
Round1_48 and Round1_96 are mixed in the same well.

Therefore, we actually have 48 different Round1_barcodes. If you use the oligos provided in the Supplementary Table S12 from the SPLiT-seq paper, you should have a capacity of 48 * 96 * 96 * 8 = 3,538,944 combinations. For the preprocessing, we could treat the different Round1 barcodes as if there are 96 different ones. During the downstream analysis after the preprocessing, we could merge them.

I have collected the index table as follows, and the names of the oligos are directly taken from the paper to be consistent:

Round1 Barcodes (8 bp)

Name	Sequence	Reverse complement
Round1_01	AACGTGAT	ATCACGTT
Round1_02	AAACATCG	CGATGTTT
Round1_03	ATGCCTAA	TTAGGCAT
Round1_04	AGTGGTCA	TGACCACT
Round1_05	ACCACTGT	ACAGTGGT
Round1_06	ACATTGGC	GCCAATGT
Round1_07	CAGATCTG	CAGATCTG
Round1_08	CATCAAGT	ACTTGATG
Round1_09	CGCTGATC	GATCAGCG
Round1_10	ACAAGCTA	TAGCTTGT
Round1_11	CTGTAGCC	GGCTACAG
Round1_12	AGTACAAG	CTTGTACT
Round1_13	AACAACCA	TGGTTGTT
Round1_14	AACCGAGA	TCTCGGTT
Round1_15	AACGCTTA	TAAGCGTT
Round1_16	AAGACGGA	TCCGTCTT
Round1_17	AAGGTACA	TGTACCTT
Round1_18	ACACAGAA	TTCTGTGT
Round1_19	ACAGCAGA	TCTGCTGT
Round1_20	ACCTCCAA	TTGGAGGT
Round1_21	ACGCTCGA	TCGAGCGT
Round1_22	ACGTATCA	TGATACGT
Round1_23	ACTATGCA	TGCATAGT
Round1_24	AGAGTCAA	TTGACTCT
Round1_25	AGATCGCA	TGCGATCT
Round1_26	AGCAGGAA	TTCCTGCT
Round1_27	AGTCACTA	TAGTGACT
Round1_28	ATCCTGTA	TACAGGAT
Round1_29	ATTGAGGA	TCCTCAAT
Round1_30	CAACCACA	TGTGGTTG
Round1_31	GACTAGTA	TACTAGTC
Round1_32	CAATGGAA	TTCCATTG
Round1_33	CACTTCGA	TCGAAGTG
Round1_34	CAGCGTTA	TAACGCTG
Round1_35	CATACCAA	TTGGTATG
Round1_36	CCAGTTCA	TGAACTGG
Round1_37	CCGAAGTA	TACTTCGG
Round1_38	CCGTGAGA	TCTCACGG
Round1_39	CCTCCTGA	TCAGGAGG
Round1_40	CGAACTTA	TAAGTTCG
Round1_41	CGACTGGA	TCCAGTCG
Round1_42	CGCATACA	TGTATGCG
Round1_43	CTCAATGA	TCATTGAG
Round1_44	CTGAGCCA	TGGCTCAG
Round1_45	CTGGCATA	TATGCCAG
Round1_46	GAATCTGA	TCAGATTC
Round1_47	CAAGACTA	TAGTCTTG
Round1_48	GAGCTGAA	TTCAGCTC
Round1_49	GATAGACA	TGTCTATC
Round1_50	GCCACATA	TATGTGGC
Round1_51	GCGAGTAA	TTACTCGC
Round1_52	GCTAACGA	TCGTTAGC
Round1_53	GCTCGGTA	TACCGAGC
Round1_54	GGAGAACA	TGTTCTCC
Round1_55	GGTGCGAA	TTCGCACC
Round1_56	GTACGCAA	TTGCGTAC
Round1_57	GTCGTAGA	TCTACGAC
Round1_58	GTCTGTCA	TGACAGAC
Round1_59	GTGTTCTA	TAGAACAC
Round1_60	TAGGATGA	TCATCCTA
Round1_61	TATCAGCA	TGCTGATA
Round1_62	TCCGTCTA	TAGACGGA
Round1_63	TCTTCACA	TGTGAAGA
Round1_64	TGAAGAGA	TCTCTTCA
Round1_65	TGGAACAA	TTGTTCCA
Round1_66	TGGCTTCA	TGAAGCCA
Round1_67	TGGTGGTA	TACCACCA
Round1_68	TTCACGCA	TGCGTGAA
Round1_69	AACTCACC	GGTGAGTT
Round1_70	AAGAGATC	GATCTCTT
Round1_71	AAGGACAC	GTGTCCTT
Round1_72	AATCCGTC	GACGGATT
Round1_73	AATGTTGC	GCAACATT
Round1_74	ACACGACC	GGTCGTGT
Round1_75	ACAGATTC	GAATCTGT
Round1_76	AGATGTAC	GTACATCT
Round1_77	AGCACCTC	GAGGTGCT
Round1_78	AGCCATGC	GCATGGCT
Round1_79	AGGCTAAC	GTTAGCCT
Round1_80	ATAGCGAC	GTCGCTAT
Round1_81	ATCATTCC	GGAATGAT
Round1_82	ATTGGCTC	GAGCCAAT
Round1_83	CAAGGAGC	GCTCCTTG
Round1_84	CACCTTAC	GTAAGGTG
Round1_85	CCATCCTC	GAGGATGG
Round1_86	CCGACAAC	GTTGTCGG
Round1_87	CCTAATCC	GGATTAGG
Round1_88	CCTCTATC	GATAGAGG
Round1_89	CGACACAC	GTGTGTCG
Round1_90	CGGATTGC	GCAATCCG
Round1_91	CTAAGGTC	GACCTTAG
Round1_92	GAACAGGC	GCCTGTTC
Round1_93	GACAGTGC	GCACTGTC
Round1_94	GAGTTAGC	GCTAACTC
Round1_95	GATGAATC	GATTCATC
Round1_96	GCCAAGAC	GTCTTGGC

Round2 Barcodes (8 bp)

Name	Sequence	Reverse complement
Round2_01	AACGTGAT	ATCACGTT
Round2_02	AAACATCG	CGATGTTT
Round2_03	ATGCCTAA	TTAGGCAT
Round2_04	AGTGGTCA	TGACCACT
Round2_05	ACCACTGT	ACAGTGGT
Round2_06	ACATTGGC	GCCAATGT
Round2_07	CAGATCTG	CAGATCTG
Round2_08	CATCAAGT	ACTTGATG
Round2_09	CGCTGATC	GATCAGCG
Round2_10	ACAAGCTA	TAGCTTGT
Round2_11	CTGTAGCC	GGCTACAG
Round2_12	AGTACAAG	CTTGTACT
Round2_13	AACAACCA	TGGTTGTT
Round2_14	AACCGAGA	TCTCGGTT
Round2_15	AACGCTTA	TAAGCGTT
Round2_16	AAGACGGA	TCCGTCTT
Round2_17	AAGGTACA	TGTACCTT
Round2_18	ACACAGAA	TTCTGTGT
Round2_19	ACAGCAGA	TCTGCTGT
Round2_20	ACCTCCAA	TTGGAGGT
Round2_21	ACGCTCGA	TCGAGCGT
Round2_22	ACGTATCA	TGATACGT
Round2_23	ACTATGCA	TGCATAGT
Round2_24	AGAGTCAA	TTGACTCT
Round2_25	AGATCGCA	TGCGATCT
Round2_26	AGCAGGAA	TTCCTGCT
Round2_27	AGTCACTA	TAGTGACT
Round2_28	ATCCTGTA	TACAGGAT
Round2_29	ATTGAGGA	TCCTCAAT
Round2_30	CAACCACA	TGTGGTTG
Round2_31	GACTAGTA	TACTAGTC
Round2_32	CAATGGAA	TTCCATTG
Round2_33	CACTTCGA	TCGAAGTG
Round2_34	CAGCGTTA	TAACGCTG
Round2_35	CATACCAA	TTGGTATG
Round2_36	CCAGTTCA	TGAACTGG
Round2_37	CCGAAGTA	TACTTCGG
Round2_38	CCGTGAGA	TCTCACGG
Round2_39	CCTCCTGA	TCAGGAGG
Round2_40	CGAACTTA	TAAGTTCG
Round2_41	CGACTGGA	TCCAGTCG
Round2_42	CGCATACA	TGTATGCG
Round2_43	CTCAATGA	TCATTGAG
Round2_44	CTGAGCCA	TGGCTCAG
Round2_45	CTGGCATA	TATGCCAG
Round2_46	GAATCTGA	TCAGATTC
Round2_47	CAAGACTA	TAGTCTTG
Round2_48	GAGCTGAA	TTCAGCTC
Round2_49	GATAGACA	TGTCTATC
Round2_50	GCCACATA	TATGTGGC
Round2_51	GCGAGTAA	TTACTCGC
Round2_52	GCTAACGA	TCGTTAGC
Round2_53	GCTCGGTA	TACCGAGC
Round2_54	GGAGAACA	TGTTCTCC
Round2_55	GGTGCGAA	TTCGCACC
Round2_56	GTACGCAA	TTGCGTAC
Round2_57	GTCGTAGA	TCTACGAC
Round2_58	GTCTGTCA	TGACAGAC
Round2_59	GTGTTCTA	TAGAACAC
Round2_60	TAGGATGA	TCATCCTA
Round2_61	TATCAGCA	TGCTGATA
Round2_62	TCCGTCTA	TAGACGGA
Round2_63	TCTTCACA	TGTGAAGA
Round2_64	TGAAGAGA	TCTCTTCA
Round2_65	TGGAACAA	TTGTTCCA
Round2_66	TGGCTTCA	TGAAGCCA
Round2_67	TGGTGGTA	TACCACCA
Round2_68	TTCACGCA	TGCGTGAA
Round2_69	AACTCACC	GGTGAGTT
Round2_70	AAGAGATC	GATCTCTT
Round2_71	AAGGACAC	GTGTCCTT
Round2_72	AATCCGTC	GACGGATT
Round2_73	AATGTTGC	GCAACATT
Round2_74	ACACGACC	GGTCGTGT
Round2_75	ACAGATTC	GAATCTGT
Round2_76	AGATGTAC	GTACATCT
Round2_77	AGCACCTC	GAGGTGCT
Round2_78	AGCCATGC	GCATGGCT
Round2_79	AGGCTAAC	GTTAGCCT
Round2_80	ATAGCGAC	GTCGCTAT
Round2_81	ATCATTCC	GGAATGAT
Round2_82	ATTGGCTC	GAGCCAAT
Round2_83	CAAGGAGC	GCTCCTTG
Round2_84	CACCTTAC	GTAAGGTG
Round2_85	CCATCCTC	GAGGATGG
Round2_86	CCGACAAC	GTTGTCGG
Round2_87	CCTAATCC	GGATTAGG
Round2_88	CCTCTATC	GATAGAGG
Round2_89	CGACACAC	GTGTGTCG
Round2_90	CGGATTGC	GCAATCCG
Round2_91	CTAAGGTC	GACCTTAG
Round2_92	GAACAGGC	GCCTGTTC
Round2_93	GACAGTGC	GCACTGTC
Round2_94	GAGTTAGC	GCTAACTC
Round2_95	GATGAATC	GATTCATC
Round2_96	GCCAAGAC	GTCTTGGC

Round3 Barcodes (8 bp)

Name	Sequence	Reverse complement
Round3_01	AACGTGAT	ATCACGTT
Round3_02	AAACATCG	CGATGTTT
Round3_03	ATGCCTAA	TTAGGCAT
Round3_04	AGTGGTCA	TGACCACT
Round3_05	ACCACTGT	ACAGTGGT
Round3_06	ACATTGGC	GCCAATGT
Round3_07	CAGATCTG	CAGATCTG
Round3_08	CATCAAGT	ACTTGATG
Round3_09	CGCTGATC	GATCAGCG
Round3_10	ACAAGCTA	TAGCTTGT
Round3_11	CTGTAGCC	GGCTACAG
Round3_12	AGTACAAG	CTTGTACT
Round3_13	AACAACCA	TGGTTGTT
Round3_14	AACCGAGA	TCTCGGTT
Round3_15	AACGCTTA	TAAGCGTT
Round3_16	AAGACGGA	TCCGTCTT
Round3_17	AAGGTACA	TGTACCTT
Round3_18	ACACAGAA	TTCTGTGT
Round3_19	ACAGCAGA	TCTGCTGT
Round3_20	ACCTCCAA	TTGGAGGT
Round3_21	ACGCTCGA	TCGAGCGT
Round3_22	ACGTATCA	TGATACGT
Round3_23	ACTATGCA	TGCATAGT
Round3_24	AGAGTCAA	TTGACTCT
Round3_25	AGATCGCA	TGCGATCT
Round3_26	AGCAGGAA	TTCCTGCT
Round3_27	AGTCACTA	TAGTGACT
Round3_28	ATCCTGTA	TACAGGAT
Round3_29	ATTGAGGA	TCCTCAAT
Round3_30	CAACCACA	TGTGGTTG
Round3_31	GACTAGTA	TACTAGTC
Round3_32	CAATGGAA	TTCCATTG
Round3_33	CACTTCGA	TCGAAGTG
Round3_34	CAGCGTTA	TAACGCTG
Round3_35	CATACCAA	TTGGTATG
Round3_36	CCAGTTCA	TGAACTGG
Round3_37	CCGAAGTA	TACTTCGG
Round3_38	CCGTGAGA	TCTCACGG
Round3_39	CCTCCTGA	TCAGGAGG
Round3_40	CGAACTTA	TAAGTTCG
Round3_41	CGACTGGA	TCCAGTCG
Round3_42	CGCATACA	TGTATGCG
Round3_43	CTCAATGA	TCATTGAG
Round3_44	CTGAGCCA	TGGCTCAG
Round3_45	CTGGCATA	TATGCCAG
Round3_46	GAATCTGA	TCAGATTC
Round3_47	CAAGACTA	TAGTCTTG
Round3_48	GAGCTGAA	TTCAGCTC
Round3_49	GATAGACA	TGTCTATC
Round3_50	GCCACATA	TATGTGGC
Round3_51	GCGAGTAA	TTACTCGC
Round3_52	GCTAACGA	TCGTTAGC
Round3_53	GCTCGGTA	TACCGAGC
Round3_54	GGAGAACA	TGTTCTCC
Round3_55	GGTGCGAA	TTCGCACC
Round3_56	GTACGCAA	TTGCGTAC
Round3_57	GTCGTAGA	TCTACGAC
Round3_58	GTCTGTCA	TGACAGAC
Round3_59	GTGTTCTA	TAGAACAC
Round3_60	TAGGATGA	TCATCCTA
Round3_61	TATCAGCA	TGCTGATA
Round3_62	TCCGTCTA	TAGACGGA
Round3_63	TCTTCACA	TGTGAAGA
Round3_64	TGAAGAGA	TCTCTTCA
Round3_65	TGGAACAA	TTGTTCCA
Round3_66	TGGCTTCA	TGAAGCCA
Round3_67	TGGTGGTA	TACCACCA
Round3_68	TTCACGCA	TGCGTGAA
Round3_69	AACTCACC	GGTGAGTT
Round3_70	AAGAGATC	GATCTCTT
Round3_71	AAGGACAC	GTGTCCTT
Round3_72	AATCCGTC	GACGGATT
Round3_73	AATGTTGC	GCAACATT
Round3_74	ACACGACC	GGTCGTGT
Round3_75	ACAGATTC	GAATCTGT
Round3_76	AGATGTAC	GTACATCT
Round3_77	AGCACCTC	GAGGTGCT
Round3_78	AGCCATGC	GCATGGCT
Round3_79	AGGCTAAC	GTTAGCCT
Round3_80	ATAGCGAC	GTCGCTAT
Round3_81	ATCATTCC	GGAATGAT
Round3_82	ATTGGCTC	GAGCCAAT
Round3_83	CAAGGAGC	GCTCCTTG
Round3_84	CACCTTAC	GTAAGGTG
Round3_85	CCATCCTC	GAGGATGG
Round3_86	CCGACAAC	GTTGTCGG
Round3_87	CCTAATCC	GGATTAGG
Round3_88	CCTCTATC	GATAGAGG
Round3_89	CGACACAC	GTGTGTCG
Round3_90	CGGATTGC	GCAATCCG
Round3_91	CTAAGGTC	GACCTTAG
Round3_92	GAACAGGC	GCCTGTTC
Round3_93	GACAGTGC	GCACTGTC
Round3_94	GAGTTAGC	GCTAACTC
Round3_95	GATGAATC	GATTCATC
Round3_96	GCCAAGAC	GTCTTGGC

I have put those three tables into csv files and you can download them to have a look:

SPLiT-seq_Round1_bc.csv
SPLiT-seq_Round2_bc.csv
SPLiT-seq_Round3_bc.csv

Let’s download them:

wget -P split-seq/data \
    https://teichlab.github.io/scg_lib_structs/data/SPLiT-seq/SPLiT-seq_Round1_bc.csv \
    https://teichlab.github.io/scg_lib_structs/data/SPLiT-seq/SPLiT-seq_Round2_bc.csv \
    https://teichlab.github.io/scg_lib_structs/data/SPLiT-seq/SPLiT-seq_Round3_bc.csv

Now we need to generate the whitelist of those three rounds of barcodes. Those barcodes are sequenced in Read 2 using the top strand as the template. They are in the same direction of the Illumina TruSeq Read 2 sequence. Therefore, we should take their sequences as they are. In addition, if you check the SPLiT-seq GitHub page, you will see that the Round3 barcode is sequenced first, then Round2 barcode and finally Round1 barcode. Therefore, we should pass the whitelist to starsolo in that order. See the next section for more details.

tail -n +2 split-seq/data/SPLiT-seq_Round1_bc.csv | \
    cut -f 2 -d, > split-seq/data/round1_whitelist.txt

tail -n +2 split-seq/data/SPLiT-seq_Round2_bc.csv | \
    cut -f 2 -d, > split-seq/data/round2_whitelist.txt

tail -n +2 split-seq/data/SPLiT-seq_Round3_bc.csv | \
    cut -f 2 -d, > split-seq/data/round3_whitelist.txt

From FastQ To Count Matrix#

We can run starsolo in the following way:

# map and generate the count matrix

STAR --runThreadN 4 \
     --genomeDir mm10/star_index \
     --readFilesCommand zcat \
     --outFileNamePrefix split-seq/star_outs/ \
     --readFilesIn split-seq/data/SRR6750042_1.fastq.gz split-seq/data/SRR6750042_2.fastq.gz \
     --soloType CB_UMI_Complex \
     --soloCBposition 0_10_0_17 0_48_0_55 0_86_0_93 \
     --soloUMIposition 0_0_0_9 \
     --soloCBwhitelist split-seq/data/round3_whitelist.txt split-seq/data/round2_whitelist.txt split-seq/data/round1_whitelist.txt \
     --soloCBmatchWLtype 1MM \
     --soloCellFilter EmptyDrops_CR \
     --soloStrand Forward \
     --outSAMattributes CB UB \
     --outSAMtype BAM SortedByCoordinate

Once that is finished, you can do the exact the same thing with all the rest sublibraries. In practice, you can do this via a loop or a pipeline. They can be run independently in parallel.

Explanation#

If you understand the SPLiT-seq experimental procedures described in this GitHub Page, the command above should be straightforward to understand.

--runThreadN 4

Use 4 cores for the preprocessing. Change accordingly if using more or less cores.

--genomeDir mm10/star_index

Pointing to the directory of the star index. The public data from the above paper was produced from mouse brains.

--readFilesCommand zcat

Since the fastq files are in .gz format, we need the zcat command to extract them on the fly.

--outFileNamePrefix split-seq/star_outs/

We want to keep everything organised. This parameter directs all output files into the split-seq/star_outs/ directory.

--readFilesIn

If you check the manual, we should put two files here. The first file is the reads that come from cDNA, and the second file should contain cell barcode and UMI. In SPLiT-seq, cDNA reads come from Read 1, and the cell barcode and UMI come from Read 2. Check the SPLiT-seq GitHub Page if you are not sure.

--soloType CB_UMI_Complex

Since Read 2 not only has cell barcodes and UMI, the common linker sequences are also there. The cell barcodes are non-consecutive, separated by the linker sequences. In this case, we have to use the CB_UMI_Complex option. Of course, we could also extract them upfront into a new fastq file, but that’s slow. It is better to use this option.

--soloCBposition and --soloUMIposition

These options specify the locations of cell barcode and UMI in the 2nd fastq files we passed to --readFilesIn. In this case, it is Read 2. Read the STAR manual for more details. I have drawn a picture to help myself decide the exact parameters. There are some freedom here depending on what you are using as anchors. in SPLiT-seq, the UMI and cell barcodes are in fixed position in the Read 2. It is relatively straightforward to specify the parameter. See the image:

--soloCBwhitelist

Since the real cell barcodes consists of three non-consecutive parts: three rounds of barcodes. The whitelist here is the combination of those three lists. We should provide them separately in the specified order and star will take care of the combinations.

--soloCBmatchWLtype 1MM

How stringent we want the cell barcode reads to match the whitelist. The default option (1MM_Multi) does not work here. We choose this one here for simplicity, but you might want to experimenting different parameters to see what the difference is.

--soloCellFilter EmptyDrops_CR

Experiments are never perfect. Even for barcodes that do not capture the molecules inside the cells, you may still get some reads due to various reasons, such as ambient RNA or DNA and leakage. In general, the number of reads from those cell barcodes should be much smaller, often orders of magnitude smaller, than those barcodes that come from real cells. In order to identify true cells from the background, you can apply different algorithms. Check the star manual for more information. We use EmptyDrops_CR which is the most frequently used parameter.

--soloStrand Forward

The choice of this parameter depends on where the cDNA reads come from, i.e. the reads from the first file passed to --readFilesIn. You need to check the experimental protocol. If the cDNA reads are from the same strand as the mRNA (the coding strand), this parameter will be Forward (this is the default). If they are from the opposite strand as the mRNA, which is often called the first strand, this parameter will be Reverse. In the case of SPLiT-seq, the cDNA reads are from the Read 1 file. During the experiment, the mRNA molecules are captured by barcoded oligo-dT primer containing UMI, and later the Illumina Read 2 sequence will be ligated to this end. Therefore, Read 2 consists of RT barcodes and UMI. They come from the first strand, complementary to the coding strand. Read 1 comes from the coding strand. Therefore, use Forward for SPLiT-seq data. This Forward parameter is the default, because many protocols generate data like this, but I still specified it here to make it clear. Check the SPLiT-seq GitHub Page if you are not sure.

--outSAMattributes CB UB

We want the cell barcode and UMI sequences in the CB and UB attributes of the output, respectively. The information will be very helpful for downstream analysis.

--outSAMtype BAM SortedByCoordinate

We want sorted BAM for easy handling by other programs.

Once that finishes, you could further merge some barcodes based on the information of the Round1 barcodes during the downstream analysis. We are not going to do it here.

If everything goes well, your directory should look the same as the following:

scg_prep_test/split-seq/
├── data
│   ├── round1_whitelist.txt
│   ├── round2_whitelist.txt
│   ├── round3_whitelist.txt
│   ├── SPLiT-seq_Round1_bc.csv
│   ├── SPLiT-seq_Round2_bc.csv
│   ├── SPLiT-seq_Round3_bc.csv
│   ├── SRR6750042_1.fastq.gz
│   └── SRR6750042_2.fastq.gz
└── star_outs
    ├── Aligned.sortedByCoord.out.bam
    ├── Log.final.out
    ├── Log.out
    ├── Log.progress.out
    ├── SJ.out.tab
    └── Solo.out
        ├── Barcodes.stats
        └── Gene
            ├── Features.stats
            ├── filtered
            │   ├── barcodes.tsv
            │   ├── features.tsv
            │   └── matrix.mtx
            ├── raw
            │   ├── barcodes.tsv
            │   ├── features.tsv
            │   └── matrix.mtx
            ├── Summary.csv
            └── UMIperCellSorted.txt

6 directories, 23 files