sci-ATAC-seq#

Check this GitHub page to see how sci-ATAC-seq libraries are generated experimentally. This is a split-pool based combinatorial indexing strategy, where nuclei are transposed in mini-bulk with indexed transposase Tn5 (C15 + D15). Then all nuclei are pooled and a fraction of the nuclei is randomly distributed into new wells. Library preparation is performed using a PCR primers with different indices (i5 + i7). Single cells can be identified by the combination of D15 + i7 + C15 + i5. You might wonder what C15 and D15 are, those are the primer in the published paper. Keep reading and you will see.

Important

The library of the sci-ATAC-seq is kind of complicated, especially for beginners. First, make sure you are familiar with different sequencing modes from different Illumina machines by looking at this page. Then make sure you understand how the actual sequencing is done for sci-ATAC-seq by checking this GitHub page and the associated publications.

For Your Own Experiments#

If you use this assay, you have to run the sequencing by yourself using a custom sequencing recipe or ask your core facility to do this for you. We don’t have the proper setup to do this experiment here, so I haven’t tried sci-ATAC-seq by myself. This is the educational guess of the sequencing configuration. Using more recent machines and chemistries, like iSeq 100, MiniSeq (Standard), NextSeq, HiSeq X, HiSeq 3000, HiSeq 4000 and NovaSeq 600 (v1.5), it should be (I call this Configuration 1):

Order

Read

Cycle

Description

1

Read 1

>50

This yields R1_001.fastq.gz, Genomic insert

2

Index 1 (i7)

8 normal + 27 dark + 10 normal cycles

This yields I1_001.fastq.gz, D15 + i7 index

3

Index 2 (i5)

8 normal + 21 dark + 10 normal cycles

This yields I2_001.fastq.gz, C15 + i5 index

4

Read 2

>50

This yields R2_001.fastq.gz, Genomic insert

Using older machines and chemistries like MiSeq, HiSeq 2000, HiSeq 2500, MiniSeq (Rapid) and NovaSeq 6000 (v1.0), it should be (I call this Configuration 2):

Order

Read

Cycle

Description

1

Read 1

>50

This yields R1_001.fastq.gz, Genomic insert

2

Index 1 (i7)

8 normal + 27 dark + 10 normal cycles

This yields I1_001.fastq.gz, D15 + i7 index

3

Index 2 (i5)

10 normal + 21 dark + 8 normal cycles

This yields I2_001.fastq.gz, i5 + C15 index

4

Read 2

>50

This yields R2_001.fastq.gz, Genomic insert

The reason for the dark cycles is to avoid sequencing those common adaptors with exact the same DNA sequences. Again, if you are not sure, check this GitHub page to see the reason.

After the sequencing is done, you could just run the bcl2fastq without a SampleSheet.csv, like this:

bcl2fastq --create-fastq-for-index-reads \
          --no-lane-splitting \
          --ignore-missing-positions \
          --ignore-missing-controls \
          --ignore-missing-filter \
          --ignore-missing-bcls \
          -r 4 -w 4 -p 4

In this case, you will have four fastq files per run:

Undetermined_S0_I1_001.fastq.gz  # 16 bp, D15 + i7
Undetermined_S0_I2_001.fastq.gz  # 16 bp, C15 + i5
Undetermined_S0_R1_001.fastq.gz  # >50 bp, genomic
Undetermined_S0_R2_001.fastq.gz  # >50 bp, genomic

Your cell barcodes will be I1 + I2. We cloud stitch them together to get the cell barcodes to a single fastq file:

paste <(zcat Undetermined_S0_I1_001.fastq.gz) <(zcat Undetermined_S0_I2_001.fastq.gz) | \
    awk -F '\t' '{ if(NR%4==1||NR%4==3) {print $1} else {print $1 $2} }' | \
    gzip > Undetermined_S0_CB_001.fastq.gz

After that, you are ready to throw Undetermined_S0_R1_001.fastq.gz, Undetermined_S0_R2_001.fastq.gz and Undetermined_S0_CB_001.fastq.gz into chromap. See the later section.

Public Data#

For the purpose of demonstration, we will use the sci-ATAC-seq data from the following paper:

Note

Cusanovich DA, Reddington JP, Garfield DA, Daza RM, Aghamirzaie D, Marco-Ferreres R, Pliner HA, Christiansen L, Qiu X, Steemers FJ, Trapnell C, Shendure J, Furlong EEM (2018) The cis-regulatory dynamics of embryonic development at single-cell resolution. Nature 555:538–542. https://doi.org/10.1038/nature25981

where the authors used sci-ATAC-seq to profile the chromatin accessibility of Drosophila embryos during three landmark embryonic stages. The data is in GEO under the accession code GSE101581. The raw reads are in SRR5837698. To get the reads, we use fastq-dump from the sratools:

mkdir -p sci-atac/data
fastq-dump --split-files \
           --origfmt \
           --outdir sci-atac/data \
           --defline-seq '@rd.$si:$sg:$sn' SRR5837698

The reason of using the specific options above is a bit complicated. I wrote a post about this, and you can have a look to see the reason. Once the program finishes running, you will have two plain files: SRR5837698_1.fastq and SRR5837698_1.fastq, each of which has a size of 86G. Unfortunately, it seems fastq-dump cannot generate gzipped file on the fly. I’m going to gzip them manually to save space:

gzip sci-atac/data/SRR5837698_1.fastq sci-atac/data/SRR5837698_2.fastq

Anyway, let’s have a look at the first 5 reads from Read 1:

@rd.1::ATTCAGAACCGCTAAGAGNAAGATTATTAGATTCCG:1
GGCTTNTATTATGACCGCAATGAAGTCCGATCGCAGATAATCCGCAAAGGA
+ATTCAGAACCGCTAAGAGNAAGATTATTAGATTCCG:1
AAAAA#EEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
@rd.2::CGCTCATTATACCTCGACNGGCTATAGATACCTAAG:2
CTTCTNTTCTCTTCTCTTCTCTTCTCTTCTCTTCTCTTCACTTCTATTCTA
+CGCTCATTATACCTCGACNGGCTATAGATACCTAAG:2
AAAAA#EEEAEEEEEEEEEEAEEAEEEEE/EEEAE</E/////E</////A
@rd.3::CTGAAGCTTCCTCTGCCGNGGCTATAAAGCCGGCTG:3
GTAGTNGAGTGTATGGCCCACGCCGTAGCACCAGAAAAAAATTGGGTCGTT
+CTGAAGCTTCCTCTGCCGNGGCTATAAAGCCGGCTG:3
AAAAA#EEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
@rd.4::GAGATTCCGACTGGACCANCCTCTATATAAGAGGCG:4
ACCGANGAGGGTTGCGTCATTCGTCCTGTACGCTCCGCATTGTCCAGTAAA
+GAGATTCCGACTGGACCANCCTCTATATAAGAGGCG:4
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
@rd.5::TAATGCGCGCGCGTTCATNTCAGTACGGTCCAGGAG:5
ATTTTNGATCAAATCCACATTCAAATGGGGATCTTTTTTCTCGGTCGATTA
+TAATGCGCGCGCGTTCATNTCAGTACGGTCCAGGAG:5
AAAAA#E/EEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEE

As you can see from the header, the 8 bp + 8 bp + 8 bp + 8 bp = 32 bp cell barcodes are in the 3rd field with : as the field separator. Unfortunately, the quality strings for them are lost, but we can just use them as if they are perfect. Now we need to generate a new fastq files holding these 32 bp cell barcodes with high Phred scores, say 40. The corresponding ASCII encoding is 40 + 33 = 73, which is I. You can use the header from either SRR5837698_1.fastq.gz or SRR5837698_2.fastq.gz, it does not matter:

zcat sci-atac/data/SRR5837698_1.fastq.gz | \
    awk 'NR%4==1' | \
    awk -F ':' '{print $0 "\n" $3 "\n+\n" "IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII"}' | \
    gzip > sci-atac/data/SRR5837698_CB.fastq.gz

Now we have the three fastq files SRR5837698_1.fastq.gz, SRR5837698_2.fastq.gz and SRR5837698_CB.fastq.gz. That’s all we need for chromap.

Prepare Whitelist#

This is the most confusing part of the pipeline, but if you understand the procedures in the sci-ATAC-seq GitHub page, you should be able to understand.

Important

For the exact primer sequences, check Supplementary Table 12 from the Cusanovich 2018 Nature paper.

This is the D15 primer sequence: 5’- GTCTCGTGGGCTCGGCTGTCCCTGTCCNNNNNNNNCACCGTCTCCGCCTCAGATGTGTATAAGAGACAG -3’. The 8 Ns are the barcode, and the barcode sequences (from 5’ -> 3’) are:

Primer

Barcode D15

Reverse complement

D15_1

CGAGTAAT

ATTACTCG

D15_2

TCTCCGGA

TCCGGAGA

D15_3

AATGAGCG

CGCTCATT

D15_4

GGAATCTC

GAGATTCC

D15_5

TTCTGAAT

ATTCAGAA

D15_6

ACGAATTC

GAATTCGT

D15_7

AGCTTCAG

CTGAAGCT

D15_8

GCGCATTA

TAATGCGC

D15_9

CATAGCCG

CGGCTATG

D15_10

TTCGCGGA

TCCGCGAA

D15_11

GCGCGAGA

TCTCGCGC

D15_12

CTATCGCT

AGCGATAG

This is the C15 primer sequence: 5’- TCGTCGGCAGCGTCTCCACGCNNNNNNNNGCGATCGAGGACGGCAGATGTGTATAAGAGACAG -3’. The 8 Ns are the barcode, and the barcode sequences (from 5’ -> 3’) are:

Primer

Barcode C15

Reverse complement

C15_1

TATAGCCT

AGGCTATA

C15_2

ATAGAGGC

GCCTCTAT

C15_3

CCTATCCT

AGGATAGG

C15_4

GGCTCTGA

TCAGAGCC

C15_5

AGGCGAAG

CTTCGCCT

C15_6

TAATCTTA

TAAGATTA

C15_7

CAGGACGT

ACGTCCTG

C15_8

GTACTGAC

GTCAGTAC

This is the P7 primer sequence: 5’- CAAGCAGAAGACGGCATACGAGATNNNNNNNNNNGTCTCGTGGGCTCGG -3’. The 10 Ns are the i7 barcode, and the barcode sequences (from 5’ -> 3’) are:

Primer

Barcode i7

Reverse complement

P7_1

CCGAATCCGA

TCGGATTCGG

P7_2

CCGCAGCCGC

GCGGCTGCGG

P7_3

AACGTAATCT

AGATTACGTT

P7_4

ACCTAGTTAG

CTAACTAGGT

P7_5

GGTCGCTATG

CATAGCGACC

P7_6

CTCTTAGCGG

CCGCTAAGAG

P7_7

TTCGTTCCAT

ATGGAACGAA

P7_8

AACGGAACGC

GCGTTCCGTT

P7_9

TTCGATAACC

GGTTATCGAA

P7_10

CATACGATGC

GCATCGTATG

P7_11

TTATCGTATT

AATACGATAA

P7_12

GTCGACGGAA

TTCCGTCGAC

P7_13

ATAAGCCGGA

TCCGGCTTAT

P7_14

TGCGCCTGGT

ACCAGGCGCA

P7_15

ATTCTCCTCT

AGAGGAGAAT

P7_16

ATAGGAGTAC

GTACTCCTAT

P7_17

ATCCGTTAGC

GCTAACGGAT

P7_18

TGATTCAACT

AGTTGAATCA

P7_19

TACCTAATCA

TGATTAGGTA

P7_20

GATGCTACGA

TCGTAGCATC

P7_21

AACCTCAAGA

TCTTGAGGTT

P7_22

AAGCTGACCT

AGGTCAGCTT

P7_23

AAGTCTAATA

TATTAGACTT

P7_24

ACTAATTGAG

CTCAATTAGT

P7_25

CCGGCGGCGA

TCGCCGCCGG

P7_26

AATCATACGG

CCGTATGATT

P7_27

TCTGCGCGTT

AACGCGCAGA

P7_28

CTACGACGAG

CTCGTCGTAG

P7_29

TCGCAATTAG

CTAATTGCGA

P7_30

TATGGCCGCG

CGCGGCCATA

P7_31

AAGTAATATT

AATATTACTT

P7_32

ATCTGCCAAT

ATTGGCAGAT

P7_33

CAGGCGCCAT

ATGGCGCCTG

P7_34

GAGTCCTTAT

ATAAGGACTC

P7_35

CGGCTTACTA

TAGTAAGCCG

P7_36

CTTGCATAAT

ATTATGCAAG

P7_37

GGCTTGCCAA

TTGGCAAGCC

P7_38

CGCCAATCAA

TTGATTGGCG

P7_39

GCTCATATGC

GCATATGAGC

P7_40

AGTCGAGTTC

GAACTCGACT

P7_41

GGCTGGCTAG

CTAGCCAGCC

P7_42

AGAGGTCGCA

TGCGACCTCT

P7_43

AGCTAAGAAT

ATTCTTAGCT

P7_44

ATCGTATCAA

TTGATACGAT

P7_45

AACTATTATA

TATAATAGTT

P7_46

CCTACGGCAA

TTGCCGTAGG

P7_47

GATATGGTCT

AGACCATATC

P7_48

TCCTTACCAA

TTGGTAAGGA

P7_49

CCGCTAGCTG

CAGCTAGCGG

P7_50

CAAGGCTTAG

CTAAGCCTTG

P7_51

AGCGGTAACG

CGTTACCGCT

P7_52

TGGTCCAGTC

GACTGGACCA

P7_53

ACGGTCTTGC

GCAAGACCGT

P7_54

AGGAGATTGA

TCAATCTCCT

P7_55

GTCGAGGTAT

ATACCTCGAC

P7_56

AACGCCTCTA

TAGAGGCGTT

P7_57

AAGTTACCTA

TAGGTAACTT

P7_58

AATATTCGAA

TTCGAATATT

P7_59

TAGTCGTCCA

TGGACGACTA

P7_60

TGCAGCCTAC

GTAGGCTGCA

P7_61

CTTATCCTAC

GTAGGATAAG

P7_62

GCGCTCGACG

CGTCGAGCGC

P7_63

AATGAATAGT

ACTATTCATT

P7_64

ATCTAAGCAA

TTGCTTAGAT

P7_65

GCTCCATTCG

CGAATGGAGC

P7_66

GGCTATATAG

CTATATAGCC

P7_67

TTATTAGTAG

CTACTAATAA

P7_68

ACGGCAACCA

TGGTTGCCGT

P7_69

CGGCAGAGGA

TCCTCTGCCG

P7_70

TTCAAGAATC

GATTCTTGAA

P7_71

TAGCTGCTAC

GTAGCAGCTA

P7_72

GGAGCTGAGG

CCTCAGCTCC

P7_73

TGAGCTACTT

AAGTAGCTCA

P7_74

TCCAGCAATA

TATTGCTGGA

P7_75

CCGTATCTGG

CCAGATACGG

P7_76

CGAATTCGTT

AACGAATTCG

P7_77

ACGATAAGCG

CGCTTATCGT

P7_78

TCGCGTACTT

AAGTACGCGA

P7_79

TGCGAAGATC

GATCTTCGCA

P7_80

CAGGCTAAGA

TCTTAGCCTG

P7_81

GCCTCAATAA

TTATTGAGGC

P7_82

ATGCTCGCAA

TTGCGAGCAT

P7_83

CTCTTCAAGC

GCTTGAAGAG

P7_84

GCAGCGGACT

AGTCCGCTGC

P7_85

TCAGGACTTA

TAAGTCCTGA

P7_86

CATGAGAACT

AGTTCTCATG

P7_87

CCTTAGTCTG

CAGACTAAGG

P7_88

CAGCGATAGA

TCTATCGCTG

P7_89

ACCATAGCGC

GCGCTATGGT

P7_90

AATAATAATG

CATTATTATT

P7_91

AACTACGGCT

AGCCGTAGTT

P7_92

CGCAATATCA

TGATATTGCG

P7_93

TTAACGCCGT

ACGGCGTTAA

P7_94

GGAGTAAGCC

GGCTTACTCC

P7_95

ATGAACGCGC

GCGCGTTCAT

P7_96

CATCGCGCTC

GAGCGCGATG

This is the P5 primer sequence: 5’- AATGATACGGCGACCACCGAGATCTACACNNNNNNNNNNTCGTCGGCAGCGTC -3’. The 10 Ns are the i5 barcode, and the barcode sequences (from 5’ -> 3’) are:

Primer

Barcode i5

Reverse complement

P5_1

CTCCATCGAG

CTCGATGGAG

P5_2

TTGGTAGTCG

CGACTACCAA

P5_3

GGCCGTCAAC

GTTGACGGCC

P5_4

CCTAGACGAG

CTCGTCTAGG

P5_5

TCGTTAGAGC

GCTCTAACGA

P5_6

CGTTCTATCA

TGATAGAACG

P5_7

CGGAATCTAA

TTAGATTCCG

P5_8

ATGACTGATC

GATCAGTCAT

P5_9

TCAATATCGA

TCGATATTGA

P5_10

GTAGACCTGG

CCAGGTCTAC

P5_11

TTATGACCAA

TTGGTCATAA

P5_12

TTGGTCCGTT

AACGGACCAA

P5_13

GGTACGTTAA

TTAACGTACC

P5_14

CAATGAGTCC

GGACTCATTG

P5_15

GATGCAGTTC

GAACTGCATC

P5_16

CCATCGTTCC

GGAACGATGG

P5_17

TTGAGAGAGT

ACTCTCTCAA

P5_18

ACTGAGCGAC

GTCGCTCAGT

P5_19

TGAGGAATCA

TGATTCCTCA

P5_20

CCTCCGACGG

CCGTCGGAGG

P5_21

CATTGACGCT

AGCGTCAATG

P5_22

TCGTCCTTCG

CGAAGGACGA

P5_23

TGATACTCAA

TTGAGTATCA

P5_24

TTCTACCTCA

TGAGGTAGAA

P5_25

TCGTCGGAAC

GTTCCGACGA

P5_26

ATCGAGATGA

TCATCTCGAT

P5_27

TAGACTAGTC

GACTAGTCTA

P5_28

GTCGAAGCAG

CTGCTTCGAC

P5_29

AGGCGCTAGG

CCTAGCGCCT

P5_30

AGATGCAACT

AGTTGCATCT

P5_31

AAGCCTACGA

TCGTAGGCTT

P5_32

GTAGGCAATT

AATTGCCTAC

P5_33

GGAGGCGGCG

CGCCGCCTCC

P5_34

CCAGTACTTG

CAAGTACTGG

P5_35

GGTCTCGCCG

CGGCGAGACC

P5_36

GGCGGAGGTC

GACCTCCGCC

P5_37

TAGTTCTAGA

TCTAGAACTA

P5_38

TTGGAGTTAG

CTAACTCCAA

P5_39

AGATCTTGGT

ACCAAGATCT

P5_40

GTAATGATCG

CGATCATTAC

P5_41

CAGAGAGGTC

GACCTCTCTG

P5_42

TTAATTAGCC

GGCTAATTAA

P5_43

CTCTAACTCG

CGAGTTAGAG

P5_44

TACGATCATC

GATGATCGTA

P5_45

AGGCGAGAGC

GCTCTCGCCT

P5_46

TCAAGATAGT

ACTATCTTGA

P5_47

TAATTGACCT

AGGTCAATTA

P5_48

CAGCCGGCTT

AAGCCGGCTG

P5_49

AGAACCGGAG

CTCCGGTTCT

P5_50

GAGATGCATG

CATGCATCTC

P5_51

GATTACCGGA

TCCGGTAATC

P5_52

TCGTAACGGT

ACCGTTACGA

P5_53

TGGCGACGGA

TCCGTCGCCA

P5_54

AGTCATAGCC

GGCTATGACT

P5_55

GTCAAGTCCA

TGGACTTGAC

P5_56

ATTCGGAAGT

ACTTCCGAAT

P5_57

GTCGGTAGTT

AACTACCGAC

P5_58

AGGACGGACG

CGTCCGTCCT

P5_59

CTCCTGGACC

GGTCCAGGAG

P5_60

TAGCCTCGTT

AACGAGGCTA

P5_61

GGTTGAACGT

ACGTTCAACC

P5_62

AGGTCCTCGT

ACGAGGACCT

P5_63

GGAAGTTATA

TATAACTTCC

P5_64

TGGTAATCCT

AGGATTACCA

P5_65

AAGCTAGGTT

AACCTAGCTT

P5_66

TCCGCGGACT

AGTCCGCGGA

P5_67

TGCGGATAGT

ACTATCCGCA

P5_68

TGGCAGCTCG

CGAGCTGCCA

P5_69

TGCTACGGTC

GACCGTAGCA

P5_70

GCGCAATGAC

GTCATTGCGC

P5_71

CTTAATCTTG

CAAGATTAAG

P5_72

GGAGTTGCGT

ACGCAACTCC

P5_73

ACTCGTATCA

TGATACGAGT

P5_74

GGTAATAATG

CATTATTACC

P5_75

TCCTTATAGA

TCTATAAGGA

P5_76

CCGACTCCAA

TTGGAGTCGG

P5_77

GCCAAGCTTG

CAAGCTTGGC

P5_78

CATATCCTAT

ATAGGATATG

P5_79

ACCTACGCCA

TGGCGTAGGT

P5_80

GGAATTCAGT

ACTGAATTCC

P5_81

TGGCGTAGAA

TTCTACGCCA

P5_82

ATTGCGGCCA

TGGCCGCAAT

P5_83

TTCAGCTTGG

CCAAGCTGAA

P5_84

CCATCTGGCA

TGCCAGATGG

P5_85

CTTATAAGTT

AACTTATAAG

P5_86

GATTAGATGA

TCATCTAATC

P5_87

TATAGGATCT

AGATCCTATA

P5_88

AGCTTATAGG

CCTATAAGCT

P5_89

GTCTGCAATC

GATTGCAGAC

P5_90

CGCCTCTTAT

ATAAGAGGCG

P5_91

GTTGGATCTT

AAGATCCAAC

P5_92

GCGATTGCAG

CTGCAATCGC

P5_93

TGCCAGTTGC

GCAACTGGCA

P5_94

CTTAGGTATC

GATACCTAAG

P5_95

GAGACCTACC

GGTAGGTCTC

P5_96

ATTGACCGAG

CTCGGTCAAT

The 36 bp cell barcode consists of the four parts listed above. To construct full whitelist, we need to get all possible combinations of D15, i7, C15 and i5. In total, we should have a whitelist of 8 * 12 * 96 * 96 = 884736 barcodes. However, the order of those four parts depends on Illumina machines. I have put those four tables into csv files and you can download them to have a look:

sci-ATAC-C15.csv
sci-ATAC-D15.csv
sci-ATAC-P5.csv
sci-ATAC-P7.csv

Configuration 1#

Position

Description

1 - 8

8 bp Tn5 barcode at the i7 side. This is the barcode in the D15 primer in Table S12 from the paper

9 - 18

10 bp i7 index. This is the barcode in the P7 primer in Table S12 from the paper

19 - 26

8 bp Tn5 barcode at the i5 side. This is the barcode in the C15 primer in Table S12 from the paper

27 - 36

10 bp i5 index. This is the barcode in the P5 primer in Table S12 from the paper

In this configuration, D15 and i7 are sequenced using the bottom strand as the template, and C15 and i5 are sequenced using the top strand as the template. Therefore, we should take the reverse complementary (rc) in the following order:

D15 rc + i7 rc + C15 rc + i5 rc

Now let’s generate the whitelist using the above order:

# download the tables
wget -P sci-atac/data/ \
    https://teichlab.github.io/scg_lib_structs/data/sci-ATAC-seq_family/sci-ATAC-C15.csv \
    https://teichlab.github.io/scg_lib_structs/data/sci-ATAC-seq_family/sci-ATAC-D15.csv \
    https://teichlab.github.io/scg_lib_structs/data/sci-ATAC-seq_family/sci-ATAC-P5.csv \
    https://teichlab.github.io/scg_lib_structs/data/sci-ATAC-seq_family/sci-ATAC-P7.csv

# generate combinations of them
for w in $(tail -n +2 sci-atac/data/sci-ATAC-D15.csv | cut -f 3 -d,); do
    for x in $(tail -n +2 sci-atac/data/sci-ATAC-P7.csv | cut -f 3 -d,); do
        for y in $(tail -n +2 sci-atac/data/sci-ATAC-C15.csv | cut -f 3 -d,); do
            for z in $(tail -n +2 sci-atac/data/sci-ATAC-P5.csv | cut -f 3 -d,); do
                echo "${w}${x}${y}${z}"
                done
            done
        done
    done > sci-atac/data/whitelist_config1.txt

Configuration 2#

Position

Description

1 - 8

8 bp Tn5 barcode at the i7 side. This is the barcode in the D15 primer in Table S12 from the paper

9 - 18

10 bp i7 index. This is the barcode in the P7 primer in Table S12 from the paper

19 - 28

10 bp i5 index. This is the barcode in the P5 primer in Table S12 from the paper

29 - 36

8 bp Tn5 barcode at the i5 side. This is the barcode in the C15 primer in Table S12 from the paper

In this configuration, D15 and i7 are sequenced using the bottom strand as the template, and we should take the reverse complementary (rc) of them. The C15 and i5 are sequenced using the bottom strand as the template as well, and we should take the sequence as they are. Therefore, we should take the index sequence in the following order:

D15 rc + i7 rc + i5 + C15

Now let’s generate the whitelist using the above order:

for w in $(tail -n +2 sci-atac/data/sci-ATAC-D15.csv | cut -f 3 -d,); do
    for x in $(tail -n +2 sci-atac/data/sci-ATAC-P7.csv | cut -f 3 -d,); do
        for y in $(tail -n +2 sci-atac/data/sci-ATAC-P5.csv | cut -f 2 -d,); do
            for z in $(tail -n +2 sci-atac/data/sci-ATAC-C15.csv | cut -f 2 -d,); do
                echo "${w}${x}${y}${z}"
                done
            done
        done
    done > sci-atac/data/whitelist_config2.txt

Normally, we are all set and ready to go now. However … see below …

Some Extra Work#

At this time of writing (31-July-2022), chromap only supports cell barcodes with <32 bp in length based on this thread. Therefore, we need a truncated version of the cell barcode and whitelist. We have a total of 884,736 valid cell barcodes with 36 bp. It turns out that the positions 3 - 33 (31 bp) are already unique. We could just use those 31 bp as our cell barcodes and whitelist. To this end, we do:

# truncate whitelist
cut -c 3-33 sci-atac/data/whitelist_config1.txt > sci-atac/data/whitelist_config1_base3-33.txt

# truncate cell barcode reads
zcat sci-atac/data/SRR5837698_CB.fastq.gz | \
    awk '{ if (NR%4==1||NR%4==3) {print $1} else {print substr($1, 3, 31)} }' | \
    gzip > sci-atac/data/SRR5837698_CB_base3-33.fastq.gz

Now we are ready to go.

Index Genome#

The data we are using here is from Drosophila embryos. We need to download the genome sequence and index it with chromap. In this case, we are using the dm6 build from UCSC:

mkdir -p dm6/chromap_index
wget -P dm6 -c https://hgdownload.soe.ucsc.edu/goldenPath/dm6/bigZips/dm6.fa.gz
gunzip dm6/dm6.fa.gz
chromap -i -t 4 -r dm6/dm6.fa -o dm6/chromap_index/genome.index

From FastQ To Count Matrix#

Now we are ready to map the reads to the genome using chromap:

mkdir -p sci-atac/chromap_outs

# map and generate the fragment file

chromap -t 4 --preset atac \
        -x dm6/chromap_index/genome.index \
        -r dm6/dm6.fa \
        -1 sci-atac/data/SRR5837698_1.fastq.gz \
        -2 sci-atac/data/SRR5837698_2.fastq.gz \
        -b sci-atac/data/SRR5837698_CB_base3-33.fastq.gz \
        --barcode-whitelist sci-atac/data/whitelist_config1_base3-33.txt \
        -o sci-atac/chromap_outs/fragments.tsv

# compress and index the fragment file

bgzip sci-atac/chromap_outs/fragments.tsv
tabix -s 1 -b 2 -e 3 -p bed sci-atac/chromap_outs/fragments.tsv.gz

Two new files fragments.tsv.gz and fragments.tsv.gz.tbi are generated. They will be useful and sometimes required for other programs to perform downstream analysis.

Explain chromap#

If you understand the sci-ATAC-seq experimental procedures described in this GitHub Page and the content in the previous sections, the command above should be straightforward to understand.

-t 4

Use 4 cores for the preprocessing. Change accordingly if using more or less cores.

-x dm6/chromap_index/genome.index

The chromap index file. We are dealing with fly samples in this case.

-r dm6/dm6.fa

Reference genome sequence in fasta format. This is basically the file which you used to create the chromap index file.

-1, -2 and -b

They are Read 1 (genomic), Read 2 (genomic) and cell barcode read, respectively. For ATAC-seq, the sequencing is usually done in pair-end mode. Therefore, you normally have two genomic reads for each genomic fragment: Read 1 and Read 2. For the reason described previously, we are uisng the truncated version (31 bp) of the cell barcode reads here. Multiple input files are supported and they can be listed in a comma-separated manner. In that case, they must be in the same order.

--barcode-whitelist sci-atac/data/whitelist_config1_base3-33.txt

The plain text file containing all possible valid cell barcodes. For the reason described in the previous sections, we are using the truncated version (31 bp) of the whitelist because the sequencing machines used to generate the data was NextSeq based on the publication.

-o sci-atac/chromap_outs/fragments.tsv

Direct the mapped fragments to a file. The format is described in the 10x Genomics website.

Fragments To Reads#

The fragment file is the following format:

Column number

Meaning

1

fragment chromosome

2

fragment start

3

fragment end

4

cell barcode

5

Number of read pairs of this fragment

It is very useful, but we often need the peak-by-cell matrix for the downstream analysis. Therefore, we need to perform a peak calling process to identify open chromatin regions. We need to convert the fragment into reads. For each fragment, we will have two reads, a forward read and a reverse read. The read length is not important, but we could generate a 50-bp read pair for each fragment.

First, we need to create a genome size file, which is a tab delimited file with only two columns. The first column is the chromosome name and the second column is the length of the chromosome in bp:

# we also sort the output by chromosome name
# which will be useful later

faSize -detailed dm6/dm6.fa | \
    sort -k1,1 > dm6/dm6.chrom.sizes

This is the first 5 lines of dm6/dm6.chrom.sizes:

chr2L	23513712
chr2R	25286936
chr3L	28110227
chr3R	32079331
chr4	1348131

Now let’s generate the reads from fragments:

# we use bedClip to remove reads outside the chromosome boundary
# we also remove reads mapped to the mitochondrial genome (chrM)

zcat sci-atac/chromap_outs/fragments.tsv.gz | \
    awk 'BEGIN{OFS="\t"}{print $1, $2, $2+50, $4, ".", "+" "\n" $1, $3-50, $3, $4, ".", "-"}' | \
    sed '/chrM/d' | \
    bedClip stdin dm6/dm6.chrom.sizes stdout | \
    sort -k1,1 -k2,2n | \
    gzip > sci-atac/chromap_outs/reads.bed.gz

Note we also sort the output reads by sort -k1,1 -k2,2n. In this way, the order of chromosomes in the reads.bed.gz is the same as that in dm6.chrom.sizes, which makes downstream processes easier. The output reads.bed.gz are the reads in bed format, with the 4th column holding the cell barcodes.

Peak Calling By MACS2#

Now we can use the newly generated read file for the peak calling using MACS2:

macs2 callpeak -t sci-atac/chromap_outs/reads.bed.gz \
               -g dm -f BED -q 0.01 \
               --nomodel --shift -100 --extsize 200 \
               --keep-dup all \
               -B --SPMR \
               --outdir sci-atac/chromap_outs \
               -n aggregate

Explain MACS2#

The reasons of choosing those specific parameters are a bit more complicated. I have dedicated a post for this a while ago. Please have a look at this post if you are still confused. The following output files are particularly useful:

File

Description

aggregate_peaks.narrowPeak

Open chromatin peak locations in the narrowPeak format

aggregate_peaks.xls

More information about peaks

aggregate_treat_pileup.bdg

Signal tracks. Can be used to generate the bigWig file for visualisation

Getting The Peak-By-Cell Count Matrix#

Now that we have the peak and reads files, we can compute the number of reads in each peak for each cell. Then we could get the peak-by-cell count matrix. There are different ways of doing this. The following is the method I use.

Find Reads In Peaks Per Cell#

First, we use the aggregate_peaks.narrowPeak file. We only need the first 4 columns (chromosome, start, end, peak ID). You can also remove the peaks that overlap the black list regions. The black list is not available for every species and every build, so I’m not doing it here. We also need to sort the peak to make sure the order of the chromosomes in the peak file is the same as that in the dm6.chrom.sizes and reads.bed.gz files. Then we could find the overlap by bedtools. We need to do this in a specific way to get the number of reads in each peak from each cell:

# format and sort peaks

cut -f 1-4 sci-atac/chromap_outs/aggregate_peaks.narrowPeak | \
    sort -k1,1 -k2,2n > sci-atac/chromap_outs/aggregate_peaks_sorted.bed

# prepare the overlap

bedtools intersect \
    -a sci-atac/chromap_outs/aggregate_peaks_sorted.bed \
    -b sci-atac/chromap_outs/reads.bed.gz \
    -wo -sorted -g dm6/dm6.chrom.sizes | \
    sort -k8,8 | \
    bedtools groupby -g 8 -c 4 -o freqdesc | \
    gzip > sci-atac/chromap_outs/peak_read_ov.tsv.gz

Explain Finding Reads In Peaks Per Cell#

We start with the command before the first pipe, that is, the intersection part. If you read the manual of the bedtools intersect, it should be straightforward to understand. The -wo option will output the records in both -a file and -b file. Since the reads.bed.gz file has the cell barcode information at the 4th column, we would get an output with both peak and cell information for the overlap. The -sorted -g dm6/dm6.chrom.sizes options make the program use very little memory. Here is an example (top 5 lines) of the output of this part:

chr2L	5477	6081	aggregate_peak_1	chr2L	5430	5480	ATGCGCATTATGCAAGGCCTCTATAGCGTCA	.	-	3
chr2L	5477	6081	aggregate_peak_1	chr2L	5430	5480	ATGCGCTTATTGAGGCTAAGATTACCGTCGG	.	+	3
chr2L	5477	6081	aggregate_peak_1	chr2L	5430	5480	CGCGAACATAGCGACCGCCTCTATACGCAAC	.	-	3
chr2L	5477	6081	aggregate_peak_1	chr2L	5430	5480	CGCGAATTGGCAAGCCTCAGAGCCAGATCCT	.	+	3
chr2L	5477	6081	aggregate_peak_1	chr2L	5430	5480	CGGAGAAGCCGTAGTTAGGCTATAAACTACC	.	+	3

We see that the 8th column holds the cell barcode and we want to group them using bedtools groupby. Therefore, we need to sort by this column, that is the sort -k8,8. When we group by the 8th column, we are interested in how many times each peak appear per group, so we could gather the information of the peak ID (4th column). That is the -g 8 -c 4 -o freqdesc. The -o freqdesc option returns a value:frequency pair in descending order. Here are some records from peak_read_ov.tsv.gz:

ATGCGCAACGAATTCGACGTCCTGAACCTAG	aggregate_peak_11321:2,aggregate_peak_12286:2,aggregate_peak_12350:2,aggregate_peak_15932:2,aggregate_peak_15948:2,aggregate_peak_17209:2,aggregate_peak_1747:2,aggregate_peak_17908:2,aggregate_peak_18231:2,aggregate_peak_287:2,aggregate_peak_3680:2,aggregate_peak_381:2,aggregate_peak_4107:2,aggregate_peak_4475:2,aggregate_peak_5726:2,aggregate_peak_9261:2,aggregate_peak_12324:1,aggregate_peak_7571:1
ATGCGCAACGAATTCGACGTCCTGAACGGAC	aggregate_peak_15000:2,aggregate_peak_3846:1
ATGCGCAACGAATTCGACGTCCTGAACTACC	aggregate_peak_7461:2,aggregate_peak_2739:1

In a way, that is sort of a count matrix in an awkward format. For example:

  • The first line is too long.

  • The second line means that in cell ATGCGCAACGAATTCGACGTCCTGAACGGAC, the peak aggregate_peak_15000 has 2 counts, and the peak aggregate_peak_3846 has 1 count. All the rest peaks not mentioned here have 0 counts in this cell.

  • The third line means that in cell ATGCGCAACGAATTCGACGTCCTGAACTACC, the peak aggregate_peak_7461 has 2 counts, and the peak aggregate_peak_2739 has 1 count. All the rest peaks not mentioned here have 0 counts in this cell.

Output The Peak-By-Cell Matrix#

At this stage, we pretty much have all the things needed. Those two files aggregate_peaks_sorted.bed and peak_read_ov.tsv.gz contain all information for a peak-by-cell count matrix. We just need a final touch to make the output in a standard format: a market exchange format (MEX). Since most downstream software takes the input from the 10x Genomics Single Cell ATAC results, we are going to generate the MEX and the associated files similar to the output from 10x Genomics.

Here, I’m using a python script for this purpose. You don’t have to do this. Choose whatever works for you. The point here is to just generate similar files as the peak-barcode matrix described from the 10x Genomics website.

First, let’s make a directory to hold the output files and generate the peaks.bed and barcodes.tsv files, which is easy to do:

# create dirctory
mkdir -p sci-atac/chromap_outs/raw_peak_bc_matrix

# The 10x Genomics peaks.bed is a 3-column bed file, so we do
cut -f 1-3 sci-atac/chromap_outs/aggregate_peaks_sorted.bed > \
    sci-atac/chromap_outs/raw_peak_bc_matrix/peaks.bed

# The barcode is basically the first column of the file peak_read_ov.tsv.gz
zcat sci-atac/chromap_outs/peak_read_ov.tsv.gz | \
    cut -f 1 > \
    sci-atac/chromap_outs/raw_peak_bc_matrix/barcodes.tsv

The slightly more difficult file to generate is matrix.mtx. This is the python script generate_csc_mtx.py for this purpose:

# import helper packages
# most entries in the count matrix is 0, so we are going to use a sparse matrix
# since we need to keep updating the sparse matrix, we use lil_matrix from scipy
import sys
import gzip
from scipy.io import mmwrite
from scipy.sparse import lil_matrix

# the unique peak ID is a good renference
# generate a dictionary with peak_id : index_in_the_file
# sys.argv[1] is the 4-column bed file aggregate_peaks_sorted.bed
peak_idx = {}
with open(sys.argv[1]) as fh:
    for i, line in enumerate(fh):
        _, _, _, peak_name = line.strip().split('\t')
        peak_idx[peak_name] = i

# determine and create the dimension of the output matrix
# that is, to calculate the number of peaks and the number of barcodes
# sys.argv[2] is barcodes.tsv
num_peaks = len(peak_idx.keys())
num_cells = len(open(sys.argv[2]).readlines())
mtx = lil_matrix((num_peaks, num_cells), dtype = int)

# read the information from peak_read_ov.tsv.gz
# update the counts into the mtx object
# sys.argv[3] is peak_read_ov.tsv.gz
with gzip.open(sys.argv[3], 'rt') as fh:
    for i, line in enumerate(fh):
        col_idx = i # each column is a cell barcode
        count_info = line.strip().split('\t')[1]
        items = count_info.split(',')
        for pn_count in items:
            pn, count = pn_count.split(':')
            row_idx = peak_idx[pn] # each row is a peak
            mtx[row_idx, col_idx] = int(count)

# get a CSC sparse matrix, which is the same as the 10x Genomics matrix.mtx
mtx = mtx.tocsc()

# sys.argv[4] is the path to the output directory
mmwrite(sys.argv[4] + '/matrix.mtx', mtx, field='integer')

Run that script in the terminal:

python generate_csc_mtx.py \
    sci-atac/chromap_outs/aggregate_peaks_sorted.bed \
    sci-atac/chromap_outs/raw_peak_bc_matrix/barcodes.tsv \
    sci-atac/chromap_outs/peak_read_ov.tsv.gz \
    sci-atac/chromap_outs/raw_peak_bc_matrix

After that, you should have the matrix.mtx in the sci-atac/chromap_outs/raw_peak_bc_matrix directory.

Cell Calling (Filter Cell Barcodes)#

Experiments are never perfect. Even for barcodes that do not capture the molecules inside the cells, you may still get some reads due to various reasons, such as ambient RNA or DNA and leakage. In general, the number of reads from those barcodes should be much smaller, often orders of magnitude smaller, than those barcodes from real cells. In order to identify true cells from the background, we could use starolo. It is used for scRNA-seq in general, but it does have a cell calling function that takes a directory containing raw mtx and associated files, and return the filtered ones. Since starsolo looks for the following three files in the input directory: matrix.mtx, features.tsv and barcodes.tsv. Those are the output from the 10x Genomics scRNA-seq workflow. In this case, we can use peaks.bed as our features.tsv:

# trick starsolo to use peaks.bed as features.tsv by creating symlink

ln -s peaks.bed sci-atac/chromap_outs/raw_peak_bc_matrix/features.tsv

# filter cells using starsolo

STAR --runMode soloCellFiltering \
     sci-atac/chromap_outs/raw_peak_bc_matrix \
     sci-atac/chromap_outs/filtered_peak_bc_matrix/ \
     --soloCellFilter EmptyDrops_CR

# rename the new feature.tsv to peaks.bed or just create symlink
ln -s features.tsv sci-atac/chromap_outs/filtered_peak_bc_matrix/peaks.bed

If everything goes well, your directory should look the same as the following:

scg_prep_test/sci-atac/
├── chromap_outs
│   ├── aggregate_control_lambda.bdg
│   ├── aggregate_peaks.narrowPeak
│   ├── aggregate_peaks_sorted.bed
│   ├── aggregate_peaks.xls
│   ├── aggregate_summits.bed
│   ├── aggregate_treat_pileup.bdg
│   ├── filtered_peak_bc_matrix
│   │   ├── barcodes.tsv
│   │   ├── features.tsv
│   │   ├── matrix.mtx
│   │   └── peaks.bed -> features.tsv
│   ├── fragments.tsv.gz
│   ├── fragments.tsv.gz.tbi
│   ├── peak_read_ov.tsv.gz
│   ├── raw_peak_bc_matrix
│   │   ├── barcodes.tsv
│   │   ├── features.tsv -> peaks.bed
│   │   ├── matrix.mtx
│   │   └── peaks.bed
│   └── reads.bed.gz
└── data
    ├── sci-ATAC-C15.csv
    ├── sci-ATAC-D15.csv
    ├── sci-ATAC-P5.csv
    ├── sci-ATAC-P7.csv
    ├── SRR5837698_1.fastq.gz
    ├── SRR5837698_2.fastq.gz
    ├── SRR5837698_CB_base3-33.fastq.gz
    ├── SRR5837698_CB.fastq.gz
    ├── whitelist_config1_base3-33.txt
    ├── whitelist_config1.txt
    └── whitelist_config2.txt

4 directories, 29 files