SNARE-seq#

Check this GitHub page to see how SNARE-seq libraries are generated experimentally. This is a droplet-based method, transposed nuclei are captured inside droplets. At the same time, gel beads with barcoded primer containing a oligo-dT sequence are also captured inside the droplet. Inside the droplet, transposed DNA molecules and mRNA molecules are both captured by the oligo-dT primers on the beads. The capture of the transposed DNA is achieved through the addition of a splint oligo. See the SNARE-seq GitHub page for more details. The cells and gel beads are loaded on the microfluidic device at certain concentrations, such that a fraction of droplets contain only one cell AND one bead. Then, all droplets from one sample is collected. After that, a few cycles of pre-amplification are performed. Then the reaction is split into two portions: one for gene expression library preparation and the other for ATAC library preparation.

For Your Own Experiments#

If you follow the protocol from the paper, you should have two libraries per sample: one for RNA and the other for ATAC. Normally, you prepare them independently and sequence them separately. If you use this assay, you have to run the sequencing for both RNA and ATAC by yourself using a custom sequencing recipe or ask your core facility to do this for you. See below the sequencing read configurations.

RNA Read Configuration#

Order

Read

Cycle

Description

1

Read 1

>20

This yields R1_001.fastq.gz, 12 bp cell barcodes + 8 bp UMI, the rest are poly-T and hence ignored

2

Index 1 (i7)

10

This yields I1_001.fastq.gz, Sample index

3

Index 2 (i5)

10

This yields I2_001.fastq.gz, Sample index (if using dual index)

4

Read 2

>50

This yields R2_001.fastq.gz, cDNA reads

After sequencing, you need to run bcl2fastq by yourself with a SampleSheet.csv. Here is an example of SampleSheet.csv of a NextSeq run with two different samples using the Illumina Nextera N701 and N702 primers as sample index. The i5 index is not used:

[Header],,,,,,,,,,,
IEMFileVersion,5,,,,,,,,,,
Date,17/12/2019,,,,,,,,,,
Workflow,GenerateFASTQ,,,,,,,,,,
Application,NextSeq FASTQ Only,,,,,,,,,,
Instrument Type,NextSeq/MiniSeq,,,,,,,,,,
Assay,AmpliSeq Library PLUS for Illumina,,,,,,,,,,
Index Adapters,AmpliSeq CD Indexes (384),,,,,,,,,,
Chemistry,Amplicon,,,,,,,,,,
,,,,,,,,,,,
[Reads],,,,,,,,,,,
20,,,,,,,,,,,
50,,,,,,,,,,,
,,,,,,,,,,,
[Settings],,,,,,,,,,,
,,,,,,,,,,,
[Data],,,,,,,,,,,
Sample_ID,Sample_Name,Sample_Plate,Sample_Well,Index_Plate,Index_Plate_Well,I7_Index_ID,index,I5_Index_ID,index2,Sample_Project,Description
Sample01_RNA,,,,,,N701,TAAGGCGA,,,,
Sample02_RNA,,,,,,N702,CGTACTAG,,,,

You will get two fastq files per sample:

Sample01_RNA_S1_R1_001.fastq.gz # 20 bp: cell barcodes (12 bp) + UMI (8 bp)
Sample01_RNA_S1_R2_001.fastq.gz # >50 bp: cDNA reads
Sample02_RNA_S2_R1_001.fastq.gz # 20 bp: cell barcodes (12 bp) + UMI (8 bp)
Sample02_RNA_S2_R2_001.fastq.gz # >50 bp: cDNA reads

You are ready to go from here.

ATAC Read Configuration#

The ATAC library is slightly more complicated than the RNA library. Make sure you understand how sequencing is done for this assay by checking the SNARE-seq GitHub page. Then it is relatively straightforward to see there are two configurations.

Order

Read

Cycle

Description

1

Read 1

>20

This normally yields R1_001.fastq.gz, 12 bp cell barcodes + 8 bp UMI, the rest are poly-T and hence ignored

2

Index 1 (i7)

>50

This normally yields I1_001.fastq.gz, Genomic insert

3

Index 2 (i5)

8

This normally yields I2_001.fastq.gz, Sample index

4

Read 2

>50

This normally yields R2_001.fastq.gz, Genomic insert

After sequencing, you need to run bcl2fastq with a SampleSheet.csv in specific way. Here is an example of SampleSheet.csv of a NextSeq run with two different samples using the P501-Tso and P502-Tso primers from the Supplementary Table 4 from the SNARE-seq paper as sample index:

[Header],,,,,,,,,,,
IEMFileVersion,5,,,,,,,,,,
Date,17/12/2019,,,,,,,,,,
Workflow,GenerateFASTQ,,,,,,,,,,
Application,NextSeq FASTQ Only,,,,,,,,,,
Instrument Type,NextSeq/MiniSeq,,,,,,,,,,
Assay,AmpliSeq Library PLUS for Illumina,,,,,,,,,,
Index Adapters,AmpliSeq CD Indexes (384),,,,,,,,,,
Chemistry,Amplicon,,,,,,,,,,
,,,,,,,,,,,
[Reads],,,,,,,,,,,
20,,,,,,,,,,,
50,,,,,,,,,,,
,,,,,,,,,,,
[Settings],,,,,,,,,,,
,,,,,,,,,,,
[Data],,,,,,,,,,,
Sample_ID,Sample_Name,Sample_Plate,Sample_Well,Index_Plate,Index_Plate_Well,I7_Index_ID,index,I5_Index_ID,index2,Sample_Project,Description
Sample01_ATAC,,,,,,,,P501-Tso,GCGATCTA,,
Sample02_ATAC,,,,,,,,P502-Tso,ATAGAGAG,,

If you look at the order of the sequencing read configuration above, as you can see, the first (R1), the 2nd (I1) and the 4th (R2) reads are all important for us. Therefore, we would like to get all of them for each sample based on sample index, that is, the 3rd read (I2). To do this, you should run bcl2fastq in the following way:

bcl2fastq --use-bases-mask=Y20,Y50,I8,Y50 \
          --create-fastq-for-index-reads \
          --no-lane-splitting \
          --ignore-missing-positions \
          --ignore-missing-controls \
          --ignore-missing-filter \
          --ignore-missing-bcls \
          -r 4 -w 4 -p 4

You can check the bcl2fastq manual for more information, but the important bit that needs explanation is --use-bases-mask=Y20,Y50,I8,Y50. We have four reads, and that parameter specify how we treat each read in the stated order:

  1. Y20 at the first position indicates “use the cycle as a real read”, so you will get 20-nt sequences, output as R1_001.fastq.gz, because this is the 1st real read.

  2. Y50 at the second position indicates “use the cycle as a real read”, so you will get 50-nt sequences, output as R2_001.fastq.gz, because this is the 2nd real read.

  3. I8 at the third position indicates “use the cycle as an index read”, so you will get 8-nt sequences, output as I1_001.fastq.gz, because this is the 1st index read, though it is the 3rd read overall.

  4. Y50 at the fourth position indicates “use the cycle as a real read”, so you will get 50-nt sequences, output as R3_001.fastq.gz, because this is the 3rd real read, though it is the 4th read overall.

Therefore, you will get four fastq files per sample. Using the examples above, these are the files you should get:

# files for Sample01

Sample01_ATAC_S1_I1_001.fastq.gz # 8 bp: sample index
Sample01_ATAC_S1_R1_001.fastq.gz # 20 bp: 12 bp cell barcodes + 8 bp UMI
Sample01_ATAC_S1_R2_001.fastq.gz # 50 bp: genomic insert
Sample01_ATAC_S1_R3_001.fastq.gz # 50 bp: genomic insert 

# files for Sample02

Sample02_ATAC_S2_I1_001.fastq.gz # 8 bp: sample index
Sample02_ATAC_S2_R1_001.fastq.gz # 20 bp: 12 bp cell barcodes + 8bp UMI
Sample02_ATAC_S2_R2_001.fastq.gz # 50 bp: genomic insert
Sample02_ATAC_S2_R3_001.fastq.gz # 50 bp: genomic insert

The naming here is really different from our normal usage. The I1 files here actually mean I2 in our normal usage, and we can safely ignore them. The R1 files are good. The R2 files here actually mean I1 in our normal usage. The R3 files here actually means R2 in our normal usage. Anyway, DO NOT get confused. You are ready to go from here.

Public Data#

We will use the data from the following paper:

Note

Chen S, Lake B, Zhang K (2019) High-throughput sequencing of the transcriptome and chromatin accessibility in the same cell. Nat Biotechnol. 37: 1452-1457. https://doi.org/10.1038/s41587-019-0290-0

where the authors developed SNARE-seq for the first time. The data is deposited to GEO under the accession code GSE126074. There are quite a few samples there, but we will just use the Adult mouse brain cortex, rep 1 sample. The access number for the RNA modality is SRR9672089, and the accession number for the ATAC modality is SRR9672090. To get the reads, we use fastq-dump from the sratools:

mkdir -p snare-seq/data
fastq-dump --split-files \
           --origfmt \
           --outdir snare-seq/data \
           --defline-seq '@rd.$si:$sg:$sn' \
           SRR9672089 SRR9672090

The reason of using the specific options above is a bit complicated. I wrote a post about this, and you can have a look to see the reason. Once the program finishes running, you will have two plain files (SRR9672089_1.fastq and SRR9672089_2.fastq) for the RNA library, and three plain files (SRR9672090_1.fastq, SRR9672090_2.fastq and SRR9672090_3.fastq) for the ATAC library. Unfortunately, it seems fastq-dump cannot generate gzipped file on the fly. I’m going to gzip them manually to save space:

gzip snare-seq/data/*.fastq

These are the files to start with

scg_prep_test/snare-seq/data/
├── SRR9672089_1.fastq.gz
├── SRR9672089_2.fastq.gz
├── SRR9672090_1.fastq.gz
├── SRR9672090_2.fastq.gz
└── SRR9672090_3.fastq.gz

0 directories, 5 files

We are ready to go from here, but some explanation about the files is needed before we proceed. The SRA has different naming convention, which makes the ATAC file names confusing:

  • SRR9672089_1.fastq.gz is RNA library Read 1. Only 20 bp are needed, but the authors sequenced 30 bp. The first 12 bp are cell barcodes, then the next 8 bp are UMI. The rest of the bases are poly-T and hence can be ignored.

  • SRR9672089_2.fastq.gz is RNA library Read 2, which is the cDNA reads. The author provided 50-bp reads.

  • SRR9672090_1.fastq.gz is the ATAC library genomic insert Read 1, which is equivalent to R2_001.fastq.gz if you do it on your own, as described above. The authors provided 75-bp reads.

  • SRR9672090_2.fastq.gz is the ATAC library genomic insert Read 2, which is equivalent to R3_001.fastq.gz if you do it on your own, as described above. The authors provided 75-bp reads.

  • SRR9672090_3.fastq.gz is the ATAC library technical read, which is equivalent to R1_001.fastq.gz if you do it on your own, as described above. Only 12 bp are needed, but the authors provided 30-bp reads. The first 12 bp are cell barcodes. The next 8 bp are UMI, which is not used in ATAC-seq and hence ignored. The rest bases are poly-T and hence ignored.

Now, we are ready to begin the preprocessing.

Prepare Whitelist#

The cell barcodes on the beads are generated using the Drop-seq method. Actually, the beads are the same as the one from Drop-seq (ChemGenes, cat. no. Macosko2011-10 B). Therefore, there is no well-defined cell barcodes, so no whitelist in this case. The cell barcodes can be any 12-bp DNA sequence.

From FastQ To Count Matrices#

Let’s map the reads to the genome using starsolo for the RNA library and chromap for the ATAC library:

mkdir -p snare-seq/star_outs
mkdir -p snare-seq/chromap_outs

# process the RNA library using starsolo

STAR --runThreadN 4 \
     --genomeDir mm10/star_index \
     --readFilesCommand zcat \
     --outFileNamePrefix snare-seq/star_outs/ \
     --readFilesIn snare-seq/data/SRR9672089_2.fastq.gz snare-seq/data/SRR9672089_1.fastq.gz \
     --soloType CB_UMI_Simple \
     --soloCBstart 1 --soloCBlen 12 --soloUMIstart 13 --soloUMIlen 8 \
     --soloBarcodeReadLength 0 \
     --soloCBwhitelist None \
     --soloCellFilter EmptyDrops_CR \
     --soloStrand Forward \
     --outSAMattributes CB UB \
     --outSAMtype BAM SortedByCoordinate

# process the ATAC library using chromap

## map and generate the fragment file

chromap -t 4 --preset atac \
        -x mm10/chromap_index/genome.index \
        -r mm10/mm10.fa \
        -1 snare-seq/data/SRR9672090_1.fastq.gz \
        -2 snare-seq/data/SRR9672090_2.fastq.gz \
        -b snare-seq/data/SRR9672090_3.fastq.gz \
        --read-format bc:0:11 \
        -o snare-seq/chromap_outs/fragments.tsv

## compress and index the fragment file

bgzip snare-seq/chromap_outs/fragments.tsv
tabix -s 1 -b 2 -e 3 -p bed snare-seq/chromap_outs/fragments.tsv.gz

After this stage, we are done with the RNA library. The count matrix and other useful information can be found in the star_outs directory. For the ATAC library, two new files fragments.tsv.gz and fragments.tsv.gz.tbi are generated. They will be useful and sometimes required for other programs to perform downstream analysis. There are still some extra work.

Explain star and chromap#

If you understand the SNARE-seq experimental procedures described in this GitHub Page, the commands above should be straightforward to understand.

Explain star#

--runThreadN 4

Use 4 cores for the preprocessing. Change accordingly if using more or less cores.

--genomeDir mm10/star_index

Pointing to the directory of the star index. The public data we are analysing is from adult mouse cerebral cortex.

--readFilesCommand zcat

Since the fastq files are in .gz format, we need the zcat command to extract them on the fly.

--outFileNamePrefix snare-seq/star_outs/

We want to keep everything organised. This directs all output files inside the snare-seq/star_outs directory.

--readFilesIn

If you check the manual, we should put two files here. The first file is the reads that come from cDNA, and the second the file should contain cell barcode and UMI. In SNARE-seq, cDNA reads come from Read 2, and the cell barcode and UMI come from Read 1. Check the SNARE-seq GitHub Page if you are not sure. Multiple input files are supported and they can be listed in a comma-separated manner. In that case, they must be in the same order.

--soloType CB_UMI_Simple

Most of the time, you should use this option, and specify the configuration of cell barcodes and UMI in the command line (see immediately below). Sometimes, it is actually easier to prepare the cell barcode and UMI file upfront so that we could use this parameter.

--soloCBstart 1 --soloCBlen 12 --soloUMIstart 13 --soloUMIlen 8

The name of the parameter is pretty much self-explanatory. If using --soloType CB_UMI_Simple, we can specify where the cell barcode and UMI start and how long they are in the reads from the first file passed to --readFilesIn. Note the position is 1-based (the first base of the read is 1, NOT 0).

--soloBarcodeReadLength 0

If we specify --soloCBstart 1 --soloCBlen 12 --soloUMIstart 13 --soloUMIlen 8, starsolo will make sure that the length of Read 1 is 12 + 8 = 20 bp. If not, it will throw an error. In this case, our fastq file contain 30 bp. Therefore, we need to turn off the length check by setting this parameter to 0.

--soloCBwhitelist None

The plain text file containing all possible valid cell barcodes, one per line. SNARE-seq uses the bead oligos from Drop-seq. Due to the way the bead oligos are generated, there is no well-defined cell barcodes, so no whitelist is available.

--soloCellFilter EmptyDrops_CR

Experiments are never perfect. Even for droplets that do not contain any cell, you may still get some reads. In general, the number of reads from those droplets should be much smaller, often orders of magnitude smaller, than those droplets with cells. In order to identify true cells from the background, you can apply different algorithms. Check the star manual for more information. We use EmptyDrops_CR which is the most frequently used parameter.

--soloStrand Forward

The choice of this parameter depends on where the cDNA reads come from, i.e. the reads from the first file passed to --readFilesIn. You need to check the experimental protocol. If the cDNA reads are from the same strand as the mRNA (the coding strand), this parameter will be Forward (this is the default). If they are from the opposite strand as the mRNA, which is often called the first strand, this parameter will be Reverse. In the case of SNARE-seq, the cDNA reads are from the Read 2 file. During the experiment, the mRNA molecules are captured by barcoded oligo-dT primer containing UMI and the Read 1 sequence. Therefore, Read 1 consists of cell barcodes and UMI comes from the first strand, complementary to the coding strand. Read 2 comes from the coding strand. Therefore, use Forward for SNARE-seq data. This Forward parameter is the default, because many protocols generate data like this, but I still specified it here to make it clear. Check the SNARE-seq GitHub Page if you are not sure.

--outSAMattributes CB UB

We want the cell barcode and UMI sequences in the CB and UB attributes of the output, respectively. The information will be very helpful for downstream analysis.

--outSAMtype BAM SortedByCoordinate

We want sorted BAM for easy handling by other programs.

Explain chromap#

-t 4

Use 4 cores for the preprocessing. Change accordingly if using more or less cores.

-x mm10/chromap_index/genome.index

The chromap index file. The public data We are analysing is from adult mouse cerebral cortex.

-r mm10/mm10.fa

Reference genome sequence in fasta format. This is basically the file which you used to create the chromap index file.

-1, -2 and -b

They are Read 1 (genomic), Read 2 (genomic) and cell barcode read, respectively. For ATAC-seq, the sequencing is usually done in pair-end mode. Therefore, you normally have two genomic reads for each genomic fragment: Read 1 and Read 2. For the reason described previously, SRR9672090_1.fastq.gz is the genomic Read 1 and should be passed to -1; SRR9672090_2.fastq.gz is the genomic Read 2 and should be passed to -2; SRR9672090_3.fastq.gz is the cell barcode read and should be passed to -b. Multiple input files are supported and they can be listed in a comma-separated manner. In that case, they must be in the same order.

--read-format bc:0:11

Note that SRR9672090_3.fastq.gz contains the cell barcode. The reads are 30-bp long, but only the first 12 bp are cell barcodes. Therefore, we tell chromap to only use the first 12 bp (0:11) of the barcode file (bc) as the cell barcode. Be aware that the position is 0-based (the first base of the read is 0, NOT 1). Check the chromap manual if you are not sure.

-o snare-seq/chromap_outs/fragments.tsv

Direct the mapped fragments to a file. The format is described in the 10x Genomics website.

From ATAC Fragments To Reads#

The fragment file is the following format:

Column number

Meaning

1

fragment chromosome

2

fragment start

3

fragment end

4

cell barcode

5

Number of read pairs of this fragment

It is very useful, but we often need the peak-by-cell matrix for the downstream analysis. Therefore, we need to perform a peak calling process to identify open chromatin regions. We need to convert the fragment into reads. For each fragment, we will have two reads, a forward read and a reverse read. The read length is not important, but we could generate a 50-bp read pair for each fragment.

First, we need to create a genome size file, which is a tab delimited file with only two columns. The first column is the chromosome name and the second column is the length of the chromosome in bp:

# we also sort the output by chromosome name
# which will be useful later

faSize -detailed mm10/mm10.fa | \
    sort -k1,1 > mm10/mm10.chrom.sizes

This is the first 5 lines of mm10/mm10.chrom.sizes:

chr1	195471971
chr10	130694993
chr11	122082543
chr12	120129022
chr13	120421639

Now let’s generate the reads from fragments:

# we use bedClip to remove reads outside the chromosome boundary
# we also remove reads mapped to the mitochondrial genome (chrM)

zcat snare-seq/chromap_outs/fragments.tsv.gz | \
    awk 'BEGIN{OFS="\t"}{print $1, $2, $2+50, $4, ".", "+" "\n" $1, $3-50, $3, $4, ".", "-"}' | \
    sed '/chrM/d' | \
    bedClip stdin mm10/mm10.chrom.sizes stdout | \
    sort -k1,1 -k2,2n | \
    gzip > snare-seq/chromap_outs/reads.bed.gz

Note we also sort the output reads by sort -k1,1 -k2,2n. In this way, the order of chromosomes in the reads.bed.gz is the same as that in mm10.chrom.sizes, which makes downstream processes easier. The output reads.bed.gz are the reads in bed format, with the 4th column holding the cell barcodes.

Peak Calling By MACS2#

Now we can use the newly generated read file for the peak calling using MACS2:

macs2 callpeak -t snare-seq/chromap_outs/reads.bed.gz \
               -g mm -f BED -q 0.01 \
               --nomodel --shift -100 --extsize 200 \
               --keep-dup all \
               -B --SPMR \
               --outdir snare-seq/chromap_outs \
               -n aggregate

Explain MACS2#

The reasons of choosing those specific parameters are a bit more complicated. I have dedicated a post for this a while ago. Please have a look at this post if you are still confused. The following output files are particularly useful:

File

Description

aggregate_peaks.narrowPeak

Open chromatin peak locations in the narrowPeak format

aggregate_peaks.xls

More information about peaks

aggregate_treat_pileup.bdg

Signal tracks. Can be used to generate the bigWig file for visualisation

Getting The Peak-By-Cell Count Matrix#

Now that we have the peak and reads files, we can compute the number of reads in each peak for each cell. Then we could get the peak-by-cell count matrix. There are different ways of doing this. The following is the method I use.

Find Reads In Peaks Per Cell#

First, we use the aggregate_peaks.narrowPeak file. We only need the first 4 columns (chromosome, start, end, peak ID). You can also remove the peaks that overlap the black list regions. The black list is not available for every species and every build, so I’m not doing it here. We also need to sort the peak to make sure the order of the chromosomes in the peak file is the same as that in the mm10.chrom.sizes and reads.bed.gz files. Then we could find the overlap by bedtools. We need to do this in a specific way to get the number of reads in each peak from each cell:

# format and sort peaks

cut -f 1-4 snare-seq/chromap_outs/aggregate_peaks.narrowPeak | \
    sort -k1,1 -k2,2n > snare-seq/chromap_outs/aggregate_peaks_sorted.bed

# prepare the overlap

bedtools intersect \
    -a snare-seq/chromap_outs/aggregate_peaks_sorted.bed \
    -b snare-seq/chromap_outs/reads.bed.gz \
    -wo -sorted -g mm10/mm10.chrom.sizes | \
    sort -k8,8 | \
    bedtools groupby -g 8 -c 4 -o freqdesc | \
    gzip > snare-seq/chromap_outs/peak_read_ov.tsv.gz
Explain Finding Reads In Peaks Per Cell#

We start with the command before the first pipe, that is, the intersection part. If you read the manual of the bedtools intersect, it should be straightforward to understand. The -wo option will output the records in both -a file and -b file. Since the reads.bed.gz file has the cell barcode information at the 4th column, we would get an output with both peak and cell information for the overlap. The -sorted -g mm10/mm10.chrom.sizes options make the program use very little memory. Here is an example (top 5 lines) of the output of this part:

chr1	3094857	3095425	aggregate_peak_1	chr1	3094908	3094958	ACCAATAAAACC	.	+	50
chr1	3094857	3095425	aggregate_peak_1	chr1	3094908	3094958	ACCTCCATGCGG	.	+	50
chr1	3094857	3095425	aggregate_peak_1	chr1	3094908	3094958	AGAGTGTCGAGG	.	+	50
chr1	3094857	3095425	aggregate_peak_1	chr1	3094908	3094958	AGTCTCAGAGGA	.	+	50
chr1	3094857	3095425	aggregate_peak_1	chr1	3094908	3094958	CAGCTCGAGACA	.	+	50

We see that the 8th column holds the cell barcode and we want to group them using bedtools groupby. Therefore, we need to sort by this column, that is the sort -k8,8. When we group by the 8th column, we are interested in how many times each peak appear per group, so we could gather the information of the peak ID (4th column). That is the -g 8 -c 4 -o freqdesc. The -o freqdesc option returns a value:frequency pair in descending order. Here are some records from peak_read_ov.tsv.gz:

AAAAAAAAAAAA	aggregate_peak_36632:2
AAAAAAAAAGCG	aggregate_peak_208084:2
AAAAAAAAAGGC	aggregate_peak_188604:2,aggregate_peak_209595:2

In a way, that is sort of a count matrix in an awkward format. For example:

  • The first line means that in cell AAAAAAAAAAAA, the peak aggregate_peak_36632 has 2 counts. All the rest peaks not mentioned here have 0 counts in this cell.

  • The second line means that in cell AAAAAAAAAGCG, the peak aggregate_peak_208084 has 2 counts. All the rest peaks not mentioned here have 0 counts in this cell.

Output The Peak-By-Cell Matrix#

At this stage, we pretty much have all the things needed. Those two files aggregate_peaks_sorted.bed and peak_read_ov.tsv.gz contain all information for a peak-by-cell count matrix. We just need a final touch to make the output in a standard format: a market exchange format (MEX). Since most downstream software takes the input from the 10x Genomics Single Cell ATAC results, we are going to generate the MEX and the associated files similar to the output from 10x Genomics.

Here, I’m using a python script for this purpose. You don’t have to do this. Choose whatever works for you. The point here is to just generate similar files as the peak-barcode matrix described from the 10x Genomics website.

First, let’s make a directory to hold the output files and generate the peaks.bed and barcodes.tsv files, which are easy to do:

# create dirctory
mkdir -p snare-seq/chromap_outs/raw_peak_bc_matrix

# The 10x Genomics peaks.bed is a 3-column bed file, so we do
cut -f 1-3 snare-seq/chromap_outs/aggregate_peaks_sorted.bed > \
    snare-seq/chromap_outs/raw_peak_bc_matrix/peaks.bed

# The barcode is basically the first column of the file peak_read_ov.tsv.gz
zcat snare-seq/chromap_outs/peak_read_ov.tsv.gz | \
    cut -f 1 > \
    snare-seq/chromap_outs/raw_peak_bc_matrix/barcodes.tsv

The slightly more difficult file to generate is matrix.mtx. This is the python script generate_csc_mtx.py for this purpose:

# import helper packages
# most entries in the count matrix is 0, so we are going to use a sparse matrix
# since we need to keep updating the sparse matrix, we use lil_matrix from scipy
import sys
import gzip
from scipy.io import mmwrite
from scipy.sparse import lil_matrix

# the unique peak ID is a good renference
# generate a dictionary with peak_id : index_in_the_file
# sys.argv[1] is the 4-column bed file aggregate_peaks_sorted.bed
peak_idx = {}
with open(sys.argv[1]) as fh:
    for i, line in enumerate(fh):
        _, _, _, peak_name = line.strip().split('\t')
        peak_idx[peak_name] = i

# determine and create the dimension of the output matrix
# that is, to calculate the number of peaks and the number of barcodes
# sys.argv[2] is barcodes.tsv
num_peaks = len(peak_idx.keys())
num_cells = len(open(sys.argv[2]).readlines())
mtx = lil_matrix((num_peaks, num_cells), dtype = int)

# read the information from peak_read_ov.tsv.gz
# update the counts into the mtx object
# sys.argv[3] is peak_read_ov.tsv.gz
with gzip.open(sys.argv[3], 'rt') as fh:
    for i, line in enumerate(fh):
        col_idx = i # each column is a cell barcode
        count_info = line.strip().split('\t')[1]
        items = count_info.split(',')
        for pn_count in items:
            pn, count = pn_count.split(':')
            row_idx = peak_idx[pn] # each row is a peak
            mtx[row_idx, col_idx] = int(count)

# get a CSC sparse matrix, which is the same as the 10x Genomics matrix.mtx
mtx = mtx.tocsc()

# sys.argv[4] is the path to the output directory
mmwrite(sys.argv[4] + '/matrix.mtx', mtx, field='integer')

Run that script in the terminal:

python generate_csc_mtx.py \
    snare-seq/chromap_outs/aggregate_peaks_sorted.bed \
    snare-seq/chromap_outs/raw_peak_bc_matrix/barcodes.tsv \
    snare-seq/chromap_outs/peak_read_ov.tsv.gz \
    snare-seq/chromap_outs/raw_peak_bc_matrix

After that, you should have the matrix.mtx in the snare-seq/chromap_outs/raw_peak_bc_matrix directory.

Cell Calling (Filter Cell Barcodes)#

Experiments are never perfect. Even for droplets that do not contain any cell, you may still get some reads. In general, the number of reads from those droplets should be much smaller, often orders of magnitude smaller, than those droplets with cells. In order to identify true cells from the background, we could use starolo. It is used for scRNA-seq in general, but it does have a cell calling function that takes a directory containing raw mtx and associated files, and return the filtered ones. Since starsolo looks for the following three files in the input directory: matrix.mtx, features.tsv and barcodes.tsv. Those are the output from the 10x Genomics scRNA-seq workflow. In this case, we can use peaks.bed as our features.tsv:

# trick starsolo to use peaks.bed as features.tsv by creating symlink

ln -s peaks.bed snare-seq/chromap_outs/raw_peak_bc_matrix/features.tsv

# filter cells using starsolo

STAR --runMode soloCellFiltering \
     snare-seq/chromap_outs/raw_peak_bc_matrix \
     snare-seq/chromap_outs/filtered_peak_bc_matrix/ \
     --soloCellFilter EmptyDrops_CR

# rename the new feature.tsv to peaks.bed or just create symlink
ln -s features.tsv snare-seq/chromap_outs/filtered_peak_bc_matrix/peaks.bed

If everything goes well, your directory should look the same as the following:

scg_prep_test/snare-seq/
├── chromap_outs
│   ├── aggregate_control_lambda.bdg
│   ├── aggregate_peaks.narrowPeak
│   ├── aggregate_peaks_sorted.bed
│   ├── aggregate_peaks.xls
│   ├── aggregate_summits.bed
│   ├── aggregate_treat_pileup.bdg
│   ├── filtered_peak_bc_matrix
│   │   ├── barcodes.tsv
│   │   ├── features.tsv
│   │   ├── matrix.mtx
│   │   └── peaks.bed -> features.tsv
│   ├── fragments.tsv.gz
│   ├── fragments.tsv.gz.tbi
│   ├── peak_read_ov.tsv.gz
│   ├── raw_peak_bc_matrix
│   │   ├── barcodes.tsv
│   │   ├── features.tsv -> peaks.bed
│   │   ├── matrix.mtx
│   │   └── peaks.bed
│   └── reads.bed.gz
├── data
│   ├── SRR9672089_1.fastq.gz
│   ├── SRR9672089_2.fastq.gz
│   ├── SRR9672090_1.fastq.gz
│   ├── SRR9672090_2.fastq.gz
│   └── SRR9672090_3.fastq.gz
└── star_outs
    ├── Aligned.sortedByCoord.out.bam
    ├── Log.final.out
    ├── Log.out
    ├── Log.progress.out
    ├── SJ.out.tab
    └── Solo.out
        ├── Barcodes.stats
        └── Gene
            ├── Features.stats
            ├── filtered
            │   ├── barcodes.tsv
            │   ├── features.tsv
            │   ├── matrix.mtx
            │   └── metrics.csv
            ├── raw
            │   ├── barcodes.tsv
            │   ├── features.tsv
            │   └── matrix.mtx
            ├── Summary.csv
            └── UMIperCellSorted.txt

9 directories, 39 files