Single Cell Genomics Library Preprocessing Pipelines

Single Cell Genomics Library Preprocessing Pipelines#

As more and more methods are developed, the list in the scg_lib_structs GitHub repository grows bigger and bigger. To extract information from the data, the first step is always data preprocessing, a procedure that converts the raw data ( fastq files ) to some sort of count matrices, such as gene-by-cell or peak-by-cell matrices. If you Google it, you will find some tutorials on this topic, and different experimental methods have quite different preprocessing pipelines. With the development of the computational tools, it is now possible to perform the preprocessing procedures using a unified pipeline (sort of …). This documentation showcases how to perform the data preprocessing step using just starsolo for ALL scRNA-seq methods and chromap + MACS for ALL scATAC-seq methods. This documentation does not really provide a ready-to-use pipeline for each method. Instead, it documents the commands used in the pipeline and, more importantly, explains the reason and the rationale behind the chosen parameters of each software. The reason of choosing starsolo and chromap + MACS is I’m familiar with them and they are fast. The point here is to showcase how and why we do this so that you can build and customise the pipeline on your own using the tools you like, such as zUMI and kallisto + bustools.

Note

  • Feedback needed !!! This project is still under development and will be updated according to my own time. If you have questions, spot any errors, see something confusing and have suggestions for improvement, please do get in touch by raising an issue in the scg_lib_structs GitHub repository, or by email: chenx9@sustech.edu.cn.

  • If you go through a few methods, you will find some text is repeating. The reason is that I want to make each method as a self-contained and independent page. I want people to be able to just click a method they are interested in and immediately start reading, without having to read other methods first.

Tip

Make sure you are familiar with different sequencing modes from different Illumina machines by looking at this page.

Required softwares#

The softwares needed for the preprocessing is very standard. All of them are used routinely in genomics and can be installed via conda or the like, so I’m not going to talk about software installation. I’m stating the version of each software I’m currently using (08-Aug-2022). Also make sure they are executable and in your $PATH.

  • General utilities

curl v7.79.1
wget v1.20.3
samtools v1.13
bedtools v2.30.0
tabix v0.2.5
bgzip v0.2.5
bedClip
faSize
  • scRNA-seq

STAR v2.7.9a
  • scATAC-seq

chromap v0.2.3-r424
MACS2 v2.2.7.1

Note that bedClip and faSize are from the UCSC genome browser executables.

In addition, you might also need the following programs in certain cases:

bcl2fastq v2.20.0.422 (only if you want to practice generating FastQ)
sratoolkit v3.0.0 (only for a few GEO data)
cutadapt v2.10
umi_tools v1.0.1

Before you start:#

Methods:#