Single Cell Genomics Library Preprocessing Pipelines
As more and more methods are developed, the list in the scg_lib_structs GitHub repository grows bigger and bigger. To extract information from the data, the first step is always data preprocessing, a procedure that converts the raw data (
fastq files ) to some sort of count matrices, such as gene-by-cell or peak-by-cell matrices. If you Google it, you will find some tutorials on this topic, and different experimental methods have quite different preprocessing pipelines. With the development of the computational tools, it is now possible to perform the preprocessing procedures using a unified pipeline (sort of …). This documentation showcases how to perform the data preprocessing step using just starsolo for ALL scRNA-seq methods and chromap + MACS for ALL scATAC-seq methods. This documentation does not really provide a ready-to-use pipeline for each method. Instead, it documents the commands used in the pipeline and, more importantly, explains the reason and the rationale behind the chosen parameters of each software. The reason of choosing starsolo and chromap + MACS is I’m familiar with them and they are fast. The point here is to showcase how and why we do this so that you can build and customise the pipeline on your own using the tools you like, such as zUMI and kallisto + bustools.
Feedback needed !!! This project is still under development and will be updated according to my own time. If you have questions, spot any errors, see something confusing and have suggestions for improvement, please do get in touch by raising an issue in the scg_lib_structs GitHub repository, or by email:
If you go through a few methods, you will find some text is repeating. The reason is that I want to make each method as a self-contained and independent page. I want people to be able to just click a method they are interested in and immediately start reading, without having to read other methods first.
Make sure you are familiar with different sequencing modes from different Illumina machines by looking at this page.
The softwares needed for the preprocessing is very standard. All of them are used routinely in genomics and can be installed via
conda or the like, so I’m not going to talk about software installation. I’m stating the version of each software I’m currently using (08-Aug-2022). Also make sure they are executable and in your
curl v7.79.1 wget v1.20.3 samtools v1.13 bedtools v2.30.0 tabix v0.2.5 bgzip v0.2.5 bedClip faSize
chromap v0.2.3-r424 MACS2 v188.8.131.52
faSize are from the UCSC genome browser executables.
In addition, you might also need the following programs in certain cases:
bcl2fastq v184.108.40.2062 (only if you want to practice generating FastQ) sratoolkit v3.0.0 (only for a few GEO data) cutadapt v2.10 umi_tools v1.0.1
Before you start:
- Gene expression