CN112599199A

CN112599199A - Analysis method suitable for 10x single cell transcriptome sequencing data

Info

Publication number: CN112599199A
Application number: CN202011592574.4A
Authority: CN
Inventors: 沈立; 姜丽荣; 孙子奎
Original assignee: Shanghai Personal Biotechnology Co ltd
Current assignee: Shanghai Personal Biotechnology Co ltd
Priority date: 2020-12-29
Filing date: 2020-12-29
Publication date: 2021-04-02

Abstract

The invention discloses an analysis method suitable for 10x single cell transcriptome sequencing data, which comprises the steps of 1) sequencing data processing, 2) Seurat data filtering, 3) cell clustering and grouping, and 4) cell group marker gene analysis.

Description

Analysis method suitable for 10x single cell transcriptome sequencing data

Technical Field

The invention relates to the technical field of gene detection, in particular to an analysis method suitable for 10x single cell transcriptome sequencing data.

Background

A transcriptome is the collection of all transcripts produced by a certain species or specific cell type. Transcriptome studies can study gene functions and gene structures from the whole level, reveal molecular mechanisms in specific biological processes and disease development processes, and have been widely applied in the fields of basic research, clinical diagnosis, drug development, and the like.

The common transcriptome is the transcription condition of all mRNA corresponding to a certain time in a biological tissue sample, and is usually used as an important index of the tissue or a certain time state of the sample, different samples, different tissues, different species and different treatments can cause the change of the expression condition of the mRNA, thereby regulating and controlling the life state of an organism or executing certain cell functions.

However, the transcriptome of a sample or a tissue is an average value of the expression levels of one transcriptome of all cells, and cannot reflect the state of all cells or a certain group of cells in the sample, so that the transcriptional state of a single cell or a certain group of cells needs to be deeply studied, and thus the state of the tissue can be more finely and accurately reflected. If the immune or drug response research is carried out, the immune therapy or the targeted therapy can be more accurately carried out aiming at cells or cell subsets, which is a necessary condition for precise medical treatment.

Disclosure of Invention

The invention provides an analysis method suitable for 10x single cell transcriptome sequencing data.

The scheme of the invention is as follows:

an analysis method suitable for 10x single cell transcriptome sequencing data, comprising the following steps:

1) processing the sequencing data, and performing sequencing data processing,

the raw data were processed using the 10x genomics official cellanger count procedure:

processing the off-line Fastq file by using a count program to obtain the number of expressed reads of the gene of each cell;

performing quality control on the sequencing result according to the information in the webpage generated by the count program;

2) the data of the Seurat is filtered,

further removing low-quality cells from the expression data obtained in step 1) using a Seurat software package to obtain filtered cells;

3) clustering the cells, and clustering the filtered cells by using a Surart software package;

4) and (3) analyzing the marker genes of the cell populations by using a Surart software package to obtain results.

As a preferred technical scheme, the main quality control indexes in the step 1) comprise effective cell numbers which are close to the estimated cell numbers during library building; a genomic alignment ratio, the genomic alignment ratio being greater than 70%; the gene set alignment rate is more than 30 percent.

As a preferred technical scheme, the main quality control indexes in the step 1) comprise effective cell numbers which are close to the estimated cell numbers during library building; a genomic alignment ratio, the genomic alignment ratio being greater than 80%; the gene set alignment rate is more than 50%.

Preferably, the step 2) removes low-quality cells, wherein the low-quality cells comprise cells with over-high expression genes, cells with over-low expression reads and cells with over 10% of mitochondrial gene expression reads.

As a preferred technical scheme, clustering the filtered cells by using the Suerat software package specifically comprises:

normalizing and normalizing the filtered cell expression matrix;

selecting 2000 genes with highest expression variation among cells as factors for principal component analysis;

analyzing and selecting main component analysis factors of the genes by using a PCA algorithm, and selecting the first 20 main component analysis factors as input items of cluster analysis;

constructing a K-nearest neighbor (KNN) graph according to PCA main components, and clustering cells by using a Louvain algorithm;

and carrying out nonlinear dimensionality reduction on the clustering result by using a UMAP algorithm, and visualizing the clustering result according to the first two dimensions.

As a preferred technical scheme, the method for analyzing the marker genes of each cell population by using the Surart software package specifically comprises the following steps:

comparing the cell expression level of each cell population with the mean expression level of other cell populations to find genes which are highly expressed in the cell population and are lowly expressed in other cell populations;

screening the found marker gene;

the expression of marker genes in each cell population was displayed using the VlnPlot function of the Suerat software package.

As a preferred technical scheme, the marker gene is screened to be more than 25 percent of the expression ratio in the cell population, and the result that the logfc is more than 0.25 is retained.

Due to the adoption of the technical scheme, the analysis method is suitable for 10x single cell transcriptome sequencing data, 1) sequencing data processing is carried out, and a 10x genomics official cellanger count flow is used for processing the original data: processing the off-line Fastq file by using a count program to obtain the number of expressed reads of the gene of each cell; performing quality control on the sequencing result according to the information in the webpage generated by the count program; 2) seurat data filtering, namely further removing low-quality cells from the expression data obtained in the step 1) by using a Seurat software package to obtain filtered cells; 3) clustering the cells, and clustering the filtered cells by using a Surart software package; 4) and (3) analyzing the marker genes of the cell populations by using a Surart software package to obtain results.

The invention has the advantages that: 1. the method for sequencing reduces the steps of data preprocessing, improves the analysis speed, enhances the sequencing efficiency and is convenient to operate and use;

2. the later analysis is carried out on the data by using the R language Seurat software package with higher acceptance at present, so that the accuracy of the analysis is improved and the analysis is more precise;

3. the quality control is carried out on the cells by combining a plurality of parameters, so that the influence of low-quality cells on analysis is reduced, and the accuracy of the analysis is improved;

4. the sensitivity and accuracy of the marker gene search are improved by various optional difference analysis methods;

5. provides a plurality of result display forms, combines the UMAP dimensionality reduction result, and is more convenient to understand the dynamic change of the marker gene in the cell.

Drawings

FIG. 1 is an analytical flow chart according to the present invention;

FIG. 2 is a scatter plot of the UMI number distribution of cells of the present invention;

FIG. 3 is a diagram of a single cell data quality control violin according to the present invention;

FIG. 4 is a UMAP scatter plot of cells of the present invention;

FIG. 5 is a gene expression distribution scattergram of the present invention;

FIG. 6 is a Marker gene GO enrichment factor graph of the invention.

Detailed Description

In order to make up for the above deficiencies, the present invention provides an analysis method suitable for 10 × single cell transcriptome sequencing data to solve the above problems in the background art.

1) processing the sequencing data, and performing sequencing data processing,

2) the data of the Seurat is filtered,

further removing low-quality cells from the expression data obtained in step 1) using the saurat software package to obtain filtered cells:

The main quality control indexes in the step 1) comprise effective cell numbers which are close to the estimated cell numbers when the database is built; a genomic alignment ratio, the genomic alignment ratio being greater than 70%; the gene set comparison rate is more than 30 percent

The main quality control indexes in the step 1) comprise effective cell numbers which are close to the estimated cell numbers when the database is built; a genomic alignment ratio, the genomic alignment ratio being greater than 80%; the gene set alignment rate is more than 50%.

Removing low-quality cells in the step 2), wherein the low-quality cells comprise cells with over-high expression gene number, cells with over-low expression reads number and cells with over 10% of mitochondrial gene expression reads.

Clustering the filtered cells using the Suerat software package specifically was:

normalizing and normalizing the filtered cell expression matrix;

The specific analysis of marker genes for each cell population using the Surart software package is:

screening the found marker gene;

The result that the expression ratio of the marker gene in the cell population is more than 25 percent and the logfc is more than 0.25 is reserved is screened.

In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the invention easy to understand, the invention is further described with the specific embodiments.

Example (b):

the method comprises the following steps of:

processing the off-line Fastq file by using a count program to obtain the number of expressed reads of the gene of each cell; as shown in fig. 2;

according to the information in the webpage generated by the count program, the quality control is carried out on the sequencing result, the main index is 1, the effective cell number is close to the estimated cell number when the library is built; 2. the genome alignment rate is more than 80% under the conventional condition; 3. the gene set alignment, which conventionally should be greater than 50%, can be relaxed to the following criteria for species with poor genome assembly: the genome alignment rate is more than 70 percent, and the gene set alignment rate is more than 30 percent;

step two, a Seurat data filtering step:

low quality cells were further removed from expression data obtained in step one using the saurat software package:

removing cells with too high a number of expressed genes;

removing cells expressing an excessively low number of reads;

removing cells in which more than 10% of mitochondrial gene expression reads are present; as shown in fig. 3;

step three, clustering and grouping cells:

clustering the filtered cells by using Surart, specifically:

normalizing and normalizing the filtered cell expression matrix;

analyzing the principal components of the selected genes by using a PCA algorithm, and selecting the first 20 principal components as input items of cluster analysis;

performing nonlinear dimensionality reduction on the clustering result by using a UMAP algorithm, and visualizing the clustering result according to the first two dimensions; as shown in fig. 4;

step four, analyzing the cell population marker gene

The marker genes of each cell population were analyzed using surat, specifically:

screening the found marker gene, and reserving the result that the expression ratio is more than 25% and the logfc is more than 0.25 in the cell population;

the expression of the marker gene in each cell population was displayed using the VlnPlot function of Suerat, as shown in fig. 5 and 6.

The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. An analysis method suitable for 10x single cell transcriptome sequencing data, which is characterized by comprising the following steps:

1) processing the sequencing data, and performing sequencing data processing,

2) the data of the Seurat is filtered,

2. The method of claim 1, wherein the method is applied to 10x single cell transcriptome sequencing data analysis, and comprises the following steps:

the main quality control indexes in the step 1) comprise effective cell numbers which are close to the estimated cell numbers when the database is built; a genomic alignment ratio, the genomic alignment ratio being greater than 70%; the gene set alignment rate is more than 30 percent.

3. The method of claim 2, wherein the method is applied to 10x single cell transcriptome sequencing data, and comprises the following steps:

4. The method of claim 1, wherein the method is applied to 10x single cell transcriptome sequencing data analysis, and comprises the following steps:

5. The method of claim 1, wherein clustering filtered cells using a Suerat software package is specifically:

normalizing and normalizing the filtered cell expression matrix;

constructing a KNN graph according to PCA main components, and clustering cells by using a Louvain algorithm;

6. The method for analyzing 10x single cell transcriptome sequencing data according to claim 1, wherein the analysis of marker genes of each cell population using the Suerat software package is specifically:

screening the found marker gene;

7. The method of claim 6, wherein the method is applied to 10x single cell transcriptome sequencing data, and comprises the following steps: the result that the expression ratio of the marker gene in the cell population is more than 25 percent and the logfc is more than 0.25 is reserved is screened.