CN112599199A - Analysis method suitable for 10x single cell transcriptome sequencing data - Google Patents

Analysis method suitable for 10x single cell transcriptome sequencing data Download PDF

Info

Publication number
CN112599199A
CN112599199A CN202011592574.4A CN202011592574A CN112599199A CN 112599199 A CN112599199 A CN 112599199A CN 202011592574 A CN202011592574 A CN 202011592574A CN 112599199 A CN112599199 A CN 112599199A
Authority
CN
China
Prior art keywords
cells
cell
expression
sequencing data
clustering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011592574.4A
Other languages
Chinese (zh)
Inventor
沈立
姜丽荣
孙子奎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Personal Biotechnology Co ltd
Original Assignee
Shanghai Personal Biotechnology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Personal Biotechnology Co ltd filed Critical Shanghai Personal Biotechnology Co ltd
Priority to CN202011592574.4A priority Critical patent/CN112599199A/en
Publication of CN112599199A publication Critical patent/CN112599199A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Abstract

The invention discloses an analysis method suitable for 10x single cell transcriptome sequencing data, which comprises the steps of 1) sequencing data processing, 2) Seurat data filtering, 3) cell clustering and grouping, and 4) cell group marker gene analysis.

Description

Analysis method suitable for 10x single cell transcriptome sequencing data
Technical Field
The invention relates to the technical field of gene detection, in particular to an analysis method suitable for 10x single cell transcriptome sequencing data.
Background
A transcriptome is the collection of all transcripts produced by a certain species or specific cell type. Transcriptome studies can study gene functions and gene structures from the whole level, reveal molecular mechanisms in specific biological processes and disease development processes, and have been widely applied in the fields of basic research, clinical diagnosis, drug development, and the like.
The common transcriptome is the transcription condition of all mRNA corresponding to a certain time in a biological tissue sample, and is usually used as an important index of the tissue or a certain time state of the sample, different samples, different tissues, different species and different treatments can cause the change of the expression condition of the mRNA, thereby regulating and controlling the life state of an organism or executing certain cell functions.
However, the transcriptome of a sample or a tissue is an average value of the expression levels of one transcriptome of all cells, and cannot reflect the state of all cells or a certain group of cells in the sample, so that the transcriptional state of a single cell or a certain group of cells needs to be deeply studied, and thus the state of the tissue can be more finely and accurately reflected. If the immune or drug response research is carried out, the immune therapy or the targeted therapy can be more accurately carried out aiming at cells or cell subsets, which is a necessary condition for precise medical treatment.
Disclosure of Invention
The invention provides an analysis method suitable for 10x single cell transcriptome sequencing data.
The scheme of the invention is as follows:
an analysis method suitable for 10x single cell transcriptome sequencing data, comprising the following steps:
1) processing the sequencing data, and performing sequencing data processing,
the raw data were processed using the 10x genomics official cellanger count procedure:
processing the off-line Fastq file by using a count program to obtain the number of expressed reads of the gene of each cell;
performing quality control on the sequencing result according to the information in the webpage generated by the count program;
2) the data of the Seurat is filtered,
further removing low-quality cells from the expression data obtained in step 1) using a Seurat software package to obtain filtered cells;
3) clustering the cells, and clustering the filtered cells by using a Surart software package;
4) and (3) analyzing the marker genes of the cell populations by using a Surart software package to obtain results.
As a preferred technical scheme, the main quality control indexes in the step 1) comprise effective cell numbers which are close to the estimated cell numbers during library building; a genomic alignment ratio, the genomic alignment ratio being greater than 70%; the gene set alignment rate is more than 30 percent.
As a preferred technical scheme, the main quality control indexes in the step 1) comprise effective cell numbers which are close to the estimated cell numbers during library building; a genomic alignment ratio, the genomic alignment ratio being greater than 80%; the gene set alignment rate is more than 50%.
Preferably, the step 2) removes low-quality cells, wherein the low-quality cells comprise cells with over-high expression genes, cells with over-low expression reads and cells with over 10% of mitochondrial gene expression reads.
As a preferred technical scheme, clustering the filtered cells by using the Suerat software package specifically comprises:
normalizing and normalizing the filtered cell expression matrix;
selecting 2000 genes with highest expression variation among cells as factors for principal component analysis;
analyzing and selecting main component analysis factors of the genes by using a PCA algorithm, and selecting the first 20 main component analysis factors as input items of cluster analysis;
constructing a K-nearest neighbor (KNN) graph according to PCA main components, and clustering cells by using a Louvain algorithm;
and carrying out nonlinear dimensionality reduction on the clustering result by using a UMAP algorithm, and visualizing the clustering result according to the first two dimensions.
As a preferred technical scheme, the method for analyzing the marker genes of each cell population by using the Surart software package specifically comprises the following steps:
comparing the cell expression level of each cell population with the mean expression level of other cell populations to find genes which are highly expressed in the cell population and are lowly expressed in other cell populations;
screening the found marker gene;
the expression of marker genes in each cell population was displayed using the VlnPlot function of the Suerat software package.
As a preferred technical scheme, the marker gene is screened to be more than 25 percent of the expression ratio in the cell population, and the result that the logfc is more than 0.25 is retained.
Due to the adoption of the technical scheme, the analysis method is suitable for 10x single cell transcriptome sequencing data, 1) sequencing data processing is carried out, and a 10x genomics official cellanger count flow is used for processing the original data: processing the off-line Fastq file by using a count program to obtain the number of expressed reads of the gene of each cell; performing quality control on the sequencing result according to the information in the webpage generated by the count program; 2) seurat data filtering, namely further removing low-quality cells from the expression data obtained in the step 1) by using a Seurat software package to obtain filtered cells; 3) clustering the cells, and clustering the filtered cells by using a Surart software package; 4) and (3) analyzing the marker genes of the cell populations by using a Surart software package to obtain results.
The invention has the advantages that: 1. the method for sequencing reduces the steps of data preprocessing, improves the analysis speed, enhances the sequencing efficiency and is convenient to operate and use;
2. the later analysis is carried out on the data by using the R language Seurat software package with higher acceptance at present, so that the accuracy of the analysis is improved and the analysis is more precise;
3. the quality control is carried out on the cells by combining a plurality of parameters, so that the influence of low-quality cells on analysis is reduced, and the accuracy of the analysis is improved;
4. the sensitivity and accuracy of the marker gene search are improved by various optional difference analysis methods;
5. provides a plurality of result display forms, combines the UMAP dimensionality reduction result, and is more convenient to understand the dynamic change of the marker gene in the cell.
Drawings
FIG. 1 is an analytical flow chart according to the present invention;
FIG. 2 is a scatter plot of the UMI number distribution of cells of the present invention;
FIG. 3 is a diagram of a single cell data quality control violin according to the present invention;
FIG. 4 is a UMAP scatter plot of cells of the present invention;
FIG. 5 is a gene expression distribution scattergram of the present invention;
FIG. 6 is a Marker gene GO enrichment factor graph of the invention.
Detailed Description
In order to make up for the above deficiencies, the present invention provides an analysis method suitable for 10 × single cell transcriptome sequencing data to solve the above problems in the background art.
An analysis method suitable for 10x single cell transcriptome sequencing data, comprising the following steps:
1) processing the sequencing data, and performing sequencing data processing,
the raw data were processed using the 10x genomics official cellanger count procedure:
processing the off-line Fastq file by using a count program to obtain the number of expressed reads of the gene of each cell;
performing quality control on the sequencing result according to the information in the webpage generated by the count program;
2) the data of the Seurat is filtered,
further removing low-quality cells from the expression data obtained in step 1) using the saurat software package to obtain filtered cells:
3) clustering the cells, and clustering the filtered cells by using a Surart software package;
4) and (3) analyzing the marker genes of the cell populations by using a Surart software package to obtain results.
The main quality control indexes in the step 1) comprise effective cell numbers which are close to the estimated cell numbers when the database is built; a genomic alignment ratio, the genomic alignment ratio being greater than 70%; the gene set comparison rate is more than 30 percent
The main quality control indexes in the step 1) comprise effective cell numbers which are close to the estimated cell numbers when the database is built; a genomic alignment ratio, the genomic alignment ratio being greater than 80%; the gene set alignment rate is more than 50%.
Removing low-quality cells in the step 2), wherein the low-quality cells comprise cells with over-high expression gene number, cells with over-low expression reads number and cells with over 10% of mitochondrial gene expression reads.
Clustering the filtered cells using the Suerat software package specifically was:
normalizing and normalizing the filtered cell expression matrix;
selecting 2000 genes with highest expression variation among cells as factors for principal component analysis;
analyzing and selecting main component analysis factors of the genes by using a PCA algorithm, and selecting the first 20 main component analysis factors as input items of cluster analysis;
constructing a K-nearest neighbor (KNN) graph according to PCA main components, and clustering cells by using a Louvain algorithm;
and carrying out nonlinear dimensionality reduction on the clustering result by using a UMAP algorithm, and visualizing the clustering result according to the first two dimensions.
The specific analysis of marker genes for each cell population using the Surart software package is:
comparing the cell expression level of each cell population with the mean expression level of other cell populations to find genes which are highly expressed in the cell population and are lowly expressed in other cell populations;
screening the found marker gene;
the expression of marker genes in each cell population was displayed using the VlnPlot function of the Suerat software package.
The result that the expression ratio of the marker gene in the cell population is more than 25 percent and the logfc is more than 0.25 is reserved is screened.
In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the invention easy to understand, the invention is further described with the specific embodiments.
Example (b):
the method comprises the following steps of:
the raw data were processed using the 10x genomics official cellanger count procedure:
processing the off-line Fastq file by using a count program to obtain the number of expressed reads of the gene of each cell; as shown in fig. 2;
according to the information in the webpage generated by the count program, the quality control is carried out on the sequencing result, the main index is 1, the effective cell number is close to the estimated cell number when the library is built; 2. the genome alignment rate is more than 80% under the conventional condition; 3. the gene set alignment, which conventionally should be greater than 50%, can be relaxed to the following criteria for species with poor genome assembly: the genome alignment rate is more than 70 percent, and the gene set alignment rate is more than 30 percent;
step two, a Seurat data filtering step:
low quality cells were further removed from expression data obtained in step one using the saurat software package:
removing cells with too high a number of expressed genes;
removing cells expressing an excessively low number of reads;
removing cells in which more than 10% of mitochondrial gene expression reads are present; as shown in fig. 3;
step three, clustering and grouping cells:
clustering the filtered cells by using Surart, specifically:
normalizing and normalizing the filtered cell expression matrix;
selecting 2000 genes with highest expression variation among cells as factors for principal component analysis;
analyzing the principal components of the selected genes by using a PCA algorithm, and selecting the first 20 principal components as input items of cluster analysis;
constructing a K-nearest neighbor (KNN) graph according to PCA main components, and clustering cells by using a Louvain algorithm;
performing nonlinear dimensionality reduction on the clustering result by using a UMAP algorithm, and visualizing the clustering result according to the first two dimensions; as shown in fig. 4;
step four, analyzing the cell population marker gene
The marker genes of each cell population were analyzed using surat, specifically:
comparing the cell expression level of each cell population with the mean expression level of other cell populations to find genes which are highly expressed in the cell population and are lowly expressed in other cell populations;
screening the found marker gene, and reserving the result that the expression ratio is more than 25% and the logfc is more than 0.25 in the cell population;
the expression of the marker gene in each cell population was displayed using the VlnPlot function of Suerat, as shown in fig. 5 and 6.
The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (7)

1. An analysis method suitable for 10x single cell transcriptome sequencing data, which is characterized by comprising the following steps:
1) processing the sequencing data, and performing sequencing data processing,
the raw data were processed using the 10x genomics official cellanger count procedure:
processing the off-line Fastq file by using a count program to obtain the number of expressed reads of the gene of each cell;
performing quality control on the sequencing result according to the information in the webpage generated by the count program;
2) the data of the Seurat is filtered,
further removing low-quality cells from the expression data obtained in step 1) using a Seurat software package to obtain filtered cells;
3) clustering the cells, and clustering the filtered cells by using a Surart software package;
4) and (3) analyzing the marker genes of the cell populations by using a Surart software package to obtain results.
2. The method of claim 1, wherein the method is applied to 10x single cell transcriptome sequencing data analysis, and comprises the following steps:
the main quality control indexes in the step 1) comprise effective cell numbers which are close to the estimated cell numbers when the database is built; a genomic alignment ratio, the genomic alignment ratio being greater than 70%; the gene set alignment rate is more than 30 percent.
3. The method of claim 2, wherein the method is applied to 10x single cell transcriptome sequencing data, and comprises the following steps:
the main quality control indexes in the step 1) comprise effective cell numbers which are close to the estimated cell numbers when the database is built; a genomic alignment ratio, the genomic alignment ratio being greater than 80%; the gene set alignment rate is more than 50%.
4. The method of claim 1, wherein the method is applied to 10x single cell transcriptome sequencing data analysis, and comprises the following steps:
removing low-quality cells in the step 2), wherein the low-quality cells comprise cells with over-high expression gene number, cells with over-low expression reads number and cells with over 10% of mitochondrial gene expression reads.
5. The method of claim 1, wherein clustering filtered cells using a Suerat software package is specifically:
normalizing and normalizing the filtered cell expression matrix;
selecting 2000 genes with highest expression variation among cells as factors for principal component analysis;
analyzing and selecting main component analysis factors of the genes by using a PCA algorithm, and selecting the first 20 main component analysis factors as input items of cluster analysis;
constructing a KNN graph according to PCA main components, and clustering cells by using a Louvain algorithm;
and carrying out nonlinear dimensionality reduction on the clustering result by using a UMAP algorithm, and visualizing the clustering result according to the first two dimensions.
6. The method for analyzing 10x single cell transcriptome sequencing data according to claim 1, wherein the analysis of marker genes of each cell population using the Suerat software package is specifically:
comparing the cell expression level of each cell population with the mean expression level of other cell populations to find genes which are highly expressed in the cell population and are lowly expressed in other cell populations;
screening the found marker gene;
the expression of marker genes in each cell population was displayed using the VlnPlot function of the Suerat software package.
7. The method of claim 6, wherein the method is applied to 10x single cell transcriptome sequencing data, and comprises the following steps: the result that the expression ratio of the marker gene in the cell population is more than 25 percent and the logfc is more than 0.25 is reserved is screened.
CN202011592574.4A 2020-12-29 2020-12-29 Analysis method suitable for 10x single cell transcriptome sequencing data Pending CN112599199A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011592574.4A CN112599199A (en) 2020-12-29 2020-12-29 Analysis method suitable for 10x single cell transcriptome sequencing data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011592574.4A CN112599199A (en) 2020-12-29 2020-12-29 Analysis method suitable for 10x single cell transcriptome sequencing data

Publications (1)

Publication Number Publication Date
CN112599199A true CN112599199A (en) 2021-04-02

Family

ID=75203357

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011592574.4A Pending CN112599199A (en) 2020-12-29 2020-12-29 Analysis method suitable for 10x single cell transcriptome sequencing data

Country Status (1)

Country Link
CN (1) CN112599199A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113188981A (en) * 2021-04-30 2021-07-30 天津深析智能科技发展有限公司 Automatic analysis method of multi-factor cytokine
CN113257364A (en) * 2021-05-26 2021-08-13 南开大学 Single cell transcriptome sequencing data clustering method and system based on multi-objective evolution
CN113611359A (en) * 2021-08-13 2021-11-05 江苏先声医学诊断有限公司 Method for improving strain assembly efficiency of metagenome nanopore sequencing data
CN113674800A (en) * 2021-08-25 2021-11-19 中国农业科学院蔬菜花卉研究所 Cell clustering method based on single cell transcriptome sequencing data
CN115424668A (en) * 2022-11-02 2022-12-02 杭州联川基因诊断技术有限公司 Single-cell transcriptome data availability analysis method, medium and equipment
CN117079726A (en) * 2023-10-16 2023-11-17 浙江大学长三角智慧绿洲创新中心 Database visualization method based on single cells and related equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109979538A (en) * 2019-03-28 2019-07-05 广州基迪奥生物科技有限公司 A kind of analysis method based on the unicellular transcript profile sequencing data of 10X
CN110060729A (en) * 2019-03-28 2019-07-26 广州序科码生物技术有限责任公司 A method of cell identity is annotated based on unicellular transcript profile cluster result
WO2019200342A1 (en) * 2018-04-12 2019-10-17 The J. David Gladstone Institutes Methods for treating apoe4/4-associated disorders
CN111863138A (en) * 2020-05-26 2020-10-30 浙江大学 Human uterine tissue cell composition analysis model and establishing method and application thereof

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019200342A1 (en) * 2018-04-12 2019-10-17 The J. David Gladstone Institutes Methods for treating apoe4/4-associated disorders
CN109979538A (en) * 2019-03-28 2019-07-05 广州基迪奥生物科技有限公司 A kind of analysis method based on the unicellular transcript profile sequencing data of 10X
CN110060729A (en) * 2019-03-28 2019-07-26 广州序科码生物技术有限责任公司 A method of cell identity is annotated based on unicellular transcript profile cluster result
CN111863138A (en) * 2020-05-26 2020-10-30 浙江大学 Human uterine tissue cell composition analysis model and establishing method and application thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YULONG FU;XIAOHU HUANG;PENG ZHANG;JOYCE VAN DE LEEMPUT;ZHE HAN;: "Single-cell RNA sequencing identifies novel cell types in Drosophila blood", JOURNAL OF GENETICS AND GENOMICS, no. 04 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113188981A (en) * 2021-04-30 2021-07-30 天津深析智能科技发展有限公司 Automatic analysis method of multi-factor cytokine
CN113257364A (en) * 2021-05-26 2021-08-13 南开大学 Single cell transcriptome sequencing data clustering method and system based on multi-objective evolution
CN113611359A (en) * 2021-08-13 2021-11-05 江苏先声医学诊断有限公司 Method for improving strain assembly efficiency of metagenome nanopore sequencing data
CN113674800A (en) * 2021-08-25 2021-11-19 中国农业科学院蔬菜花卉研究所 Cell clustering method based on single cell transcriptome sequencing data
CN115424668A (en) * 2022-11-02 2022-12-02 杭州联川基因诊断技术有限公司 Single-cell transcriptome data availability analysis method, medium and equipment
CN117079726A (en) * 2023-10-16 2023-11-17 浙江大学长三角智慧绿洲创新中心 Database visualization method based on single cells and related equipment
CN117079726B (en) * 2023-10-16 2024-01-30 浙江大学长三角智慧绿洲创新中心 Database visualization method based on single cells and related equipment

Similar Documents

Publication Publication Date Title
CN112599199A (en) Analysis method suitable for 10x single cell transcriptome sequencing data
US7860660B2 (en) Characterization of phenotypes by gene expression patterns and classification of samples based thereon
Belacel et al. Clustering methods for microarray gene expression data
US20060259246A1 (en) Methods for efficiently mining broad data sets for biological markers
Kim et al. Effect of data normalization on fuzzy clustering of DNA microarray data
CA2300639A1 (en) Methods and apparatus for analyzing gene expression data
CN113674800B (en) Cell clustering method based on single cell transcriptome sequencing data
US20130304783A1 (en) Computer-implemented method for analyzing multivariate data
Jhajharia et al. A cross-platform evaluation of various decision tree algorithms for prognostic analysis of breast cancer data
US20140058682A1 (en) Nucleic Acid Information Processing Device and Processing Method Thereof
US20140019062A1 (en) Nucleic Acid Information Processing Device and Processing Method Thereof
Bir-Jmel et al. Gene selection via BPSO and Backward generation for cancer classification
TW202121223A (en) Methods for training an artificial neural network to predict whether a subject will exhibit a characteristic gene expression and systems for executing the same
Khalilabad et al. Fully automatic classification of breast cancer microarray images
CN115274136A (en) Tumor cell line drug response prediction method integrating multiomic and essential genes
KR20100001177A (en) Gene selection algorithm using principal component analysis
Schaefer Gene expression analysis based on ant colony optimisation classification
Ma et al. EnsembleKQC: an unsupervised ensemble learning method for quality control of single cell RNA-seq sequencing data
JP3936851B2 (en) Clustering result evaluation method and clustering result display method
Walsh et al. Feature selection using co-occurrence correlation improves cell clustering and embedding in single cell rnaseq data
EP1691311A1 (en) Method, system and software for carrying out biological interpretations of microarray experiments
CN113971984A (en) Classification model construction method and device, electronic equipment and storage medium
CN115527610B (en) Cluster analysis method for single-cell histology data
Muhammad et al. Gvdeepnet: Unsupervised deep learning techniques for effective genetic variant classification
Zhong et al. Controlled Noise: Evidence of Epigenetic Regulation of Single-Cell Expression Variability

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination