CN115440303B

CN115440303B - Method, medium and equipment for filtering low-quality cells of unicellular transcriptome

Info

Publication number: CN115440303B
Application number: CN202211367300.4A
Authority: CN
Inventors: 陈哲名; 郎秋蕾; 韩斐然
Original assignee: Hangzhou Lianchuan Biotechnology Co ltd
Current assignee: Hangzhou Lianchuan Biotechnology Co ltd
Priority date: 2022-11-03
Filing date: 2022-11-03
Publication date: 2023-02-10
Anticipated expiration: 2042-11-03
Also published as: CN116805511A; CN116486916A; CN115440303A

Abstract

The invention discloses a method for filtering low-quality cells of a single-cell transcriptome, and relates to a biological data processing method. The method comprises the following steps: grouping the cells; taking the average expression quantity according to the genes to generate a characteristic expression profile of the cell population; combining the characteristic expression profiles of the cell populations randomly in pairs to generate artificial multiple cells; combining the artificial multi-cell expression profile and the real cell expression profile, and calculating the distance between each cell; setting a plurality of equidistant neighborhoods in a specified range, and calculating the artificial multicellular proportion of each real cell in each neighborhood; counting the artificial multicellular proportion distribution under each neighborhood, solving the bimodal coefficient of the artificial multicellular proportion distribution, and taking the neighborhood with the maximum bimodal coefficient as an optimal neighborhood; in the optimal neighborhood, a prescribed number of real cells having the largest proportion of artificial multicellular cells are identified as multicellular cells, and deleted from the real cell expression profile. The filtering standard and the accuracy of the single-cell transcriptome data are improved, and the reliability of the data is enhanced.

Description

Method, medium and equipment for filtering low-quality cells of unicellular transcriptome

Technical Field

The invention relates to a biological data processing method, in particular to a method, a medium and equipment for filtering low-quality cells of a single-cell transcriptome.

Background

Single cell transcriptome sequencing based on microfluidic technology enables quantification of gene expression of tens of thousands of cells in a single experiment. The method mainly identifies single cells based on sequence tags, and the core technology is to add a unique sequence tag to each cell, and consider nucleic acid sequences carrying the same tag as coming from the same cell during sequencing. The 10X Genomics single cell transcriptome sequencing platform is a widely applied technology at present, realizes high-flux cell sorting and capturing by utilizing technologies such as microfluidics, oil drop wrapping, barcode labels and the like, can separate and mark 500 to tens of thousands of single cells at one time, can obtain transcriptome information of each cell after sequencing, and has the advantages of high cell flux, low library construction cost, short capturing period and the like.

The typical single cell transcriptome sequencing experiment process is as follows, firstly preparing cell suspension, mixing the cell suspension with magnetic beads on a corresponding platform instrument by using a microfluidic chip, and wrapping with oil drops. Each microbead is provided with a unique nucleotide sequence, namely a barcode label, and can mark a single cell. Each barcode tag is also linked to a molecular identifier (UMI) consisting of a nucleotide sequence, and each UMI can tag an mRNA transcript. Through reverse transcription, PCR amplification, library generation and sequencing, whether each sequence in the result is from the same cell and the same mRNA can be determined according to the barcode label and the UMI label in sequencing data, and the method can reduce the preference influence of PCR on different molecules. By matching and counting barcode and UMI, gene expression information is summarized in a counting matrix, thereby obtaining a transcriptome expression profile of an individual cell.

Single cell experiments often obtain single cells in bulk based on dissociation, disruption of biological tissues, which often results in many cell fragments or apoptosis. Droplet-based single-cell transcriptome techniques also exist where two or more cells (or whole cells + cell debris) form a droplet. In the single-cell transcriptome data, hundreds of thousands to millions of droplets are included, but barcode in the droplets cannot automatically identify whether the droplets include cells, or whether the included cells are cell fragments or dead/dying cells or multiple cells, i.e., the amount of the included cytoplasm cannot be automatically determined. The quality of the cells greatly influences the results in the subsequent analysis, so that the type of the droplet represented by the barcode needs to be judged before the data analysis. The 10 Xgenomics official software cellanger can only determine whether barcode is empty liquid drop, and can not identify the cell mass, which may cause that the analysis result of the single cell transcriptome has larger deviation from the actual condition, and even obtains the opposite result in the biological sense. There is currently no systematic method for identifying low quality cells and filtering the cells.

Disclosure of Invention

In order to solve at least one technical problem mentioned in the background art, the present invention aims to provide a method, a medium and a device for filtering low-quality cells in a single-cell transcriptome, which identify and filter low-quality cells in the single-cell transcriptome, improve the filtering standard and accuracy of the single-cell transcriptome, and enhance the reliability of data.

In order to achieve the purpose, the invention provides the following technical scheme:

a method for filtering low-quality cells of a single-cell transcriptome, comprising the steps of:

s101, grouping cells based on the real cell expression profile;

s1041, taking the average value of the expression quantity of the cell expression profile of each cell group according to the gene, and generating the characteristic expression profile of each cell group;

s1042, randomly combining the characteristic expression profiles of the cell populations pairwise to generate a certain number of artificial multiple cells;

s1043, combining the artificial multi-cell expression profile and the real cell expression profile, and calculating the distance between each cell;

s1044, setting a plurality of equidistant neighborhoods in a specified range, and calculating the artificial multicellular proportion of each real cell in each neighborhood;

s1045, counting the artificial multicellular proportion distribution under each neighborhood, solving a double-peak coefficient of the artificial multicellular proportion distribution, and taking the neighborhood with the maximum double-peak coefficient as an optimal neighborhood;

s1046, in the optimal neighborhood, identifying a predetermined number of real cells having the largest proportion of artificial multicellular cells as multicellular cells, and deleting them from the real cell expression profile.

Further, the method for pairwise combination of the characteristic expression profiles comprises the following steps:

Y=a1*X1+a2*X2

wherein Y is the generated artificial multicellular, and X1 and X2 are characteristic expression profiles of cell populations; a1 and a2 are scaling coefficients, one of a1 and a2 is set to 1, and the other is set to a random value larger than 0 and smaller than 1.

Further, the distance between the cells is an Euclidean distance or a Manhattan distance.

Further, the artificial multicellular proportion is as follows: the ratio of the number of artificial multicellular cells to the total number of cells in the neighborhood in which the combined expression profile is located.

Further, the bimodal coefficients are:

wherein the content of the first and second substances,

is a bimodal coefficient;

and

respectively representing the skewness and kurtosis of artificial multicellular proportion distribution;

the number of the real cells.

Further, the method for determining the specified number is as follows: setting a multicellular ratio, the product of the number of real cells and the multicellular ratio being a prescribed number of real cells identified as multicellular.

Further, in S1044, the neighborhood is set as follows: 100 equidistant neighbourhoods are arranged in the range of 0.0001 to 0.01.

A computer storage medium having stored thereon a computer program which, when executed by a processor, implements the method for low quality cell filtration of a single-cell transcriptome as described above.

A terminal device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of low quality cell filtration of a single-cell transcriptome as described above when executing the computer program.

Compared with the prior art, the invention has the beneficial effects that: the method can generate a certain amount of artificial multiple cells, count the artificial multiple cell proportion distribution of each real cell in the neighborhood according to each set neighborhood, determine the optimal neighborhood, then regard a plurality of real cells with the maximum artificial multiple cell proportion as multiple cells in the optimal neighborhood, and delete the multiple cells (low-quality cells) from the real cell expression spectrum, thereby improving the filtering standard and accuracy of the single-cell transcriptome data and enhancing the reliability of the data.

Drawings

FIG. 1 is an overall flow chart of an embodiment of the present invention.

FIG. 2 is a flow chart of a multi-cell filtration process according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The first embodiment is as follows:

referring to fig. 1, the embodiment provides a method for filtering low-quality cells of a single-cell transcriptome, the implementation process mainly includes steps S100 to S104, which are described in detail as follows:

in step S100, the original expression profile (also called real cell expression profile) may contain non-cell data, so the original expression profile is initially filtered by using cellanger software to remove the non-cell data in the data, and a filtered cell expression profile is generated. If the original expression profile does not contain non-cell data, the original expression profile can be directly adopted for the next operation without filtering.

The original expression profile of this example is shown in Table 1 below.

Table 1: original expression profiles

In Table 1 above, the numbers of the cells such as C1, C301 and C15001, the numbers of the genes such as G1, G2 and G53, and the data in the text of the table indicate the expression levels of the genes.

The primary filtered cell expression profile of this example is shown in Table 2 below.

Table 2: primary filtering of cell expression profiles

In comparison to table 1 above, one-time filtering of the cell expression profile filtered cell C10001 and cell C15001.

And step S101, based on the primary filtered cell expression profile or the original expression profile, carrying out normalization, dimensionality reduction and grouping on the cells by using Seurat software. Wherein the normalization method selects LogNormalize, and the scale factor parameter is set as 10000; the screening method for the hypervariable gene used "vst"; reducing the PCA dimension to 50 PC dimensions; clustering used FindNeighbors functions with resolution parameter set to 0.8. To this end, all cells are divided into different cell populations. The grouping results of the primary filtered cell expression profile of this example are shown in Table 3 below:

table 3: clustering results of once filtered cell expression profiles

In table 3 above, the cells C1 and C601 are divided into the cell group a, and similarly, the other cells are divided into the cell group B, the cell group C, and the cell group D.

The cell debris and dead/dying cells are filtered, and the steps S102 to S103 are completed. It is worth mentioning that the cell debris and dead/dying cells may be filtered together or may be removed separately.

Step S102, setting four gene sets A for the genes respectively _gene 、A _mt 、A _active 、A _antioxi And calculating the expressed gene fraction S of each cell population _gene Mitochondrial fraction S _mt Activity of the compoundFraction S _active Antioxidant fraction S _antioxi 。

Wherein, A _gene Total gene set (including G1 to G53 in table 3 above) in the cell expression profile was filtered for one time; s _gene The average value of the number of expressed genes (i.e., the number of genes whose expression level is greater than 0) in all cells in the cell population.

A _mt The set of mitochondrial genes for the corresponding species (including G1 and G2 in table 3 above); s _mt Is A of all cells in the cell population _mt Average value of the ratio of Gene expression, A _mt Ratio of Gene expression = A in Individual cells _mt Sum of Gene expression amounts/in Individual cells A _gene Sum of gene expression amounts 100%.

A _active Housekeeping gene sets for the corresponding species (including G6 and G7 in table 3 above); human ACTB and GAPDH genes; s _active A for all cells in the cell population _active Average value of average expression amount of Gene, A _active Mean gene expression = a in single cell _active Sum of Gene expression amount/A _active Number of genes.

A _antioxi Set of antioxidant genes for the corresponding species (including G20 and G21 in table 3 above); such as human SOD1, SOD2, SOD3, CAT, GPX1, GPX2, GPX3, GPX4, GPX5, GPX6, GPX7, GPX8, NQO1, NFE2L2 genes. S _antioxi Is A of all cells in the cell population _antioxi Average value of average expression amount of Gene, A _antioxi Mean gene expression = a in individual cells _antioxi Sum of Gene expression amount/A _antioxi Number of genes.

In conclusion, for each cell population, the corresponding expressed gene fraction S is obtained _gene Mitochondrial fraction S _mt Activity fraction S _active Antioxidant fraction S _antioxi Table 4 below is a table showing the scores obtained.

Table 4: display table of four scores

Step S103, setting S _gene 、S _mt 、S _active 、S _antioxi Corresponding threshold value G _gene 、G _mt 、G _active 、G _antioxi And determining the cell population type.

When S is _gene Fraction less than G _gene At this time, the cell population was judged to be cell debris.

When S is _mt Greater than G _mt And S _antioxi Less than G _antioxi When the cells were dead or dying, the cell population was judged.

When S is _active Less than G _active When the cells were dead or dying, the cell population was judged.

G _gene 、G _mt 、G _active 、G _antioxi Typically set at 500, 25%, 2.

As shown in table 4 above, since the cell population D (corresponding cell C8001) satisfies the above three at the same time, the cell population D is judged as both cell debris and dead/dying cells.

In the primary-filtered cell expression profile, cells judged as cell debris and dead/dying cells (which can be deleted if either of them is satisfied) are deleted, and a secondary-filtered cell expression profile is generated. Table 5 shows the expression profile of the secondary filtered cells of this example.

Table 5: secondary filtering of cell expression profiles

The cells in the above primary filtered cell expression profile and the secondary filtered cell expression profile are both true and therefore also called true cell expression profiles.

And step S104, screening out the multicellular in the real cells based on the simulated multicellular characteristic expression profile and a knn algorithm, and filtering the multicellular from the secondary filtered cell expression profile to generate a final filtered expression profile. The filtration of the multicellular cells may be carried out independently or after filtration of cell debris or dead/dying cells.

Referring to fig. 2, the following steps S1041 and S1046 are specifically implemented (the grouping step is completed by the foregoing step S101):

s1041, generating a characteristic expression profile of each cell group by averaging (rounding off) the expression levels of the respective genes.

Taking cell population a as an example, if it contains only 2 cells in table 5, its characteristic expression profile is shown in table 5 below:

cell population/gene

G1

G2

…

G6

G7

…

G20

G21

…

G51

G52

G53

A

2

4

13

18

4

6

740

9

5

S1042, randomly combining the characteristic expression profiles of all cell populations in pairs to generate a certain ratio P _N The ratio of P to P _N Usually set to 25%; ratio P of artificial multicellular _N = number of artificial multicellular cells/(number of artificial multicellular cells + number of true cells) × 100%;

the method for pairwise combination of the characteristic expression profiles comprises the following steps:

Y=a1*X1+a2*X2

wherein Y is the generated artificial multicellular, and X1 and X2 are characteristic expression profiles of cell populations; a1 and a2 are proportionality coefficients, one of a1 and a2 is set to be 1, and the other is set to be a random value larger than 0 and smaller than 1, namely when a1 is 1, a2 is a random value between 0~1, and when a2 is 1, a1 is a random value 1< -a 1 > -a 2 between 0~1. Each artificial multicellular cell generated in this way comprises an intact cell and a defective cell, i.e., the process of generating multicellular cells in actual experiments is simulated.

And S1043, combining the artificial multi-cell expression profile and the secondary filtered cell expression profile, and using Seurat software for renormalization and PCA dimension reduction. Based on PCA dimension reduction results, the distance between each cell is calculated, usually as Euclidean distance or Manhattan distance, and other distance measurement methods can play similar roles.

S1044 setting 100 equidistant neighbourhoods P within the range of 0.0001 to 0.01 _kn Calculating each of the domains P _kn Next, each real cell is in the neighborhood P _kn Inner artificial multicellular proportion P _ANN I.e. counting the nearest true cells (total number of cells N in the combined expression profile) _merge *P _kn ) Number of Artificial multicellular in Individual cell N _ANN Calculate P _ANN =N _ANN /（N _merge *P _kn ）。

S1045, counting each neighborhood P _kn P of _ANN And (4) distributing, and calculating the skewness s and the kurtosis k of the distribution. The calculation formula of the skewness s is s =

The formula for calculating kurtosis k is k =

Where n is the same neighborhood P _kn Lower P _ANN Number of (1), p _i Is the same neighborhood P _kn The lower ith P _ANN The value of (A), M is the same neighborhood P _kn All of _ANN SD is the same neighborhood P _kn All of _ANN Standard deviation of (d).

Recalculate each P _ANN Distributed bimodal coefficient BC, BC =

Wherein N is _real The number of the real cells. Selecting the neighborhood size when the double-peak coefficient BC is maximum as the optimal neighborhood P used finally _K 。

In the optimal neighborhood P _K Under the neighborhood, the boundary of single cell and multiple cells can be clarified to the greatest extent, namely all cells can be divided into single cell and multiple cells as much as possible, and the number of the cells in the intermediate state with fuzzy classification is reduced to obtain the cells in the intermediate state which are as accurate as possibleThe classification result of (1).

S1046, setting the multicellular rate R _doub Calculating the expected value E of the number of multiple cells _doub =N _real *R _doub . Calculating the neighborhood size as the optimal neighborhood P _K P of each real cell _ANN A1 is to P _ANN Maximum E _doub Individual cells were identified as multicellular and deleted from the secondary filtered cell expression profile to generate the final filtered cell expression profile.

Example two:

a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the method for single-cell transcriptome RNA contamination identification as described in example one and/or the method for low-quality cell filtration of a single-cell transcriptome as described in example two.

Example three:

a terminal device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of low quality cell filtration for a single-cell transcriptome of embodiment one when executing the computer program.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

1. A method for filtering low-quality cells of a single-cell transcriptome, comprising the steps of:

s101, grouping cells based on the real cell expression profile;

the genes are respectively arranged into four gene sets A _gene 、A _mt 、A _active 、A _antioxi And calculating the expressed gene fraction S of each cell population _gene Mitochondrial fraction S _mt Activity fraction S _active Antioxidant fraction S _antioxi ；A _gene For a primary filtration of the total gene set, S, in the cell expression profile _gene Is the average of the number of expressed genes of all cells in the cell population; a. The _mt Set of mitochondrial genes for the corresponding species, S _mt Is A of all cells in the cell population _mt Average value of gene expression ratio; a. The _active Housekeeping gene sets, S, for the corresponding species _active Is A of all cells in the cell population _active Average value of average expression level of gene; a. The _antioxi Set of antioxidant genes, S, for the corresponding species _antioxi Is A of all cells in the cell population _antioxi Average value of average expression level of gene; setting S _gene 、S _mt 、S _active 、S _antioxi Corresponding threshold value G _gene 、G _mt 、G _active 、G _antioxi And determining the cell population type; deleting cells judged as cell debris and dead/dying cells in the true cell expression profile;

s1042, randomly combining every two characteristic expression profiles of the cell populations to generate a certain number of artificial multiple cells;

s1046, under the optimal neighborhood, identifying the specified number of real cells with the maximum artificial multicellular proportion as multicellular, and deleting the multicellular from the real cell expression spectrum;

the bimodal coefficient is:

wherein the content of the first and second substances,

is a bimodal coefficient;

and

the number of the real cells.

2. The method for filtering low-quality cells of a single-cell transcriptome according to claim 1, wherein said characteristic expression profile is combined two by the following method:

Y=a1*X1+a2*X2

wherein Y is the generated artificial multicellular, and X1 and X2 are characteristic expression profiles of cell populations; one of a1 and a2 is set to 1, and the other is set to a random value greater than 0 and less than 1.

3. The method of claim 1, wherein the distance between said cells is Euclidean distance or Manhattan distance.

4. The method of claim 1, wherein the artificial multicellular ratio is: the ratio of the number of artificial multicellular cells to the total number of cells in the neighborhood in which the combined expression profile is located.

5. The method of claim 1, wherein the predetermined number is determined by the following method: setting a multicellular ratio, the product of the number of real cells and the multicellular ratio being a prescribed number of real cells identified as multicellular.

6. The method for filtering low-quality cells of a transcriptome of single cell according to claim 1, wherein in said S1044, neighborhood is set as follows: 100 equidistant neighbourhoods are arranged in the range of 0.0001 to 0.01.

7. A computer storage medium on which a computer program is stored which, when being executed by a processor, carries out the method for low quality cell filtration of a single-cell transcriptome of any of claims 1 to 6.

8. A terminal device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the computer program implements the method for low quality cell filtration in a single-cell transcriptome of any of claims 1 to 6.