CN116486916A

CN116486916A - Single cell transcriptome dying cell and multicellular filtration method, medium and equipment

Info

Publication number: CN116486916A
Application number: CN202310175918.9A
Authority: CN
Inventors: 陈哲名; 郎秋蕾; 韩斐然
Original assignee: Hangzhou Lianchuan Biotechnology Co ltd
Current assignee: Hangzhou Lianchuan Biotechnology Co ltd
Priority date: 2022-11-03
Filing date: 2022-11-03
Publication date: 2023-07-25
Anticipated expiration: 2042-11-03
Also published as: CN116805511A; CN115440303A; CN116486916B; CN115440303B

Abstract

The invention discloses a single-cell transcriptome low-quality cell filtering method, and relates to a biological data processing method. The method comprises the following steps: grouping cells; taking the average value of the expression quantity according to the genes to generate a characteristic expression profile of the cell population; randomly combining the characteristic expression profiles of the cell population two by two to generate artificial multicellular; combining the artificial multicellular expression profile and the real cell expression profile, and calculating the distance between each cell; setting a plurality of equidistant neighborhoods in a specified range, and calculating the artificial multicellular proportion of each real cell in the neighborhood under each neighborhood; counting the proportion distribution of the artificial multicellular under each neighborhood, solving the bimodal coefficient of the artificial multicellular, and taking the neighborhood with the largest bimodal coefficient as the optimal neighborhood; and in the optimal neighborhood, the specified number of real cells with the maximum artificial multicellular proportion are regarded as multicellular, and the real cells are deleted from the expression profile of the real cells. The filtering standard and the accuracy of single cell transcriptome data are improved, and the reliability of the data is enhanced.

Description

Single cell transcriptome dying cell and multicellular filtration method, medium and equipment

Cross Reference to Related Applications

The application is based on application number 2022113673004, and the application date is: 2022, 11, 03, under the name: a single cell transcriptome low quality cell filtration method, medium and apparatus are filed.

Technical Field

The invention relates to a biological data processing method, in particular to a single-cell transcriptome dying cell and multicellular filtration method, medium and equipment.

Background

Single cell transcriptome sequencing based on microfluidic technology enables quantification of gene expression of tens of thousands of cells in a single experiment. It is mainly based on sequence tags to identify single cells, and its core technology is to add a unique sequence tag to each cell, and to treat nucleic acid sequences carrying the same tag as being from the same cell when sequencing. The 10X Genomics single-cell transcriptome sequencing platform is a technology widely applied at present, and the platform utilizes technologies such as micro-flow control, oil drop package, barcode label and the like to realize high-flux cell sorting and capturing, can separate and mark 500 to tens of thousands of single cells at one time, can obtain transcriptome information of each cell after sequencing, and has the advantages of high cell flux, low library construction cost, short capturing period and the like.

A typical single cell transcriptome sequencing protocol is as follows, first a cell suspension is prepared, mixed with magnetic beads using microfluidic chips on a corresponding platform instrument, and encapsulated with oil droplets. Each microbead carries a unique nucleotide sequence, namely, a barcode tag, which can label individual cells. Each barcode tag is further linked to a molecular identifier (unique molecular identifier, UMI) consisting of a nucleotide sequence, each UMI being capable of labeling an mRNA transcript. Through reverse transcription, PCR amplification, library generation and sequencing, in sequencing data, whether each sequence in the result is from the same cell and the same mRNA can be determined according to the barcode tag and UMI mark, and the method can reduce the preference influence of PCR on different molecules. By matching and counting the barcode and UMI, the gene expression information is summarized in a counting matrix, thereby obtaining a transcriptome expression profile of individual cells.

Single cell experiments often obtain single cells in bulk based on dissociation and disruption of biological tissue, which often results in many cell fragments or apoptosis. Drop-based single cell transcriptome techniques also exist where two or more cells (or whole cells + cell debris) form a drop. Information of hundreds of thousands or millions of droplets is contained in single-cell transcriptome data, but the barcode in the droplets does not automatically identify whether the droplets contain cells or whether the cells are cell fragments or dead/dying cells or multicellular, that is, whether the quality of the cells cannot be automatically determined. The level of cell mass greatly influences the results in subsequent analyses, so that the type of droplet represented by barcode needs to be determined before data analysis. The software cellrange of the 10X Genomics authorities can only determine whether the barcode is empty droplets and cannot recognize the cell quality, which may lead to a large deviation of the single cell transcriptional analysis results from the actual situation and even the opposite results in a biological sense. There is currently no systematic way to identify low quality cells and filter cells.

Disclosure of Invention

In order to solve at least one technical problem mentioned in the background art, the invention aims to provide a single cell transcriptome low quality cell filtering method, medium and equipment, which are used for identifying and filtering low quality cells in single cell transcriptome data, so that the filtering standard and accuracy of the single cell transcriptome data are improved, and the reliability of the data is enhanced.

In order to achieve the above purpose, the present invention provides the following technical solutions:

a single cell transcriptome low quality cell filtration method comprising the steps of:

s101, grouping cells based on a true cell expression profile;

s1041, taking the average value of the expression quantity of the cell expression profile of each cell group according to genes, and generating the characteristic expression profile of each cell group;

s1042, combining the characteristic expression profiles of the cell population in a random mode to generate a certain number of artificial multicellular;

s1043, combining the artificial multicellular expression profile and the real cell expression profile, and calculating the distance between each cell;

s1044, setting a plurality of equidistant neighborhoods in a specified range, and calculating the artificial multicellular proportion of each real cell in the neighborhood under each neighborhood;

s1045, counting the proportion distribution of the artificial multicellular under each neighborhood, solving the bimodal coefficient of the artificial multicellular, and taking the neighborhood with the largest bimodal coefficient as the optimal neighborhood;

s1046, under the optimal neighborhood, identifying the actual cells with the maximum artificial multicellular proportion and the specified quantity as multicellular, and deleting the actual cells from the expression profile of the actual cells.

Further, the method for combining the characteristic expression profiles in pairs comprises the following steps:

Y＝a1*X1+a2*X2

wherein Y is the generated artificial multicellular, X1 and X2 are the characteristic expression profiles of the cell population; a1 and a2 are proportionality coefficients, one of a1 and a2 is set to 1, and the other is set to a random value larger than 0 and smaller than 1.

Further, the distance between the cells is Euclidean distance or Manhattan distance.

Further, the artificial multicellular ratio is: the ratio of the number of artificial multicellular to the total number of cells in the neighborhood for which the combined expression profile is located.

Further, the bimodal coefficient is:

wherein BC is a bimodal coefficient; s and k are the skewness and kurtosis of the artificial multicellular proportional distribution respectively; n (N) _real Is the true cell number.

Further, the method for determining the specified number is as follows: the multicellular rate is set so that the product of the actual cell number and the multicellular rate is defined as the actual number of cells identified as multicellular.

Further, in S1044, the setting of the neighborhood is as follows: 100 equidistant neighborhoods are arranged in the range of 0.0001-0.01.

A computer storage medium having stored thereon a computer program which when executed by a processor implements a single cell transcriptome low quality cell filtration method as described above.

A terminal device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing a single cell transcriptome low quality cell filtration method as described above when the computer program is executed.

Compared with the prior art, the invention has the beneficial effects that: according to the invention, a certain amount of artificial multicellular is generated, the artificial multicellular proportion distribution of each real cell in the neighborhood is counted for each set neighborhood, the optimal neighborhood is determined, then a plurality of real cells with the maximum artificial multicellular proportion are identified as multicellular under the optimal neighborhood, and the multicellular (low-quality cells) is deleted from the real cell expression profile, so that the filtering standard and the filtering accuracy of single-cell transcriptome data are improved, and the reliability of the data is enhanced.

Drawings

FIG. 1 is an overall flow chart of an embodiment of the present invention.

FIG. 2 is a flow chart of multi-cell filtration according to an embodiment of the invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely, and it is apparent that the described embodiments are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Embodiment one:

referring to fig. 1, the embodiment provides a single cell transcriptome low quality cell filtration method, the implementation process mainly includes steps S100 to S104, and the detailed description is as follows:

in step S100, the original expression profile (also called the true cell expression profile) may contain non-cellular data, so that the original expression profile is initially filtered by using cellrange software to remove non-cellular data in the data, and a primary filtered cell expression profile is generated. If the original expression profile does not contain non-cellular data, the original expression profile can be directly adopted for the next operation without filtering.

The original expression profile of this example is shown in table 1 below.

Table 1: original expression profile

In Table 1, C301, C15001 and the like represent cell numbers, G1, G2, G53 and the like represent gene numbers, and the data in the text of the tables represent gene expression levels.

The primary filtered cell expression profile of this example is shown in table 2 below.

Table 2: once-through filtration of cell expression profiles

In comparison to table 1 above, once-filtered cell expression profiles filtered cell C10001 and cell C15001.

Step S101, based on the primary filtered cell expression profile or the original expression profile, the cells are normalized, dimension reduced and clustered by using the SEURat software. The normalization method selects 'Log normal', and the scale.factor parameter is set to 10000; the screening method of the hypervariable gene uses 'vst'; PCA reduces the dimension to 50 PC dimensions; grouping was performed using the FindNeighbors function, with the resolution parameter set to 0.8. To this end, all cells are divided into different cell populations. The results of the clustering of the primary filtered cell expression profiles of this example are shown in Table 3 below:

table 3: grouping results of primary filtration of cell expression profiles

In table 3, the cells C1 and C601 are divided into the cell group a, and the other cells are divided into the cell group B, the cell group C and the cell group D, respectively.

Filtering cell debris and dead/dying cells is accomplished by steps S102 to S103. It is worth mentioning that cell debris and dead/dying cells can be filtered together or removed separately.

Step S102, setting four gene sets A respectively _gene 、A _mt 、A _active 、A _antioxi And calculating the expressed gene fraction S of each cell population _gene Mitochondrial fraction S _mt Score S of Activity _active Oxidation resistance fraction S _antioxi 。

Wherein A is _gene For one filtration of the total gene set in the cell expression profile (including G1 to G53 in table 3 above); s is S _gene Is the average value of the number of expressed genes (i.e., the number of genes whose expression level is greater than 0) in all cells in the cell population.

A _mt Mitochondrial gene sets for the corresponding species (including G1 and G2 in table 3 above); s is S _mt A for all cells in the cell population _mt Average value of Gene expression proportion, A _mt Ratio of gene expression = a in single cells _mt Sum of Gene expression amount/A in Individual cells _gene The sum of the gene expression amounts is 100%.

A _active Housekeeping gene sets for the corresponding species (including G6 and G7 in Table 3 above) The method comprises the steps of carrying out a first treatment on the surface of the Such as human ACTB and GAPDH genes; s is S _active A for all cells in the cell population _active Average value of average expression level of Gene A _active Average expression level of gene = a in single cell _active Gene expression level sum/A _active Number of genes.

A _antioxi Sets of antioxidant genes for the corresponding species (including G20 and G21 in table 3 above); such as human SOD1, SOD2, SOD3, CAT, GPX1, GPX2, GPX3, GPX4, GPX5, GPX6, GPX7, GPX8, NQO1, NFE2L2 genes. S is S _antioxi A for all cells in the cell population _antioxi Average value of average expression level of Gene A _antioxi Average expression level of gene = a in single cell _antioxi Gene expression level sum/A _antioxi Number of genes.

In summary, for each cell population, a corresponding expressed gene fraction S is obtained _gene Mitochondrial fraction S _mt Score S of Activity _active Oxidation resistance fraction S _antioxi Table 4 below is a table showing the scores obtained.

Table 4: four score display table

Fraction \cell population	A	B	C	D
					Sgene	3499	2169	1469	320
Smt	1.37％	3.33％	5.33％	49.55％
					Sactive	16.9	10.5	21.5	1.7
Santioxi	4	5	7	0.9

Step S103, set S _gene 、S _mt 、S _active 、S _antioxi Corresponding threshold G _gene 、G _mt 、G _active 、G _antioxi And determining the cell population type.

When S is _gene Score less than G _gene When the cell population is judged to be cell debris.

When S is _mt Greater than G _mt And S is _antioxi Less than G _antioxi At this time, the cell population was judged as dead/moribund cells.

When S is _active Less than G _active At this time, the cell population was judged as dead/moribund cells.

G _gene 、G _mt 、G _active 、G _antioxi Typically 500, 25%, 2.

As shown in table 4 above, cell population D (corresponding cell C8001) satisfies the above three simultaneously, and thus cell population D is judged as both cell debris and dead/dying cells.

In the primary filtered cell expression profile, cells judged as cell debris and dead/dying cells (either of which can be deleted) are deleted, and a secondary filtered cell expression profile is generated. The secondary filtered cell expression profile of this example is shown in Table 5.

Table 5: secondary filtration cell expression profile

The cells in the above primary and secondary filtered cell expression profiles are both truly present and therefore are also called true cell expression profiles.

Step S104, based on the simulated multicellular characteristic expression profile, the multicellular in the real cells is screened out based on knn algorithm, and the multicellular is filtered from the secondary filtered cell expression profile to generate a final filtered expression profile. The filtration of the multicellular may be performed independently or may be carried out after filtration of cell debris or dead/dying cells.

Referring to fig. 2, the steps S1041 and S1046 are specifically implemented (the grouping step is completed by the aforementioned step S101):

s1041, taking the average value (rounding) of the expression quantity of each cell group according to the genes, and generating the characteristic expression profile of the cell group.

Taking cell population a as an example, if it contains only 2 cells in table 5, the characteristic expression profile is as shown in table 5 below:

cell population\Gene

G1

G2

…

G6

G7

…

G20

G21

…

G51

G52

G53

A

2

4

13

18

4

6

740

9

5

S1042, combining the characteristic expression profiles of all cell populations in pairs randomly to generate a certain duty ratio P _N Is an artificial multicellular in the cell line, the duty ratio P _N Typically set to 25%; duty cycle P of artificial multicellular _N =artificial multicellular number/(artificial multicellular number+true cell number) ×100%;

the feature expression profile combining method comprises the following steps:

Y＝a1*X1+a2*X2

wherein Y is the generated artificial multicellular, X1 and X2 are the characteristic expression profiles of the cell population; a1 and a2 are proportionality coefficients, one of a1 and a2 is set to be 1, and the other is set to be a random value larger than 0 and smaller than 1, namely, when a1 is 1, a2 is a random value between 0 and 1, and when a2 is 1, a1 is a random value 1< a1+a2<2 between 0 and 1. Each artificial multicellular thus produced contained one intact cell and one defective cell, i.e., simulated the production process of the multicellular in the actual experiment.

S1043, combining the artificial multicellular expression profile and the secondary filtration cell expression profile, and re-normalizing and PCA dimension reduction by using the SEURat software. Based on the PCA dimension reduction result, the distance between each cell is calculated, and is commonly known as Euclidean distance or Manhattan distance, and other distance measurement methods can play a similar role.

S1044, setting 100 equidistant neighborhoods P in the range of 0.0001-0.01 _kn Calculate each domain P _kn Under each real cell in the neighborhood P _kn Artificial multicellular proportion P in _ANN I.e. counting the nearest real cell (total number of cells N in the combined expression profile _merge *P _kn ) Number of artificial multicellular N in individual cells _ANN Calculate P _ANN ＝N _ANN /(N _merge *P _kn )。

S1045, counting each neighborhood P _kn P under _ANN Distribution, the skewness s and kurtosis k of the distribution are calculated. The calculation formula of the skewness s isThe calculation formula of kurtosis k is +.>Wherein n is the same neighborhood P _kn Lower P _ANN Number, p of _i For the same neighborhood P _kn The ith P _ANN M is the same neighborhood P _kn All P below _ANN Is the average value of SD and is the same neighborhood P _kn All P below _ANN Standard deviation of (2).

Recalculating each P _ANN The bi-modal coefficient BC of the distribution,wherein N is _real Is the true cell number. Selecting the neighborhood size when the bimodal coefficient BC is maximum as the last used optimal neighborhood P _K 。

In the optimal neighborhood P _K Under the neighborhood, boundaries of single cells and multiple cells can be maximally clarified, namely all cells can be divided into single cells and multiple cells as far as possible, and the number of intermediate state cells with fuzzy classification is reduced, so that a classification result as accurate as possible is obtained.

S1046, setting multiple cell rate R _doub Calculating the expected value E of the multicellular number _doub ＝N _real *R _doub . Calculating the size of the neighborhood as the optimal neighborhood P _K At the time, P of each real cell _ANN Will P _ANN Maximum E _doub Individual cells were identified as multicellular, which were deleted from the secondary filtered cell expression profile, resulting in a final filtered cell expression profile.

Embodiment two:

a computer storage medium having stored thereon a computer program which when executed by a processor implements the single cell transcriptome RNA contamination identification method of embodiment one and/or the single cell transcriptome low quality cell filtration method of embodiment two.

Embodiment III:

a terminal device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the single cell transcriptome low quality cell filtration method of embodiment one when the computer program is executed.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

1. A single cell transcriptome dying cell and multicellular filtration method comprising the steps of:

s101, grouping cells based on a true cell expression profile;

gene set A was set for each gene _mt 、A _active 、A _antioxi And calculating mitochondrial fraction S for each cell population _mt Score S of Activity _active Oxidation resistance fraction S _antioxi ；A _mt For mitochondrial gene sets of the corresponding species, S _mt A for all cells in the cell population _mt Average value of gene expression ratio; a is that _active S for housekeeping Gene sets of the corresponding species _active A for all cells in the cell population _active Average value of average expression amount of gene; a is that _antioxi Is the antioxidant gene set of the corresponding species, S _antioxi A for all cells in the cell population _antioxi Average value of average expression amount of gene; set S _mt 、S _active 、S _antioxi Corresponding threshold G _mt 、G _active 、G _antioxi And determining the cell population type; in the true cell expression profile, cells judged as dead/dying cells are deleted;

s1046, under the optimal neighborhood, identifying the actual cells with the maximum artificial multicellular proportion and the specified quantity as multicellular, and deleting the actual cells from the expression profile of the actual cells;

the bimodal coefficient is:

2. The method for filtering dying cells and multiple cells from a single cell transcriptome according to claim 1, wherein said characteristic expression profile is obtained by a two-by-two combination method comprising:

Y＝a1*X1+a2*X2

wherein Y is the generated artificial multicellular, X1 and X2 are the characteristic expression profiles of the cell population; one of a1 and a2 is set to 1, and the other is set to a random value greater than 0 and less than 1.

3. The method of claim 1, wherein the distance between cells is euclidean distance or manhattan distance.

4. The single cell transcriptome dying cell and multicellular filtration method of claim 1, wherein said artificial multicellular ratio is: the ratio of the number of artificial multicellular to the total number of cells in the neighborhood for which the combined expression profile is located.

5. The method for filtering dying cells and multiple cells from a single cell transcriptome according to claim 1, wherein said predetermined number is determined as follows: the multicellular rate is set so that the product of the actual cell number and the multicellular rate is defined as the actual number of cells identified as multicellular.

6. The method for filtering dying cells and multiple cells from a single cell transcriptome according to claim 1, wherein in said S1044, the neighborhood is set as follows: 100 equidistant neighborhoods are arranged in the range of 0.0001-0.01.

7. A computer storage medium having stored thereon a computer program which when executed by a processor implements the single cell transcriptome dying cell and multicellular filtration method of any one of claims 1 to 6.

8. A terminal device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the computer program, implements the single cell transcriptome dying cell and multicellular filtration method of any one of claims 1 to 6.