CN116312786B

CN116312786B - Single cell expression pattern difference evaluation method based on multi-group comparison

Info

Publication number: CN116312786B
Application number: CN202310083585.7A
Authority: CN
Inventors: 陈哲名
Original assignee: Hangzhou Lianchuan Biotechnology Co ltd
Current assignee: Hangzhou Lianchuan Biotechnology Co ltd
Priority date: 2023-02-08
Filing date: 2023-02-08
Publication date: 2023-11-28
Anticipated expiration: 2043-02-08
Also published as: CN117457072A; CN116312786A

Abstract

The invention discloses a single cell expression pattern difference evaluation method based on multi-group comparison, and relates to the technical field of biological information analysis. The method comprises the following steps: s101, combining single cell transcriptome expression profiles of a plurality of groups; s102, grouping all cells in the combined single cell transcriptome expression profile; s103, identifying cell types of each cell population; s104, screening out cell types which exist in two or more groups simultaneously, and extracting coexisting expression profiles of the cell types; s105, calculating a difference enrichment score of the corresponding cell type according to the coexisting expression profile; and S106, ranking the cell types from large to small according to the difference enrichment scores. The method can evaluate the exact index of the difference size and the difference enrichment degree of the single cell expression spectrum under the comparison of multiple groups, and provides a basis for scientifically selecting the subsequent research direction.

Description

Single cell expression pattern difference evaluation method based on multi-group comparison

Technical Field

The invention relates to the technical field of biological information analysis, in particular to a single cell expression pattern difference evaluation method based on multi-group comparison.

Background

Single cell level sequencing techniques such as single cell transcriptome, single cell ATAC histology (high throughput analysis of single cell chromatin transposase accessibility), single cell epigenetic histology and the like can obtain RNA and chromatin information of thousands or tens of thousands of genes in a single cell, and comprehensively display the gene expression difference between each cell. The high-flux single-cell sequencing platform (such as a related platform of 10X Genomics) can realize high-flux cell sorting and capturing by utilizing technologies such as micro-flow control, oil drop wrapping, barcode labeling and the like, can separate and label hundreds or even tens of thousands of cells at one time, can obtain information such as transcriptome or chromatin, site methylation and the like of each cell after processing such as amplification, sequencing and the like, and has the advantages of high cell flux, low library construction cost, high efficiency and the like. The technology can be used for analyzing the expression, chromatin or methylation characteristics of different cell types by combining with the characteristic of a marker signal (marker gene) or a cell type marking algorithm (such as SingleR) of different cell types, and further can be used for researching the aspects of biological development, disease development, immune change and the like.

Analysis of single cell histology data typically involves the following steps: filtering low quality cells, identifying cell types, initially analyzing the overall characteristics or expression patterns of each cell type, selecting cell types for in-depth analysis, and performing personalized analysis (e.g., pathway enrichment analysis, predicting differentiation trajectories, transcription factor activity, cellular communication, etc.) for the target cell type. The last step of personalized analysis, theoretically, can be performed using a virtually unlimited number of existing software, and therefore typically takes the most manpower, effort and time in this step. The former step of selecting a cell type is the basis of personalized analysis, namely, selecting a key and proper cell type, then the subsequent personalized analysis can mine important and meaningful information, and if the cell type is selected improperly, a great deal of time and effort can be spent, but only valuable or low-value information is mined. Particularly in single cell data where there are multiple sets of comparisons, intensive investigation of each cell type increases its cost of investigation by a factor of two compared to either no comparison or two sets of comparisons.

Currently, researchers select directions for subsequent studies on single cell data with multiple treatment groups (three or more) at the same time, usually based on existing biological knowledge, rather than on the characteristics and differences of the data itself. This may lead to many potentially valuable research directions being missed. Whether there are differences between groups of different cell types, and the magnitude of the differences, the significance of the differences, are predictive of the extent to which the treatment or experimental conditions affect the expression patterns of the different cells. However, due to the huge data volume and cell volume of single-cell histology and complex data conditions, the prior art is difficult to directly and accurately evaluate the difference of cell expression patterns.

Disclosure of Invention

In order to solve at least one technical problem in the background art, the invention aims to provide a single cell expression pattern difference evaluation method based on multiple groups of comparison, which can evaluate the exact indexes of the difference size and the difference enrichment degree of single cell expression patterns under multiple groups of comparison and provide basis for scientifically selecting the subsequent research direction.

In order to achieve the above purpose, the present invention provides the following technical solutions:

a single cell expression pattern difference evaluation method based on multi-group comparison comprises the following steps:

s101, combining single cell transcriptome expression profiles of a plurality of groups;

s102, grouping all cells in the combined single cell transcriptome expression profile;

s103, identifying cell types of each cell population;

s104, screening out cell types which exist in two or more groups simultaneously, and extracting coexisting expression profiles of the cell types;

s105, calculating a difference enrichment score of the corresponding cell type according to the coexisting expression profile;

and S106, ranking the cell types from large to small according to the difference enrichment scores.

Further, the method for calculating the differential enrichment score is as follows:

s1051, obtaining the total characteristic expression profile of the cell type based on the coexisting expression profile;

s1052, dividing the total characteristic expression profile into a plurality of sub-characteristic expression profiles according to groups;

s1053, calculating the distance between each sub-feature expression profile and the corresponding total feature expression profile;

s1054, repeating S1051 to S1053 on the rest of the coexisting expression profiles to obtain corresponding distances;

s1055, ordering all distances from big to small, and giving ranking;

s1056, obtaining a differential enrichment score of the cell type based on the ranking:

wherein,representing differential enrichment score,/->Representing ranking, ->Representing the cognate cell type->Is a ranking set corresponding to the distance of (a) subscript +.>Representing cell type, subscript->Representative group; />Representing a total number of ranks; />Representing the same cell typeThe number of groups; />To enrich the weight, ++>。

Further, the method comprises the steps of,。

further, the distance is a euclidean distance or a manhattan distance.

Further, the method for solving the total characteristic expression profile is as follows: and taking an average value of the gene expression of each gene as a characteristic value of the gene aiming at the cell group corresponding to the screened cell type, wherein the characteristic values of a plurality of genes form the total characteristic expression profile.

Further, in S106, reliability screening is performed on the cell type:

s1061, comparing the ranking of all distances in the cell type with the ranking of all distances of other cell types, and calculating a p value by using Wilcoxon rank sum test;

s1062, obtaining corresponding p values for other cell types;

s1063, deleting cell types with p values higher than a preset threshold.

Further, in S102, the cells are normalized and/or PCA reduced in dimension prior to being clustered.

Further, in S103, identifying the cell type by using a marker gene having cell type specificity, and identifying the cell population with high expression of the marker gene as the corresponding cell type; alternatively, identification of each cell type was performed using SingleR software.

A computer storage medium having stored thereon a computer program, characterized in that the program when executed by a processor implements a single cell expression pattern difference assessment method based on a plurality of sets of comparisons as described above.

A terminal device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements a single cell expression pattern difference evaluation method based on a plurality of sets of comparisons as described above when executing the computer program.

Compared with the prior art, the invention has the beneficial effects that:

according to the invention, the distances between the characteristic expression profile of each group and the expression profile of the cell type are calculated under each cell type, all the distances are ranked, the difference enrichment score of each cell type is calculated based on the distance ranking, the difference enrichment score differences of each group in each cell type can be compared, and a corresponding expression mode difference amplitude evaluation result is obtained, so that a basis is provided for scientifically selecting the subsequent research direction.

Drawings

FIG. 1 is an overall flow chart of an embodiment of the present invention.

FIG. 2 is a flow chart of differential enrichment scoring according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely, and it is apparent that the described embodiments are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Embodiment one:

referring to fig. 1, the embodiment provides a single cell expression pattern difference evaluation method based on multiple sets of comparison, which includes the following steps:

s101, combining single cell transcriptome expression profiles of a plurality of groups; as shown in table 1 below:

TABLE 1 pooled expression profiles

In table 1 above, groups 1 to 3 represent groups; samp1 to Samp6 represent sample names; c1 to C22001 represent cells; gene1 to Gene3 represent genes; the numbers of the body part represent the gene expression level.

S102, grouping all cells in the combined single cell transcriptome expression profile by using the SEurat software; and meanwhile, normalization and dimension reduction can be performed, wherein the normalization method selects 'lognormal', and the screening method of the hypervariable genes uses 'vst'.

S103, identifying the cell types of the cell populations. There are two general methods of identification, one is to identify using marker genes with cell type specificity (such as CD4 and CD8 genes of human T cells), and identify cell populations highly expressing marker genes as corresponding cell types; the other is the use of cell identification function software (e.g., singleR, scCATCH) for each cell population. The results of the assay are shown in table 2 below:

TABLE 2 cell identification results

In Table 2 above, types 1 through 4 are identified cell types.

S104, counting the number of groups existing in each cell type, screening out the cell types existing in two or more groups simultaneously, and extracting the coexisting expression profile of the cell types.

If there are three groups with cell Type1, the number of groups with cell Type1 is 3.

Removing cell types with the number of the existing groups being less than 2, and respectively marking the names of the remaining n cell types asTo the point ofIts corresponding group number is marked +.>To->And extracting the corresponding expression profile +.>. Table 3 below shows the coexisting expression profile corresponding to cell Type 1->：

TABLE 3 expression profile corresponding to cell Type1

In Table 3 above, T1 represents the primitive cell Type1, which is present in three groups simultaneously, and therefore=3。

S105, calculating a differential enrichment score of the corresponding cell type according to the coexisting expression profile so as to measure the differential enrichment degree of the cell type.

Specifically, as shown in fig. 2, the method for calculating the differential enrichment score is as follows:

s1051, based on coexistence expression profileObtaining the total characteristic expression profile of the cell type>(0 in subscript represents the characteristic expression profile of the cell type as a whole).

The solving method of the total characteristic expression profile is as follows: for the cell group corresponding to the screened cell type, taking average value of gene expression of each gene, and taking the average value as the characteristic value of the gene, wherein the characteristic values of a plurality of genes form the total characteristic expression profileThe cell Type1 (i.e. corresponding coexisting expression profile +.>) Is->：

TABLE 4 Total characteristic expression profile of cell Type1

In Table 4, the coexisting expression profile was calculated using a value of 17.6 as an exampleThe average value of the Gene expression level of the medium Gene Gene 1.

In other embodiments, the method of calculating the characteristic expression profile may be adjusted according to the data distribution, which may be obtained by calculating the average value of the gene expression level of each gene when the distribution of the gene expression level is a normal distribution or a uniform distribution, or by calculating the average value of the square (or square root or logarithm) of the gene expression level if the distribution is other, which may eliminate the bias on the distribution.

S1052, dividing the total characteristic expression profile into a plurality of sub-characteristic expression profiles according to groups. In particular, there areThe individual groups comprise cell type T1, calculated group1 to +.>Sub-characteristic expression profile of group T1 cells and is marked +.>To->（To->The first 1 of the subscripts represents the cell type T1, the second 1 and +.>Representative group). The following table 5 shows several sub-characteristic expression profiles into which the total characteristic expression profile of the cell type T1 is partitioned:

TABLE 5 several subfraction profiles into which the total profile of cell type T1 is partitioned

calculating the sub-characteristic expression profilesTo->) Respective and cell type characteristic expression profile->Is a distance of (3). As shown in Table 5 above, the expression profile of cell type T1 was first +.>Split into->The expression profile is calculated using the same calculation method as the total characteristic expression profile in step S1051 (e.g., calculating the average value of the gene expression amount of each gene, or the median of the gene expression amounts, or by calculating the square or square root or logarithm of the gene expression amount)>Sub-characteristic expression profile of the individual expression profile +.>To->. Respectively calculating sub-characteristic expression profile +.>To->And total characteristic expression profile->Is of the European distance of (2)Separation ofTo->(Manhattan distance may also be used instead), will +.>To->Merging into a set->。

S1054 for the rest cell typesTo->) Repeating S1051-S1053 to obtain corresponding distances to obtain the set +.>To->. Will collect->To->Combining to form a set B of all Euclidean distances, comprising all distances in each cell type, i.e.>To->、/>To->、…、/>To->。

S1055, ranking all distances according to the order from big to small. Specifically, all distances in set B are ordered from big to small, and corresponding ranks are assignedWill->To->The corresponding rank is defined as the set +.>To the point of。

wherein,representing differential enrichment score,/->Representing ranking, ->Representing the cognate cell type->Is a ranking set corresponding to the distance of (a) subscript +.>Representing cell type, subscript->Representative group; />Representing a total number of ranks; />Representing the same cell typeThe number of groups; />To enrich the weight, ++>Can be adjusted according to the situation, and is usually set to 0.75.

S106, enriching the cell types according to the differenceRanking from large to small, as a result of evaluation of differential enrichment degree of expression patterns of each cell type under multiple groups of comparison and ranking of potential research value, differential enrichment score +.>The larger represents the higher the corresponding potential research value.

Embodiment two:

based on the first embodiment, in the second embodiment, in S106, the reliability screening is further performed on the cell type, which specifically includes the following steps:

s1062, obtaining corresponding p values for other cell types;

s1063, delete cell types with p-value higher than the preset threshold G, which is typically set to 0.05.

Embodiment III:

the present embodiment provides a computer storage medium having a computer program stored thereon, wherein the program when executed by a processor implements the single cell expression pattern difference evaluation method based on the multi-group comparison as described in the first or second embodiment.

Embodiment four:

the embodiment provides a terminal device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, wherein the processor implements the single cell expression pattern difference evaluation method based on the multiple group comparison as described in the first embodiment or the second embodiment when executing the computer program.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

1. The single cell expression pattern difference evaluation method based on the multi-group comparison is characterized by comprising the following steps of:

s103, identifying cell types of each cell population;

s106, ranking the cell types according to the difference enrichment scores from large to small;

the method for calculating the difference enrichment score comprises the following steps:

s1055, ordering all distances from big to small, and giving ranking;

wherein,representing differential enrichment score,/->Representing ranking, ->Representing the cognate cell type->Is a ranking set corresponding to the distance of (a) subscript +.>Representing cell type, subscript->Representative group; />Representing a total number of ranks; />Representing the same cell type->The number of groups; />To enrich the weight, ++>。

2. The method for evaluating the differences of single-cell expression patterns based on the multiple sets of comparison according to claim 1, wherein,。

3. the method of claim 1, wherein the distance is a euclidean distance or a manhattan distance.

4. The method for evaluating the differences of single cell expression patterns based on multiple sets of comparison according to claim 1, wherein the method for solving the total characteristic expression profile is as follows: and taking an average value of the gene expression of each gene as a characteristic value of the gene aiming at the cell group corresponding to the screened cell type, wherein the characteristic values of a plurality of genes form the total characteristic expression profile.

5. The method according to claim 1, wherein in S102, the cells are normalized and/or PCA reduced in dimension before being grouped.

6. The method according to claim 1, wherein in S103, the cell type is identified by using a marker gene having cell type specificity, and the cell population with high expression of the marker gene is identified as the corresponding cell type; alternatively, identification of each cell type was performed using SingleR software.

7. A computer storage medium having stored thereon a computer program, which when executed by a processor implements the single cell expression pattern difference assessment method based on a multi-group comparison as claimed in any one of claims 1 to 6.

8. A terminal device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the single cell expression pattern difference assessment method based on a plurality of sets of comparisons according to any one of claims 1 to 6 when executing the computer program.