CN117316295A

CN117316295A - Endocrine disease cell identification method based on cell heterogeneity gene and pathway function

Info

Publication number: CN117316295A
Application number: CN202311177280.9A
Authority: CN
Inventors: 张凝一; 臧天仪; 赵飞
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2023-09-13
Filing date: 2023-09-13
Publication date: 2023-12-29

Abstract

An endocrine disease cell identification method based on cell heterogeneity gene and pathway function relates to an endocrine disease cell identification method. The invention aims to solve the problem that the existing cell function identification method has limitation. The method comprises the steps of 1, extracting cell associated gene characteristics; step 2, amplifying cell-associated genes; step 3, predicting a cell heterogeneity gene; and 4, identifying the cell function of the endocrine disease. The invention belongs to the technical field of endocrinopathy cell function identification.

Description

Endocrine disease cell identification method based on cell heterogeneity gene and pathway function

Technical Field

The invention relates to a cell identification method for endocrine diseases, and belongs to the technical field of cell function identification for endocrine diseases.

Background

It has now been found that various endocrine cell subtypes are associated with major human diseases, such as β cells being closely related to type two diabetes, α cells being closely related to type one diabetes, T cells being closely related to polycystic ovary syndrome, etc. The main method of cell function recognition analysis is to analyze the difference of gene expression in different cells by using a gene sequencing technology. Conventional population sequencing methods provide an average of gene expression at the overall cell population level, but may mask cell-to-cell heterogeneity and individual cell differences, whereas single cell sequencing techniques can obtain detailed information of gene expression at the single cell level, thereby enabling analysis of cell-to-cell heterogeneity, revealing differences between different cell types and subtypes.

The main principle of the cell function recognition analysis based on single cell sequencing technology at present is to annotate and analyze cells based on differential expression genes and known biomarkers. For example, SCDE, a method for identifying cell function by calculating the similarity of the expression levels of differentially expressed genes in different cells using a Bayesian mixed model. A PUSeqCluster is a cell function identification method for carrying out gene expression data clustering analysis by utilizing a combined mixed T distribution clustering model. However, this approach still has certain limitations: 1) Because of the lack of prior knowledge, annotation by known biomarkers and gene expression profiles alone is not an accurate and effective means to identify the association of disease with cellular function; 2) Due to single cell sequencing technology barriers and cell number imbalance associated with endocrine tissues, cell heterogeneity gene recognition by analyzing only cell function-related gene features is affected by cell imbalance.

Cell heterogeneity functional recognition is not only related to gene feature extraction, but also can improve the accuracy of cell classification by analyzing imbalance of cell-associated genes, thereby effectively performing cell functional recognition based on endocrine disease pathway functions. A method for identifying cell function based on cell heterogeneity gene and channel function is provided, which uses improved SMOTE algorithm to amplify cell heterogeneity gene feature and label, uses RAKEL multi-label multi-classification model to identify cell heterogeneity gene, and analyzes the related endocrine disease channel function to improve cell function identification effect.

Disclosure of Invention

The invention aims to solve the problem that the existing cell function recognition method has limitation, and further provides an endocrine disease cell recognition method based on cell heterogeneity genes and pathway functions.

The technical scheme adopted by the invention for solving the problems is as follows: the method comprises the following steps:

step 1, extracting cell associated gene characteristics;

step 2, amplifying cell-associated genes;

step 3, predicting a cell heterogeneity gene;

and 4, identifying the cell function of the endocrine disease.

Further, the step of extracting the cell-associated gene feature in step 1 includes:

step 101, obtaining cell significant differential expression genes according to single cell sequencing data, and comparing the cell significant differential expression genes with disease mutation site data and tissue specific differential expression data according to position information of the genes, wherein each gene selects significant p values of 5 mutation sites which are most significantly related as disease mutation site information of cell related genes;

step 102, p-value and log of matched SNP ₂ FC values as information of differential expression of genes in this tissue, eQTL data derived from pancreatic, adipose, blood and muscle tissue;

step 103, expressing the association KEGG channel information of the gene as a 343-dimension binary vector, wherein each dimension represents the association relation between the gene and the channel;

step 104, selecting subcellular localization information of the genes represented by the first 10 cell substructure through analyzing the subcellular localization information of the cell-associated genes, and generating cell function-related gene characteristics by utilizing the generated antagonism network.

Furthermore, in the step 2, an integrated multi-objective classification model is constructed by adopting a RAkEL frame based on a problem conversion method; raklel converts a multi-objective classification problem into a single-label classification problem by treating the label combination of the samples as a new single label, and the specific steps include:

step 201, analyzing distribution of gene tag combinations, selecting small sample genes based on unbalanced proportion IR of the tag combinations, wherein the definition of IR is shown as the following formula:

in the formula (1), L represents a tag set, L ₁ Represents the 1 st tag, L represents the number of tags, N represents the number of genes, Y _i Representing the tag set corresponding to the ith gene;

the average mean (IR) of the IR of all tags may represent the degree of imbalance of the dataset. It is generally considered that a tag having an IR (l) > 10 can be regarded as a small sample tag, and a gene containing such a tag is referred to as a small sample gene;

step 202, synthesizing gene samples according to small sample genes, selecting k neighbor node sets of the small sample genes according to each small sample gene, and measuring the distance between small sample gene feature vectors by utilizing Euclidean distance; in order to generate gene tags, the number of times each tag appears in a small sample gene and neighbor nodes thereof is counted, and a threshold value is set to synthesize new tags, and the tags of the synthesized genes are expressed as follows:

Label _synthGene ＝{L ₁ ,L ₂ ,...,L _|L| } (2)，

in the formula (2), when the number of times of occurrence of the ith tag in the neighbor node is greater than the set threshold value, the tag L of the gene is synthesized _i 1, otherwise 0;

step 203, randomly selecting a neighbor node as a reference neighbor gene for generating a synthetic sample feature, and synthesizing a gene feature F by interpolation _syn This is expressed as:

F _syn ＝F _seed +r×(F _seed -F _ref ) (3)，

in the formula (3), r is a random number between (0, 1), and the characteristic of the small sample gene is recordedIs F _seed Characterization of reference neighbor Gene node F _ref ；

Step 204, by comparing the mean (IR) values and the similarity of gene labels and label combination distribution under the amplification multiples of different small sample genes, the amplification number of the small sample genes is selected, so that the uniformity of gene distribution is improved, and the main information of label distribution is reserved.

Further, the step of predicting a cell heterogeneity gene in step 3 comprises:

step 301, dividing the labels corresponding to the gene samples into m groups, wherein each group is provided with a label subset of k labels;

step 302, constructing m binary base classifiers, each of which performs binary classification on a group of label subsets, and each classifier can perform primary prediction on labels: if the ith classifier P _i The corresponding label subset is L _i Wherein each tag l _j A score can be obtained, and after training all the classifiers, each label is calculated by taking the average value _j Final score of (2)If->Above the threshold, the gene is considered to be in the jth Cell type Cell _j On the contrary, the gene is considered to have no obvious functional manifestation in the cell type;

step 303, testing the data of the test set on trained classifiers, obtaining a result on each classifier by each sample, finally obtaining a final label of the sample by voting, and classifying cells of endocrine disease genes, wherein the parameters of RAkEL are k 5 and m 14.

Further, after the cell-associated gene set is obtained in step 4, integrating the cell-associated gene set with the cell genes in the original data set to form a new cell-associated gene set, and performing KEGG pathway-based enrichment analysis on the new gene set to obtain each cell type phaseA closed set of paths; the result of the pathway enrichment analysis obtained by comparing the new cell gene set with the original cell gene set is expressed as the difference set between the two pathway setsWherein->And->The method respectively shows that a new cell-associated pathway can be identified by a pathway set obtained by enrichment analysis of the gene set identified by the invention in the ith cell type and the original gene set, and the cell function is finally embodied on a biological pathway related to cells, so that the cell function identification is carried out by analyzing endocrine disease pathway functions related to cell heterogeneity genes, and the pathway analysis method is realized by using an Enrich R package in R language.

The beneficial effects of the invention are as follows: firstly, extracting various gene characteristics affecting cell functions, and generating characteristics of related genes of cell endocrine function by utilizing an antagonistic network; then, a SMOTE gene amplification method is provided, and the problem of low classification accuracy caused by too many small sample genes in the gene tag combination is solved by amplifying the small sample genes in the gene tag combination; finally, a RAkEL-based cell function recognition method is provided, cell function recognition is carried out by analyzing endocrine disease pathway functions related to cell heterogeneity genes, and the accuracy of endocrine disease cell heterogeneity genes and cell function recognition is effectively improved by comparing the method with the existing similar methods.

Drawings

FIG. 1 is a flow diagram of the present invention;

FIG. 2 is a diagram showing the comparison of the identification result of the present invention and the reference method.

Detailed Description

The first embodiment is as follows: referring to fig. 1 and 2, the steps of the endocrine disease cell recognition method based on the cell heterogeneity gene and pathway function according to the present embodiment include:

step 1, extracting cell associated gene characteristics;

step 2, amplifying cell-associated genes;

step 3, predicting a cell heterogeneity gene;

and 4, identifying the cell function of the endocrine disease.

The second embodiment is as follows: referring to fig. 1 and 2, the steps for extracting the characteristics of the cell-associated gene in step 1 of the endocrine-disease cell recognition method based on the cell heterogeneity gene and the pathway function according to the present embodiment include:

And a third specific embodiment: referring to fig. 1 and 2, in the step 2 of the endocrine disease cell recognition method based on the cell heterogeneity gene and the pathway function according to the present embodiment, an integrated multi-objective classification model is constructed by using a raklel framework based on the problem transformation method; raklel converts a multi-objective classification problem into a single-label classification problem by treating the label combination of the samples as a new single label, and the specific steps include:

Label _synthGene ＝{L ₁ ,L ₂ ,...,L _|L| } (2)，

F _syn ＝F _seed +r×(F _seed -F _ref ) (3)，

in the formula (3), r is a random number between (0, 1), and the characteristics of the small sample gene are marked as F _seed Reference to neighbor genome sectionCharacteristic of a dot represents F _ref ；

The specific embodiment IV is as follows: referring to fig. 1 and 2, the step of predicting a cell heterogeneity gene in step 3 of the endocrine disease cell recognition method based on the cell heterogeneity gene and pathway function according to the present embodiment includes:

Fifth embodiment: the present embodiment will be described with reference to fig. 1 and 2, which are based on cellular differentiationAfter a cell-associated gene set is obtained in step 4 of the endocrine disease cell identification method with cytoplasmic genes and pathway functions, integrating the cell-associated gene set with the cell genes in the original data set into a new cell-associated gene set, and carrying out KEGG pathway-based enrichment analysis on the new gene set to obtain pathway-associated pathway sets related to each cell type; the result of the pathway enrichment analysis obtained by comparing the new cell gene set with the original cell gene set is expressed as the difference set between the two pathway setsWherein the method comprises the steps ofAnd->The method respectively shows that a new cell-associated pathway can be identified by a pathway set obtained by enrichment analysis of the gene set identified by the invention in the ith cell type and the original gene set, and the cell function is finally embodied on a biological pathway related to cells, so that the cell function identification is carried out by analyzing endocrine disease pathway functions related to cell heterogeneity genes, and the pathway analysis method is realized by using an Enrich R package in R language.

The present invention is not limited to the preferred embodiments, but is capable of modification and variation in detail, and other embodiments, such as those described above, of making various modifications and equivalents will fall within the spirit and scope of the present invention.

Claims

1. A cell identification method of endocrine disease based on cell heterogeneity gene and pathway function is characterized in that: the endocrine disease cell identification method based on the cell heterogeneity gene and the pathway function comprises the following steps:

step 1, extracting cell associated gene characteristics;

step 2, amplifying cell-associated genes;

step 3, predicting a cell heterogeneity gene;

and 4, identifying the cell function of the endocrine disease.

2. The method for identifying endocrine disease cells based on cell heterogeneity gene and pathway functions according to claim 1, wherein: the step of extracting the cell-associated gene characteristic in the step 1 comprises the following steps:

3. The method for identifying endocrine disease cells based on cell heterogeneity gene and pathway functions according to claim 1, wherein: step 2, constructing an integrated multi-target classification model by adopting a RAkEL frame based on a problem conversion method; raklel converts a multi-objective classification problem into a single-label classification problem by treating the label combination of the samples as a new single label, and the specific steps include:

in the formula (1), L represents a tag set, L ₁ Represents the 1 st tag, |L| represents the number of tags, N represents the number of genes, Y _i Representing the tag set corresponding to the ith gene;

Label _synthGene ＝{L ₁ ,L ₂ ,...,L _|L| } (2)，

F _syn ＝F _seed +r×(F _seed -F _ref ) (3)，

in the formula (3), r is a random number between (0, 1), and the sampleThe gene is characterized as F _seed Characterization of reference neighbor Gene node F _ref ；

4. The method for identifying endocrine disease cells based on cell heterogeneity gene and pathway functions according to claim 1, wherein: the step of predicting a cellular heterogeneity gene in step 3 comprises:

5. The method for identifying endocrine disease cells based on cell heterogeneity gene and pathway functions according to claim 1, wherein: after the cell-associated gene set is obtained in the step 4, integrating the cell-associated gene set and the cell genes in the original data set into a new cell-associated gene set, and carrying out KEGG (KEGG pathway-based enrichment analysis on the new gene set to obtain a pathway set related to each cell type; the result of the pathway enrichment analysis obtained by comparing the new cell gene set with the original cell gene set is expressed as the difference set between the two pathway setsWherein->And->The method respectively shows that a new cell-associated pathway can be identified by a pathway set obtained by enrichment analysis of the gene set identified by the invention in the ith cell type and the original gene set, and the cell function is finally embodied on a biological pathway related to cells, so that the cell function identification is carried out by analyzing endocrine disease pathway functions related to cell heterogeneity genes, and the pathway analysis method is realized by using an Enrich R package in R language.