CN117316295A - Endocrine disease cell identification method based on cell heterogeneity gene and pathway function - Google Patents
Endocrine disease cell identification method based on cell heterogeneity gene and pathway function Download PDFInfo
- Publication number
- CN117316295A CN117316295A CN202311177280.9A CN202311177280A CN117316295A CN 117316295 A CN117316295 A CN 117316295A CN 202311177280 A CN202311177280 A CN 202311177280A CN 117316295 A CN117316295 A CN 117316295A
- Authority
- CN
- China
- Prior art keywords
- gene
- cell
- genes
- pathway
- tag
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 108090000623 proteins and genes Proteins 0.000 title claims abstract description 194
- 238000000034 method Methods 0.000 title claims abstract description 38
- 230000037361 pathway Effects 0.000 title claims abstract description 38
- 208000030172 endocrine system disease Diseases 0.000 title claims abstract description 33
- 208000017701 Endocrine disease Diseases 0.000 title claims abstract description 30
- 230000003915 cell function Effects 0.000 claims abstract description 29
- 230000006870 function Effects 0.000 claims abstract description 21
- 230000014509 gene expression Effects 0.000 claims description 19
- 238000010201 enrichment analysis Methods 0.000 claims description 9
- 230000035772 mutation Effects 0.000 claims description 9
- 201000010099 disease Diseases 0.000 claims description 8
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims description 8
- 238000012163 sequencing technique Methods 0.000 claims description 8
- 210000001519 tissue Anatomy 0.000 claims description 7
- 230000003321 amplification Effects 0.000 claims description 6
- 238000003199 nucleic acid amplification method Methods 0.000 claims description 6
- 230000004960 subcellular localization Effects 0.000 claims description 6
- 230000002194 synthesizing effect Effects 0.000 claims description 6
- 238000012360 testing method Methods 0.000 claims description 6
- 239000013598 vector Substances 0.000 claims description 6
- 238000013145 classification model Methods 0.000 claims description 4
- 230000008485 antagonism Effects 0.000 claims description 3
- 230000008236 biological pathway Effects 0.000 claims description 3
- 210000004369 blood Anatomy 0.000 claims description 3
- 239000008280 blood Substances 0.000 claims description 3
- 210000003205 muscle Anatomy 0.000 claims description 3
- 238000003068 pathway analysis Methods 0.000 claims description 3
- 238000012549 training Methods 0.000 claims description 3
- 230000001413 cellular effect Effects 0.000 claims description 2
- 238000012512 characterization method Methods 0.000 claims description 2
- 238000006243 chemical reaction Methods 0.000 claims description 2
- 210000004027 cell Anatomy 0.000 description 85
- 230000005859 cell recognition Effects 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 4
- 239000000090 biomarker Substances 0.000 description 2
- 206010012601 diabetes mellitus Diseases 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000004544 DNA amplification Effects 0.000 description 1
- 210000001744 T-lymphocyte Anatomy 0.000 description 1
- 230000003042 antagnostic effect Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000004888 barrier function Effects 0.000 description 1
- 210000000227 basophil cell of anterior lobe of hypophysis Anatomy 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001086 cytosolic effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000002124 endocrine Effects 0.000 description 1
- 210000003890 endocrine cell Anatomy 0.000 description 1
- 230000007368 endocrine function Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 201000010065 polycystic ovary syndrome Diseases 0.000 description 1
- 238000007430 reference method Methods 0.000 description 1
- 238000011426 transformation method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6869—Methods for sequencing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24133—Distances to prototypes
- G06F18/24143—Distances to neighbourhood prototypes, e.g. restricted Coulomb energy networks [RCEN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/2431—Multiple classes
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/30—Detection of binding sites or motifs
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/50—Mutagenesis
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Chemical & Material Sciences (AREA)
- Data Mining & Analysis (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Biotechnology (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Organic Chemistry (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Analytical Chemistry (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Zoology (AREA)
- Wood Science & Technology (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Bioethics (AREA)
- Immunology (AREA)
- Microbiology (AREA)
- Software Systems (AREA)
- Epidemiology (AREA)
- Public Health (AREA)
- Biochemistry (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
An endocrine disease cell identification method based on cell heterogeneity gene and pathway function relates to an endocrine disease cell identification method. The invention aims to solve the problem that the existing cell function identification method has limitation. The method comprises the steps of 1, extracting cell associated gene characteristics; step 2, amplifying cell-associated genes; step 3, predicting a cell heterogeneity gene; and 4, identifying the cell function of the endocrine disease. The invention belongs to the technical field of endocrinopathy cell function identification.
Description
Technical Field
The invention relates to a cell identification method for endocrine diseases, and belongs to the technical field of cell function identification for endocrine diseases.
Background
It has now been found that various endocrine cell subtypes are associated with major human diseases, such as β cells being closely related to type two diabetes, α cells being closely related to type one diabetes, T cells being closely related to polycystic ovary syndrome, etc. The main method of cell function recognition analysis is to analyze the difference of gene expression in different cells by using a gene sequencing technology. Conventional population sequencing methods provide an average of gene expression at the overall cell population level, but may mask cell-to-cell heterogeneity and individual cell differences, whereas single cell sequencing techniques can obtain detailed information of gene expression at the single cell level, thereby enabling analysis of cell-to-cell heterogeneity, revealing differences between different cell types and subtypes.
The main principle of the cell function recognition analysis based on single cell sequencing technology at present is to annotate and analyze cells based on differential expression genes and known biomarkers. For example, SCDE, a method for identifying cell function by calculating the similarity of the expression levels of differentially expressed genes in different cells using a Bayesian mixed model. A PUSeqCluster is a cell function identification method for carrying out gene expression data clustering analysis by utilizing a combined mixed T distribution clustering model. However, this approach still has certain limitations: 1) Because of the lack of prior knowledge, annotation by known biomarkers and gene expression profiles alone is not an accurate and effective means to identify the association of disease with cellular function; 2) Due to single cell sequencing technology barriers and cell number imbalance associated with endocrine tissues, cell heterogeneity gene recognition by analyzing only cell function-related gene features is affected by cell imbalance.
Cell heterogeneity functional recognition is not only related to gene feature extraction, but also can improve the accuracy of cell classification by analyzing imbalance of cell-associated genes, thereby effectively performing cell functional recognition based on endocrine disease pathway functions. A method for identifying cell function based on cell heterogeneity gene and channel function is provided, which uses improved SMOTE algorithm to amplify cell heterogeneity gene feature and label, uses RAKEL multi-label multi-classification model to identify cell heterogeneity gene, and analyzes the related endocrine disease channel function to improve cell function identification effect.
Disclosure of Invention
The invention aims to solve the problem that the existing cell function recognition method has limitation, and further provides an endocrine disease cell recognition method based on cell heterogeneity genes and pathway functions.
The technical scheme adopted by the invention for solving the problems is as follows: the method comprises the following steps:
step 1, extracting cell associated gene characteristics;
step 2, amplifying cell-associated genes;
step 3, predicting a cell heterogeneity gene;
and 4, identifying the cell function of the endocrine disease.
Further, the step of extracting the cell-associated gene feature in step 1 includes:
step 101, obtaining cell significant differential expression genes according to single cell sequencing data, and comparing the cell significant differential expression genes with disease mutation site data and tissue specific differential expression data according to position information of the genes, wherein each gene selects significant p values of 5 mutation sites which are most significantly related as disease mutation site information of cell related genes;
step 102, p-value and log of matched SNP 2 FC values as information of differential expression of genes in this tissue, eQTL data derived from pancreatic, adipose, blood and muscle tissue;
step 103, expressing the association KEGG channel information of the gene as a 343-dimension binary vector, wherein each dimension represents the association relation between the gene and the channel;
step 104, selecting subcellular localization information of the genes represented by the first 10 cell substructure through analyzing the subcellular localization information of the cell-associated genes, and generating cell function-related gene characteristics by utilizing the generated antagonism network.
Furthermore, in the step 2, an integrated multi-objective classification model is constructed by adopting a RAkEL frame based on a problem conversion method; raklel converts a multi-objective classification problem into a single-label classification problem by treating the label combination of the samples as a new single label, and the specific steps include:
step 201, analyzing distribution of gene tag combinations, selecting small sample genes based on unbalanced proportion IR of the tag combinations, wherein the definition of IR is shown as the following formula:
in the formula (1), L represents a tag set, L 1 Represents the 1 st tag, L represents the number of tags, N represents the number of genes, Y i Representing the tag set corresponding to the ith gene;
the average mean (IR) of the IR of all tags may represent the degree of imbalance of the dataset. It is generally considered that a tag having an IR (l) > 10 can be regarded as a small sample tag, and a gene containing such a tag is referred to as a small sample gene;
step 202, synthesizing gene samples according to small sample genes, selecting k neighbor node sets of the small sample genes according to each small sample gene, and measuring the distance between small sample gene feature vectors by utilizing Euclidean distance; in order to generate gene tags, the number of times each tag appears in a small sample gene and neighbor nodes thereof is counted, and a threshold value is set to synthesize new tags, and the tags of the synthesized genes are expressed as follows:
Label synthGene ={L 1 ,L 2 ,...,L |L| } (2),
in the formula (2), when the number of times of occurrence of the ith tag in the neighbor node is greater than the set threshold value, the tag L of the gene is synthesized i 1, otherwise 0;
step 203, randomly selecting a neighbor node as a reference neighbor gene for generating a synthetic sample feature, and synthesizing a gene feature F by interpolation syn This is expressed as:
F syn =F seed +r×(F seed -F ref ) (3),
in the formula (3), r is a random number between (0, 1), and the characteristic of the small sample gene is recordedIs F seed Characterization of reference neighbor Gene node F ref ;
Step 204, by comparing the mean (IR) values and the similarity of gene labels and label combination distribution under the amplification multiples of different small sample genes, the amplification number of the small sample genes is selected, so that the uniformity of gene distribution is improved, and the main information of label distribution is reserved.
Further, the step of predicting a cell heterogeneity gene in step 3 comprises:
step 301, dividing the labels corresponding to the gene samples into m groups, wherein each group is provided with a label subset of k labels;
step 302, constructing m binary base classifiers, each of which performs binary classification on a group of label subsets, and each classifier can perform primary prediction on labels: if the ith classifier P i The corresponding label subset is L i Wherein each tag l j A score can be obtained, and after training all the classifiers, each label is calculated by taking the average value j Final score of (2)If->Above the threshold, the gene is considered to be in the jth Cell type Cell j On the contrary, the gene is considered to have no obvious functional manifestation in the cell type;
step 303, testing the data of the test set on trained classifiers, obtaining a result on each classifier by each sample, finally obtaining a final label of the sample by voting, and classifying cells of endocrine disease genes, wherein the parameters of RAkEL are k 5 and m 14.
Further, after the cell-associated gene set is obtained in step 4, integrating the cell-associated gene set with the cell genes in the original data set to form a new cell-associated gene set, and performing KEGG pathway-based enrichment analysis on the new gene set to obtain each cell type phaseA closed set of paths; the result of the pathway enrichment analysis obtained by comparing the new cell gene set with the original cell gene set is expressed as the difference set between the two pathway setsWherein->And->The method respectively shows that a new cell-associated pathway can be identified by a pathway set obtained by enrichment analysis of the gene set identified by the invention in the ith cell type and the original gene set, and the cell function is finally embodied on a biological pathway related to cells, so that the cell function identification is carried out by analyzing endocrine disease pathway functions related to cell heterogeneity genes, and the pathway analysis method is realized by using an Enrich R package in R language.
The beneficial effects of the invention are as follows: firstly, extracting various gene characteristics affecting cell functions, and generating characteristics of related genes of cell endocrine function by utilizing an antagonistic network; then, a SMOTE gene amplification method is provided, and the problem of low classification accuracy caused by too many small sample genes in the gene tag combination is solved by amplifying the small sample genes in the gene tag combination; finally, a RAkEL-based cell function recognition method is provided, cell function recognition is carried out by analyzing endocrine disease pathway functions related to cell heterogeneity genes, and the accuracy of endocrine disease cell heterogeneity genes and cell function recognition is effectively improved by comparing the method with the existing similar methods.
Drawings
FIG. 1 is a flow diagram of the present invention;
FIG. 2 is a diagram showing the comparison of the identification result of the present invention and the reference method.
Detailed Description
The first embodiment is as follows: referring to fig. 1 and 2, the steps of the endocrine disease cell recognition method based on the cell heterogeneity gene and pathway function according to the present embodiment include:
step 1, extracting cell associated gene characteristics;
step 2, amplifying cell-associated genes;
step 3, predicting a cell heterogeneity gene;
and 4, identifying the cell function of the endocrine disease.
The second embodiment is as follows: referring to fig. 1 and 2, the steps for extracting the characteristics of the cell-associated gene in step 1 of the endocrine-disease cell recognition method based on the cell heterogeneity gene and the pathway function according to the present embodiment include:
step 101, obtaining cell significant differential expression genes according to single cell sequencing data, and comparing the cell significant differential expression genes with disease mutation site data and tissue specific differential expression data according to position information of the genes, wherein each gene selects significant p values of 5 mutation sites which are most significantly related as disease mutation site information of cell related genes;
step 102, p-value and log of matched SNP 2 FC values as information of differential expression of genes in this tissue, eQTL data derived from pancreatic, adipose, blood and muscle tissue;
step 103, expressing the association KEGG channel information of the gene as a 343-dimension binary vector, wherein each dimension represents the association relation between the gene and the channel;
step 104, selecting subcellular localization information of the genes represented by the first 10 cell substructure through analyzing the subcellular localization information of the cell-associated genes, and generating cell function-related gene characteristics by utilizing the generated antagonism network.
And a third specific embodiment: referring to fig. 1 and 2, in the step 2 of the endocrine disease cell recognition method based on the cell heterogeneity gene and the pathway function according to the present embodiment, an integrated multi-objective classification model is constructed by using a raklel framework based on the problem transformation method; raklel converts a multi-objective classification problem into a single-label classification problem by treating the label combination of the samples as a new single label, and the specific steps include:
step 201, analyzing distribution of gene tag combinations, selecting small sample genes based on unbalanced proportion IR of the tag combinations, wherein the definition of IR is shown as the following formula:
in the formula (1), L represents a tag set, L 1 Represents the 1 st tag, L represents the number of tags, N represents the number of genes, Y i Representing the tag set corresponding to the ith gene;
the average mean (IR) of the IR of all tags may represent the degree of imbalance of the dataset. It is generally considered that a tag having an IR (l) > 10 can be regarded as a small sample tag, and a gene containing such a tag is referred to as a small sample gene;
step 202, synthesizing gene samples according to small sample genes, selecting k neighbor node sets of the small sample genes according to each small sample gene, and measuring the distance between small sample gene feature vectors by utilizing Euclidean distance; in order to generate gene tags, the number of times each tag appears in a small sample gene and neighbor nodes thereof is counted, and a threshold value is set to synthesize new tags, and the tags of the synthesized genes are expressed as follows:
Label synthGene ={L 1 ,L 2 ,...,L |L| } (2),
in the formula (2), when the number of times of occurrence of the ith tag in the neighbor node is greater than the set threshold value, the tag L of the gene is synthesized i 1, otherwise 0;
step 203, randomly selecting a neighbor node as a reference neighbor gene for generating a synthetic sample feature, and synthesizing a gene feature F by interpolation syn This is expressed as:
F syn =F seed +r×(F seed -F ref ) (3),
in the formula (3), r is a random number between (0, 1), and the characteristics of the small sample gene are marked as F seed Reference to neighbor genome sectionCharacteristic of a dot represents F ref ;
Step 204, by comparing the mean (IR) values and the similarity of gene labels and label combination distribution under the amplification multiples of different small sample genes, the amplification number of the small sample genes is selected, so that the uniformity of gene distribution is improved, and the main information of label distribution is reserved.
The specific embodiment IV is as follows: referring to fig. 1 and 2, the step of predicting a cell heterogeneity gene in step 3 of the endocrine disease cell recognition method based on the cell heterogeneity gene and pathway function according to the present embodiment includes:
step 301, dividing the labels corresponding to the gene samples into m groups, wherein each group is provided with a label subset of k labels;
step 302, constructing m binary base classifiers, each of which performs binary classification on a group of label subsets, and each classifier can perform primary prediction on labels: if the ith classifier P i The corresponding label subset is L i Wherein each tag l j A score can be obtained, and after training all the classifiers, each label is calculated by taking the average value j Final score of (2)If->Above the threshold, the gene is considered to be in the jth Cell type Cell j On the contrary, the gene is considered to have no obvious functional manifestation in the cell type;
step 303, testing the data of the test set on trained classifiers, obtaining a result on each classifier by each sample, finally obtaining a final label of the sample by voting, and classifying cells of endocrine disease genes, wherein the parameters of RAkEL are k 5 and m 14.
Fifth embodiment: the present embodiment will be described with reference to fig. 1 and 2, which are based on cellular differentiationAfter a cell-associated gene set is obtained in step 4 of the endocrine disease cell identification method with cytoplasmic genes and pathway functions, integrating the cell-associated gene set with the cell genes in the original data set into a new cell-associated gene set, and carrying out KEGG pathway-based enrichment analysis on the new gene set to obtain pathway-associated pathway sets related to each cell type; the result of the pathway enrichment analysis obtained by comparing the new cell gene set with the original cell gene set is expressed as the difference set between the two pathway setsWherein the method comprises the steps ofAnd->The method respectively shows that a new cell-associated pathway can be identified by a pathway set obtained by enrichment analysis of the gene set identified by the invention in the ith cell type and the original gene set, and the cell function is finally embodied on a biological pathway related to cells, so that the cell function identification is carried out by analyzing endocrine disease pathway functions related to cell heterogeneity genes, and the pathway analysis method is realized by using an Enrich R package in R language.
The present invention is not limited to the preferred embodiments, but is capable of modification and variation in detail, and other embodiments, such as those described above, of making various modifications and equivalents will fall within the spirit and scope of the present invention.
Claims (5)
1. A cell identification method of endocrine disease based on cell heterogeneity gene and pathway function is characterized in that: the endocrine disease cell identification method based on the cell heterogeneity gene and the pathway function comprises the following steps:
step 1, extracting cell associated gene characteristics;
step 2, amplifying cell-associated genes;
step 3, predicting a cell heterogeneity gene;
and 4, identifying the cell function of the endocrine disease.
2. The method for identifying endocrine disease cells based on cell heterogeneity gene and pathway functions according to claim 1, wherein: the step of extracting the cell-associated gene characteristic in the step 1 comprises the following steps:
step 101, obtaining cell significant differential expression genes according to single cell sequencing data, and comparing the cell significant differential expression genes with disease mutation site data and tissue specific differential expression data according to position information of the genes, wherein each gene selects significant p values of 5 mutation sites which are most significantly related as disease mutation site information of cell related genes;
step 102, p-value and log of matched SNP 2 FC values as information of differential expression of genes in this tissue, eQTL data derived from pancreatic, adipose, blood and muscle tissue;
step 103, expressing the association KEGG channel information of the gene as a 343-dimension binary vector, wherein each dimension represents the association relation between the gene and the channel;
step 104, selecting subcellular localization information of the genes represented by the first 10 cell substructure through analyzing the subcellular localization information of the cell-associated genes, and generating cell function-related gene characteristics by utilizing the generated antagonism network.
3. The method for identifying endocrine disease cells based on cell heterogeneity gene and pathway functions according to claim 1, wherein: step 2, constructing an integrated multi-target classification model by adopting a RAkEL frame based on a problem conversion method; raklel converts a multi-objective classification problem into a single-label classification problem by treating the label combination of the samples as a new single label, and the specific steps include:
step 201, analyzing distribution of gene tag combinations, selecting small sample genes based on unbalanced proportion IR of the tag combinations, wherein the definition of IR is shown as the following formula:
in the formula (1), L represents a tag set, L 1 Represents the 1 st tag, |L| represents the number of tags, N represents the number of genes, Y i Representing the tag set corresponding to the ith gene;
the average mean (IR) of the IR of all tags may represent the degree of imbalance of the dataset. It is generally considered that a tag having an IR (l) > 10 can be regarded as a small sample tag, and a gene containing such a tag is referred to as a small sample gene;
step 202, synthesizing gene samples according to small sample genes, selecting k neighbor node sets of the small sample genes according to each small sample gene, and measuring the distance between small sample gene feature vectors by utilizing Euclidean distance; in order to generate gene tags, the number of times each tag appears in a small sample gene and neighbor nodes thereof is counted, and a threshold value is set to synthesize new tags, and the tags of the synthesized genes are expressed as follows:
Label synthGene ={L 1 ,L 2 ,...,L |L| } (2),
in the formula (2), when the number of times of occurrence of the ith tag in the neighbor node is greater than the set threshold value, the tag L of the gene is synthesized i 1, otherwise 0;
step 203, randomly selecting a neighbor node as a reference neighbor gene for generating a synthetic sample feature, and synthesizing a gene feature F by interpolation syn This is expressed as:
F syn =F seed +r×(F seed -F ref ) (3),
in the formula (3), r is a random number between (0, 1), and the sampleThe gene is characterized as F seed Characterization of reference neighbor Gene node F ref ;
Step 204, by comparing the mean (IR) values and the similarity of gene labels and label combination distribution under the amplification multiples of different small sample genes, the amplification number of the small sample genes is selected, so that the uniformity of gene distribution is improved, and the main information of label distribution is reserved.
4. The method for identifying endocrine disease cells based on cell heterogeneity gene and pathway functions according to claim 1, wherein: the step of predicting a cellular heterogeneity gene in step 3 comprises:
step 301, dividing the labels corresponding to the gene samples into m groups, wherein each group is provided with a label subset of k labels;
step 302, constructing m binary base classifiers, each of which performs binary classification on a group of label subsets, and each classifier can perform primary prediction on labels: if the ith classifier P i The corresponding label subset is L i Wherein each tag l j A score can be obtained, and after training all the classifiers, each label is calculated by taking the average value j Final score of (2)If->Above the threshold, the gene is considered to be in the jth Cell type Cell j On the contrary, the gene is considered to have no obvious functional manifestation in the cell type;
step 303, testing the data of the test set on trained classifiers, obtaining a result on each classifier by each sample, finally obtaining a final label of the sample by voting, and classifying cells of endocrine disease genes, wherein the parameters of RAkEL are k 5 and m 14.
5. The method for identifying endocrine disease cells based on cell heterogeneity gene and pathway functions according to claim 1, wherein: after the cell-associated gene set is obtained in the step 4, integrating the cell-associated gene set and the cell genes in the original data set into a new cell-associated gene set, and carrying out KEGG (KEGG pathway-based enrichment analysis on the new gene set to obtain a pathway set related to each cell type; the result of the pathway enrichment analysis obtained by comparing the new cell gene set with the original cell gene set is expressed as the difference set between the two pathway setsWherein->And->The method respectively shows that a new cell-associated pathway can be identified by a pathway set obtained by enrichment analysis of the gene set identified by the invention in the ith cell type and the original gene set, and the cell function is finally embodied on a biological pathway related to cells, so that the cell function identification is carried out by analyzing endocrine disease pathway functions related to cell heterogeneity genes, and the pathway analysis method is realized by using an Enrich R package in R language.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311177280.9A CN117316295A (en) | 2023-09-13 | 2023-09-13 | Endocrine disease cell identification method based on cell heterogeneity gene and pathway function |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311177280.9A CN117316295A (en) | 2023-09-13 | 2023-09-13 | Endocrine disease cell identification method based on cell heterogeneity gene and pathway function |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117316295A true CN117316295A (en) | 2023-12-29 |
Family
ID=89261267
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311177280.9A Pending CN117316295A (en) | 2023-09-13 | 2023-09-13 | Endocrine disease cell identification method based on cell heterogeneity gene and pathway function |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117316295A (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CA2239205A1 (en) * | 1998-05-29 | 1999-11-29 | Gabrielle Boulianne | Extension of lifespan by overexpression of a gene that increases reactive oxygen metabolism |
CN106874706A (en) * | 2017-01-18 | 2017-06-20 | 湖南大学 | Disease association factor identification method and system based on functional module |
CN113470743A (en) * | 2021-07-16 | 2021-10-01 | 哈尔滨星云医学检验所有限公司 | Differential gene analysis method based on BD single cell transcriptome and proteome sequencing data |
WO2022198761A1 (en) * | 2021-03-22 | 2022-09-29 | 江苏大学 | Asthma diagnosis system based on decision tree and improved smote algorithms |
CN115148286A (en) * | 2022-06-24 | 2022-10-04 | 山东大学 | Cancer collaborative driving module identification system based on single cell data |
CN115427585A (en) * | 2020-02-20 | 2022-12-02 | 居里研究所 | Method for identifying functional disease-specific regulatory T cells |
CN115798593A (en) * | 2022-12-02 | 2023-03-14 | 中国科学院深圳先进技术研究院 | Single cell identification method and equipment based on graph neural network self-supervision clustering |
CN116564410A (en) * | 2023-05-23 | 2023-08-08 | 浙江大学 | Method, equipment and medium for predicting mutation site cis-regulatory gene |
-
2023
- 2023-09-13 CN CN202311177280.9A patent/CN117316295A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CA2239205A1 (en) * | 1998-05-29 | 1999-11-29 | Gabrielle Boulianne | Extension of lifespan by overexpression of a gene that increases reactive oxygen metabolism |
CN106874706A (en) * | 2017-01-18 | 2017-06-20 | 湖南大学 | Disease association factor identification method and system based on functional module |
CN115427585A (en) * | 2020-02-20 | 2022-12-02 | 居里研究所 | Method for identifying functional disease-specific regulatory T cells |
WO2022198761A1 (en) * | 2021-03-22 | 2022-09-29 | 江苏大学 | Asthma diagnosis system based on decision tree and improved smote algorithms |
CN113470743A (en) * | 2021-07-16 | 2021-10-01 | 哈尔滨星云医学检验所有限公司 | Differential gene analysis method based on BD single cell transcriptome and proteome sequencing data |
CN115148286A (en) * | 2022-06-24 | 2022-10-04 | 山东大学 | Cancer collaborative driving module identification system based on single cell data |
CN115798593A (en) * | 2022-12-02 | 2023-03-14 | 中国科学院深圳先进技术研究院 | Single cell identification method and equipment based on graph neural network self-supervision clustering |
CN116564410A (en) * | 2023-05-23 | 2023-08-08 | 浙江大学 | Method, equipment and medium for predicting mutation site cis-regulatory gene |
Non-Patent Citations (2)
Title |
---|
ZANG, TIANYI ETAL: "Identification of Alzheimer\'s Disease-Related Genes Based on Data Integration Method", vol. 9, no. 703, 13 February 2019 (2019-02-13), pages 1 - 6 * |
姚晨;李红东;郭政;: "肾细胞癌DNA甲基化标记检测的重复性及其与基因表达改变的相关性", 生物信息学, no. 02, 15 June 2011 (2011-06-15), pages 102 - 105 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7783581B2 (en) | Data learning system for identifying, learning apparatus, identifying apparatus and learning method | |
CN111276252B (en) | Construction method and device of tumor benign and malignant identification model | |
CN112908414A (en) | Large-scale single cell typing method, system and storage medium | |
CN106202999A (en) | Microorganism high-pass sequencing data based on different scale tuple word frequency analyzes agreement | |
CN110912917A (en) | Malicious URL detection method and system | |
CN109949863A (en) | A method of spirit quality is identified based on Random Forest model | |
CN105139037B (en) | Integrated multi-target evolution automatic clustering method based on minimum spanning tree | |
CN114266321A (en) | Weak supervision fuzzy clustering algorithm based on unconstrained prior information mode | |
Jeong et al. | Effective single-cell clustering through ensemble feature selection and similarity measurements | |
CN117316295A (en) | Endocrine disease cell identification method based on cell heterogeneity gene and pathway function | |
Li et al. | A novel algorithm for training hidden Markov models with positive and negative examples | |
KR102376212B1 (en) | Gene expression marker screening method using neural network based on gene selection algorithm | |
JP3936851B2 (en) | Clustering result evaluation method and clustering result display method | |
Salem et al. | A new gene selection technique based on hybrid methods for cancer classification using microarrays | |
Nirmalakumari et al. | Microarray prostate cancer classification using eminent genes | |
CN113223613A (en) | Cancer detection method based on multi-dimensional single nucleotide variation characteristics | |
CN116168761B (en) | Method and device for determining characteristic region of nucleic acid sequence, electronic equipment and storage medium | |
CN109817337B (en) | Method for evaluating channel activation degree of single disease sample and method for distinguishing similar diseases | |
CN117437976B (en) | Disease risk screening method and system based on gene detection | |
Lausen | Bioinformatics and classification: The analysis of genome expression data | |
CN107798217B (en) | Data analysis method based on linear relation of feature pairs | |
Rose | Analysis of phenotypic and spatial cellular heterogeneity from large scale microscopy data | |
Simon et al. | Class prediction | |
Anu et al. | Breast Cancer Classification using Machine Learning Algorithm | |
Das | Statistical approaches of gene set analysis with quantitative trait loci for high-throughput genomic studies. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |