CN117316295A - Endocrine disease cell identification method based on cell heterogeneity gene and pathway function - Google Patents

Endocrine disease cell identification method based on cell heterogeneity gene and pathway function Download PDF

Info

Publication number
CN117316295A
CN117316295A CN202311177280.9A CN202311177280A CN117316295A CN 117316295 A CN117316295 A CN 117316295A CN 202311177280 A CN202311177280 A CN 202311177280A CN 117316295 A CN117316295 A CN 117316295A
Authority
CN
China
Prior art keywords
gene
cell
genes
pathway
tag
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311177280.9A
Other languages
Chinese (zh)
Inventor
张凝一
臧天仪
赵飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202311177280.9A priority Critical patent/CN117316295A/en
Publication of CN117316295A publication Critical patent/CN117316295A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24143Distances to neighbourhood prototypes, e.g. restricted Coulomb energy networks [RCEN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2431Multiple classes
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Chemical & Material Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Biotechnology (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Organic Chemistry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Immunology (AREA)
  • Microbiology (AREA)
  • Software Systems (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Biochemistry (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

An endocrine disease cell identification method based on cell heterogeneity gene and pathway function relates to an endocrine disease cell identification method. The invention aims to solve the problem that the existing cell function identification method has limitation. The method comprises the steps of 1, extracting cell associated gene characteristics; step 2, amplifying cell-associated genes; step 3, predicting a cell heterogeneity gene; and 4, identifying the cell function of the endocrine disease. The invention belongs to the technical field of endocrinopathy cell function identification.

Description

Endocrine disease cell identification method based on cell heterogeneity gene and pathway function
Technical Field
The invention relates to a cell identification method for endocrine diseases, and belongs to the technical field of cell function identification for endocrine diseases.
Background
It has now been found that various endocrine cell subtypes are associated with major human diseases, such as β cells being closely related to type two diabetes, α cells being closely related to type one diabetes, T cells being closely related to polycystic ovary syndrome, etc. The main method of cell function recognition analysis is to analyze the difference of gene expression in different cells by using a gene sequencing technology. Conventional population sequencing methods provide an average of gene expression at the overall cell population level, but may mask cell-to-cell heterogeneity and individual cell differences, whereas single cell sequencing techniques can obtain detailed information of gene expression at the single cell level, thereby enabling analysis of cell-to-cell heterogeneity, revealing differences between different cell types and subtypes.
The main principle of the cell function recognition analysis based on single cell sequencing technology at present is to annotate and analyze cells based on differential expression genes and known biomarkers. For example, SCDE, a method for identifying cell function by calculating the similarity of the expression levels of differentially expressed genes in different cells using a Bayesian mixed model. A PUSeqCluster is a cell function identification method for carrying out gene expression data clustering analysis by utilizing a combined mixed T distribution clustering model. However, this approach still has certain limitations: 1) Because of the lack of prior knowledge, annotation by known biomarkers and gene expression profiles alone is not an accurate and effective means to identify the association of disease with cellular function; 2) Due to single cell sequencing technology barriers and cell number imbalance associated with endocrine tissues, cell heterogeneity gene recognition by analyzing only cell function-related gene features is affected by cell imbalance.
Cell heterogeneity functional recognition is not only related to gene feature extraction, but also can improve the accuracy of cell classification by analyzing imbalance of cell-associated genes, thereby effectively performing cell functional recognition based on endocrine disease pathway functions. A method for identifying cell function based on cell heterogeneity gene and channel function is provided, which uses improved SMOTE algorithm to amplify cell heterogeneity gene feature and label, uses RAKEL multi-label multi-classification model to identify cell heterogeneity gene, and analyzes the related endocrine disease channel function to improve cell function identification effect.
Disclosure of Invention
The invention aims to solve the problem that the existing cell function recognition method has limitation, and further provides an endocrine disease cell recognition method based on cell heterogeneity genes and pathway functions.
The technical scheme adopted by the invention for solving the problems is as follows: the method comprises the following steps:
step 1, extracting cell associated gene characteristics;
step 2, amplifying cell-associated genes;
step 3, predicting a cell heterogeneity gene;
and 4, identifying the cell function of the endocrine disease.
Further, the step of extracting the cell-associated gene feature in step 1 includes:
step 101, obtaining cell significant differential expression genes according to single cell sequencing data, and comparing the cell significant differential expression genes with disease mutation site data and tissue specific differential expression data according to position information of the genes, wherein each gene selects significant p values of 5 mutation sites which are most significantly related as disease mutation site information of cell related genes;
step 102, p-value and log of matched SNP 2 FC values as information of differential expression of genes in this tissue, eQTL data derived from pancreatic, adipose, blood and muscle tissue;
step 103, expressing the association KEGG channel information of the gene as a 343-dimension binary vector, wherein each dimension represents the association relation between the gene and the channel;
step 104, selecting subcellular localization information of the genes represented by the first 10 cell substructure through analyzing the subcellular localization information of the cell-associated genes, and generating cell function-related gene characteristics by utilizing the generated antagonism network.
Furthermore, in the step 2, an integrated multi-objective classification model is constructed by adopting a RAkEL frame based on a problem conversion method; raklel converts a multi-objective classification problem into a single-label classification problem by treating the label combination of the samples as a new single label, and the specific steps include:
step 201, analyzing distribution of gene tag combinations, selecting small sample genes based on unbalanced proportion IR of the tag combinations, wherein the definition of IR is shown as the following formula:
in the formula (1), L represents a tag set, L 1 Represents the 1 st tag, L represents the number of tags, N represents the number of genes, Y i Representing the tag set corresponding to the ith gene;
the average mean (IR) of the IR of all tags may represent the degree of imbalance of the dataset. It is generally considered that a tag having an IR (l) > 10 can be regarded as a small sample tag, and a gene containing such a tag is referred to as a small sample gene;
step 202, synthesizing gene samples according to small sample genes, selecting k neighbor node sets of the small sample genes according to each small sample gene, and measuring the distance between small sample gene feature vectors by utilizing Euclidean distance; in order to generate gene tags, the number of times each tag appears in a small sample gene and neighbor nodes thereof is counted, and a threshold value is set to synthesize new tags, and the tags of the synthesized genes are expressed as follows:
Label synthGene ={L 1 ,L 2 ,...,L |L| } (2),
in the formula (2), when the number of times of occurrence of the ith tag in the neighbor node is greater than the set threshold value, the tag L of the gene is synthesized i 1, otherwise 0;
step 203, randomly selecting a neighbor node as a reference neighbor gene for generating a synthetic sample feature, and synthesizing a gene feature F by interpolation syn This is expressed as:
F syn =F seed +r×(F seed -F ref ) (3),
in the formula (3), r is a random number between (0, 1), and the characteristic of the small sample gene is recordedIs F seed Characterization of reference neighbor Gene node F ref
Step 204, by comparing the mean (IR) values and the similarity of gene labels and label combination distribution under the amplification multiples of different small sample genes, the amplification number of the small sample genes is selected, so that the uniformity of gene distribution is improved, and the main information of label distribution is reserved.
Further, the step of predicting a cell heterogeneity gene in step 3 comprises:
step 301, dividing the labels corresponding to the gene samples into m groups, wherein each group is provided with a label subset of k labels;
step 302, constructing m binary base classifiers, each of which performs binary classification on a group of label subsets, and each classifier can perform primary prediction on labels: if the ith classifier P i The corresponding label subset is L i Wherein each tag l j A score can be obtained, and after training all the classifiers, each label is calculated by taking the average value j Final score of (2)If->Above the threshold, the gene is considered to be in the jth Cell type Cell j On the contrary, the gene is considered to have no obvious functional manifestation in the cell type;
step 303, testing the data of the test set on trained classifiers, obtaining a result on each classifier by each sample, finally obtaining a final label of the sample by voting, and classifying cells of endocrine disease genes, wherein the parameters of RAkEL are k 5 and m 14.
Further, after the cell-associated gene set is obtained in step 4, integrating the cell-associated gene set with the cell genes in the original data set to form a new cell-associated gene set, and performing KEGG pathway-based enrichment analysis on the new gene set to obtain each cell type phaseA closed set of paths; the result of the pathway enrichment analysis obtained by comparing the new cell gene set with the original cell gene set is expressed as the difference set between the two pathway setsWherein->And->The method respectively shows that a new cell-associated pathway can be identified by a pathway set obtained by enrichment analysis of the gene set identified by the invention in the ith cell type and the original gene set, and the cell function is finally embodied on a biological pathway related to cells, so that the cell function identification is carried out by analyzing endocrine disease pathway functions related to cell heterogeneity genes, and the pathway analysis method is realized by using an Enrich R package in R language.
The beneficial effects of the invention are as follows: firstly, extracting various gene characteristics affecting cell functions, and generating characteristics of related genes of cell endocrine function by utilizing an antagonistic network; then, a SMOTE gene amplification method is provided, and the problem of low classification accuracy caused by too many small sample genes in the gene tag combination is solved by amplifying the small sample genes in the gene tag combination; finally, a RAkEL-based cell function recognition method is provided, cell function recognition is carried out by analyzing endocrine disease pathway functions related to cell heterogeneity genes, and the accuracy of endocrine disease cell heterogeneity genes and cell function recognition is effectively improved by comparing the method with the existing similar methods.
Drawings
FIG. 1 is a flow diagram of the present invention;
FIG. 2 is a diagram showing the comparison of the identification result of the present invention and the reference method.
Detailed Description
The first embodiment is as follows: referring to fig. 1 and 2, the steps of the endocrine disease cell recognition method based on the cell heterogeneity gene and pathway function according to the present embodiment include:
step 1, extracting cell associated gene characteristics;
step 2, amplifying cell-associated genes;
step 3, predicting a cell heterogeneity gene;
and 4, identifying the cell function of the endocrine disease.
The second embodiment is as follows: referring to fig. 1 and 2, the steps for extracting the characteristics of the cell-associated gene in step 1 of the endocrine-disease cell recognition method based on the cell heterogeneity gene and the pathway function according to the present embodiment include:
step 101, obtaining cell significant differential expression genes according to single cell sequencing data, and comparing the cell significant differential expression genes with disease mutation site data and tissue specific differential expression data according to position information of the genes, wherein each gene selects significant p values of 5 mutation sites which are most significantly related as disease mutation site information of cell related genes;
step 102, p-value and log of matched SNP 2 FC values as information of differential expression of genes in this tissue, eQTL data derived from pancreatic, adipose, blood and muscle tissue;
step 103, expressing the association KEGG channel information of the gene as a 343-dimension binary vector, wherein each dimension represents the association relation between the gene and the channel;
step 104, selecting subcellular localization information of the genes represented by the first 10 cell substructure through analyzing the subcellular localization information of the cell-associated genes, and generating cell function-related gene characteristics by utilizing the generated antagonism network.
And a third specific embodiment: referring to fig. 1 and 2, in the step 2 of the endocrine disease cell recognition method based on the cell heterogeneity gene and the pathway function according to the present embodiment, an integrated multi-objective classification model is constructed by using a raklel framework based on the problem transformation method; raklel converts a multi-objective classification problem into a single-label classification problem by treating the label combination of the samples as a new single label, and the specific steps include:
step 201, analyzing distribution of gene tag combinations, selecting small sample genes based on unbalanced proportion IR of the tag combinations, wherein the definition of IR is shown as the following formula:
in the formula (1), L represents a tag set, L 1 Represents the 1 st tag, L represents the number of tags, N represents the number of genes, Y i Representing the tag set corresponding to the ith gene;
the average mean (IR) of the IR of all tags may represent the degree of imbalance of the dataset. It is generally considered that a tag having an IR (l) > 10 can be regarded as a small sample tag, and a gene containing such a tag is referred to as a small sample gene;
step 202, synthesizing gene samples according to small sample genes, selecting k neighbor node sets of the small sample genes according to each small sample gene, and measuring the distance between small sample gene feature vectors by utilizing Euclidean distance; in order to generate gene tags, the number of times each tag appears in a small sample gene and neighbor nodes thereof is counted, and a threshold value is set to synthesize new tags, and the tags of the synthesized genes are expressed as follows:
Label synthGene ={L 1 ,L 2 ,...,L |L| } (2),
in the formula (2), when the number of times of occurrence of the ith tag in the neighbor node is greater than the set threshold value, the tag L of the gene is synthesized i 1, otherwise 0;
step 203, randomly selecting a neighbor node as a reference neighbor gene for generating a synthetic sample feature, and synthesizing a gene feature F by interpolation syn This is expressed as:
F syn =F seed +r×(F seed -F ref ) (3),
in the formula (3), r is a random number between (0, 1), and the characteristics of the small sample gene are marked as F seed Reference to neighbor genome sectionCharacteristic of a dot represents F ref
Step 204, by comparing the mean (IR) values and the similarity of gene labels and label combination distribution under the amplification multiples of different small sample genes, the amplification number of the small sample genes is selected, so that the uniformity of gene distribution is improved, and the main information of label distribution is reserved.
The specific embodiment IV is as follows: referring to fig. 1 and 2, the step of predicting a cell heterogeneity gene in step 3 of the endocrine disease cell recognition method based on the cell heterogeneity gene and pathway function according to the present embodiment includes:
step 301, dividing the labels corresponding to the gene samples into m groups, wherein each group is provided with a label subset of k labels;
step 302, constructing m binary base classifiers, each of which performs binary classification on a group of label subsets, and each classifier can perform primary prediction on labels: if the ith classifier P i The corresponding label subset is L i Wherein each tag l j A score can be obtained, and after training all the classifiers, each label is calculated by taking the average value j Final score of (2)If->Above the threshold, the gene is considered to be in the jth Cell type Cell j On the contrary, the gene is considered to have no obvious functional manifestation in the cell type;
step 303, testing the data of the test set on trained classifiers, obtaining a result on each classifier by each sample, finally obtaining a final label of the sample by voting, and classifying cells of endocrine disease genes, wherein the parameters of RAkEL are k 5 and m 14.
Fifth embodiment: the present embodiment will be described with reference to fig. 1 and 2, which are based on cellular differentiationAfter a cell-associated gene set is obtained in step 4 of the endocrine disease cell identification method with cytoplasmic genes and pathway functions, integrating the cell-associated gene set with the cell genes in the original data set into a new cell-associated gene set, and carrying out KEGG pathway-based enrichment analysis on the new gene set to obtain pathway-associated pathway sets related to each cell type; the result of the pathway enrichment analysis obtained by comparing the new cell gene set with the original cell gene set is expressed as the difference set between the two pathway setsWherein the method comprises the steps ofAnd->The method respectively shows that a new cell-associated pathway can be identified by a pathway set obtained by enrichment analysis of the gene set identified by the invention in the ith cell type and the original gene set, and the cell function is finally embodied on a biological pathway related to cells, so that the cell function identification is carried out by analyzing endocrine disease pathway functions related to cell heterogeneity genes, and the pathway analysis method is realized by using an Enrich R package in R language.
The present invention is not limited to the preferred embodiments, but is capable of modification and variation in detail, and other embodiments, such as those described above, of making various modifications and equivalents will fall within the spirit and scope of the present invention.

Claims (5)

1. A cell identification method of endocrine disease based on cell heterogeneity gene and pathway function is characterized in that: the endocrine disease cell identification method based on the cell heterogeneity gene and the pathway function comprises the following steps:
step 1, extracting cell associated gene characteristics;
step 2, amplifying cell-associated genes;
step 3, predicting a cell heterogeneity gene;
and 4, identifying the cell function of the endocrine disease.
2. The method for identifying endocrine disease cells based on cell heterogeneity gene and pathway functions according to claim 1, wherein: the step of extracting the cell-associated gene characteristic in the step 1 comprises the following steps:
step 101, obtaining cell significant differential expression genes according to single cell sequencing data, and comparing the cell significant differential expression genes with disease mutation site data and tissue specific differential expression data according to position information of the genes, wherein each gene selects significant p values of 5 mutation sites which are most significantly related as disease mutation site information of cell related genes;
step 102, p-value and log of matched SNP 2 FC values as information of differential expression of genes in this tissue, eQTL data derived from pancreatic, adipose, blood and muscle tissue;
step 103, expressing the association KEGG channel information of the gene as a 343-dimension binary vector, wherein each dimension represents the association relation between the gene and the channel;
step 104, selecting subcellular localization information of the genes represented by the first 10 cell substructure through analyzing the subcellular localization information of the cell-associated genes, and generating cell function-related gene characteristics by utilizing the generated antagonism network.
3. The method for identifying endocrine disease cells based on cell heterogeneity gene and pathway functions according to claim 1, wherein: step 2, constructing an integrated multi-target classification model by adopting a RAkEL frame based on a problem conversion method; raklel converts a multi-objective classification problem into a single-label classification problem by treating the label combination of the samples as a new single label, and the specific steps include:
step 201, analyzing distribution of gene tag combinations, selecting small sample genes based on unbalanced proportion IR of the tag combinations, wherein the definition of IR is shown as the following formula:
in the formula (1), L represents a tag set, L 1 Represents the 1 st tag, |L| represents the number of tags, N represents the number of genes, Y i Representing the tag set corresponding to the ith gene;
the average mean (IR) of the IR of all tags may represent the degree of imbalance of the dataset. It is generally considered that a tag having an IR (l) > 10 can be regarded as a small sample tag, and a gene containing such a tag is referred to as a small sample gene;
step 202, synthesizing gene samples according to small sample genes, selecting k neighbor node sets of the small sample genes according to each small sample gene, and measuring the distance between small sample gene feature vectors by utilizing Euclidean distance; in order to generate gene tags, the number of times each tag appears in a small sample gene and neighbor nodes thereof is counted, and a threshold value is set to synthesize new tags, and the tags of the synthesized genes are expressed as follows:
Label synthGene ={L 1 ,L 2 ,...,L |L| } (2),
in the formula (2), when the number of times of occurrence of the ith tag in the neighbor node is greater than the set threshold value, the tag L of the gene is synthesized i 1, otherwise 0;
step 203, randomly selecting a neighbor node as a reference neighbor gene for generating a synthetic sample feature, and synthesizing a gene feature F by interpolation syn This is expressed as:
F syn =F seed +r×(F seed -F ref ) (3),
in the formula (3), r is a random number between (0, 1), and the sampleThe gene is characterized as F seed Characterization of reference neighbor Gene node F ref
Step 204, by comparing the mean (IR) values and the similarity of gene labels and label combination distribution under the amplification multiples of different small sample genes, the amplification number of the small sample genes is selected, so that the uniformity of gene distribution is improved, and the main information of label distribution is reserved.
4. The method for identifying endocrine disease cells based on cell heterogeneity gene and pathway functions according to claim 1, wherein: the step of predicting a cellular heterogeneity gene in step 3 comprises:
step 301, dividing the labels corresponding to the gene samples into m groups, wherein each group is provided with a label subset of k labels;
step 302, constructing m binary base classifiers, each of which performs binary classification on a group of label subsets, and each classifier can perform primary prediction on labels: if the ith classifier P i The corresponding label subset is L i Wherein each tag l j A score can be obtained, and after training all the classifiers, each label is calculated by taking the average value j Final score of (2)If->Above the threshold, the gene is considered to be in the jth Cell type Cell j On the contrary, the gene is considered to have no obvious functional manifestation in the cell type;
step 303, testing the data of the test set on trained classifiers, obtaining a result on each classifier by each sample, finally obtaining a final label of the sample by voting, and classifying cells of endocrine disease genes, wherein the parameters of RAkEL are k 5 and m 14.
5. The method for identifying endocrine disease cells based on cell heterogeneity gene and pathway functions according to claim 1, wherein: after the cell-associated gene set is obtained in the step 4, integrating the cell-associated gene set and the cell genes in the original data set into a new cell-associated gene set, and carrying out KEGG (KEGG pathway-based enrichment analysis on the new gene set to obtain a pathway set related to each cell type; the result of the pathway enrichment analysis obtained by comparing the new cell gene set with the original cell gene set is expressed as the difference set between the two pathway setsWherein->And->The method respectively shows that a new cell-associated pathway can be identified by a pathway set obtained by enrichment analysis of the gene set identified by the invention in the ith cell type and the original gene set, and the cell function is finally embodied on a biological pathway related to cells, so that the cell function identification is carried out by analyzing endocrine disease pathway functions related to cell heterogeneity genes, and the pathway analysis method is realized by using an Enrich R package in R language.
CN202311177280.9A 2023-09-13 2023-09-13 Endocrine disease cell identification method based on cell heterogeneity gene and pathway function Pending CN117316295A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311177280.9A CN117316295A (en) 2023-09-13 2023-09-13 Endocrine disease cell identification method based on cell heterogeneity gene and pathway function

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311177280.9A CN117316295A (en) 2023-09-13 2023-09-13 Endocrine disease cell identification method based on cell heterogeneity gene and pathway function

Publications (1)

Publication Number Publication Date
CN117316295A true CN117316295A (en) 2023-12-29

Family

ID=89261267

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311177280.9A Pending CN117316295A (en) 2023-09-13 2023-09-13 Endocrine disease cell identification method based on cell heterogeneity gene and pathway function

Country Status (1)

Country Link
CN (1) CN117316295A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2239205A1 (en) * 1998-05-29 1999-11-29 Gabrielle Boulianne Extension of lifespan by overexpression of a gene that increases reactive oxygen metabolism
CN106874706A (en) * 2017-01-18 2017-06-20 湖南大学 Disease association factor identification method and system based on functional module
CN113470743A (en) * 2021-07-16 2021-10-01 哈尔滨星云医学检验所有限公司 Differential gene analysis method based on BD single cell transcriptome and proteome sequencing data
WO2022198761A1 (en) * 2021-03-22 2022-09-29 江苏大学 Asthma diagnosis system based on decision tree and improved smote algorithms
CN115148286A (en) * 2022-06-24 2022-10-04 山东大学 Cancer collaborative driving module identification system based on single cell data
CN115427585A (en) * 2020-02-20 2022-12-02 居里研究所 Method for identifying functional disease-specific regulatory T cells
CN115798593A (en) * 2022-12-02 2023-03-14 中国科学院深圳先进技术研究院 Single cell identification method and equipment based on graph neural network self-supervision clustering
CN116564410A (en) * 2023-05-23 2023-08-08 浙江大学 Method, equipment and medium for predicting mutation site cis-regulatory gene

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2239205A1 (en) * 1998-05-29 1999-11-29 Gabrielle Boulianne Extension of lifespan by overexpression of a gene that increases reactive oxygen metabolism
CN106874706A (en) * 2017-01-18 2017-06-20 湖南大学 Disease association factor identification method and system based on functional module
CN115427585A (en) * 2020-02-20 2022-12-02 居里研究所 Method for identifying functional disease-specific regulatory T cells
WO2022198761A1 (en) * 2021-03-22 2022-09-29 江苏大学 Asthma diagnosis system based on decision tree and improved smote algorithms
CN113470743A (en) * 2021-07-16 2021-10-01 哈尔滨星云医学检验所有限公司 Differential gene analysis method based on BD single cell transcriptome and proteome sequencing data
CN115148286A (en) * 2022-06-24 2022-10-04 山东大学 Cancer collaborative driving module identification system based on single cell data
CN115798593A (en) * 2022-12-02 2023-03-14 中国科学院深圳先进技术研究院 Single cell identification method and equipment based on graph neural network self-supervision clustering
CN116564410A (en) * 2023-05-23 2023-08-08 浙江大学 Method, equipment and medium for predicting mutation site cis-regulatory gene

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ZANG, TIANYI ETAL: "Identification of Alzheimer\'s Disease-Related Genes Based on Data Integration Method", vol. 9, no. 703, 13 February 2019 (2019-02-13), pages 1 - 6 *
姚晨;李红东;郭政;: "肾细胞癌DNA甲基化标记检测的重复性及其与基因表达改变的相关性", 生物信息学, no. 02, 15 June 2011 (2011-06-15), pages 102 - 105 *

Similar Documents

Publication Publication Date Title
US7783581B2 (en) Data learning system for identifying, learning apparatus, identifying apparatus and learning method
CN111276252B (en) Construction method and device of tumor benign and malignant identification model
CN112908414A (en) Large-scale single cell typing method, system and storage medium
CN106202999A (en) Microorganism high-pass sequencing data based on different scale tuple word frequency analyzes agreement
CN110912917A (en) Malicious URL detection method and system
CN109949863A (en) A method of spirit quality is identified based on Random Forest model
CN105139037B (en) Integrated multi-target evolution automatic clustering method based on minimum spanning tree
CN114266321A (en) Weak supervision fuzzy clustering algorithm based on unconstrained prior information mode
Jeong et al. Effective single-cell clustering through ensemble feature selection and similarity measurements
CN117316295A (en) Endocrine disease cell identification method based on cell heterogeneity gene and pathway function
Li et al. A novel algorithm for training hidden Markov models with positive and negative examples
KR102376212B1 (en) Gene expression marker screening method using neural network based on gene selection algorithm
JP3936851B2 (en) Clustering result evaluation method and clustering result display method
Salem et al. A new gene selection technique based on hybrid methods for cancer classification using microarrays
Nirmalakumari et al. Microarray prostate cancer classification using eminent genes
CN113223613A (en) Cancer detection method based on multi-dimensional single nucleotide variation characteristics
CN116168761B (en) Method and device for determining characteristic region of nucleic acid sequence, electronic equipment and storage medium
CN109817337B (en) Method for evaluating channel activation degree of single disease sample and method for distinguishing similar diseases
CN117437976B (en) Disease risk screening method and system based on gene detection
Lausen Bioinformatics and classification: The analysis of genome expression data
CN107798217B (en) Data analysis method based on linear relation of feature pairs
Rose Analysis of phenotypic and spatial cellular heterogeneity from large scale microscopy data
Simon et al. Class prediction
Anu et al. Breast Cancer Classification using Machine Learning Algorithm
Das Statistical approaches of gene set analysis with quantitative trait loci for high-throughput genomic studies.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination