CN116486911A - Processing method and system for respiratory disease data - Google Patents
Processing method and system for respiratory disease data Download PDFInfo
- Publication number
- CN116486911A CN116486911A CN202211277916.2A CN202211277916A CN116486911A CN 116486911 A CN116486911 A CN 116486911A CN 202211277916 A CN202211277916 A CN 202211277916A CN 116486911 A CN116486911 A CN 116486911A
- Authority
- CN
- China
- Prior art keywords
- cell
- pathway
- data
- pas
- processing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 208000023504 respiratory system disease Diseases 0.000 title claims abstract description 64
- 238000003672 processing method Methods 0.000 title description 5
- 230000037361 pathway Effects 0.000 claims abstract description 115
- 238000000034 method Methods 0.000 claims abstract description 70
- 230000002068 genetic effect Effects 0.000 claims abstract description 64
- 238000012545 processing Methods 0.000 claims abstract description 57
- 230000000694 effects Effects 0.000 claims abstract description 45
- 238000012163 sequencing technique Methods 0.000 claims abstract description 39
- 239000011159 matrix material Substances 0.000 claims abstract description 35
- 238000010801 machine learning Methods 0.000 claims abstract description 10
- 238000007619 statistical method Methods 0.000 claims abstract description 8
- 238000003860 storage Methods 0.000 claims abstract description 8
- 108090000623 proteins and genes Proteins 0.000 claims description 107
- 230000014509 gene expression Effects 0.000 claims description 31
- 238000009826 distribution Methods 0.000 claims description 9
- 238000013077 scoring method Methods 0.000 claims description 7
- 238000010219 correlation analysis Methods 0.000 claims description 5
- 102100030385 Granzyme B Human genes 0.000 claims description 4
- 101001009603 Homo sapiens Granzyme B Proteins 0.000 claims description 4
- 208000018569 Respiratory Tract disease Diseases 0.000 claims description 4
- 238000012216 screening Methods 0.000 claims description 4
- 102100027314 Beta-2-microglobulin Human genes 0.000 claims description 3
- 102100025926 Calmodulin-3 Human genes 0.000 claims description 3
- 101000937544 Homo sapiens Beta-2-microglobulin Proteins 0.000 claims description 3
- 101000933777 Homo sapiens Calmodulin-3 Proteins 0.000 claims description 3
- 101000998139 Homo sapiens Interleukin-32 Proteins 0.000 claims description 3
- 101001120056 Homo sapiens Phosphatidylinositol 3-kinase regulatory subunit alpha Proteins 0.000 claims description 3
- 101000946860 Homo sapiens T-cell surface glycoprotein CD3 epsilon chain Proteins 0.000 claims description 3
- 102100033501 Interleukin-32 Human genes 0.000 claims description 3
- -1 PRS29 Proteins 0.000 claims description 3
- 102100026169 Phosphatidylinositol 3-kinase regulatory subunit alpha Human genes 0.000 claims description 3
- 102100035794 T-cell surface glycoprotein CD3 epsilon chain Human genes 0.000 claims description 3
- 230000001174 ascending effect Effects 0.000 claims description 3
- 238000004590 computer program Methods 0.000 claims description 3
- 230000007613 environmental effect Effects 0.000 claims description 3
- 230000004640 cellular pathway Effects 0.000 claims description 2
- 210000004027 cell Anatomy 0.000 description 129
- 201000010099 disease Diseases 0.000 description 21
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 21
- 238000012174 single-cell RNA sequencing Methods 0.000 description 17
- 210000001744 T-lymphocyte Anatomy 0.000 description 8
- 238000010586 diagram Methods 0.000 description 6
- 239000003550 marker Substances 0.000 description 6
- 238000000354 decomposition reaction Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 208000015181 infectious disease Diseases 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 241000711573 Coronaviridae Species 0.000 description 3
- 208000000059 Dyspnea Diseases 0.000 description 3
- 206010013975 Dyspnoeas Diseases 0.000 description 3
- 241000282414 Homo sapiens Species 0.000 description 3
- 201000003176 Severe Acute Respiratory Syndrome Diseases 0.000 description 3
- 230000008236 biological pathway Effects 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000028993 immune response Effects 0.000 description 3
- 210000004072 lung Anatomy 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 210000001519 tissue Anatomy 0.000 description 3
- 108700028369 Alleles Proteins 0.000 description 2
- 208000001528 Coronaviridae Infections Diseases 0.000 description 2
- 206010011224 Cough Diseases 0.000 description 2
- 102100030386 Granzyme A Human genes 0.000 description 2
- 101001009599 Homo sapiens Granzyme A Proteins 0.000 description 2
- 208000025370 Middle East respiratory syndrome Diseases 0.000 description 2
- 238000003559 RNA-seq method Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000034994 death Effects 0.000 description 2
- 239000012636 effector Substances 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 210000000581 natural killer T-cell Anatomy 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 210000003819 peripheral blood mononuclear cell Anatomy 0.000 description 2
- 230000002685 pulmonary effect Effects 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 210000003765 sex chromosome Anatomy 0.000 description 2
- 208000024891 symptom Diseases 0.000 description 2
- 229960005486 vaccine Drugs 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 102000004363 Aquaporin 3 Human genes 0.000 description 1
- 108090000991 Aquaporin 3 Proteins 0.000 description 1
- 206010006458 Bronchitis chronic Diseases 0.000 description 1
- 210000001266 CD8-positive T-lymphocyte Anatomy 0.000 description 1
- 101100510617 Caenorhabditis elegans sel-8 gene Proteins 0.000 description 1
- 108091006146 Channels Proteins 0.000 description 1
- 206010008479 Chest Pain Diseases 0.000 description 1
- 208000006545 Chronic Obstructive Pulmonary Disease Diseases 0.000 description 1
- 201000006306 Cor pulmonale Diseases 0.000 description 1
- 206010014561 Emphysema Diseases 0.000 description 1
- 206010016654 Fibrosis Diseases 0.000 description 1
- 102100021186 Granulysin Human genes 0.000 description 1
- 102100038395 Granzyme K Human genes 0.000 description 1
- 241000282412 Homo Species 0.000 description 1
- 101001040751 Homo sapiens Granulysin Proteins 0.000 description 1
- 101001033007 Homo sapiens Granzyme K Proteins 0.000 description 1
- 101000599940 Homo sapiens Interferon gamma Proteins 0.000 description 1
- 101000987581 Homo sapiens Perforin-1 Proteins 0.000 description 1
- 101000831007 Homo sapiens T-cell immunoreceptor with Ig and ITIM domains Proteins 0.000 description 1
- 206010021143 Hypoxia Diseases 0.000 description 1
- 102100037850 Interferon gamma Human genes 0.000 description 1
- 206010058467 Lung neoplasm malignant Diseases 0.000 description 1
- 108700018351 Major Histocompatibility Complex Proteins 0.000 description 1
- 241000699660 Mus musculus Species 0.000 description 1
- 241000699670 Mus sp. Species 0.000 description 1
- 210000004460 N cell Anatomy 0.000 description 1
- 102100028467 Perforin-1 Human genes 0.000 description 1
- 206010035664 Pneumonia Diseases 0.000 description 1
- 102000009844 Positive Regulatory Domain I-Binding Factor 1 Human genes 0.000 description 1
- 108010009975 Positive Regulatory Domain I-Binding Factor 1 Proteins 0.000 description 1
- 208000031951 Primary immunodeficiency Diseases 0.000 description 1
- 208000004186 Pulmonary Heart Disease Diseases 0.000 description 1
- 206010037660 Pyrexia Diseases 0.000 description 1
- 208000001647 Renal Insufficiency Diseases 0.000 description 1
- 208000004756 Respiratory Insufficiency Diseases 0.000 description 1
- 244000124765 Salsola kali Species 0.000 description 1
- 108091008874 T cell receptors Proteins 0.000 description 1
- 102000016266 T-Cell Antigen Receptors Human genes 0.000 description 1
- 102100024834 T-cell immunoreceptor with Ig and ITIM domains Human genes 0.000 description 1
- 102000040945 Transcription factor Human genes 0.000 description 1
- 108091023040 Transcription factor Proteins 0.000 description 1
- 241000700605 Viruses Species 0.000 description 1
- 230000004931 aggregating effect Effects 0.000 description 1
- 230000032683 aging Effects 0.000 description 1
- 238000012098 association analyses Methods 0.000 description 1
- 208000006673 asthma Diseases 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 210000000601 blood cell Anatomy 0.000 description 1
- 210000001185 bone marrow Anatomy 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 210000000621 bronchi Anatomy 0.000 description 1
- 206010006451 bronchitis Diseases 0.000 description 1
- 230000011712 cell development Effects 0.000 description 1
- 210000000038 chest Anatomy 0.000 description 1
- 208000007451 chronic bronchitis Diseases 0.000 description 1
- 230000004186 co-expression Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 235000006694 eating habits Nutrition 0.000 description 1
- 230000004761 fibrosis Effects 0.000 description 1
- 210000005260 human cell Anatomy 0.000 description 1
- 230000007954 hypoxia Effects 0.000 description 1
- 210000002865 immune cell Anatomy 0.000 description 1
- 230000036039 immunity Effects 0.000 description 1
- 208000030603 inherited susceptibility to asthma Diseases 0.000 description 1
- 201000006370 kidney failure Diseases 0.000 description 1
- 238000012177 large-scale sequencing Methods 0.000 description 1
- 210000004185 liver Anatomy 0.000 description 1
- 201000005202 lung cancer Diseases 0.000 description 1
- 208000020816 lung neoplasm Diseases 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 210000003593 megakaryocyte Anatomy 0.000 description 1
- 230000003446 memory effect Effects 0.000 description 1
- 230000004060 metabolic process Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 230000009456 molecular mechanism Effects 0.000 description 1
- 210000001616 monocyte Anatomy 0.000 description 1
- 230000035772 mutation Effects 0.000 description 1
- 201000009240 nasopharyngitis Diseases 0.000 description 1
- 230000031942 natural killer cell mediated cytotoxicity Effects 0.000 description 1
- 210000005259 peripheral blood Anatomy 0.000 description 1
- 239000011886 peripheral blood Substances 0.000 description 1
- 230000010118 platelet activation Effects 0.000 description 1
- 230000003234 polygenic effect Effects 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 230000000241 respiratory effect Effects 0.000 description 1
- 201000004193 respiratory failure Diseases 0.000 description 1
- 230000029058 respiratory gaseous exchange Effects 0.000 description 1
- 210000002345 respiratory system Anatomy 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 210000003705 ribosome Anatomy 0.000 description 1
- 208000013220 shortness of breath Diseases 0.000 description 1
- 230000019491 signal transduction Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 230000000391 smoking effect Effects 0.000 description 1
- 230000000087 stabilizing effect Effects 0.000 description 1
- 210000000130 stem cell Anatomy 0.000 description 1
- 230000020382 suppression by virus of host antigen processing and presentation of peptide antigen via MHC class I Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000002560 therapeutic procedure Methods 0.000 description 1
- 210000003437 trachea Anatomy 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Biophysics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Bioethics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Evolutionary Computation (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention discloses a method, a system, equipment and a computer readable storage medium for processing respiratory disease data, wherein the method comprises the following steps: acquiring single cell sequencing sequence data to be analyzed; processing the single-cell sequencing sequence data to be analyzed by adopting a machine learning method to obtain PAS scoring matrix of a cell pathway and PAS of the cell pathway; acquiring genetic association data of respiratory diseases, and processing the genetic association data to obtain path data with SNPs annotation; performing statistical analysis processing on PAS of the cell pathway and the pathway data with SNP annotation to obtain an estimation coefficient; multiplying the estimation coefficient by PAS and then summing to obtain a genetic related pathway activity score gPAS of the cell; outputting the genetically related pathway activity score gPAS.
Description
Technical Field
The invention relates to the technical field of gene sequencing, in particular to a method and a system for processing respiratory disease data.
Background
Respiratory diseases are common and frequently-occurring diseases, and are mainly caused by the diseases of trachea, bronchus, lung and chest, and the patients with light diseases are affected by cough, chest pain and respiration, and the patients with serious diseases are caused by dyspnea, hypoxia and even respiratory failure. Mortality in cities takes the 3 rd place, and rural areas take the first place. It is more important to pay attention to the increase or decrease of the incidence and death rate of chronic obstructive pulmonary disease (chronic bronchitis, emphysema and pulmonary heart disease for short), bronchial asthma, lung cancer, pulmonary diffuse interstitial fibrosis, pulmonary infection and other diseases at home and abroad due to atmospheric pollution, smoking, aging population and other factors.
Coronaviruses are a large virus family and are known to cause relatively serious diseases such as common cold, middle East Respiratory Syndrome (MERS), and Severe Acute Respiratory Syndrome (SARS). Common signs of a person infected with coronavirus are respiratory symptoms, fever, cough, shortness of breath, dyspnea, and the like. In more severe cases, the infection can lead to pneumonia, severe acute respiratory syndrome, renal failure, and even death. Many symptoms of coronavirus-induced diseases can be treated, and thus require treatment according to the clinical condition of the patient. In addition, assisted care of the infected person may be very effective, making self-protection, including: keep basic hand and respiratory tract hygiene, adhere to safe eating habits, etc. Understanding the effects of host genetic components on the immune response to severe infections has helped develop effective vaccines and therapeutic methods to control related respiratory disease pandemics. With the rapid development of sequencing technology, single cell sequencing technology has brought more comprehensive opportunities for revealing relevant mechanisms of related respiratory diseases.
The use of single cell RNA sequencing (scRNA-seq) technology to identify key cell subsets associated with complex diseases or features is critical to understanding the mechanisms of complex diseases. However, scRNA-seq data do not allow large-scale sequencing due to their high cost and low throughput characteristics, and most single cell-based research samples currently do not exceed 20, resulting in limited statistical efficacy and failure to accurately reveal a subset of risks associated with disease or features in a cell subpopulation. In addition, the scRNA-seq data is characterized by high sparsity, technical noise and variance instability at the genetic level. Genetic association data such as: (Whole genome association study, GWAS) is widely used to study different complex diseases or traits, and correlating scRNA-seq data with phenotype-associated genetic information of GWAS from large-scale samples is considered to be a practical and effective method for revealing genetic molecular mechanisms of complex diseases or traits at single cell resolution.
Methods combining GWAS with scRNA-seq data to identify cell types associated with complex diseases, including such as LDSC-SEG, MAGMA, rolyPoly, require extensive adjustment of parameters in order to annotate cell types with known marker genes and largely ignore the internal heterogeneity of each cell type. Furthermore, the prior art can identify genes with high expression levels, but has the potential disadvantage that overattention to high expression genes underestimates the functional role of genes whose expression levels are relatively low but important in revealing cell fate.
Disclosure of Invention
The present invention aims to solve at least one of the technical problems existing in the prior art. Therefore, the invention provides a processing method and a system for respiratory disease data; the method of the invention deduces genes, cells and the like related to respiratory diseases by combining scRNA-seq data and genetic association data through a grading method based on single cell pathways, and deeply digs life laws underlying single cell sequencing data to determine potential relations between the genes, cells, cell subgroups, biological pathways and the like and the respiratory diseases.
The application discloses a processing method of respiratory disease data, comprising the following steps:
acquiring single cell sequencing sequence data to be analyzed;
processing the single-cell sequencing sequence data to be analyzed by adopting a machine learning method to obtain PAS scoring matrix of a cell pathway and PAS of the cell pathway;
acquiring genetic association data of respiratory diseases, and processing the genetic association data to obtain path data with SNPs annotation;
performing statistical analysis processing on PAS of the cell pathway and the pathway data with SNP annotation to obtain an estimation coefficient;
multiplying the estimation coefficient by PAS and then summing to obtain a genetic related pathway activity score gPAS of the cell;
outputting the genetically related pathway activity score gPAS.
The step of performing statistical analysis processing on PAS of the cell pathway and the pathway data with SNP annotation to obtain an estimation coefficient comprises the following steps:
obtaining genetic effect values of all SNPs in single path data based on the path data with the SNPs annotation;
based on a polygene regression model of genetic association data of respiratory diseases, carrying out parameter estimation on the distribution of the genetic effect values based on the PAS and the genetic effect values to obtain estimation coefficients;
optionally, the obtaining formula of the genetic effect value is:wherein, beta represents the theoretical effect size vector of m SNPs, epsilon represents random environmental error, R represents LD matrix, X T Representing a standard genotype of SNPs in the genetically related data sample;
optionally, the obtaining manner of the estimation coefficient includes:
wherein τ i,j Estimated coefficient, τ, representing pathway i in cell j 0 Representing intercept term, σ 2 Shows the variance of the magnitude of SNP effect in the pathway,representing a weighted PAS.
The acquisition formula of the genetic related pathway activity score gPAS (gPj) is as follows:
wherein the saidAnd the estimated coefficient is optimized.
The method further comprises the steps of: performing correlation analysis and sequencing on the genetic related pathway activity score gPAS and the gene expression quantity of each cell, and screening N personality-related genes;
optionally, the method for performing correlation analysis and sequencing on the genetic correlation pathway activity score gPAS and the gene expression quantity of each cell comprises the following steps: determining the correlation between the expression of a single gene and the gPAS through a Pearson Correlation Coefficient (PCC), and sequencing the genes according to the correlation to obtain the N personality-related genes;
optionally, the N trait related genes are the first 1000 or the last 1000 trait related genes sequenced according to a descending order or an ascending order of the relativity;
optionally, the N personality-related genes include one or more of the following: CALM3, PIK3R1, IL32, CD3E, B2M, PRS29, and GZMB.
The method further comprises the steps of: calculating a trait related score TRS for each cell according to the N trait related genes; clustering according to the trait related score TRS and the level P value of the single cells to obtain trait related cells related to respiratory diseases with different levels of severity;
optionally, calculating the trait related score TRS of the N personality genes using a cell scoring method;
optionally, the different levels of severity of the respiratory disease include mild, moderate and severe.
The method further comprises the steps of: the trait-related cell type or subpopulation was obtained based on the block boot method block bootstrap method.
The method further comprises the steps of: and sequencing the genetic related pathway activity scores gPAS, and according to sequencing results and the P value of the pathway on the cell type level, carrying out the property related pathway according to the statistical significance value.
Detecting newUse of a product of a cd8+ T cell subpopulation for the manufacture of a product for diagnosing a respiratory disease.
A device for processing respiratory disease data, the device comprising: a memory and a processor;
the memory is used for storing program instructions; the processor is configured to invoke the program instructions, which when executed, are configured to perform the above-described method of processing respiratory disease data.
A system for processing respiratory disease data, comprising:
an acquisition unit for acquiring single-cell sequencing sequence data to be analyzed;
the first processing unit is used for processing the single-cell sequencing sequence data to be analyzed by adopting a machine learning method to obtain a PAS scoring matrix of a cell pathway and PAS of the cell pathway;
the second processing unit is used for acquiring genetic association data of the respiratory tract diseases, and processing the genetic association data to obtain path data with SNPs annotation;
a third processing unit for performing statistical analysis processing on PAS of the cell pathway and the pathway data with SNP annotation to obtain an estimation coefficient;
a fourth processing unit for multiplying the estimation coefficient by PAS and then summing to obtain a genetic related pathway activity score gPAS of the cell;
and an output unit for outputting the genetically related pathway activity score gPAS.
A computer readable storage medium having stored thereon a computer program which when executed by a processor implements the method of processing respiratory disease data described above.
The application has the following beneficial effects:
1. the application innovatively disclosesThe processing method of respiratory disease data by combining single cell sequencing data and genetic association data can infer genes, cells, cell subsets, related biological pathways and the like related to respiratory diseases from deep and more dimensions, and understanding the influence of host genetic components on the immune response of severe infection is helpful for developing effective vaccines and treatment methods to control disease pandemics and contributes to research on respiratory diseases; the method is based on a scoring method of a single cell pathway, has strong capability of finding a disease risk cell type, fuses the functional actions of different genes participating in the same biological pathway to obtain a stable cell state, and remarkably increases the statistical efficiency, biological interpretability and result repeatability; overcomes the limitation of the known annotation cell types, and can discover the new genetic related subgroup and the key genes or channels of the cell types, thereby having wide application and strong practicability. Such as: the scheme is as follows: gene driven novel which can be prioritizedThe cd8+ T cell subpopulation may play an important role in mediating the immune response in patients with severe respiratory diseases.
2. The application creatively discloses a scoring method based on a single cell pathway, which adopts a polygene regression model to reveal genes and cell subgroups related to traits by utilizing scRNA-seq data converted by pathway activity and genetic association research data; the method effectively solves the problems that the identification of the genes and cell subsets related to the polygenic risk of the complex diseases is greatly hindered by small sample size and high sparsity in the scRNA-seq data, so that the statistical efficiency is limited, and the risk subsets related to the diseases or features in the cell subsets can not be accurately revealed. The method is used for deep mining of life rules hidden behind single-cell sequencing data, and deep analysis of multiple dimensions such as population genetics mutation and disease relation, single-cell sequencing gene abundance information and the like, so that the accuracy and depth of data analysis are greatly improved.
3. The method combines the scRNA-seq data with the genetic association data based on large-scale simulation and real data, so that the problem that a large amount of adjustment parameters are needed for conveniently annotating cell types with known marker genes in the prior art and internal heterogeneity in each cell type can be ignored to a great extent can be effectively overcome; there is no functional role of genes whose expression levels are relatively low, but important for revealing cell fate, underestimation due to overconcerns about high-expression genes, helping to identify disease-related early developmental events or progenitor cells, such as key transcription factors related to cell development, by aggregating the role of genes whose average expression levels are low; meanwhile, the sparsity and technical noise of the scRNA-seq data can be effectively reduced, and the method has good robustness and capability in the aspect of identifying cell types and sub-populations related to characteristics.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is an analytical schematic flow chart of a method for processing respiratory disease data provided by an embodiment of the present invention;
FIG. 2 is a schematic diagram of a device for processing respiratory disease data according to an embodiment of the present invention;
FIG. 3 is a schematic flow chart of a processing system for respiratory disease data provided by an embodiment of the present invention;
fig. 4 is a schematic diagram of obtaining the gPAS by the scoring method based on the single cell pathway and outputting TRS, the property-related genes, the property-related cells, the property-related cell types/sub-populations and the property-type tubular pathway by using the gPAS according to the embodiment of the invention.
Detailed Description
In order to enable those skilled in the art to better understand the present invention, the following description will make clear and complete descriptions of the technical solutions according to the embodiments of the present invention with reference to the accompanying drawings.
In some of the flows described in the specification and claims of the present invention and in the foregoing figures, a plurality of operations occurring in a particular order are included, but it should be understood that the operations may be performed out of order or performed in parallel, with the order of operations such as 101, 102, etc., being merely used to distinguish between the various operations, the order of the operations themselves not representing any order of execution. In addition, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first" and "second" herein are used to distinguish different messages, devices, modules, etc., and do not represent a sequence, and are not limited to the "first" and the "second" being different types.
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments according to the invention without any creative effort, are within the protection scope of the invention.
Fig. 1 is a schematic flow chart of a method for processing respiratory disease data according to an embodiment of the present invention, specifically, the method includes the following steps:
101: acquiring single cell sequencing sequence data to be analyzed;
in one embodiment, single cell sequencing data comprises seven independent single cell RNA-seq (scRNA-seq) or single core RNA-seq (snRNA-seq) datasets covering 139 ten thousand cells from humans (homo sapiens) and mice (mus musculus). For blood cells, two scRNA-seq datasets based on human BMMC (n= 35,582 cells) and human PBMCs (n= 97,039 cells) were collected to reveal trait-related cell subsets or types. For diseases/features related to immunity/metabolism, a pseudo-tissue (pseudo-bulk) expression profile and a preferential risk tissue related to disease/feature was constructed for each tissue using the scRNA-seq dataset from human cells (HCL, n= 513,707 cells in 35 adult tissues).
In one example, to discover immune cell populations associated with severe respiratory disease, a large-scale PBMC scRNA-seq dataset (n= 469,453 cells) was collected containing 254 peripheral blood samples with varying respiratory disease severity (mild n=109 samples, moderate n=102 samples, severe n=50) and 16 healthy controls. Alternatively, the single cell sequencing sequence data to be analyzed includes single cell sequencing sequence data of healthy control groups and respiratory tract diseases of varying grade severity.
102: processing single-cell sequencing sequence data to be analyzed by adopting a machine learning method to obtain PAS scoring matrix of a cell pathway and PAS of the cell pathway;
in one embodiment, the step of processing single cell sequencing sequence data to be analyzed by a machine learning method to obtain a PAS scoring matrix of a cell pathway and obtaining PAS of the cell pathway comprises the steps of:
acquiring pathway data of respiratory diseases;
carrying out standardization treatment on a gene-cell matrix in single-cell sequencing sequence data to obtain a standardized gene-cell matrix; specifically, the sparse gene-cell matrix in the scRNA-seq data was normalized using a variance stabilizing transformation parameter with a scale factor of 10,000, resulting in normalized expression of a single gene in a single cell; the normalized formula is:wherein a is g,j Represents the original expression of gene g in cell j, e g,j Represents the normalized expression of gene g in cell j;
based on the pathway data of respiratory diseases, converting the standardized gene-cell matrix into a pathway-cell matrix by using a machine learning method, and obtaining a PAS scoring matrix of the cell-pathway by using the pathway-cell matrix, wherein the PAS scoring matrix comprises pathway activity scoring PAS of single cells in a single pathway;
in one embodiment, the number of passesAccording to the KEGG access data, the access from the KEGG database is used as a default gene set for evaluating PAS, and the standardized gene-cell matrix is converted into an access-cell matrix by utilizing a Singular Value Decomposition (SVD) method; using P i Representing the gene set in pathway i, for each pathway i, matrix A was selected from the normalized gene-cell matrix A i Wherein matrix A i Is the column of all N cells, and the row is the pathway gene set P i Middle |P i Gene and SVDWherein U represents an N orthogonal matrix, Σ represents a diagonal matrix having all zeros except for the main diagonal element, V T Representing |P i |×|P i An i orthogonal matrix; right orthogonal matrix->The t th column vector v t Representing the t-th principal component, reflecting the co-expression variability of genes in single cell data in the pathway; since the first principal component PC1 represents the greatest variance variation, the projection of the cell j feature onto PC1 represents the PASs of pathway i i,j The method comprises the steps of carrying out a first treatment on the surface of the For cell j, the original PASs were adjusted using all the expression variances in pathway i as weights i,j The method comprises the steps of carrying out a first treatment on the surface of the For gene g in pathway i, gene expression e was readjusted using min-max scaling g,j Regulated Gene expression +.>
In one embodiment, the pathway activity score PAS is optimized to obtain a weighted PAS;
the acquisition mode of the weighted PAS comprises the following steps:
wherein,,representing weighted PAS->Representing normalized expression of gene g in optimized cell i, s i,j A pathway activity score PAS representing cell j pathway i;
in one embodiment of the present invention, in one embodiment,the acquisition mode of (1) comprises the following steps:
wherein,,represents the normalized expression of gene g in cell i, MAX (e g,j ) Represents the maximum value of gene expression in pathway i, MIN (e g,j ) Represents the minimum value of gene expression in pathway i.
Optionally, the method of machine learning includes a method of Singular Value Decomposition (SVD); the SVD method greatly improves the calculation efficiency of analysis sparse matrix, and can obtain the characteristic value under the condition of not calculating variance matrix; and (3) the standardized gene-cell matrix is sublimated into a path-cell matrix in a low-dimensional space by utilizing a singular value decomposition method.
103: acquiring genetic association data of the respiratory tract diseases, and processing the genetic association data to obtain path data with SNPs annotation;
in one embodiment, the genetic association data for respiratory disease comprises genetic association data for severe respiratory disease;
in one embodiment, the step of processing the genetic association data to obtain pathway data with SNPs annotations comprises:
screening from the genetic association data to obtain SNPs of a single gene, and mapping the SNPs of the single gene into corresponding paths based on path data of respiratory diseases to obtain path data with SNPs annotation;
alternatively, the SNPs of a single gene may be obtained by the steps of: after SNPs of genes in the genetic association data are obtained, SNPs gene pairs are respectively allocated to obtain allocation results;
respectively carrying out association treatment on repeated genes of a plurality of single SNPs corresponding to a plurality of genes as independent SNP genes; preserving SNPs with Minor Allele Frequencies (MAFs) greater than 0.1 in the allocation results; deleting SNPs on sex chromosomes; obtaining SNPs of a single gene;
the SNPs of a single gene are collected to obtain SNPs of all genes. Specifically, the genetic association data is GWAS data, and SNPs in the GWAS summary statistical data are allocated to related genes by taking 20kb as a default parameter; the symbol g (k) is used to indicate the gene g with SNP k, and a plurality of single SNPs correspond to a plurality of genes through the distribution of SNP gene pairs; since the whole process needs to infer parameters from thousands of SNPs, but SNPs of the single SNPs corresponding to multiple genes have no effect on the inference process, the repeated genes need to be treated as independent Snp genes in an associated manner; preserving SNPs with Minor Allele Frequency (MAF) greater than 0.1, deleting SNPs on sex chromosomes, and finally obtaining SNPs of related genes;
annotating genes with associated SNPs into the pathway based on the pathway in the KEGG database, and representing the set of SNPs in pathway i using si=formula 2; calculating linkage disequilibrium LD (linkage disequilibrium) on SNPs extracted from the GWAS summary data by using the 3 rd stage data of the thousand genome project; the present protocol provides a collection of functional genes such as GO, reactiomer, and MSigDB as alternative options. In addition, the region of the major histocompatibility complex where the broad LD exists, chr6:25-35Mbp, was deleted.
In one embodiment, GWAS data has given a phenotype, and the phenotypic annotation of the given phenotype includes dichotomy, continuous dependency characteristics, or intra-phenotype and center measurements.
104: performing statistical analysis processing on PAS of the cell pathway and pathway data with SNP annotation to obtain an estimation coefficient;
in one embodiment, the step of statistically analyzing PAS of the cellular pathway and pathway data annotated with SNP to obtain the estimated coefficients comprises:
obtaining genetic effect values of all SNPs in single path data based on path data with SNPs annotation;
based on a polygene regression model of genetic association data of respiratory diseases, carrying out parameter estimation on the distribution of genetic effect values based on PAS and the genetic effect values to obtain estimation coefficients;
optionally, the genetic effect value is obtained by the following formula:wherein, beta represents the theoretical effect size vector of m SNPs, epsilon represents random environmental error, R represents LD matrix, X T Representing a standard genotype of SNPs in the genetically related data sample;
optionally, the obtaining manner of the estimation coefficient includes:
wherein τ i,j Representing an estimated coefficient of pathway i in cell j, the estimated coefficient reflecting the effect of cell-specific PAS on the size variance of the GWAS effect, i.e., the effect of inheritance on the response; τ 0 Representing intercept term, σ 2 Shows the variance of the magnitude of SNP effect in the pathway,representing a weighted PAS.
In one embodiment, S i SNP sets representing all SNPs contained in the localized genes of each pathway i, the multiple gene model assuming that the magnitude of the effect of all SNPs of a priori pathway i follows a multivariate normal distribution, wherein σ 2 Representing the variance of the magnitude of SNPs effects in a pathway, I representing |S i |×|S i An I identity matrix;
in one embodiment, the genetic effect value is based on previous assumptionsThe distribution of (2) is estimated using the following formula: />Optimizing the estimation coefficient by using the formula;
in one embodiment, in order to optimize the estimation coefficient of each path in the multiple gene regression model, a moment method (method-of-motion approach) capable of significantly improving the calculation efficiency and the estimation consistency is adopted to optimize the multiple gene regression model; then, the observed and expected squaring effects of SNPs associated with each pathway are fitted and the expected values are estimated by the following formula:where Tr represents a matrix trace.
105: multiplying the estimation coefficient by PAS and then summing to obtain a genetic related pathway activity score gPAS of the cell;
in one embodiment, the genetically related pathway activity score gPAS (gPj) is a respiratory disease related gPAS, obtained by the formula:
wherein,,and the estimated coefficient is optimized.
In one embodiment, the method further comprises: performing correlation analysis and sequencing on the genetic related pathway activity score gPAS and the gene expression quantity of each cell, and screening N personality-related genes;
optionally, the method for correlating and sequencing the genetic related pathway activity score gPAS with the gene expression level of each cell comprises: determining the correlation between the expression of a single gene and gPAS through a Pearson Correlation Coefficient (PCC), and sequencing the genes according to the correlation to obtain N personality-related genes related to respiratory diseases; specifically, to maximize efficacy, the expression of each gene g is inversely weighted by its gene-specific technical noise level estimated by modeling the mean variance relationship between genes in the scRNA-seq data;
optionally, the N character related genes are the first 1000 or the last 1000 character related genes sequenced according to a descending order or an ascending order of the relativity; n is not limited to 1000, and N is a natural number integer. The N personality-related genes include one or more of the following: CALM3, PIK3R1, IL32, CD3E, B2M, PRS29, and GZMB;
in one embodiment, the method further comprises: calculating a trait related score TRS of each cell according to the N trait related genes; clustering according to the property-related score TRS and the level P value of the single cells to obtain property-related cells related to respiratory diseases with different levels of severity; the trait-related gene is significantly enriched in the following trait-related cells comprising: hay marrow naive T16 cells (hay bone marrow)T16 cells), lung naive CD8+ T cells (lung +.>Cd8+ T cells), liver NKT cells (lever NKT cells) and brain naive T-like cells (brain->T cells); the acquisition formula of the trait related score TRS is: trs=average RE (GS) -average RE (CG); wherein average RE (GS) is the average relative expression value of N personality-related gene sets in a given cell, and average RE (CG) is the average relative expression value of the same number of control gene sets randomly extracted from the existing gene library; RE is relative expression; GS is gene set; CG is control gene set;
optionally, the different levels of severity of respiratory disease include mild, moderate and severe.
Alternatively, the trait related scores TRS for the N genes are calculated using a cell scoring method of the AddModuleScore function in semoat.
In one embodiment, the method further comprises: based on the block boot pulling method block bootstrap method, the cell type or the subgroup related to the characteristics of the respiratory diseases is obtained, and whether the cell type of the single cell is related or not is determined. Trait correlationCell types or subpopulations (associated with severe respiratory disease) include one or more of the following:cd8+ T cells, megakaryocytes, cd16+ monocytes; />Genes highly expressed in cd8+ T cells include: memory effect marker genes (memory effector marker genes) (GZMK, AQP3, GZMA, PRF1, and GNLY) and exhaustive effect marker genes (exhaustive effector marker genes) (LAG 3, TIGIT, GZMA, GZMB, PRDM1, and IFNG); specifically, a group of cells is considered as a pseudo-tissue (pseudo-bulk) transcriptome profile, and the amount of gene expression across cells within a given cell type is averaged; for the associated cell types, standard error was estimated with block bootstrap method and t statistics were calculated for each cell type corresponding to the P value. Whereas the goal of the bootstrapping approach is to maintain the data structure as sampling is distributed from experience, the pathway of the KEGG database is utilized to divide the genome into multiple biologically significant blocks and to perform a replacement sampling of the pathway-based blocks described above. Under default parameters, 200 iterations are performed for each cell type association analysis, and the default parameters may be modified as a particular execution proceeds.
In one embodiment, the method further comprises: sorting the genetic related pathway activity scores gPAS, and obtaining a property related pathway related to the respiratory system disease according to the sorting result (the pathway ranked at the top in the sorting result is selected) and the P value of the pathway on the cell type level; trait-related pathways include ribosomes, T cell receptor signaling pathways, primary immunodeficiency, natural killer cell mediated cytotoxicity, and platelet activation.
Specifically, the gPAS is ordered based on the central limit theorem; using symbol C t Representing cell type t, C was calculated using the following formula t Percent passage rating for each cell j within:wherein (1)>gPAS grade of pathway i in cell j, M represents total number of pathways; similarly, the statistical significance T of each pathway i in cell type T is calculated using the following formula i t :/>Wherein (1)> The assumption is that: h 0 :T i t =0 vs H 1 :T i t >0; the P value for each pathway i in cell type t is: />
In one embodiment, the statistical significance of individual cells is determined by calculating the rank distribution of trait-related genes to further evaluate whether the cells are significantly associated with the trait of interest; specifically, the percentage grade of the trait-related gene in the cell is obtained,wherein r is g,j Expressing the expression level of gene G in cell j, G representing the number of genes associated with the specified trait; the gene percentage grade follows a normal distribution U (0, 1), and under the null assumption that there is no correlation between the gene percentage grades, a statistical value T of each cell is obtained j The formula is obtained as follows: />
Deriving T using the central limit theorem based on the number of cells in single cell data j Is a distribution of: wherein N is the total number of cells; significance of the inventionThe hypothesis for the test was: h 0 :T j =0 vs H 1 :T j >0; the P value for each cell j is: p is p j =Pr(T j ≤t)。
106: outputting a genetically related pathway activity score gPAS;
application, detection newUse of a product of a cd8+ T cell subpopulation for the manufacture of a product for diagnosing a respiratory disease; new->The cd8+ T cell subpopulation is a newly discovered function associated with respiratory disease.
FIG. 2 is a schematic diagram of a conventional deviceAn embodiment of the present invention provides a schematic flowchart of a processing device for respiratory disease data, where the device includes: a memory and a processor; the memory is used for storing program instructions; the processor is configured to invoke the program instructions, which when executed, are configured to perform the above-described method of processing respiratory disease data.
FIG. 3 is a schematic diagram of a preferred embodiment of the present inventionThe embodiment of the invention provides a schematic flow chart of a processing system for respiratory disease data, which comprises the following steps:
an acquisition unit 301 for acquiring single-cell sequencing sequence data to be analyzed;
a first processing unit 302, configured to process single-cell sequencing sequence data to be analyzed by using a machine learning method, so as to obtain a PAS scoring matrix of a cell pathway and PAS of the cell pathway;
a second processing unit 303, configured to obtain genetic association data of respiratory diseases, and process the genetic association data to obtain path data with SNPs annotations;
a third processing unit 304, configured to perform statistical analysis processing on PAS of the cell pathway and pathway data with SNP annotation, to obtain an estimation coefficient;
a fourth processing unit 305 for multiplying the estimation coefficient by PAS and then summing to obtain a genetic related pathway activity score gPAS of the cell;
an output unit 306 for outputting the genetically related pathway activity score gPAS.
A computer readable storage medium having stored thereon a computer program which when executed by a processor implements the method of processing respiratory disease data described above.
FIG. 4 is a schematic diagram of a preferred embodiment of the present inventionThe scoring method based on the single cell pathway provided by the embodiment of the invention obtains gPAS, and utilizes gPAS to output TRS, character related genes, character related cells, character related cell types/subgroups and outline diagrams of the character type tubular pathway;
wherein A represents a method of converting a gene-cell matrix into a pathway-cell matrix by using singular value decomposition, and PC1 represents PAS of each pathway; b denotes annotating SNPs in GWAS data into corresponding pathways; c represents a polygene regression model; wherein the top graph represents estimating coefficients in each path by utilizing a multi-gene regression model, then calculating gPAS by using the estimating coefficients and corresponding PAS, and the bottom graph represents a Pearson correlation model for correlating gPAS of each cell with genes of all single cells so as to rank the property-related genes; the top N personality-related genes (top 1,000 defaults) were obtained using the AddModulecore function in the setup. To calculate a trait related score TRS for each cell; d represents an output, including four outputs, respectively: trait-related cells, trait-related cell types, trait-related pathways, and trait-related genes.
The results of the verification of the present verification embodiment show that assigning an inherent weight to an indication may moderately improve the performance of the present method relative to the default settings.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.
In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program to instruct related hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.
Those of ordinary skill in the art will appreciate that all or a portion of the steps in implementing the methods of the above embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, where the storage medium may be a read only memory, a magnetic disk or optical disk, etc.
While the foregoing describes a computer device provided by the present invention in detail, those skilled in the art will appreciate that the foregoing description is not meant to limit the invention thereto, as long as the scope of the invention is defined by the claims appended hereto.
Claims (10)
1. A method of processing respiratory disease data, comprising:
acquiring single cell sequencing sequence data to be analyzed;
processing the single-cell sequencing sequence data to be analyzed by adopting a machine learning method to obtain PAS scoring matrix of a cell pathway and PAS of the cell pathway;
acquiring genetic association data of respiratory diseases, and processing the genetic association data to obtain path data with SNPs annotation;
performing statistical analysis processing on PAS of the cell pathway and the pathway data with SNP annotation to obtain an estimation coefficient;
multiplying the estimation coefficient by PAS and then summing to obtain a genetic related pathway activity score gPAS of the cell;
outputting the genetically related pathway activity score gPAS.
2. The method of claim 1, wherein the step of statistically analyzing the PAS of the cellular pathway and the pathway data with SNP annotations to obtain estimated coefficients comprises:
obtaining genetic effect values of all SNPs in single path data based on the path data with the SNPs annotation;
based on a polygene regression model of genetic association data of respiratory diseases, carrying out parameter estimation on the distribution of the genetic effect values based on the PAS and the genetic effect values to obtain estimation coefficients;
optionally, the obtaining formula of the genetic effect value is:wherein beta represents mThe magnitude vector of the theoretical effect of SNPs, epsilon, represents the random environmental error, R represents the LD matrix, X T Representing a standard genotype of SNPs in the genetically related data sample;
optionally, the obtaining manner of the estimation coefficient includes:
wherein τ i,j Estimated coefficient, τ, representing pathway i in cell j 0 Representing intercept term, σ 2 Shows the variance of the magnitude of SNP effect in the pathway,representing a weighted PAS.
3. The method of claim 1, wherein the genetically related pathway activity score gPAS (gP j ) The acquisition formula of (1) is:
wherein the gP j As gPAS, saidAnd the estimated coefficient is optimized.
4. The method of processing respiratory disease data according to claim 1, wherein the method further comprises: performing correlation analysis and sequencing on the genetic related pathway activity score gPAS and the gene expression quantity of each cell, and screening N personality-related genes;
optionally, the method for performing correlation analysis and sequencing on the genetic correlation pathway activity score gPAS and the gene expression quantity of each cell comprises the following steps: determining the correlation between the expression of a single gene and the gPAS through a Pearson Correlation Coefficient (PCC), and sequencing the genes according to the correlation to obtain the N personality-related genes;
optionally, the N trait related genes are the first 1000 or the last 1000 trait related genes sequenced according to a descending or ascending rule of the relativity;
optionally, the N personality-related genes include one or more of the following: CALM3, PIK3R1, IL32, CD3E, B2M, PRS29, and GZMB.
5. The method of processing respiratory disease data according to claim 1, wherein the method further comprises: calculating a trait related score TRS for each cell according to the N trait related genes; clustering according to the trait related score TRS and the level P value of the single cells to obtain trait related cells related to respiratory diseases with different levels of severity;
optionally, calculating the trait related score TRS of the N personality genes using a cell scoring method;
optionally, the different levels of severity of the respiratory disease include mild, moderate and severe.
6. The method of processing respiratory disease data according to claim 1, wherein the method further comprises: obtaining a trait-related cell type or subpopulation based on the block boot method block bootstrap method;
optionally, the method further comprises: and sequencing the genetic related pathway activity scores gPAS, and obtaining the trait related pathway according to the sequencing result and the P value of the pathway on the cell type level.
7. Detecting newUse of a product of a cell subpopulation for the preparation of a product for diagnosing a respiratory disease.
8. A device for processing respiratory disease data, the device comprising: a memory and a processor;
the memory is used for storing program instructions; the processor is adapted to invoke program instructions for performing the method of processing respiratory disease data according to any of claims 1-6 when the program instructions are executed.
9. A system for processing respiratory disease data, comprising:
an acquisition unit for acquiring single-cell sequencing sequence data to be analyzed;
the first processing unit is used for processing the single-cell sequencing sequence data to be analyzed by adopting a machine learning method to obtain a PAS scoring matrix of a cell pathway and PAS of the cell pathway;
the second processing unit is used for acquiring genetic association data of the respiratory tract diseases, and processing the genetic association data to obtain path data with SNPs annotation;
a third processing unit for performing statistical analysis processing on PAS of the cell pathway and the pathway data with SNP annotation to obtain an estimation coefficient;
a fourth processing unit for multiplying the estimation coefficient by PAS and then summing to obtain a genetic related pathway activity score gPAS of the cell;
and an output unit for outputting the genetically related pathway activity score gPAS.
10. A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of processing respiratory disease data according to any of the preceding claims 1-6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211277916.2A CN116486911A (en) | 2022-10-19 | 2022-10-19 | Processing method and system for respiratory disease data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211277916.2A CN116486911A (en) | 2022-10-19 | 2022-10-19 | Processing method and system for respiratory disease data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116486911A true CN116486911A (en) | 2023-07-25 |
Family
ID=87225583
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211277916.2A Pending CN116486911A (en) | 2022-10-19 | 2022-10-19 | Processing method and system for respiratory disease data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116486911A (en) |
-
2022
- 2022-10-19 CN CN202211277916.2A patent/CN116486911A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112133365B (en) | Gene set for evaluating tumor microenvironment, scoring model and application of gene set | |
Wu et al. | PROPER: comprehensive power evaluation for differential expression using RNA-seq | |
Fernández et al. | Evaluating topological conflict in centipede phylogeny using transcriptomic data sets | |
Yu et al. | Statistical and bioinformatics analysis of data from bulk and single-cell RNA sequencing experiments | |
CN102051412B (en) | Method for determining the presence of disease | |
CN112725453B (en) | Application of m5c modified regulatory genome in preparation of tumor prognosis evaluation reagent or kit | |
Harrison et al. | Fungal microbiomes are determined by host phylogeny and exhibit widespread associations with the bacterial microbiome | |
CN115588465B (en) | Screening method and system for character related genes | |
Huang et al. | Promoting similarity of model sparsity structures in integrative analysis of cancer genetic data | |
Lin et al. | Scalable workflow for characterization of cell-cell communication in COVID-19 patients | |
CN116486911A (en) | Processing method and system for respiratory disease data | |
CN115472219B (en) | Alzheimer's disease data processing method and system | |
Bell-Glenn et al. | A novel framework for the identification of reference dna methylation libraries for reference-based deconvolution of cellular mixtures | |
CN113035275B (en) | Feature extraction method for tumor gene point mutation by combining contour coefficient and RJMMC algorithm | |
KR20240046481A (en) | Systems and methods for associating compounds with physiological conditions using fingerprint analysis | |
Lin et al. | Characterization of cell-cell communication in COVID-19 patients | |
Jaffe et al. | Gene set bagging for estimating the probability a statistically significant result will replicate | |
JP2007535305A (en) | Methods for molecular toxicity modeling | |
KR102225231B1 (en) | IDENTIFYING METHOD FOR TUMOR PATIENT BASED ON miRNA IN EXOSOME AND APPARATUS FOR THE SAME | |
Xie et al. | Robust statistical inference for cell type deconvolution | |
Ferreira et al. | Deep exponential families for single-cell data analysis | |
Alayoubi et al. | Scanpro: robust proportion analysis for single cell resolution data | |
CN118352007B (en) | Disease data analysis method and system based on crowd queue multiunit study data | |
CN118197406A (en) | Scoring method and system for assessing association between microorganisms and host cells | |
Hukku | Statistical Approaches for the Integrative Analysis of Multi-omics Data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |