CN116486911A - Processing method and system for respiratory disease data - Google Patents

Processing method and system for respiratory disease data Download PDF

Info

Publication number
CN116486911A
CN116486911A CN202211277916.2A CN202211277916A CN116486911A CN 116486911 A CN116486911 A CN 116486911A CN 202211277916 A CN202211277916 A CN 202211277916A CN 116486911 A CN116486911 A CN 116486911A
Authority
CN
China
Prior art keywords
cell
pathway
data
pas
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211277916.2A
Other languages
Chinese (zh)
Inventor
马云龙
苏建忠
邓春玉
瞿佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wenzhou Medical University
Original Assignee
Wenzhou Medical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wenzhou Medical University filed Critical Wenzhou Medical University
Priority to CN202211277916.2A priority Critical patent/CN116486911A/en
Publication of CN116486911A publication Critical patent/CN116486911A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a method, a system, equipment and a computer readable storage medium for processing respiratory disease data, wherein the method comprises the following steps: acquiring single cell sequencing sequence data to be analyzed; processing the single-cell sequencing sequence data to be analyzed by adopting a machine learning method to obtain PAS scoring matrix of a cell pathway and PAS of the cell pathway; acquiring genetic association data of respiratory diseases, and processing the genetic association data to obtain path data with SNPs annotation; performing statistical analysis processing on PAS of the cell pathway and the pathway data with SNP annotation to obtain an estimation coefficient; multiplying the estimation coefficient by PAS and then summing to obtain a genetic related pathway activity score gPAS of the cell; outputting the genetically related pathway activity score gPAS.

Description

Processing method and system for respiratory disease data
Technical Field
The invention relates to the technical field of gene sequencing, in particular to a method and a system for processing respiratory disease data.
Background
Respiratory diseases are common and frequently-occurring diseases, and are mainly caused by the diseases of trachea, bronchus, lung and chest, and the patients with light diseases are affected by cough, chest pain and respiration, and the patients with serious diseases are caused by dyspnea, hypoxia and even respiratory failure. Mortality in cities takes the 3 rd place, and rural areas take the first place. It is more important to pay attention to the increase or decrease of the incidence and death rate of chronic obstructive pulmonary disease (chronic bronchitis, emphysema and pulmonary heart disease for short), bronchial asthma, lung cancer, pulmonary diffuse interstitial fibrosis, pulmonary infection and other diseases at home and abroad due to atmospheric pollution, smoking, aging population and other factors.
Coronaviruses are a large virus family and are known to cause relatively serious diseases such as common cold, middle East Respiratory Syndrome (MERS), and Severe Acute Respiratory Syndrome (SARS). Common signs of a person infected with coronavirus are respiratory symptoms, fever, cough, shortness of breath, dyspnea, and the like. In more severe cases, the infection can lead to pneumonia, severe acute respiratory syndrome, renal failure, and even death. Many symptoms of coronavirus-induced diseases can be treated, and thus require treatment according to the clinical condition of the patient. In addition, assisted care of the infected person may be very effective, making self-protection, including: keep basic hand and respiratory tract hygiene, adhere to safe eating habits, etc. Understanding the effects of host genetic components on the immune response to severe infections has helped develop effective vaccines and therapeutic methods to control related respiratory disease pandemics. With the rapid development of sequencing technology, single cell sequencing technology has brought more comprehensive opportunities for revealing relevant mechanisms of related respiratory diseases.
The use of single cell RNA sequencing (scRNA-seq) technology to identify key cell subsets associated with complex diseases or features is critical to understanding the mechanisms of complex diseases. However, scRNA-seq data do not allow large-scale sequencing due to their high cost and low throughput characteristics, and most single cell-based research samples currently do not exceed 20, resulting in limited statistical efficacy and failure to accurately reveal a subset of risks associated with disease or features in a cell subpopulation. In addition, the scRNA-seq data is characterized by high sparsity, technical noise and variance instability at the genetic level. Genetic association data such as: (Whole genome association study, GWAS) is widely used to study different complex diseases or traits, and correlating scRNA-seq data with phenotype-associated genetic information of GWAS from large-scale samples is considered to be a practical and effective method for revealing genetic molecular mechanisms of complex diseases or traits at single cell resolution.
Methods combining GWAS with scRNA-seq data to identify cell types associated with complex diseases, including such as LDSC-SEG, MAGMA, rolyPoly, require extensive adjustment of parameters in order to annotate cell types with known marker genes and largely ignore the internal heterogeneity of each cell type. Furthermore, the prior art can identify genes with high expression levels, but has the potential disadvantage that overattention to high expression genes underestimates the functional role of genes whose expression levels are relatively low but important in revealing cell fate.
Disclosure of Invention
The present invention aims to solve at least one of the technical problems existing in the prior art. Therefore, the invention provides a processing method and a system for respiratory disease data; the method of the invention deduces genes, cells and the like related to respiratory diseases by combining scRNA-seq data and genetic association data through a grading method based on single cell pathways, and deeply digs life laws underlying single cell sequencing data to determine potential relations between the genes, cells, cell subgroups, biological pathways and the like and the respiratory diseases.
The application discloses a processing method of respiratory disease data, comprising the following steps:
acquiring single cell sequencing sequence data to be analyzed;
processing the single-cell sequencing sequence data to be analyzed by adopting a machine learning method to obtain PAS scoring matrix of a cell pathway and PAS of the cell pathway;
acquiring genetic association data of respiratory diseases, and processing the genetic association data to obtain path data with SNPs annotation;
performing statistical analysis processing on PAS of the cell pathway and the pathway data with SNP annotation to obtain an estimation coefficient;
multiplying the estimation coefficient by PAS and then summing to obtain a genetic related pathway activity score gPAS of the cell;
outputting the genetically related pathway activity score gPAS.
The step of performing statistical analysis processing on PAS of the cell pathway and the pathway data with SNP annotation to obtain an estimation coefficient comprises the following steps:
obtaining genetic effect values of all SNPs in single path data based on the path data with the SNPs annotation;
based on a polygene regression model of genetic association data of respiratory diseases, carrying out parameter estimation on the distribution of the genetic effect values based on the PAS and the genetic effect values to obtain estimation coefficients;
optionally, the obtaining formula of the genetic effect value is:wherein, beta represents the theoretical effect size vector of m SNPs, epsilon represents random environmental error, R represents LD matrix, X T Representing a standard genotype of SNPs in the genetically related data sample;
optionally, the obtaining manner of the estimation coefficient includes:
wherein τ i,j Estimated coefficient, τ, representing pathway i in cell j 0 Representing intercept term, σ 2 Shows the variance of the magnitude of SNP effect in the pathway,representing a weighted PAS.
The acquisition formula of the genetic related pathway activity score gPAS (gPj) is as follows:
wherein the saidAnd the estimated coefficient is optimized.
The method further comprises the steps of: performing correlation analysis and sequencing on the genetic related pathway activity score gPAS and the gene expression quantity of each cell, and screening N personality-related genes;
optionally, the method for performing correlation analysis and sequencing on the genetic correlation pathway activity score gPAS and the gene expression quantity of each cell comprises the following steps: determining the correlation between the expression of a single gene and the gPAS through a Pearson Correlation Coefficient (PCC), and sequencing the genes according to the correlation to obtain the N personality-related genes;
optionally, the N trait related genes are the first 1000 or the last 1000 trait related genes sequenced according to a descending order or an ascending order of the relativity;
optionally, the N personality-related genes include one or more of the following: CALM3, PIK3R1, IL32, CD3E, B2M, PRS29, and GZMB.
The method further comprises the steps of: calculating a trait related score TRS for each cell according to the N trait related genes; clustering according to the trait related score TRS and the level P value of the single cells to obtain trait related cells related to respiratory diseases with different levels of severity;
optionally, calculating the trait related score TRS of the N personality genes using a cell scoring method;
optionally, the different levels of severity of the respiratory disease include mild, moderate and severe.
The method further comprises the steps of: the trait-related cell type or subpopulation was obtained based on the block boot method block bootstrap method.
The method further comprises the steps of: and sequencing the genetic related pathway activity scores gPAS, and according to sequencing results and the P value of the pathway on the cell type level, carrying out the property related pathway according to the statistical significance value.
Detecting newUse of a product of a cd8+ T cell subpopulation for the manufacture of a product for diagnosing a respiratory disease.
A device for processing respiratory disease data, the device comprising: a memory and a processor;
the memory is used for storing program instructions; the processor is configured to invoke the program instructions, which when executed, are configured to perform the above-described method of processing respiratory disease data.
A system for processing respiratory disease data, comprising:
an acquisition unit for acquiring single-cell sequencing sequence data to be analyzed;
the first processing unit is used for processing the single-cell sequencing sequence data to be analyzed by adopting a machine learning method to obtain a PAS scoring matrix of a cell pathway and PAS of the cell pathway;
the second processing unit is used for acquiring genetic association data of the respiratory tract diseases, and processing the genetic association data to obtain path data with SNPs annotation;
a third processing unit for performing statistical analysis processing on PAS of the cell pathway and the pathway data with SNP annotation to obtain an estimation coefficient;
a fourth processing unit for multiplying the estimation coefficient by PAS and then summing to obtain a genetic related pathway activity score gPAS of the cell;
and an output unit for outputting the genetically related pathway activity score gPAS.
A computer readable storage medium having stored thereon a computer program which when executed by a processor implements the method of processing respiratory disease data described above.
The application has the following beneficial effects:
1. the application innovatively disclosesThe processing method of respiratory disease data by combining single cell sequencing data and genetic association data can infer genes, cells, cell subsets, related biological pathways and the like related to respiratory diseases from deep and more dimensions, and understanding the influence of host genetic components on the immune response of severe infection is helpful for developing effective vaccines and treatment methods to control disease pandemics and contributes to research on respiratory diseases; the method is based on a scoring method of a single cell pathway, has strong capability of finding a disease risk cell type, fuses the functional actions of different genes participating in the same biological pathway to obtain a stable cell state, and remarkably increases the statistical efficiency, biological interpretability and result repeatability; overcomes the limitation of the known annotation cell types, and can discover the new genetic related subgroup and the key genes or channels of the cell types, thereby having wide application and strong practicability. Such as: the scheme is as follows: gene driven novel which can be prioritizedThe cd8+ T cell subpopulation may play an important role in mediating the immune response in patients with severe respiratory diseases.
2. The application creatively discloses a scoring method based on a single cell pathway, which adopts a polygene regression model to reveal genes and cell subgroups related to traits by utilizing scRNA-seq data converted by pathway activity and genetic association research data; the method effectively solves the problems that the identification of the genes and cell subsets related to the polygenic risk of the complex diseases is greatly hindered by small sample size and high sparsity in the scRNA-seq data, so that the statistical efficiency is limited, and the risk subsets related to the diseases or features in the cell subsets can not be accurately revealed. The method is used for deep mining of life rules hidden behind single-cell sequencing data, and deep analysis of multiple dimensions such as population genetics mutation and disease relation, single-cell sequencing gene abundance information and the like, so that the accuracy and depth of data analysis are greatly improved.
3. The method combines the scRNA-seq data with the genetic association data based on large-scale simulation and real data, so that the problem that a large amount of adjustment parameters are needed for conveniently annotating cell types with known marker genes in the prior art and internal heterogeneity in each cell type can be ignored to a great extent can be effectively overcome; there is no functional role of genes whose expression levels are relatively low, but important for revealing cell fate, underestimation due to overconcerns about high-expression genes, helping to identify disease-related early developmental events or progenitor cells, such as key transcription factors related to cell development, by aggregating the role of genes whose average expression levels are low; meanwhile, the sparsity and technical noise of the scRNA-seq data can be effectively reduced, and the method has good robustness and capability in the aspect of identifying cell types and sub-populations related to characteristics.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is an analytical schematic flow chart of a method for processing respiratory disease data provided by an embodiment of the present invention;
FIG. 2 is a schematic diagram of a device for processing respiratory disease data according to an embodiment of the present invention;
FIG. 3 is a schematic flow chart of a processing system for respiratory disease data provided by an embodiment of the present invention;
fig. 4 is a schematic diagram of obtaining the gPAS by the scoring method based on the single cell pathway and outputting TRS, the property-related genes, the property-related cells, the property-related cell types/sub-populations and the property-type tubular pathway by using the gPAS according to the embodiment of the invention.
Detailed Description
In order to enable those skilled in the art to better understand the present invention, the following description will make clear and complete descriptions of the technical solutions according to the embodiments of the present invention with reference to the accompanying drawings.
In some of the flows described in the specification and claims of the present invention and in the foregoing figures, a plurality of operations occurring in a particular order are included, but it should be understood that the operations may be performed out of order or performed in parallel, with the order of operations such as 101, 102, etc., being merely used to distinguish between the various operations, the order of the operations themselves not representing any order of execution. In addition, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first" and "second" herein are used to distinguish different messages, devices, modules, etc., and do not represent a sequence, and are not limited to the "first" and the "second" being different types.
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments according to the invention without any creative effort, are within the protection scope of the invention.
Fig. 1 is a schematic flow chart of a method for processing respiratory disease data according to an embodiment of the present invention, specifically, the method includes the following steps:
101: acquiring single cell sequencing sequence data to be analyzed;
in one embodiment, single cell sequencing data comprises seven independent single cell RNA-seq (scRNA-seq) or single core RNA-seq (snRNA-seq) datasets covering 139 ten thousand cells from humans (homo sapiens) and mice (mus musculus). For blood cells, two scRNA-seq datasets based on human BMMC (n= 35,582 cells) and human PBMCs (n= 97,039 cells) were collected to reveal trait-related cell subsets or types. For diseases/features related to immunity/metabolism, a pseudo-tissue (pseudo-bulk) expression profile and a preferential risk tissue related to disease/feature was constructed for each tissue using the scRNA-seq dataset from human cells (HCL, n= 513,707 cells in 35 adult tissues).
In one example, to discover immune cell populations associated with severe respiratory disease, a large-scale PBMC scRNA-seq dataset (n= 469,453 cells) was collected containing 254 peripheral blood samples with varying respiratory disease severity (mild n=109 samples, moderate n=102 samples, severe n=50) and 16 healthy controls. Alternatively, the single cell sequencing sequence data to be analyzed includes single cell sequencing sequence data of healthy control groups and respiratory tract diseases of varying grade severity.
102: processing single-cell sequencing sequence data to be analyzed by adopting a machine learning method to obtain PAS scoring matrix of a cell pathway and PAS of the cell pathway;
in one embodiment, the step of processing single cell sequencing sequence data to be analyzed by a machine learning method to obtain a PAS scoring matrix of a cell pathway and obtaining PAS of the cell pathway comprises the steps of:
acquiring pathway data of respiratory diseases;
carrying out standardization treatment on a gene-cell matrix in single-cell sequencing sequence data to obtain a standardized gene-cell matrix; specifically, the sparse gene-cell matrix in the scRNA-seq data was normalized using a variance stabilizing transformation parameter with a scale factor of 10,000, resulting in normalized expression of a single gene in a single cell; the normalized formula is:wherein a is g,j Represents the original expression of gene g in cell j, e g,j Represents the normalized expression of gene g in cell j;
based on the pathway data of respiratory diseases, converting the standardized gene-cell matrix into a pathway-cell matrix by using a machine learning method, and obtaining a PAS scoring matrix of the cell-pathway by using the pathway-cell matrix, wherein the PAS scoring matrix comprises pathway activity scoring PAS of single cells in a single pathway;
in one embodiment, the number of passesAccording to the KEGG access data, the access from the KEGG database is used as a default gene set for evaluating PAS, and the standardized gene-cell matrix is converted into an access-cell matrix by utilizing a Singular Value Decomposition (SVD) method; using P i Representing the gene set in pathway i, for each pathway i, matrix A was selected from the normalized gene-cell matrix A i Wherein matrix A i Is the column of all N cells, and the row is the pathway gene set P i Middle |P i Gene and SVDWherein U represents an N orthogonal matrix, Σ represents a diagonal matrix having all zeros except for the main diagonal element, V T Representing |P i |×|P i An i orthogonal matrix; right orthogonal matrix->The t th column vector v t Representing the t-th principal component, reflecting the co-expression variability of genes in single cell data in the pathway; since the first principal component PC1 represents the greatest variance variation, the projection of the cell j feature onto PC1 represents the PASs of pathway i i,j The method comprises the steps of carrying out a first treatment on the surface of the For cell j, the original PASs were adjusted using all the expression variances in pathway i as weights i,j The method comprises the steps of carrying out a first treatment on the surface of the For gene g in pathway i, gene expression e was readjusted using min-max scaling g,j Regulated Gene expression +.>
In one embodiment, the pathway activity score PAS is optimized to obtain a weighted PAS;
the acquisition mode of the weighted PAS comprises the following steps:
wherein,,representing weighted PAS->Representing normalized expression of gene g in optimized cell i, s i,j A pathway activity score PAS representing cell j pathway i;
in one embodiment of the present invention, in one embodiment,the acquisition mode of (1) comprises the following steps:
wherein,,represents the normalized expression of gene g in cell i, MAX (e g,j ) Represents the maximum value of gene expression in pathway i, MIN (e g,j ) Represents the minimum value of gene expression in pathway i.
Optionally, the method of machine learning includes a method of Singular Value Decomposition (SVD); the SVD method greatly improves the calculation efficiency of analysis sparse matrix, and can obtain the characteristic value under the condition of not calculating variance matrix; and (3) the standardized gene-cell matrix is sublimated into a path-cell matrix in a low-dimensional space by utilizing a singular value decomposition method.
103: acquiring genetic association data of the respiratory tract diseases, and processing the genetic association data to obtain path data with SNPs annotation;
in one embodiment, the genetic association data for respiratory disease comprises genetic association data for severe respiratory disease;
in one embodiment, the step of processing the genetic association data to obtain pathway data with SNPs annotations comprises:
screening from the genetic association data to obtain SNPs of a single gene, and mapping the SNPs of the single gene into corresponding paths based on path data of respiratory diseases to obtain path data with SNPs annotation;
alternatively, the SNPs of a single gene may be obtained by the steps of: after SNPs of genes in the genetic association data are obtained, SNPs gene pairs are respectively allocated to obtain allocation results;
respectively carrying out association treatment on repeated genes of a plurality of single SNPs corresponding to a plurality of genes as independent SNP genes; preserving SNPs with Minor Allele Frequencies (MAFs) greater than 0.1 in the allocation results; deleting SNPs on sex chromosomes; obtaining SNPs of a single gene;
the SNPs of a single gene are collected to obtain SNPs of all genes. Specifically, the genetic association data is GWAS data, and SNPs in the GWAS summary statistical data are allocated to related genes by taking 20kb as a default parameter; the symbol g (k) is used to indicate the gene g with SNP k, and a plurality of single SNPs correspond to a plurality of genes through the distribution of SNP gene pairs; since the whole process needs to infer parameters from thousands of SNPs, but SNPs of the single SNPs corresponding to multiple genes have no effect on the inference process, the repeated genes need to be treated as independent Snp genes in an associated manner; preserving SNPs with Minor Allele Frequency (MAF) greater than 0.1, deleting SNPs on sex chromosomes, and finally obtaining SNPs of related genes;
annotating genes with associated SNPs into the pathway based on the pathway in the KEGG database, and representing the set of SNPs in pathway i using si=formula 2; calculating linkage disequilibrium LD (linkage disequilibrium) on SNPs extracted from the GWAS summary data by using the 3 rd stage data of the thousand genome project; the present protocol provides a collection of functional genes such as GO, reactiomer, and MSigDB as alternative options. In addition, the region of the major histocompatibility complex where the broad LD exists, chr6:25-35Mbp, was deleted.
In one embodiment, GWAS data has given a phenotype, and the phenotypic annotation of the given phenotype includes dichotomy, continuous dependency characteristics, or intra-phenotype and center measurements.
104: performing statistical analysis processing on PAS of the cell pathway and pathway data with SNP annotation to obtain an estimation coefficient;
in one embodiment, the step of statistically analyzing PAS of the cellular pathway and pathway data annotated with SNP to obtain the estimated coefficients comprises:
obtaining genetic effect values of all SNPs in single path data based on path data with SNPs annotation;
based on a polygene regression model of genetic association data of respiratory diseases, carrying out parameter estimation on the distribution of genetic effect values based on PAS and the genetic effect values to obtain estimation coefficients;
optionally, the genetic effect value is obtained by the following formula:wherein, beta represents the theoretical effect size vector of m SNPs, epsilon represents random environmental error, R represents LD matrix, X T Representing a standard genotype of SNPs in the genetically related data sample;
optionally, the obtaining manner of the estimation coefficient includes:
wherein τ i,j Representing an estimated coefficient of pathway i in cell j, the estimated coefficient reflecting the effect of cell-specific PAS on the size variance of the GWAS effect, i.e., the effect of inheritance on the response; τ 0 Representing intercept term, σ 2 Shows the variance of the magnitude of SNP effect in the pathway,representing a weighted PAS.
In one embodiment, S i SNP sets representing all SNPs contained in the localized genes of each pathway i, the multiple gene model assuming that the magnitude of the effect of all SNPs of a priori pathway i follows a multivariate normal distribution, wherein σ 2 Representing the variance of the magnitude of SNPs effects in a pathway, I representing |S i |×|S i An I identity matrix;
in one embodiment, the genetic effect value is based on previous assumptionsThe distribution of (2) is estimated using the following formula: />Optimizing the estimation coefficient by using the formula;
in one embodiment, in order to optimize the estimation coefficient of each path in the multiple gene regression model, a moment method (method-of-motion approach) capable of significantly improving the calculation efficiency and the estimation consistency is adopted to optimize the multiple gene regression model; then, the observed and expected squaring effects of SNPs associated with each pathway are fitted and the expected values are estimated by the following formula:where Tr represents a matrix trace.
105: multiplying the estimation coefficient by PAS and then summing to obtain a genetic related pathway activity score gPAS of the cell;
in one embodiment, the genetically related pathway activity score gPAS (gPj) is a respiratory disease related gPAS, obtained by the formula:
wherein,,and the estimated coefficient is optimized.
In one embodiment, the method further comprises: performing correlation analysis and sequencing on the genetic related pathway activity score gPAS and the gene expression quantity of each cell, and screening N personality-related genes;
optionally, the method for correlating and sequencing the genetic related pathway activity score gPAS with the gene expression level of each cell comprises: determining the correlation between the expression of a single gene and gPAS through a Pearson Correlation Coefficient (PCC), and sequencing the genes according to the correlation to obtain N personality-related genes related to respiratory diseases; specifically, to maximize efficacy, the expression of each gene g is inversely weighted by its gene-specific technical noise level estimated by modeling the mean variance relationship between genes in the scRNA-seq data;
optionally, the N character related genes are the first 1000 or the last 1000 character related genes sequenced according to a descending order or an ascending order of the relativity; n is not limited to 1000, and N is a natural number integer. The N personality-related genes include one or more of the following: CALM3, PIK3R1, IL32, CD3E, B2M, PRS29, and GZMB;
in one embodiment, the method further comprises: calculating a trait related score TRS of each cell according to the N trait related genes; clustering according to the property-related score TRS and the level P value of the single cells to obtain property-related cells related to respiratory diseases with different levels of severity; the trait-related gene is significantly enriched in the following trait-related cells comprising: hay marrow naive T16 cells (hay bone marrow)T16 cells), lung naive CD8+ T cells (lung +.>Cd8+ T cells), liver NKT cells (lever NKT cells) and brain naive T-like cells (brain->T cells); the acquisition formula of the trait related score TRS is: trs=average RE (GS) -average RE (CG); wherein average RE (GS) is the average relative expression value of N personality-related gene sets in a given cell, and average RE (CG) is the average relative expression value of the same number of control gene sets randomly extracted from the existing gene library; RE is relative expression; GS is gene set; CG is control gene set;
optionally, the different levels of severity of respiratory disease include mild, moderate and severe.
Alternatively, the trait related scores TRS for the N genes are calculated using a cell scoring method of the AddModuleScore function in semoat.
In one embodiment, the method further comprises: based on the block boot pulling method block bootstrap method, the cell type or the subgroup related to the characteristics of the respiratory diseases is obtained, and whether the cell type of the single cell is related or not is determined. Trait correlationCell types or subpopulations (associated with severe respiratory disease) include one or more of the following:cd8+ T cells, megakaryocytes, cd16+ monocytes; />Genes highly expressed in cd8+ T cells include: memory effect marker genes (memory effector marker genes) (GZMK, AQP3, GZMA, PRF1, and GNLY) and exhaustive effect marker genes (exhaustive effector marker genes) (LAG 3, TIGIT, GZMA, GZMB, PRDM1, and IFNG); specifically, a group of cells is considered as a pseudo-tissue (pseudo-bulk) transcriptome profile, and the amount of gene expression across cells within a given cell type is averaged; for the associated cell types, standard error was estimated with block bootstrap method and t statistics were calculated for each cell type corresponding to the P value. Whereas the goal of the bootstrapping approach is to maintain the data structure as sampling is distributed from experience, the pathway of the KEGG database is utilized to divide the genome into multiple biologically significant blocks and to perform a replacement sampling of the pathway-based blocks described above. Under default parameters, 200 iterations are performed for each cell type association analysis, and the default parameters may be modified as a particular execution proceeds.
In one embodiment, the method further comprises: sorting the genetic related pathway activity scores gPAS, and obtaining a property related pathway related to the respiratory system disease according to the sorting result (the pathway ranked at the top in the sorting result is selected) and the P value of the pathway on the cell type level; trait-related pathways include ribosomes, T cell receptor signaling pathways, primary immunodeficiency, natural killer cell mediated cytotoxicity, and platelet activation.
Specifically, the gPAS is ordered based on the central limit theorem; using symbol C t Representing cell type t, C was calculated using the following formula t Percent passage rating for each cell j within:wherein (1)>gPAS grade of pathway i in cell j, M represents total number of pathways; similarly, the statistical significance T of each pathway i in cell type T is calculated using the following formula i t :/>Wherein (1)> The assumption is that: h 0 :T i t =0 vs H 1 :T i t >0; the P value for each pathway i in cell type t is: />
In one embodiment, the statistical significance of individual cells is determined by calculating the rank distribution of trait-related genes to further evaluate whether the cells are significantly associated with the trait of interest; specifically, the percentage grade of the trait-related gene in the cell is obtained,wherein r is g,j Expressing the expression level of gene G in cell j, G representing the number of genes associated with the specified trait; the gene percentage grade follows a normal distribution U (0, 1), and under the null assumption that there is no correlation between the gene percentage grades, a statistical value T of each cell is obtained j The formula is obtained as follows: />
Deriving T using the central limit theorem based on the number of cells in single cell data j Is a distribution of: wherein N is the total number of cells; significance of the inventionThe hypothesis for the test was: h 0 :T j =0 vs H 1 :T j >0; the P value for each cell j is: p is p j =Pr(T j ≤t)。
106: outputting a genetically related pathway activity score gPAS;
application, detection newUse of a product of a cd8+ T cell subpopulation for the manufacture of a product for diagnosing a respiratory disease; new->The cd8+ T cell subpopulation is a newly discovered function associated with respiratory disease.
FIG. 2 is a schematic diagram of a conventional deviceAn embodiment of the present invention provides a schematic flowchart of a processing device for respiratory disease data, where the device includes: a memory and a processor; the memory is used for storing program instructions; the processor is configured to invoke the program instructions, which when executed, are configured to perform the above-described method of processing respiratory disease data.
FIG. 3 is a schematic diagram of a preferred embodiment of the present inventionThe embodiment of the invention provides a schematic flow chart of a processing system for respiratory disease data, which comprises the following steps:
an acquisition unit 301 for acquiring single-cell sequencing sequence data to be analyzed;
a first processing unit 302, configured to process single-cell sequencing sequence data to be analyzed by using a machine learning method, so as to obtain a PAS scoring matrix of a cell pathway and PAS of the cell pathway;
a second processing unit 303, configured to obtain genetic association data of respiratory diseases, and process the genetic association data to obtain path data with SNPs annotations;
a third processing unit 304, configured to perform statistical analysis processing on PAS of the cell pathway and pathway data with SNP annotation, to obtain an estimation coefficient;
a fourth processing unit 305 for multiplying the estimation coefficient by PAS and then summing to obtain a genetic related pathway activity score gPAS of the cell;
an output unit 306 for outputting the genetically related pathway activity score gPAS.
A computer readable storage medium having stored thereon a computer program which when executed by a processor implements the method of processing respiratory disease data described above.
FIG. 4 is a schematic diagram of a preferred embodiment of the present inventionThe scoring method based on the single cell pathway provided by the embodiment of the invention obtains gPAS, and utilizes gPAS to output TRS, character related genes, character related cells, character related cell types/subgroups and outline diagrams of the character type tubular pathway;
wherein A represents a method of converting a gene-cell matrix into a pathway-cell matrix by using singular value decomposition, and PC1 represents PAS of each pathway; b denotes annotating SNPs in GWAS data into corresponding pathways; c represents a polygene regression model; wherein the top graph represents estimating coefficients in each path by utilizing a multi-gene regression model, then calculating gPAS by using the estimating coefficients and corresponding PAS, and the bottom graph represents a Pearson correlation model for correlating gPAS of each cell with genes of all single cells so as to rank the property-related genes; the top N personality-related genes (top 1,000 defaults) were obtained using the AddModulecore function in the setup. To calculate a trait related score TRS for each cell; d represents an output, including four outputs, respectively: trait-related cells, trait-related cell types, trait-related pathways, and trait-related genes.
The results of the verification of the present verification embodiment show that assigning an inherent weight to an indication may moderately improve the performance of the present method relative to the default settings.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.
In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program to instruct related hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.
Those of ordinary skill in the art will appreciate that all or a portion of the steps in implementing the methods of the above embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, where the storage medium may be a read only memory, a magnetic disk or optical disk, etc.
While the foregoing describes a computer device provided by the present invention in detail, those skilled in the art will appreciate that the foregoing description is not meant to limit the invention thereto, as long as the scope of the invention is defined by the claims appended hereto.

Claims (10)

1. A method of processing respiratory disease data, comprising:
acquiring single cell sequencing sequence data to be analyzed;
processing the single-cell sequencing sequence data to be analyzed by adopting a machine learning method to obtain PAS scoring matrix of a cell pathway and PAS of the cell pathway;
acquiring genetic association data of respiratory diseases, and processing the genetic association data to obtain path data with SNPs annotation;
performing statistical analysis processing on PAS of the cell pathway and the pathway data with SNP annotation to obtain an estimation coefficient;
multiplying the estimation coefficient by PAS and then summing to obtain a genetic related pathway activity score gPAS of the cell;
outputting the genetically related pathway activity score gPAS.
2. The method of claim 1, wherein the step of statistically analyzing the PAS of the cellular pathway and the pathway data with SNP annotations to obtain estimated coefficients comprises:
obtaining genetic effect values of all SNPs in single path data based on the path data with the SNPs annotation;
based on a polygene regression model of genetic association data of respiratory diseases, carrying out parameter estimation on the distribution of the genetic effect values based on the PAS and the genetic effect values to obtain estimation coefficients;
optionally, the obtaining formula of the genetic effect value is:wherein beta represents mThe magnitude vector of the theoretical effect of SNPs, epsilon, represents the random environmental error, R represents the LD matrix, X T Representing a standard genotype of SNPs in the genetically related data sample;
optionally, the obtaining manner of the estimation coefficient includes:
wherein τ i,j Estimated coefficient, τ, representing pathway i in cell j 0 Representing intercept term, σ 2 Shows the variance of the magnitude of SNP effect in the pathway,representing a weighted PAS.
3. The method of claim 1, wherein the genetically related pathway activity score gPAS (gP j ) The acquisition formula of (1) is:
wherein the gP j As gPAS, saidAnd the estimated coefficient is optimized.
4. The method of processing respiratory disease data according to claim 1, wherein the method further comprises: performing correlation analysis and sequencing on the genetic related pathway activity score gPAS and the gene expression quantity of each cell, and screening N personality-related genes;
optionally, the method for performing correlation analysis and sequencing on the genetic correlation pathway activity score gPAS and the gene expression quantity of each cell comprises the following steps: determining the correlation between the expression of a single gene and the gPAS through a Pearson Correlation Coefficient (PCC), and sequencing the genes according to the correlation to obtain the N personality-related genes;
optionally, the N trait related genes are the first 1000 or the last 1000 trait related genes sequenced according to a descending or ascending rule of the relativity;
optionally, the N personality-related genes include one or more of the following: CALM3, PIK3R1, IL32, CD3E, B2M, PRS29, and GZMB.
5. The method of processing respiratory disease data according to claim 1, wherein the method further comprises: calculating a trait related score TRS for each cell according to the N trait related genes; clustering according to the trait related score TRS and the level P value of the single cells to obtain trait related cells related to respiratory diseases with different levels of severity;
optionally, calculating the trait related score TRS of the N personality genes using a cell scoring method;
optionally, the different levels of severity of the respiratory disease include mild, moderate and severe.
6. The method of processing respiratory disease data according to claim 1, wherein the method further comprises: obtaining a trait-related cell type or subpopulation based on the block boot method block bootstrap method;
optionally, the method further comprises: and sequencing the genetic related pathway activity scores gPAS, and obtaining the trait related pathway according to the sequencing result and the P value of the pathway on the cell type level.
7. Detecting newUse of a product of a cell subpopulation for the preparation of a product for diagnosing a respiratory disease.
8. A device for processing respiratory disease data, the device comprising: a memory and a processor;
the memory is used for storing program instructions; the processor is adapted to invoke program instructions for performing the method of processing respiratory disease data according to any of claims 1-6 when the program instructions are executed.
9. A system for processing respiratory disease data, comprising:
an acquisition unit for acquiring single-cell sequencing sequence data to be analyzed;
the first processing unit is used for processing the single-cell sequencing sequence data to be analyzed by adopting a machine learning method to obtain a PAS scoring matrix of a cell pathway and PAS of the cell pathway;
the second processing unit is used for acquiring genetic association data of the respiratory tract diseases, and processing the genetic association data to obtain path data with SNPs annotation;
a third processing unit for performing statistical analysis processing on PAS of the cell pathway and the pathway data with SNP annotation to obtain an estimation coefficient;
a fourth processing unit for multiplying the estimation coefficient by PAS and then summing to obtain a genetic related pathway activity score gPAS of the cell;
and an output unit for outputting the genetically related pathway activity score gPAS.
10. A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of processing respiratory disease data according to any of the preceding claims 1-6.
CN202211277916.2A 2022-10-19 2022-10-19 Processing method and system for respiratory disease data Pending CN116486911A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211277916.2A CN116486911A (en) 2022-10-19 2022-10-19 Processing method and system for respiratory disease data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211277916.2A CN116486911A (en) 2022-10-19 2022-10-19 Processing method and system for respiratory disease data

Publications (1)

Publication Number Publication Date
CN116486911A true CN116486911A (en) 2023-07-25

Family

ID=87225583

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211277916.2A Pending CN116486911A (en) 2022-10-19 2022-10-19 Processing method and system for respiratory disease data

Country Status (1)

Country Link
CN (1) CN116486911A (en)

Similar Documents

Publication Publication Date Title
CN112133365B (en) Gene set for evaluating tumor microenvironment, scoring model and application of gene set
Wu et al. PROPER: comprehensive power evaluation for differential expression using RNA-seq
Fernández et al. Evaluating topological conflict in centipede phylogeny using transcriptomic data sets
Yu et al. Statistical and bioinformatics analysis of data from bulk and single-cell RNA sequencing experiments
CN102051412B (en) Method for determining the presence of disease
CN112725453B (en) Application of m5c modified regulatory genome in preparation of tumor prognosis evaluation reagent or kit
Harrison et al. Fungal microbiomes are determined by host phylogeny and exhibit widespread associations with the bacterial microbiome
CN115588465B (en) Screening method and system for character related genes
Huang et al. Promoting similarity of model sparsity structures in integrative analysis of cancer genetic data
Lin et al. Scalable workflow for characterization of cell-cell communication in COVID-19 patients
CN116486911A (en) Processing method and system for respiratory disease data
CN115472219B (en) Alzheimer's disease data processing method and system
Bell-Glenn et al. A novel framework for the identification of reference dna methylation libraries for reference-based deconvolution of cellular mixtures
CN113035275B (en) Feature extraction method for tumor gene point mutation by combining contour coefficient and RJMMC algorithm
KR20240046481A (en) Systems and methods for associating compounds with physiological conditions using fingerprint analysis
Lin et al. Characterization of cell-cell communication in COVID-19 patients
Jaffe et al. Gene set bagging for estimating the probability a statistically significant result will replicate
JP2007535305A (en) Methods for molecular toxicity modeling
KR102225231B1 (en) IDENTIFYING METHOD FOR TUMOR PATIENT BASED ON miRNA IN EXOSOME AND APPARATUS FOR THE SAME
Xie et al. Robust statistical inference for cell type deconvolution
Ferreira et al. Deep exponential families for single-cell data analysis
Alayoubi et al. Scanpro: robust proportion analysis for single cell resolution data
CN118352007B (en) Disease data analysis method and system based on crowd queue multiunit study data
CN118197406A (en) Scoring method and system for assessing association between microorganisms and host cells
Hukku Statistical Approaches for the Integrative Analysis of Multi-omics Data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination