CN116486911A - Processing method and system for respiratory disease data - Google Patents

Processing method and system for respiratory disease data Download PDF

Info

Publication number
CN116486911A
CN116486911A CN202211277916.2A CN202211277916A CN116486911A CN 116486911 A CN116486911 A CN 116486911A CN 202211277916 A CN202211277916 A CN 202211277916A CN 116486911 A CN116486911 A CN 116486911A
Authority
CN
China
Prior art keywords
cell
pathway
data
pas
genetic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211277916.2A
Other languages
Chinese (zh)
Inventor
马云龙
苏建忠
邓春玉
瞿佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wenzhou Medical University
Original Assignee
Wenzhou Medical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wenzhou Medical University filed Critical Wenzhou Medical University
Priority to CN202211277916.2A priority Critical patent/CN116486911A/en
Publication of CN116486911A publication Critical patent/CN116486911A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a method, a system, equipment and a computer readable storage medium for processing respiratory disease data, wherein the method comprises the following steps: acquiring single cell sequencing sequence data to be analyzed; processing the single-cell sequencing sequence data to be analyzed by adopting a machine learning method to obtain PAS scoring matrix of a cell pathway and PAS of the cell pathway; acquiring genetic association data of respiratory diseases, and processing the genetic association data to obtain path data with SNPs annotation; performing statistical analysis processing on PAS of the cell pathway and the pathway data with SNP annotation to obtain an estimation coefficient; multiplying the estimation coefficient by PAS and then summing to obtain a genetic related pathway activity score gPAS of the cell; outputting the genetically related pathway activity score gPAS.

Description

一种呼吸系统疾病数据的处理方法及其系统A method and system for processing respiratory system disease data

技术领域technical field

本发明涉及基因测序技术领域,更具体地,涉及一种呼吸系统疾病数据的处理方法及其系统。The present invention relates to the technical field of gene sequencing, and more specifically, to a method and system for processing respiratory disease data.

背景技术Background technique

呼吸系统疾病是一种常见病、多发病,主要病变在气管、支气管、肺部及胸腔,病变轻者多咳嗽、胸痛、呼吸受影响,重者呼吸困难、缺氧,甚至呼吸衰竭而致死。在城市的死亡率占第3位,而在农村则占首位。更应重视的是由于大气污染、吸烟、人口老龄化及其他因素,使国内外的慢性阻塞性肺病(简称慢阻肺,包括慢性支气管炎、肺气肿、肺心病)、支气管哮喘、肺癌、肺部弥散性间质纤维化,以及肺部感染等疾病的发病率、死亡率有增无减。Respiratory system disease is a common and frequently-occurring disease. The main lesions are in the trachea, bronchi, lungs and chest cavity. Mild cases often cause coughing, chest pain, and respiratory problems. Severe cases have dyspnea, hypoxia, and even death due to respiratory failure. The death rate in the city is the third, while in the countryside it is the first. What should be paid more attention to is that due to air pollution, smoking, population aging and other factors, the morbidity and mortality of chronic obstructive pulmonary disease (COPD for short, including chronic bronchitis, emphysema, and pulmonary heart disease), bronchial asthma, lung cancer, pulmonary diffuse interstitial fibrosis, and pulmonary infection have increased at home and abroad.

冠状病毒是一个大型病毒家族,已知可引起感冒、中东呼吸综合征(MERS)和严重急性呼吸综合征(SARS)等较严重疾病。人感染了冠状病毒后的常见体征有呼吸道症状、发热、咳嗽、气促和呼吸困难等。在较严重病例中,感染可导致肺炎、严重急性呼吸综合征、肾衰竭,甚至死亡。对于冠状病毒所致疾病的许多症状是可以处理的,因此需根据患者的临床情况进行治疗。此外,对感染者的辅助护理可能非常有效,做好自我保护,包括:保持基本的手部和呼吸道卫生,坚持安全饮食习惯等。了解宿主遗传成分对严重感染的免疫反应的影响有助于开发有效的疫苗和治疗方法来控制相关呼吸系统疾病大流行。随着测序技术的快速发展,单细胞测序技术为揭示相关呼吸系统疾病的相关机制带来了更全面的机遇。Coronaviruses are a large family of viruses known to cause more serious illnesses such as colds, Middle East Respiratory Syndrome (MERS) and Severe Acute Respiratory Syndrome (SARS). Common signs of a person infected with a coronavirus include respiratory symptoms, fever, cough, shortness of breath, and dyspnea. In more severe cases, infection can lead to pneumonia, severe acute respiratory syndrome, kidney failure, and even death. Many of the symptoms of coronavirus-caused illness are manageable, so treatment should be tailored to the patient's clinical condition. In addition, ancillary care for infected people can be very effective, and self-protection includes: maintaining basic hand and respiratory hygiene, adhering to safe eating habits, etc. Understanding the influence of host genetic composition on the immune response to severe infection could aid in the development of effective vaccines and therapeutics to control the associated respiratory disease pandemic. With the rapid development of sequencing technology, single-cell sequencing technology has brought more comprehensive opportunities to reveal the relevant mechanisms of related respiratory diseases.

利用单细胞RNA测序(scRNA-seq)技术识别与复杂疾病或特征相关的关键细胞亚群,对于理解复杂疾病机制至关重要。但scRNA-seq数据因其高成本和低通量的特性不允许对其进行大规模测序,且目前大多数基于单细胞的研究样本不超过20个,导致统计效能有限,无法准确揭示细胞亚群中与疾病或特征相关的风险子集。此外,scRNA-seq数据在基因水平上具有高稀疏性、技术噪声和方差不稳定性的特点。遗传关联数据例如:(全基因组关联研究,GWAS)广泛应用于研究不同的复杂疾病或性状,将scRNA-seq数据与来自大规模样本的GWAS的表型相关遗传信息相关联,被认为是一种实用而有效的方法,可以在单细胞分辨率下揭示复杂疾病或性状的遗传分子机制。Identifying key cell subpopulations associated with complex diseases or traits using single-cell RNA sequencing (scRNA-seq) technology is critical for understanding complex disease mechanisms. However, scRNA-seq data does not allow large-scale sequencing due to its high cost and low-throughput characteristics, and most current research samples based on single cells do not exceed 20, resulting in limited statistical power and cannot accurately reveal risk subsets associated with diseases or characteristics in cell subpopulations. In addition, scRNA-seq data is characterized by high sparsity, technical noise, and variance instability at the gene level. Genetic association data such as: (genome-wide association studies, GWAS) are widely used in the study of different complex diseases or traits, and correlating scRNA-seq data with phenotype-related genetic information from GWAS of large-scale samples is considered to be a practical and effective method to reveal the genetic molecular mechanisms of complex diseases or traits at single-cell resolution.

将GWAS与scRNA-seq数据相结合用来识别与复杂疾病相关的细胞类型的方法,包括诸如LDSC-SEG,MAGMA,RolyPoly,但是上述方法需要大量调整参数,以便用已知标记基因注释细胞类型,且在很大程度上忽略了每种细胞类型的内部异质性。此外,现有技术可以识别高表达水平的基因,但是其潜在缺陷是,过度关注高表达基因会低估表达水平相对较低但对揭示细胞命运很重要的基因的功能作用。Methods that combine GWAS with scRNA-seq data to identify cell types associated with complex diseases include methods such as LDSC-SEG, MAGMA, RolyPoly, but the above methods require extensive tuning of parameters to annotate cell types with known marker genes and largely ignore the internal heterogeneity of each cell type. In addition, existing techniques can identify genes with high expression levels, but a potential pitfall is that excessive focus on highly expressed genes underestimates the functional role of relatively low expressed genes that are important for revealing cell fate.

发明内容Contents of the invention

本发明旨在至少解决现有技术中存在的技术问题之一。为此,本发明提供一种呼吸系统疾病数据的处理方法及其系统;本发明方法通过发明基于单细胞通路的评分方法,结合scRNA-seq数据和遗传关联数据推断与呼吸系统疾病相关的基因、细胞等,从深层次挖掘隐含在单细胞测序数据背后的生命规律,确定基因、细胞、细胞亚群、生物学通路等与呼吸系统疾病之间的潜在联系。The present invention aims to solve at least one of the technical problems existing in the prior art. To this end, the present invention provides a method and system for processing respiratory disease data; the method of the present invention infers genes, cells, etc. related to respiratory diseases by inventing a scoring method based on single-cell pathways, combining scRNA-seq data and genetic association data, digging deep into the life laws hidden behind single-cell sequencing data, and determining the potential connections between genes, cells, cell subgroups, biological pathways, etc., and respiratory diseases.

本申请公开一种呼吸系统疾病数据的处理方法,包括:The present application discloses a method for processing respiratory disease data, including:

获取待分析的单细胞测序序列数据;Obtain the single-cell sequencing sequence data to be analyzed;

采用机器学习的方法对所述待分析的单细胞测序序列数据进行处理,得到细胞通路的PAS评分矩阵和细胞通路的PAS;Processing the single-cell sequencing sequence data to be analyzed by using a machine learning method to obtain a PAS scoring matrix of cell pathways and a PAS of cell pathways;

获取呼吸道疾病的遗传关联数据,对所述遗传关联数据进行处理,得到带有SNPs注释的通路数据;Obtain genetic association data of respiratory diseases, process the genetic association data, and obtain pathway data with SNPs annotation;

对所述细胞通路的PAS和所述带有SNP注释的通路数据进行统计分析处理,得到估计系数;performing statistical analysis on the PAS of the cellular pathway and the pathway data with SNP annotations to obtain estimated coefficients;

将所述估计系数乘以PAS再求和得到细胞的遗传相关通路活性评分gPAS;The estimated coefficient is multiplied by PAS and then summed to obtain the cell's genetic-related pathway activity score gPAS;

输出所述遗传相关通路活性评分gPAS。Output the genetic related pathway activity score gPAS.

所述对所述细胞通路的PAS和所述带有SNP注释的通路数据进行统计分析处理,得到估计系数的步骤包括:The step of performing statistical analysis on the PAS of the cell pathway and the pathway data with SNP annotations to obtain an estimated coefficient includes:

基于所述带有SNPs注释的通路数据得到单条通路数据中所有SNPs的遗传效应值;Obtain the genetic effect value of all SNPs in the single pathway data based on the pathway data with SNPs annotation;

基于呼吸道疾病的遗传关联数据的多基因回归模型,基于所述PAS和所述遗传效应值,对所述遗传效应值的分布进行参数估计,得到估计系数;Based on the multigene regression model of the genetic association data of respiratory diseases, based on the PAS and the genetic effect value, the distribution of the genetic effect value is estimated to obtain an estimated coefficient;

可选的,所述遗传效应值的获取公式为:其中,β表示m个SNPs的理论效应大小向量,ε表示随机环境误差,R代表LD矩阵,XT表示遗传关联数据样本中SNPs的标准基因型;Optionally, the formula for obtaining the genetic effect value is: Among them, β represents the theoretical effect size vector of m SNPs, ε represents the random environmental error, R represents the LD matrix, and X T represents the standard genotype of the SNPs in the genetic association data sample;

可选的,所述估计系数的获取方式包括:Optionally, the manner of obtaining the estimated coefficients includes:

其中,τi,j表示细胞j中通路i的估计系数,τ0表示截距项,σ2表示通路中SNP效应大小的方差,表示加权PAS。where τi ,j represents the estimated coefficient of pathway i in cell j, τ0 represents the intercept term, σ2 represents the variance of the SNP effect size in the pathway, Indicates weighted PAS.

所述遗传相关通路活性评分gPAS(gPj)的获取公式为:The acquisition formula of the genetic related pathway activity score gPAS (gPj) is:

其中,所述为优化后的估计系数。Among them, the is the estimated coefficient after optimization.

所述方法还包括:对所述遗传相关通路活性评分gPAS与每个细胞的基因表达量进行相关性分析并排序,筛选出N个性状相关基因;The method also includes: performing a correlation analysis and sorting on the genetic related pathway activity score gPAS and the gene expression of each cell, and screening out N trait-related genes;

可选的,所述对所述遗传相关通路活性评分gPAS与每个细胞的基因表达量进行相关性分析并排序的方法包括:通过皮尔逊相关系数(PCC)确定单个基因的表达与所述gPAS之间的相关性,根据相关性对基因进行排序,得到所述N个性状相关基因;Optionally, the method of performing correlation analysis and sorting on the genetic correlation pathway activity score gPAS and the gene expression of each cell includes: determining the correlation between the expression of a single gene and the gPAS by Pearson correlation coefficient (PCC), and sorting the genes according to the correlation to obtain the N trait-related genes;

可选的,所述N个性状相关基因为按照相关度降序或者升序规则进行排序后的前1000或后1000个性状相关基因;Optionally, the N trait-related genes are the top 1000 or last 1000 trait-related genes sorted according to the descending or ascending order of correlation;

可选的,所述N个性状相关基因包括以下一种或几种:CALM3,PIK3R1,IL32,CD3E,B2M,PRS29,和GZMB。Optionally, the N trait-related genes include one or more of the following: CALM3, PIK3R1, IL32, CD3E, B2M, PRS29, and GZMB.

所述方法还包括:根据所述N个性状相关基因计算每个细胞的性状相关分数TRS;根据所述性状相关分数TRS和单个细胞的水平P值进行聚类,得到与不同等级严重程度呼吸道疾病相关的性状相关细胞;The method also includes: calculating the trait-related score TRS of each cell according to the N trait-related genes; performing clustering according to the trait-related score TRS and the horizontal P value of a single cell to obtain trait-related cells related to respiratory diseases with different levels of severity;

可选的,利用细胞评分方法计算所述N个性状基因的性状相关分数TRS;Optionally, calculate the trait-related score TRS of the N trait genes by using a cell scoring method;

可选的,所述呼吸道疾病的不同等级严重程度包括轻度、中度和重度。Optionally, the different grades of severity of the respiratory diseases include mild, moderate and severe.

所述方法还包括:基于区块拔靴法block bootstrap method得到性状相关细胞类型或亚群。The method further includes: obtaining trait-related cell types or subgroups based on a block bootstrap method.

所述方法还包括:对所述遗传相关通路活性评分gPAS进行排序,根据排序结果和细胞类型水平上通路的P值,根据所述统计重要值得到性状相关通路。The method further includes: sorting the genetic-related pathway activity score gPAS, and obtaining trait-related pathways according to the statistically important values according to the sorting results and the P values of the pathways at the cell type level.

检测新的CD8+T细胞亚群的产品在制备诊断呼吸道疾病的产品中的应用。detect new Application of products of CD8+T cell subsets in the preparation of products for diagnosing respiratory diseases.

一种呼吸系统疾病数据的处理设备,所述设备包括:存储器和处理器;A device for processing respiratory disease data, the device comprising: a memory and a processor;

所述存储器用于存储程序指令;所述处理器用于调用程序指令,当程序指令被执行时,用于执行上述的呼吸系统疾病数据的处理方法。The memory is used to store program instructions; the processor is used to call the program instructions, and when the program instructions are executed, is used to execute the above-mentioned method for processing respiratory system disease data.

一种呼吸系统疾病数据的处理系统,包括:A system for processing respiratory disease data, comprising:

获取单元,用于获取待分析的单细胞测序序列数据;an acquisition unit, configured to acquire single-cell sequencing sequence data to be analyzed;

第一处理单元,用于采用机器学习的方法对所述待分析的单细胞测序序列数据进行处理,得到细胞通路的PAS评分矩阵和细胞通路的PAS;The first processing unit is configured to process the single-cell sequencing sequence data to be analyzed by using a machine learning method to obtain a PAS scoring matrix of cell pathways and a PAS of cell pathways;

第二处理单元,用于获取呼吸道疾病的遗传关联数据,对所述遗传关联数据进行处理,得到带有SNPs注释的通路数据;The second processing unit is used to obtain genetic association data of respiratory diseases, and process the genetic association data to obtain pathway data with SNPs annotation;

第三处理单元,用于对所述细胞通路的PAS和所述带有SNP注释的通路数据进行统计分析处理,得到估计系数;The third processing unit is used to perform statistical analysis on the PAS of the cell pathway and the pathway data with SNP annotations to obtain estimated coefficients;

第四处理单元,用于将所述估计系数乘以PAS再求和得到细胞的遗传相关通路活性评分gPAS;The fourth processing unit is used to multiply the estimated coefficient by PAS and then sum to obtain the genetic-related pathway activity score gPAS of the cell;

输出单元,用于输出所述遗传相关通路活性评分gPAS。An output unit, configured to output the genetic correlation pathway activity score gPAS.

一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现上述的呼吸系统疾病数据的处理方法。A computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the above-mentioned method for processing respiratory disease data is realized.

本申请具有以下有益效果:The application has the following beneficial effects:

1、本申请创新性的公开了一种单细胞测序数据和遗传关联数据相结合的呼吸系统疾病数据的处理方法,能够从深层次、更多维度地推断出与呼吸系统疾病相关的基因、细胞、细胞亚群以及相关的生物学通路等,了解上述宿主遗传成分对严重感染的免疫反应的影响有助于开发有效的疫苗和治疗方法来控制疾病大流行,为呼吸系统疾病的研究做出贡献;该方法基于单细胞通路的评分方法,具有发现疾病风险细胞类型的强大能力,融合了参与相同生物通路的不同基因的功能作用以获得稳定的细胞状态,显著增加了统计效能、生物学可解释性和结果可重复性;克服了已知注释细胞类型的限制,并可能发现新的遗传相关亚群和细胞类型的关键基因或通路,应用广泛,实用性强。比如:通过本方案得出:可以优先考虑的基因驱动的新的CD8+T细胞亚群可能在介导重症呼吸系统疾病患者的免疫反应中发挥重要作用。1. This application innovatively discloses a processing method for respiratory system disease data that combines single-cell sequencing data and genetic correlation data, which can infer genes, cells, cell subgroups and related biological pathways related to respiratory system diseases from a deep and multi-dimensional perspective. Understanding the impact of the above-mentioned host genetic components on the immune response to severe infection is helpful for the development of effective vaccines and therapeutic methods to control the pandemic and contribute to the research of respiratory system diseases; The functional role of different genes of the pathway to obtain a stable cell state significantly increases statistical power, biological interpretability, and reproducibility of results; it overcomes the limitations of known annotated cell types, and may discover new genetically related subpopulations and key genes or pathways of cell types, which are widely used and practical. For example: through this plan, it can be concluded that new gene drives that can be prioritized CD8+ T cell subsets may play an important role in mediating immune responses in patients with severe respiratory diseases.

2、本申请创新性的公开一种基于单细胞通路的评分方法,采用多基因回归模型,通过利用通路活性转化的scRNA-seq数据和遗传关联研究数据揭示与性状相关的基因、细胞亚群;有效克服了目前与复杂疾病多基因风险相关的基因、细胞亚群的鉴定在很大程度上受到scRNA-seq数据中样本量小和高度稀疏性的阻碍,导致统计效能有限,无法准确揭示细胞亚群中与疾病或特征相关的风险子集的问题。该方法从深层次挖掘隐含在单细胞测序数据背后的生命规律,从群体遗传学突变与疾病关系和单细胞测序基因丰度信息等多个维度深度分析,大大提高数据分析的精度和深度。2. This application innovatively discloses a single-cell pathway-based scoring method, which adopts a multigene regression model to reveal genes and cell subgroups related to traits by using scRNA-seq data transformed by pathway activity and genetic association research data; it effectively overcomes the problem that the current identification of genes and cell subgroups related to polygenic risk of complex diseases is largely hindered by the small sample size and high sparsity of scRNA-seq data, resulting in limited statistical power and the inability to accurately reveal risk subsets related to diseases or characteristics in cell subgroups. This method digs deep into the life laws hidden behind the single-cell sequencing data, and conducts in-depth analysis from multiple dimensions such as the relationship between population genetic mutations and diseases and single-cell sequencing gene abundance information, which greatly improves the accuracy and depth of data analysis.

3、本申请基于大规模模拟和真实数据,利用上述评分方法将scRNA-seq数据和遗传关联数据相结合,可以有效克服现有技术中为了方便用已知标记基因注释细胞类型需要大量调整参数,且在很大程度上会忽略每种细胞类型内的内部异质性的问题;不会存在因过度关注高表达基因而低估表达水平相对较低但对揭示细胞命运很重要的基因的功能作用,有助于通过聚集平均表达水平较低的基因的作用来识别疾病相关的早期发育事件或祖细胞,例如与细胞发育相关的关键转录因子;同时可以有效降低scRNA-seq数据的稀疏性和技术噪声,并在识别特征相关的细胞类型和亚群方面表现出很好的稳健性和能力。3. This application is based on large-scale simulation and real data, and uses the above scoring method to combine scRNA-seq data and genetic association data, which can effectively overcome the problem that a large number of adjustment parameters are required in the prior art to conveniently annotate cell types with known marker genes, and the internal heterogeneity within each cell type will be ignored to a large extent; there will be no underestimation of the functional role of genes with relatively low expression levels but important for revealing cell fate due to over-focus on high-expression genes. Related key transcription factors; at the same time, it can effectively reduce the sparsity and technical noise of scRNA-seq data, and show good robustness and ability in identifying feature-related cell types and subpopulations.

附图说明Description of drawings

为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获取其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the following will briefly introduce the accompanying drawings that need to be used in the description of the embodiments. Obviously, the accompanying drawings in the following description are only some embodiments of the present invention. For those skilled in the art, other drawings can also be obtained based on these drawings without creative work.

图1是本发明实施例提供的呼吸系统疾病数据的处理方法的分析示意流程图;Fig. 1 is a schematic flowchart of the analysis of the processing method of respiratory system disease data provided by the embodiment of the present invention;

图2是本发明实施例提供的呼吸系统疾病数据的处理设备示意图;Fig. 2 is a schematic diagram of a processing device for respiratory disease data provided by an embodiment of the present invention;

图3是本发明实施例提供的呼吸系统疾病数据的处理系统示意流程图;Fig. 3 is a schematic flowchart of a processing system for respiratory system disease data provided by an embodiment of the present invention;

图4是本发明实施例提供的基于单细胞通路的评分方法获得gPAS,以及利用gPAS输出TRS、性状相关基因、性状相关细胞、性状相关细胞类型/亚群、性状型管通路的概述图。Fig. 4 is an overview of gPAS obtained by the single-cell pathway-based scoring method provided by the embodiment of the present invention, and output of TRS, trait-related genes, trait-related cells, trait-related cell types/subgroups, and trait-type tube pathways using gPAS.

具体实施方式Detailed ways

为了使本技术领域的人员更好地理解本发明方案,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述。In order to enable those skilled in the art to better understand the solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention.

在本发明的说明书和权利要求书及上述附图中的描述的一些流程中,包含了按照特定顺序出现的多个操作,但是应该清楚了解,这些操作可以不按照其在本文中出现的顺序来执行或并行执行,操作的序号如101、102等,仅仅是用于区分开各个不同的操作,序号本身不代表任何的执行顺序。另外,这些流程可以包括更多或更少的操作,并且这些操作可以按顺序执行或并行执行。需要说明的是,本文中的“第一”、“第二”等描述,是用于区分不同的消息、设备、模块等,不代表先后顺序,也不限定“第一”和“第二”是不同的类型。In some processes described in the specification and claims of the present invention and the descriptions in the above-mentioned drawings, multiple operations appearing in a specific order are included, but it should be clearly understood that these operations may not be performed in the order in which they appear herein or executed in parallel. The serial numbers of operations such as 101, 102, etc. are only used to distinguish different operations, and the serial numbers themselves do not represent any execution order. Additionally, these processes can include more or fewer operations, and these operations can be performed sequentially or in parallel. It should be noted that the descriptions of "first" and "second" in this article are used to distinguish different messages, devices, modules, etc., and do not represent a sequence, nor do they limit that "first" and "second" are different types.

下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域技术人员在没有做出创造性劳动前提下所获取的所有其他实施例,都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without making creative efforts belong to the protection scope of the present invention.

图1是本发明实施例提供的一种呼吸系统疾病数据的处理方法示意流程图,具体地,方法包括如下步骤:Figure 1 is a schematic flowchart of a method for processing respiratory disease data provided by an embodiment of the present invention. Specifically, the method includes the following steps:

101:获取待分析的单细胞测序序列数据;101: Obtain the single-cell sequencing sequence data to be analyzed;

在一个实施例中,单细胞测序数据包括七个独立的单细胞RNA-seq(scRNA-seq)或单核RNA-seq(snRNA-seq)数据集,涵盖来自人类(homo sapiens)和小鼠(mus musculus)的139万个细胞。针对血细胞,收集了两个基于人类BMMC(N=35,582个细胞)和人类PBMC(N=97,039个细胞)的scRNA-seq数据集,以揭示性状相关的细胞亚群或类型。对于与免疫/代谢相关的疾病/特征,利用来自人类细胞的scRNA-seq数据集(HCL,N=35个成人组织中的513,707个细胞)为每个组织构建一个伪组织(psudo-bulk)表达谱和与疾病/特征相关的优先风险组织。In one embodiment, the single-cell sequencing data includes seven independent single-cell RNA-seq (scRNA-seq) or single-nucleus RNA-seq (snRNA-seq) datasets covering 1.39 million cells from humans (homo sapiens) and mice (mus musculus). For blood cells, two scRNA-seq datasets based on human BMMC (N = 35,582 cells) and human PBMC (N = 97,039 cells) were collected to reveal trait-associated cell subpopulations or types. For immune/metabolic related diseases/traits, a scRNA-seq dataset from human cells (HCL, N = 513,707 cells in 35 adult tissues) was utilized to construct a pseudo-bulk expression profile and priority risk tissues associated with the disease/trait for each tissue.

在一个实施例中,为了发现与重症呼吸系统疾病相关的免疫细胞群,收集了大规模的PBMC scRNA-seq数据集(N=469,453个细胞),其中包含254个具有不同呼吸系统疾病严重程度的外周血样本(轻度N=109个样本,中度N=102个样本,重度N=50)和16个健康对照。可选的,待分析的单细胞测序序列数据包括健康对照组和不同等级严重程度呼吸道疾病的单细胞测序序列数据。In one example, to discover immune cell populations associated with severe respiratory disease, a large-scale PBMC scRNA-seq dataset (N = 469,453 cells) was collected, containing 254 peripheral blood samples with different severity of respiratory disease (mild N = 109 samples, moderate N = 102 samples, severe N = 50) and 16 healthy controls. Optionally, the single-cell sequencing sequence data to be analyzed includes single-cell sequencing sequence data of a healthy control group and respiratory diseases of different degrees of severity.

102:采用机器学习的方法对待分析的单细胞测序序列数据进行处理,得到细胞通路的PAS评分矩阵和细胞通路的PAS;102: Use machine learning to process the single-cell sequencing sequence data to be analyzed, and obtain the PAS scoring matrix of the cell pathway and the PAS of the cell pathway;

在一个实施例中,采用机器学习的方法对待分析的单细胞测序序列数据进行处理,得到细胞通路的PAS评分矩阵和细胞通路的PAS的获取步骤包括:In one embodiment, the single-cell sequencing sequence data to be analyzed is processed by using a machine learning method to obtain the PAS score matrix of the cell pathway and the acquisition steps of the PAS of the cell pathway include:

获取呼吸系统疾病的通路数据;Access to pathway data for respiratory diseases;

对单细胞测序序列数据中的基因-细胞矩阵进行标准化处理,得到经标准化处理后的基因-细胞矩阵;具体地,使用比例因子为10,000的方差稳定化变换参数将scRNA-seq数据中的稀疏基因-细胞矩阵进行标准化,得到在单个细胞中单个基因的标准化表达;标准化的公式为:其中,ag,j表示细胞j中基因g的原始表达,eg,j表示细胞j中基因g的标准化表达;The gene-cell matrix in the single-cell sequencing sequence data is standardized to obtain the normalized gene-cell matrix; specifically, the sparse gene-cell matrix in the scRNA-seq data is standardized using the variance stabilization transformation parameter with a scaling factor of 10,000 to obtain the standardized expression of a single gene in a single cell; the normalized formula is: Among them, a g, j represents the original expression of gene g in cell j, e g, j represents the normalized expression of gene g in cell j;

基于呼吸系统疾病的通路数据,利用机器学习的方法将经标准化处理后的基因-细胞矩阵转换为通路-细胞矩阵,利用通路-细胞矩阵得到细胞-通路的PAS评分矩阵,PAS评分矩阵包括单条通路中单个细胞的通路活性评分PAS;Based on the pathway data of respiratory diseases, the normalized gene-cell matrix is transformed into a pathway-cell matrix by machine learning method, and the pathway-cell matrix is used to obtain the cell-pathway PAS score matrix, which includes the pathway activity score PAS of a single cell in a single pathway;

在一个实施例中,通路数据为KEGG通路数据,来自KEGG数据库的通路作为评估PAS的默认基因集,利用奇异值分解SVD的方法将经标准化处理后的基因-细胞矩阵转换为通路-细胞矩阵;使用Pi表示通路i中的基因集,对于每条通路i,从标准化的基因-细胞矩阵A选择矩阵Ai,其中矩阵Ai的列是所有N个细胞,行是通路基因集Pi中|Pi|基因,根据SVD得到的公式,其中,U表示N×N正交矩阵,Σ表示除主对角线元素外具有全零的对角矩阵,VT表示|Pi|×|Pi|正交矩阵;对于右正交矩阵/>第t列向量vt表示第t主成分,反映了单细胞数据中基因在通路中的协同表达变异性;由于第一主成分PC1代表最大的方差变异,因此细胞j特征在PC1的投影代表了通路i的PASsi,j;对于细胞j,利用通路i中所有的表达方差作为权重调整原始PASsi,j;对于通路i中的基因g,使用最小-最大值缩放法重新调整基因表达eg,j调整后的基因表达为/> In one embodiment, the pathway data is KEGG pathway data, the pathway from the KEGG database is used as the default gene set for evaluating PAS, and the standardized gene-cell matrix is converted into a pathway-cell matrix by using the method of singular value decomposition SVD; P i is used to represent the gene set in pathway i, and for each pathway i, matrix A i is selected from the standardized gene-cell matrix A, wherein the columns of matrix A i are all N cells, and the rows are the |P i |genes in the pathway gene set P i , obtained according to SVD where U represents an N×N orthogonal matrix, Σ represents a diagonal matrix with all zeros except the main diagonal elements, V T represents a |P i |×|P i | orthogonal matrix; for a right orthogonal matrix /> The t-th column vector v t represents the t-th principal component, which reflects the collaborative expression variability of genes in the pathway in single-cell data; since the first principal component PC1 represents the largest variance variation, the projection of cell j characteristics on PC1 represents the PASs i, j of pathway i; for cell j, the original PASs i ,j is adjusted using all expression variances in pathway i;

在一个实施例中,对通路活性评分PAS进行优化处理,得到加权PAS;In one embodiment, the pathway activity score PAS is optimized to obtain a weighted PAS;

加权PAS的获取方式包括: Ways to obtain weighted PAS include:

其中,表示加权PAS,/>表示优化后的细胞i中基因g的标准化表达,si,j表示细胞j通路i的通路活性评分PAS;in, Indicates weighted PAS, /> Indicates the standardized expression of gene g in cell i after optimization, and si ,j indicates the pathway activity score PAS of pathway i in cell j;

在一个实施例中,的获取方式包括:In one embodiment, Ways to obtain include:

其中,表示细胞i中基因g的标准化表达,MAX(eg,j)表示通路i中基因表达最大值,MIN(eg,j)表示通路i中基因表达最小值。in, Indicates the normalized expression of gene g in cell i, MAX(e g,j ) indicates the maximum value of gene expression in pathway i, and MIN(eg ,j ) indicates the minimum value of gene expression in pathway i.

可选的,机器学习的方法包括奇异值分解SVD的方法;奇异值分解SVD方法大大提高了分析稀疏矩阵的计算效率,且可以在不计算方差矩阵的情况下获得特征值;利用奇异值分解方法将标准化的基因-细胞矩阵换华为低维空间的通路-细胞矩阵。Optionally, the machine learning method includes a singular value decomposition (SVD) method; the singular value decomposition (SVD) method greatly improves the computational efficiency of analyzing sparse matrices, and can obtain eigenvalues without calculating the variance matrix; the standardized gene-cell matrix is replaced by a low-dimensional pathway-cell matrix using the singular value decomposition method.

103:获取呼吸道疾病的遗传关联数据,对遗传关联数据进行处理,得到带有SNPs注释的通路数据;103: Obtain genetic association data of respiratory diseases, process the genetic association data, and obtain pathway data with SNPs annotations;

在一个实施例中,呼吸道疾病的遗传关联数据包括重度呼吸道疾病的遗传关联数据;In one embodiment, the genetic association data for respiratory diseases includes genetic association data for severe respiratory diseases;

在一个实施例中,对遗传关联数据进行处理,得到带有SNPs注释的通路数据的步骤包括:In one embodiment, the step of processing the genetic association data to obtain pathway data with SNPs annotations includes:

从遗传关联数据中进行筛选得到单个基因的SNPs,基于呼吸系统疾病的通路数据,将单个基因的SNPs映射到对应通路中,得到带有SNPs注释的通路数据;Screen the SNPs of a single gene from the genetic association data, map the SNPs of a single gene to the corresponding pathway based on the pathway data of respiratory diseases, and obtain the pathway data with SNPs annotation;

可选的,单个基因的SNPs的获取步骤包括:获取遗传关联数据中基因的SNPs后,分别对SNPs基因对进行分配,得到分配结果;Optionally, the step of obtaining the SNPs of a single gene includes: after obtaining the SNPs of the genes in the genetic association data, assigning the SNPs gene pairs respectively to obtain the assignment results;

将分配结果中数个单个SNPs对应多个基因的重复基因分别作为独立的SNP基因关联处理;保留分配结果中的次要等位基因频率(MAF)大于0.1的SNPs;删除性染色体上的SNPs;得到单个基因的SNPs;Repeated genes corresponding to multiple genes in the assignment results were treated as independent SNP gene associations; SNPs with a minor allele frequency (MAF) greater than 0.1 in the assignment results were retained; SNPs on the sex chromosome were deleted; SNPs of a single gene were obtained;

将单个基因的SNPs汇总后即为所有基因的SNPs。具体地,遗传关联数据为GWAS数据,以20kb作为默认参数将GWAS汇总统计数据中的SNP分配给相关基因;使用符号g(k)表示带有SNP k的基因g,通过SNP基因对的分配,存在数个单个SNPs对应着多个基因;由于整个过程需要从成千上万个Snp中推断参数,但上述单个SNPs对应着多个基因的SNPs对推断过程无作用,因此需要将上述重复基因作为独立的SNP基因关联处理;保留次要等位基因频率(MAF)大于0.1的SNPs,删除性染色体上的SNPs,最终得到相关基因的SNPs;After summarizing the SNPs of a single gene, it is the SNPs of all genes. Specifically, the genetic association data is GWAS data, and 20kb is used as the default parameter to assign SNPs in the GWAS summary statistics data to related genes; the symbol g(k) is used to represent the gene g with SNP k, and through the assignment of SNP gene pairs, there are several single SNPs corresponding to multiple genes; because the whole process needs to infer parameters from thousands of SNPs, but the above-mentioned SNPs corresponding to multiple genes have no effect on the inference process, so the above-mentioned repeated genes need to be treated as independent SNP gene associations; Keep the SNPs with a minor allele frequency (MAF) greater than 0.1, delete the SNPs on the sex chromosomes, and finally get the SNPs of related genes;

基于KEGG数据库中的通路,将具有关联SNPs的基因注释到通路中,并使用Si=公式2表示通路i中的SNPs集合;利用千人基因组项目第3阶段数据对GWAS汇总数据提取的SNPs计算连锁不平衡LD(linkage disequilibrium);本方案提供了例如GO、Reactome、和MSigDB的功能基因集合作为替换选项。另外,删除存在广泛性LD的主要组织相容性复合物区域Chr6:25-35Mbp。Based on the pathways in the KEGG database, the genes with associated SNPs are annotated into the pathways, and Si = Equation 2 is used to represent the SNPs set in pathway i; the linkage disequilibrium LD (linkage disequilibrium) is calculated for the SNPs extracted from the GWAS summary data using the Phase 3 data of the Thousand Genomes Project; this program provides functional gene sets such as GO, Reactome, and MSigDB as replacement options. Additionally, the major histocompatibility complex region Chr6: 25-35Mbp where extensive LD is present was deleted.

在一个实施例中,GWAS数据已给定表型,给定表型的表型注释包括二分法,连续依赖性特征或者内表型和中心测量。In one embodiment, GWAS data are phenotyped, and phenotype annotations for a given phenotype include dichotomies, continuous dependency features, or endophenotype and center measures.

104:对细胞通路的PAS和带有SNP注释的通路数据进行统计分析处理,得到估计系数;104: Statistically analyze and process the PAS of the cell pathway and the pathway data with SNP annotations to obtain the estimated coefficient;

在一个实施例中,对细胞通路的PAS和带有SNP注释的通路数据进行统计分析处理,得到估计系数的步骤包括:In one embodiment, the PAS of the cell pathway and the pathway data with SNP annotations are statistically analyzed, and the step of obtaining the estimated coefficient includes:

基于带有SNPs注释的通路数据得到单条通路数据中所有SNPs的遗传效应值;Get the genetic effect value of all SNPs in a single pathway data based on the pathway data with SNPs annotation;

基于呼吸道疾病的遗传关联数据的多基因回归模型,基于PAS和遗传效应值,对遗传效应值的分布进行参数估计,得到估计系数;Based on the multigene regression model of the genetic association data of respiratory diseases, based on the PAS and the genetic effect value, the parameter estimation of the distribution of the genetic effect value is carried out to obtain the estimated coefficient;

可选的,遗传效应值的获取公式为:其中,β表示m个SNPs的理论效应大小向量,ε表示随机环境误差,R代表LD矩阵,XT表示遗传关联数据样本中SNPs的标准基因型;Optionally, the formula for obtaining the genetic effect value is: Among them, β represents the theoretical effect size vector of m SNPs, ε represents the random environmental error, R represents the LD matrix, and X T represents the standard genotype of the SNPs in the genetic association data sample;

可选的,估计系数的获取方式包括:Optionally, the methods for obtaining the estimated coefficients include:

其中,τi,j表示细胞j中通路i的估计系数,估计系数反映了细胞特异性PAS对GWAS效应大小方差的影响,即遗传对反应的影响;τ0表示截距项,σ2表示通路中SNP效应大小的方差,表示加权PAS。Among them, τi ,j represents the estimated coefficient of pathway i in cell j, and the estimated coefficient reflects the influence of cell-specific PAS on the variance of the GWAS effect size, that is, the influence of genetics on the response; τ0 represents the intercept term, and σ2 represents the variance of the SNP effect size in the pathway, Indicates weighted PAS.

在一个实施例中,Si表示每条通路i的定位基因中包含所有SNPs的SNP集合,多基因模型假设先验通路i的所有SNPs的效应大小遵循多变量正态分布,其中σ2表示通路中SNPs效应大小的方差,I表示|Si|×|Si|单位矩阵;In one embodiment, S i represents the SNP set containing all SNPs in the mapping gene of each pathway i, and the polygenic model assumes that the effect size of all SNPs in the prior pathway i follows a multivariate normal distribution, where σ 2 represents the variance of the SNPs effect size in the pathway, and I represents |S i |×|S i |identity matrix;

在一个实施例中,基于先前假设,对遗传效应值的分布进行估计,采用的公式如下:/>利用此公式对估计系数进行优化;In one embodiment, based on previous assumptions, the genetic effect value The distribution of is estimated using the following formula: /> Use this formula to optimize the estimated coefficients;

在一个实施例中,为了优化多基因回归模型中每条通路的估计系数,采用能够显著提高计算效率和估计一致收敛性的矩量法(method-of-moments approach)对多基因回归模型进行优化;然后,拟合与每条通路相关的SNPs的观察和预期平方效应,并通过如下公式估计预期值:其中,Tr代表矩阵轨迹。In one embodiment, in order to optimize the estimated coefficient of each pathway in the polygenic regression model, the method of moments (method-of-moments approach) that can significantly improve computational efficiency and estimate consistent convergence is used to optimize the polygenic regression model; then, the observed and expected square effects of the SNPs associated with each pathway are fitted, and the expected value is estimated by the following formula: where Tr represents the matrix trace.

105:将估计系数乘以PAS再求和得到细胞的遗传相关通路活性评分gPAS;105: Multiply the estimated coefficient by PAS and then sum to obtain the genetic-related pathway activity score gPAS of the cell;

在一个实施例中,遗传相关通路活性评分gPAS(gPj)为呼吸系统疾病相关的gPAS,获取公式为:In one embodiment, the genetic-related pathway activity score gPAS (gPj) is gPAS related to respiratory diseases, and the acquisition formula is:

其中,为优化后的估计系数。in, is the estimated coefficient after optimization.

在一个实施例中,方法还包括:对遗传相关通路活性评分gPAS与每个细胞的基因表达量进行相关性分析并排序,筛选出N个性状相关基因;In one embodiment, the method further includes: performing correlation analysis and sorting on the genetic related pathway activity score gPAS and the gene expression of each cell, and screening out N trait-related genes;

可选的,对遗传相关通路活性评分gPAS与每个细胞的基因表达量进行相关性分析并排序的方法包括:通过皮尔逊相关系数(PCC)确定单个基因的表达与gPAS之间的相关性,根据相关性对基因进行排序,得到呼吸系统疾病相关的N个性状相关基因;具体地,为了最大限度提高功效,每个基因g的表达都由其基因特定的技术噪声水平反向加权,该噪声水平通过在scRNA-seq数据中建模基因间的平均方差关系估计;Optional, the method of analyzing and sorting the genetic expression of the genetic -related channel activity score and the gene expression of each cell includes: determine the expression of the single gene and the correlation between the GPAS through the Pilson correlation coefficient (PCC), and sort the genes according to the correlation to obtain N personalized genes related to respiratory diseases. Specifically, in order to maximize the efficacy, each gene G. The expression is weighted in reverse by its genetic noise level at a specific technical noise level. The level of noise is estimated by the average variance relationship between the foundation of the foundation in the SCRNA-SEQ data;

可选的,N个性状相关基因为按照相关度降序或者升序规则进行排序后的前1000或后1000个性状相关基因;N不限于1000,N为自然数整数。N个性状相关基因包括以下一种或几种:CALM3,PIK3R1,IL32,CD3E,B2M,PRS29,和GZMB;Optionally, the N trait-related genes are the first 1000 or last 1000 trait-related genes sorted according to the descending or ascending order of correlation; N is not limited to 1000, and N is a natural integer. N trait-related genes include one or more of the following: CALM3, PIK3R1, IL32, CD3E, B2M, PRS29, and GZMB;

在一个实施例中,方法还包括:根据N个性状相关基因计算每个细胞的性状相关分数TRS;根据性状相关分数TRS和单个细胞的水平P值进行聚类,得到与不同等级严重程度呼吸道疾病相关的性状相关细胞;性状相关基因在以下性状相关细胞中显著富集,包括:干草骨髓幼稚T16细胞(hay bone marrowT16 cells)、肺幼稚CD8+T细胞(lung/>CD8+Tcells)、肝NKT细胞(liver NKT cells)和脑幼稚样T细胞(brain/>T cells);性状相关分数TRS的获取公式为:TRS=average RE(GS)-average RE(CG);其中,average RE(GS)为给定细胞中N个性状相关基因集的平均相对表达值,average RE(CG)为从现有基因库随机抽取的相同数量的对照基因集的平均相对表达值;RE为relative expression;GS为gene set;CG为control gene set;In one embodiment, the method further includes: calculating the trait-related score TRS of each cell according to N trait-related genes; performing clustering according to the trait-related score TRS and the horizontal P value of a single cell to obtain trait-related cells related to respiratory diseases with different levels of severity; trait-related genes are significantly enriched in the following trait-related cells, including: hay bone marrow immature T16 cells (hay bone marrow T16 cells), lung naive CD8+ T cells (lung/> CD8+Tcells), liver NKT cells (liver NKT cells) and brain naive T cells (brain/> T cells); the formula for obtaining the trait-related score TRS is: TRS=average RE(GS)-average RE(CG); wherein, average RE(GS) is the average relative expression value of N trait-related gene sets in a given cell, and average RE(CG) is the average relative expression value of the same number of control gene sets randomly selected from the existing gene bank; RE is relative expression; GS is gene set; CG is control gene set;

可选的,呼吸系统疾病的不同等级严重程度包括轻度、中度和重度。Optionally, different grades of severity of respiratory diseases include mild, moderate and severe.

可选的,利用Seurat中的AddModuleScore函数的细胞评分方法计算N个基因的性状相关分数TRS。Optionally, use the cell scoring method of the AddModuleScore function in Seurat to calculate the trait correlation score TRS of N genes.

在一个实施例中,方法还包括:基于区块拔靴法block bootstrap method得到呼吸系统疾病相关的性状相关细胞类型或亚群,明确单个细胞所属细胞类型是否相关。性状相关细胞类型或亚群(与重度呼吸道疾病相关)包括以下一种或几种:CD8+T细胞、巨核细胞、CD16+单核细胞;/>CD8+T细胞中高表达的基因包括:记忆效应标记基因(memoryeffector marker genes)(GZMK,AQP3,GZMA,PRF1,和GNLY)和穷举效应标记基因(exhaustive effector marker genes)(LAG3,TIGIT,GZMA,GZMB,PRDM1,和IFNG);具体地,将一组细胞视为伪组织(psudo-bulk)转录组谱,并平均化给定的细胞类型内跨细胞的基因表达量;对于关联的细胞类型,用block bootstrap method估计标准误差并计算每个细胞类型对应P值的t统计值。鉴于区块拔靴法的目标是在从经验分布采样时保持数据结构,利用KEGG数据库的通路将基因组划分为多个生物学意义的块,并对上述基于通路的块进行替换取样。在默认参数下,为每个细胞类型关联分析执行200次迭代,具体执行时可修改默认参数。In one embodiment, the method further includes: obtaining cell types or subgroups related to traits related to respiratory diseases based on a block bootstrap method, and determining whether the cell types to which individual cells belong are related. Trait-associated cell types or subpopulations (associated with severe respiratory disease) include one or more of the following: CD8+ T cells, megakaryocytes, CD16+ monocytes; /> Highly expressed genes in CD8+ T cells include: memory effector marker genes (GZMK, AQP3, GZMA, PRF1, and GNLY) and exhaustive effect marker genes (LAG3, TIGIT, GZMA, GZMB, PRDM1, and IFNG); specifically, a group of cells is regarded as pseudo-bulk Transcriptome profiling, and average gene expression across cells within a given cell type; for associated cell types, block bootstrap method is used to estimate standard errors and calculate t-statistics for each cell type's corresponding P value. Given that the goal of block bootstrapping is to preserve the data structure when sampling from an empirical distribution, the pathways of the KEGG database are used to partition the genome into multiple biologically meaningful blocks, and the aforementioned pathway-based blocks are sampled with replacement. Under the default parameters, 200 iterations are performed for each cell type association analysis, and the default parameters can be modified for specific execution.

在一个实施例中,方法还包括:对遗传相关通路活性评分gPAS进行排序,根据排序结果(选取排序结果中排名靠前的通路)和细胞类型水平上通路的P值,得到呼吸系统疾病相关的性状相关通路;性状相关通路包括包括核糖体、T细胞受体信号通路、原发性免疫缺陷、自然杀伤细胞介导的细胞毒性和血小板活化。In one embodiment, the method further includes: sorting the genetically related pathway activity score gPAS, and obtaining the trait-related pathways related to respiratory diseases according to the sorting results (selecting the top-ranked pathways in the sorting results) and the P value of the pathways at the cell type level; the trait-related pathways include ribosomes, T cell receptor signaling pathways, primary immunodeficiency, natural killer cell-mediated cytotoxicity, and platelet activation.

具体地,基于中心极限定理对gPAS进行排序;使用符号Ct表示细胞类型t,利用如下公式计算Ct内每个细胞j的通路百分比等级:其中,/>表示细胞j中通路i的gPAS等级,M表示通路总数;类似地,利用如下公式计算细胞类型t中每条通路i的统计重要值Ti t:/>其中,/> 假设为:H0:Ti t=0 vs H1:Ti t>0;细胞类型t中每条通路i的P值为:/> Specifically, the gPAS is sorted based on the central limit theorem; the cell type t is represented by the symbol C t , and the pathway percentage rank of each cell j within C t is calculated using the following formula: where, /> represents the gPAS level of pathway i in cell j, and M represents the total number of pathways; similarly, the statistical importance value T i t of each pathway i in cell type t is calculated using the following formula: /> where, /> Suppose: H 0 : T i t = 0 vs H 1 : T i t >0; the P value for each pathway i in cell type t is: />

在一个实施例中,通过计算性状相关基因的等级分布确定单个细胞的统计显著性,以进一步评估细胞是否与感兴趣的性状显著相关;具体地,得出性状相关基因在细胞中的百分比等级,其中,rg,j表示细胞j中基因g的表达等级,G代表指定性状相关基因的数量;基因百分比等级遵循正态分布U(0,1),在基因的百分比等级之间无关联的无效假设下,获得每个细胞的统计值Tj,获得公式如下:/> In one embodiment, the statistical significance of individual cells is determined by calculating the rank distribution of trait-related genes to further assess whether the cell is significantly related to the trait of interest; specifically, the percentage rank of the trait-related genes in the cell is obtained, Among them, r g,j represent the expression level of gene g in cell j, and G represents the number of genes related to the specified trait; the percentage level of genes follows the normal distribution U(0,1), under the null assumption that there is no correlation between the percentage levels of genes, the statistical value T j of each cell is obtained, and the formula is as follows: />

基于单细胞数据中的大量细胞,使用中心极限定理推导Tj的分布: 其中N是细胞总数;显著性检验的假设为:H0:Tj=0 vs H1:Tj>0;每个细胞j的P值为:pj=Pr(Tj≤t)。Based on the large number of cells in the single-cell data, the distribution of Tj was derived using the central limit theorem: Wherein N is the total number of cells; the hypothesis of significance test is: H 0 : T j = 0 vs H 1 : T j >0; the P value of each cell j is: p j = Pr(T j ≤ t).

106:输出遗传相关通路活性评分gPAS;106: Output the genetic related pathway activity score gPAS;

一种应用,检测新的CD8+T细胞亚群的产品在制备诊断呼吸道疾病的产品中的应用;新的/>CD8+T细胞亚群为与呼吸系统疾病相关的新发现的功能。An application that detects new Application of products of CD8+ T cell subsets in the preparation of products for diagnosing respiratory diseases; new /> CD8+ T cell subsets are a newly discovered function associated with respiratory disease.

图2是本发明实施例提供的一种呼吸系统疾病数据的处理设备示意流程图,设备包括:存储器和处理器;存储器用于存储程序指令;处理器用于调用程序指令,当程序指令被执行时,用于执行上述的呼吸系统疾病数据的处理方法。 Fig. 2 is a schematic flowchart of a device for processing respiratory disease data provided by an embodiment of the present invention. The device includes: a memory and a processor; the memory is used to store program instructions; the processor is used to call the program instructions, and when the program instructions are executed, it is used to execute the above-mentioned method for processing respiratory system disease data.

图3是本发明实施例提供的一种呼吸系统疾病数据的处理系统示意流程图,包括: Fig. 3 is a schematic flowchart of a processing system for respiratory disease data provided by an embodiment of the present invention, including:

获取单元301,用于获取待分析的单细胞测序序列数据;An acquisition unit 301, configured to acquire single-cell sequencing sequence data to be analyzed;

第一处理单元302,用于采用机器学习的方法对待分析的单细胞测序序列数据进行处理,得到细胞通路的PAS评分矩阵和细胞通路的PAS;The first processing unit 302 is configured to use a machine learning method to process the single-cell sequencing sequence data to be analyzed to obtain a PAS scoring matrix of cell pathways and a PAS of cell pathways;

第二处理单元303,用于获取呼吸道疾病的遗传关联数据,对遗传关联数据进行处理,得到带有SNPs注释的通路数据;The second processing unit 303 is used to obtain genetic association data of respiratory diseases, process the genetic association data, and obtain pathway data with SNPs annotations;

第三处理单元304,用于对细胞通路的PAS和带有SNP注释的通路数据进行统计分析处理,得到估计系数;The third processing unit 304 is configured to perform statistical analysis on the PAS of the cell pathway and the pathway data with SNP annotations to obtain estimated coefficients;

第四处理单元305,用于将估计系数乘以PAS再求和得到细胞的遗传相关通路活性评分gPAS;The fourth processing unit 305 is used to multiply the estimated coefficient by PAS and then sum to obtain the genetic-related pathway activity score gPAS of the cell;

输出单元306,用于输出遗传相关通路活性评分gPAS。The output unit 306 is configured to output the genetic related pathway activity score gPAS.

一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现上述的呼吸系统疾病数据的处理方法。A computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the above-mentioned method for processing respiratory system disease data is realized.

图4是本发明实施例提供的基于单细胞通路的评分方法获得gPAS,以及利用gPAS输出TRS、性状相关基因、性状相关细胞、性状相关细胞类型/亚群、性状型管通路的概述图; Fig. 4 is an overview of gPAS obtained by the single-cell pathway-based scoring method provided by the embodiment of the present invention, and output of TRS, trait-related genes, trait-related cells, trait-related cell types/subgroups, and trait-type tube pathways using gPAS;

其中,A表示利用奇异值分解的方法将基因-细胞矩阵转换为通路-细胞矩阵,PC1表示每条通路的PAS;B表示将GWAS数据中的SNP注释到对应通路中;C表示多基因回归模型;其中,位于顶部的图表示利用多基因回归模型推断每条通路中的估计系数,再使用估计系数和相应的PAS进行计算得到gPAS,位于底部的图表示Pearson相关模型,用于将每个细胞的gPAS与所有单个细胞的基因相关联,以便对性状相关基因进行排名;利用Seurat中的AddModuleScore函数得到前N个性状相关基因(默认前1,000个)。以计算每个细胞的性状相关分数TRS;D表示输出,分别包括四个输出:性状相关细胞、性状相关细胞类型、性状相关通路和性状相关基因。Among them, A indicates that the gene-cell matrix is converted into a pathway-cell matrix by using the method of singular value decomposition, PC1 indicates the PAS of each pathway; B indicates that the SNP in the GWAS data is annotated into the corresponding pathway; C indicates the multigene regression model; among them, the figure at the top indicates that the estimated coefficient in each pathway is inferred by the multigene regression model, and then calculated using the estimated coefficient and the corresponding PAS to obtain gPAS, and the figure at the bottom indicates the Pearson correlation model, which is used to correlate the gPAS of each cell with the genes of all individual cells , in order to rank the trait-related genes; use the AddModuleScore function in Seurat to get the top N trait-related genes (the default top 1,000). To calculate the trait-related score TRS of each cell; D represents the output, including four outputs: trait-related cells, trait-related cell types, trait-related pathways and trait-related genes.

本验证实施例的验证结果表明,为适应症分配固有权重相对于默认设置来说可以适度改善本方法的性能。The validation results of this validation example show that assigning intrinsic weights to indications can moderately improve the performance of the method relative to the default setting.

所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the above-described system, device and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.

在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed system, device and method can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.

本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序可以存储于一计算机可读存储介质中,存储介质可以包括:只读存储器(ROM,Read Only Memory)、随机存取存储器(RAM,RandomAccess Memory)、磁盘或光盘等。Those of ordinary skill in the art can understand that all or part of the steps in the various methods of the above-mentioned embodiments can be completed by instructing related hardware through a program, and the program can be stored in a computer-readable storage medium, and the storage medium can include: Read Only Memory (ROM, Read Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk, etc.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分步骤是可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。Those of ordinary skill in the art can understand that all or part of the steps in the method of the above-mentioned embodiments can be completed through a program to instruct relevant hardware, and the program can be stored in a computer-readable storage medium, and the above-mentioned storage medium can be a read-only memory, a magnetic disk or an optical disk, etc.

以上对本发明所提供的一种计算机设备进行了详细介绍,对于本领域的一般技术人员,依据本发明实施例的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本发明的限制。A computer device provided by the present invention has been introduced in detail above. For those of ordinary skill in the art, according to the ideas of the embodiments of the present invention, there will be changes in the specific implementation and application scope. In summary, the contents of this specification should not be understood as limitations on the present invention.

Claims (10)

1. A method of processing respiratory disease data, comprising:
acquiring single cell sequencing sequence data to be analyzed;
processing the single-cell sequencing sequence data to be analyzed by adopting a machine learning method to obtain PAS scoring matrix of a cell pathway and PAS of the cell pathway;
acquiring genetic association data of respiratory diseases, and processing the genetic association data to obtain path data with SNPs annotation;
performing statistical analysis processing on PAS of the cell pathway and the pathway data with SNP annotation to obtain an estimation coefficient;
multiplying the estimation coefficient by PAS and then summing to obtain a genetic related pathway activity score gPAS of the cell;
outputting the genetically related pathway activity score gPAS.
2. The method of claim 1, wherein the step of statistically analyzing the PAS of the cellular pathway and the pathway data with SNP annotations to obtain estimated coefficients comprises:
obtaining genetic effect values of all SNPs in single path data based on the path data with the SNPs annotation;
based on a polygene regression model of genetic association data of respiratory diseases, carrying out parameter estimation on the distribution of the genetic effect values based on the PAS and the genetic effect values to obtain estimation coefficients;
optionally, the obtaining formula of the genetic effect value is:wherein beta represents mThe magnitude vector of the theoretical effect of SNPs, epsilon, represents the random environmental error, R represents the LD matrix, X T Representing a standard genotype of SNPs in the genetically related data sample;
optionally, the obtaining manner of the estimation coefficient includes:
wherein τ i,j Estimated coefficient, τ, representing pathway i in cell j 0 Representing intercept term, σ 2 Shows the variance of the magnitude of SNP effect in the pathway,representing a weighted PAS.
3. The method of claim 1, wherein the genetically related pathway activity score gPAS (gP j ) The acquisition formula of (1) is:
wherein the gP j As gPAS, saidAnd the estimated coefficient is optimized.
4. The method of processing respiratory disease data according to claim 1, wherein the method further comprises: performing correlation analysis and sequencing on the genetic related pathway activity score gPAS and the gene expression quantity of each cell, and screening N personality-related genes;
optionally, the method for performing correlation analysis and sequencing on the genetic correlation pathway activity score gPAS and the gene expression quantity of each cell comprises the following steps: determining the correlation between the expression of a single gene and the gPAS through a Pearson Correlation Coefficient (PCC), and sequencing the genes according to the correlation to obtain the N personality-related genes;
optionally, the N trait related genes are the first 1000 or the last 1000 trait related genes sequenced according to a descending or ascending rule of the relativity;
optionally, the N personality-related genes include one or more of the following: CALM3, PIK3R1, IL32, CD3E, B2M, PRS29, and GZMB.
5. The method of processing respiratory disease data according to claim 1, wherein the method further comprises: calculating a trait related score TRS for each cell according to the N trait related genes; clustering according to the trait related score TRS and the level P value of the single cells to obtain trait related cells related to respiratory diseases with different levels of severity;
optionally, calculating the trait related score TRS of the N personality genes using a cell scoring method;
optionally, the different levels of severity of the respiratory disease include mild, moderate and severe.
6. The method of processing respiratory disease data according to claim 1, wherein the method further comprises: obtaining a trait-related cell type or subpopulation based on the block boot method block bootstrap method;
optionally, the method further comprises: and sequencing the genetic related pathway activity scores gPAS, and obtaining the trait related pathway according to the sequencing result and the P value of the pathway on the cell type level.
7. Detecting newUse of a product of a cell subpopulation for the preparation of a product for diagnosing a respiratory disease.
8. A device for processing respiratory disease data, the device comprising: a memory and a processor;
the memory is used for storing program instructions; the processor is adapted to invoke program instructions for performing the method of processing respiratory disease data according to any of claims 1-6 when the program instructions are executed.
9. A system for processing respiratory disease data, comprising:
an acquisition unit for acquiring single-cell sequencing sequence data to be analyzed;
the first processing unit is used for processing the single-cell sequencing sequence data to be analyzed by adopting a machine learning method to obtain a PAS scoring matrix of a cell pathway and PAS of the cell pathway;
the second processing unit is used for acquiring genetic association data of the respiratory tract diseases, and processing the genetic association data to obtain path data with SNPs annotation;
a third processing unit for performing statistical analysis processing on PAS of the cell pathway and the pathway data with SNP annotation to obtain an estimation coefficient;
a fourth processing unit for multiplying the estimation coefficient by PAS and then summing to obtain a genetic related pathway activity score gPAS of the cell;
and an output unit for outputting the genetically related pathway activity score gPAS.
10. A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of processing respiratory disease data according to any of the preceding claims 1-6.
CN202211277916.2A 2022-10-19 2022-10-19 Processing method and system for respiratory disease data Pending CN116486911A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211277916.2A CN116486911A (en) 2022-10-19 2022-10-19 Processing method and system for respiratory disease data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211277916.2A CN116486911A (en) 2022-10-19 2022-10-19 Processing method and system for respiratory disease data

Publications (1)

Publication Number Publication Date
CN116486911A true CN116486911A (en) 2023-07-25

Family

ID=87225583

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211277916.2A Pending CN116486911A (en) 2022-10-19 2022-10-19 Processing method and system for respiratory disease data

Country Status (1)

Country Link
CN (1) CN116486911A (en)

Similar Documents

Publication Publication Date Title
Sapoval et al. SARS-CoV-2 genomic diversity and the implications for qRT-PCR diagnostics and transmission
Veeramah et al. An early divergence of KhoeSan ancestors from those of other modern humans is supported by an ABC-based analysis of autosomal resequencing data
Gondro et al. A simple genetic algorithm for multiple sequence alignment
Zhao et al. Whole-genome sequencing of giant pandas provides insights into demographic history and local adaptation
Wu et al. PROPER: comprehensive power evaluation for differential expression using RNA-seq
CN116486913B (en) System, apparatus and medium for de novo predictive regulatory mutations based on single cell sequencing
CN115588465B (en) Screening method and system for character related genes
Bhattacharyya et al. The conserved phylogeny of blood microbiome
Schleicher et al. Facing the challenges of multiscale modelling of bacterial and fungal pathogen–host interactions
CN114388062A (en) Method, equipment and application for predicting antibiotic resistance phenotype based on machine learning
CN117106872A (en) Screening method of molecular network markers of dairy cows with high feed efficiency
Zhang et al. Inferring historical introgression with deep learning
CN115472219B (en) Alzheimer's disease data processing method and system
WO2019242445A1 (en) Detection method, device, computer equipment and storage medium of pathogen operation group
CN116486911A (en) Processing method and system for respiratory disease data
CN114496089B (en) Pathogenic microorganism identification method
CN113035275B (en) Feature extraction method for tumor gene point mutation by combining contour coefficient and RJMMC algorithm
Moulton et al. Split networks. A tool for exploring complex evolutionary relationships in molecular data
Jaffe et al. Gene set bagging for estimating the probability a statistically significant result will replicate
CN113035274A (en) NMF-based tumor gene point mutation characteristic map extraction algorithm
Jin et al. Modeling viral evolution: A novel SIRSVIDE framework with application to SARS-CoV-2 dynamics
CN118197406A (en) Scoring methods and systems for assessing associations between microorganisms and host cells
Ganapathy et al. Allele Specific Expression Quality Control Fills Critical Gap in Transcriptome Assisted Rare Variant Interpretation
Wang Statistical Methods for Genomics and Genetics Data Analysis
Hu et al. Benchmarking, detection, and genotyping of structural variants in a population of whole-genome assemblies using the SVGAP pipeline

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination