CN118335319A

CN118335319A - Early prediction method for common major diseases based on virtual person simulation

Info

Publication number: CN118335319A
Application number: CN202410749228.4A
Authority: CN
Inventors: 张韬; 李佳圆; 温晓玲; 伍东升; 陈苗双; 胡琳; 陈馨
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2024-06-12
Filing date: 2024-06-12
Publication date: 2024-07-12
Anticipated expiration: 2044-06-12
Also published as: CN118335319B

Abstract

The invention belongs to the technical field of disease prediction, and discloses a method for early predicting common major diseases based on virtual human simulation, which comprises a virtual human prediction model based on a dynamic Bayesian network and constructed through structure learning and parameter learning, and early predicting the common major diseases through the virtual human prediction model; the invention combines multidimensional dynamic molecular change characteristics in the process of the longitudinal queue progress of the healthy age-increasing, builds a prediction model of 'virtual man' simulation based on the technologies of machine learning enhanced dynamic Bayesian network model, group learning and the like, reveals the 'multi-cause multi-fruit' combined effect of complex exposure and phenotype characteristics on various healthy outcomes, screens novel healthy age-increasing markers with biological significance, and improves the sensitivity and accuracy of early risk prediction of serious diseases in the process of the age-increasing through complementation among different layers of histology information.

Description

Early prediction method of common major diseases based on virtual human simulation

技术领域Technical Field

本发明属于疾病预测技术领域，具体涉及一种基于虚拟人仿真的常见重大疾病早期预测方法。The present invention belongs to the technical field of disease prediction, and in particular relates to an early prediction method for common major diseases based on virtual human simulation.

背景技术Background technique

常见重大疾病危险因素有年龄、遗传因素、生活方式、环境因素、慢性病史、心理因素等，其发病影响因素涉及范围广且关系较为复杂。如图2所示的分析框架为我们提供了系统性的方法，用于评估预防性医疗服务的证据，为构建准确的风险评估模型提供了重要指导。随着高通量基因组学、蛋白质组学、代谢组学等多组学的飞速发展，使得在分子水平上阐述常见重大疾病发生发展过程中的复杂机制成为可能。研究表明：遗传因素在常见重大疾病的发生发展中也同样扮演重要角色。因此，从分子水平研究常见重大疾病的发病风险，进而构建基于多组学的风险评估模型可帮助预测常见重大疾病表型并进行常见重大疾病发病风险分层，将为精准预防和诊疗奠定基础。Common risk factors for major diseases include age, genetic factors, lifestyle, environmental factors, chronic disease history, psychological factors, etc. The factors affecting the onset of diseases cover a wide range and have relatively complex relationships. The analytical framework shown in Figure 2 provides us with a systematic method for evaluating the evidence of preventive medical services and provides important guidance for building an accurate risk assessment model. With the rapid development of multi-omics such as high-throughput genomics, proteomics, and metabolomics, it is possible to explain the complex mechanisms in the occurrence and development of common major diseases at the molecular level. Studies have shown that genetic factors also play an important role in the occurrence and development of common major diseases. Therefore, studying the risk of common major diseases at the molecular level and then building a multi-omics-based risk assessment model can help predict the phenotypes of common major diseases and stratify the risk of common major diseases, which will lay the foundation for precise prevention and diagnosis and treatment.

传统单组学研究为筛选增龄及增龄过程中常见重大疾病新型分子标志物提供了重要信息，但其所提供的生物学过程信息往往具有较大的局限性，对复杂的增龄过程的整体解释度亦有限。将基因、蛋白、代谢、脂质、菌群等不同层面之间的信息进行有机整合，不仅为解释增龄进程的生物学机制提供更多的证据，还有利于深层次挖掘与增龄过程中常见重大疾病相关的新型分子标志物。因此，融合基因组学、代谢组学、膳食模式与营养状况等多维度信息，构建常见重大疾病发病风险评估模型，识别常见重大疾病高危人群、动态评估个体未来发生常见重大疾病的风险，为常见重大疾病高危人群提供个体化的早期针对性预防干预。Traditional single-omics studies provide important information for screening new molecular markers of common major diseases in aging and the aging process, but the biological process information they provide often has great limitations, and the overall explanatory power of the complex aging process is also limited. The organic integration of information between different levels such as genes, proteins, metabolism, lipids, and microbiota not only provides more evidence to explain the biological mechanisms of the aging process, but also helps to deeply explore new molecular markers related to common major diseases in the aging process. Therefore, by integrating multi-dimensional information such as genomics, metabolomics, dietary patterns, and nutritional status, a risk assessment model for the onset of common major diseases is constructed to identify high-risk groups for common major diseases, dynamically assess the risk of individuals developing common major diseases in the future, and provide individualized early targeted preventive interventions for high-risk groups for common major diseases.

发明内容Summary of the invention

本发明的目的在于提供一种基于虚拟人仿真的常见重大疾病早期预测方法，该方法采用基于真实因果关系和机器学习的动态贝叶斯网络模型作为虚拟人预测模型，能对重大疾病进行分期预测，为常见重大疾病的风险评估和防治提供重要的参考依据。The purpose of the present invention is to provide an early prediction method for common major diseases based on virtual human simulation. The method adopts a dynamic Bayesian network model based on real causal relationships and machine learning as a virtual human prediction model, which can predict major diseases by stages and provide an important reference basis for risk assessment and prevention and treatment of common major diseases.

为实现上述目的，本发明采用如下技术方案：To achieve the above object, the present invention adopts the following technical solution:

一种基于虚拟人仿真的常见重大疾病早期预测方法，包括通过结构学习和参数学习构建的基于动态贝叶斯网络的虚拟人预测模型，并通过虚拟人预测模型对常见重大疾病进行早期预测；A method for early prediction of common major diseases based on virtual human simulation, including a virtual human prediction model based on a dynamic Bayesian network constructed through structural learning and parameter learning, and early prediction of common major diseases through the virtual human prediction model;

其中，所述结构学习是用于从潜在因素中筛选出与风险预测相关的因素，并确定动态贝叶斯网络的拓扑结构构建虚拟人预测模型；The structural learning is used to screen out factors related to risk prediction from potential factors and determine the topological structure of the dynamic Bayesian network to construct a virtual human prediction model;

所述结构学习是先利用模糊理论将知识转换为归一化模糊隶属度，通过隶属度确定两个变量之间的因果关系，并将两个变量之间的关系作为搜索贝叶斯网络结构空间的强限制条件或者弱限制条件构建出多个初始虚拟人预测模型；然后利用分区MCMC方法从多个初始虚拟人预测模型中探索出最优的虚拟人预测模型；The structural learning is to first convert knowledge into normalized fuzzy membership by using fuzzy theory, determine the causal relationship between two variables by using the membership, and use the relationship between the two variables as a strong constraint or a weak constraint to search the Bayesian network structure space to construct multiple initial virtual human prediction models; then use the partitioned MCMC method to explore the optimal virtual human prediction model from the multiple initial virtual human prediction models;

所述参数学习用于确定最优的虚拟人预测模型中每个变量在给定其父节点集的条件下的概率分布利用已有的数据集估计出最优的虚拟人预测模型中各个节点之间的条件概率关系，为每个节点分配概率参数；所述参数学习以常见重大疾病分期作为风险评估的分类结局指标并先通过EKM方法聚类后再利用扩增型两阶段stacking算法进行预测结果；最后通过网格搜索法和5折交叉验证来确定虚拟人预测模型的最优参数，从而构建出最终的虚拟人预测模型；The parameter learning is used to determine the probability distribution of each variable in the optimal virtual human prediction model under the condition of a given set of its parent nodes. The conditional probability relationship between the nodes in the optimal virtual human prediction model is estimated using the existing data set, and probability parameters are assigned to each node. The parameter learning uses the common major disease staging as the classification outcome indicator for risk assessment and first clusters through the EKM method and then uses the expanded two-stage stacking algorithm to predict the results. Finally, the optimal parameters of the virtual human prediction model are determined through the grid search method and 5-fold cross validation, so as to construct the final virtual human prediction model.

所述扩增型两阶段stacking算法是将N个不同的初级学习器对同一数据集的输出类后验概率分别作为元层分类器的N个固定维度的输入向量，且在stacking算法的基础上增加一个初级层。The amplified two-stage stacking algorithm uses the output class posterior probabilities of N different primary learners for the same data set as N fixed-dimensional input vectors of the meta-layer classifier, and adds a primary layer based on the stacking algorithm.

进一步地，通过隶属度确定两个变量之间的因果关系：当隶属度为0，两个变量之间不存在因果关系，当隶属度为1，两个变量之间存在因果关系，当隶属度在0与1之间，两个变量之间的因果关系不能确定。Furthermore, the causal relationship between two variables is determined by the membership degree: when the membership degree is 0, there is no causal relationship between the two variables; when the membership degree is 1, there is a causal relationship between the two variables; when the membership degree is between 0 and 1, the causal relationship between the two variables cannot be determined.

进一步地，所述强限制条件是指隶属度为0和隶属度为1时两个变量的因果关系，所述弱限制条件是指隶属度在0与1之间两个变量的因果关系。Furthermore, the strong restriction condition refers to the causal relationship between two variables when the membership degree is 0 and the membership degree is 1, and the weak restriction condition refers to the causal relationship between two variables when the membership degree is between 0 and 1.

进一步地，所述分区MCMC方法包括：Furthermore, the partitioned MCMC method comprises:

根据分区要求和分区规则将贝叶斯网络的初始拓扑结构中所有变量分成m个区并对其依次编号，设第i区(i=1,2,…,m)的变量个数为k_i，变量个数，各个分区中的变量个数，每个分区中的具体变量为π_λ，标记分区Λ=(λ,π_λ)，标记分区Λ下的贝叶斯网络结构为；在给定数据D的情况下，标记分区Λ的后验分布正比于对贝叶斯网络结构中每个节点X_i及其父节点Pa_i的评分进行合并后得到的总评分，总评分为标记分区空间与贝叶斯网络结构空间的等价性，根据总评分最大的标记分区确定贝叶斯网络结构空间中最优的网络结构； According to the partitioning requirements and partitioning rules, all variables in the initial topological structure of the Bayesian network are divided into m regions and numbered in sequence. Let the number of variables in the i-th region (i=1,2,…,m) be k _i , and the number of variables be , the number of variables in each partition , the specific variables in each partition are π _λ , the label partition Λ=(λ,π _λ ), and the Bayesian network structure under the label partition Λ is ; Given the data D, the posterior distribution of the label partition Λ is proportional to the total score obtained by merging the scores of each node _Xi and its parent node _Pai in the Bayesian network structure ,total score To verify the equivalence between the label partition space and the Bayesian network structure space, the optimal network structure in the Bayesian network structure space is determined based on the label partition with the largest total score;

每次迭代中，当前标记分区为Λ，提议标记分区为Λ^*，接受概率为，其中，为标记分区的邻域，该邻域由标记分区将一个分区拆分成两个分区或者将两个相邻分区合并成为一个分区。 In each iteration, the current marked partition is Λ, the proposed marked partition is Λ ^* , and the acceptance probability is ,in, To mark the partition The neighborhood of is partitioned by the label Split a partition into two partitions or merge two adjacent partitions into one partition.

进一步地，所述分区要求包括同一个区内的变量之间没有箭头连接、第1区的变量没有父节点、除第1区外其他每个区的各个变量必须至少有一个来自前一个区的父节点；Furthermore, the partitioning requirements include that there are no arrow connections between variables in the same zone, that variables in zone 1 have no parent nodes, and that each variable in each zone except zone 1 must have at least one parent node from the previous zone;

所述分区规则由强限制条件决定，对于强限制条件中没有明确规定的变量采用随机模拟的方式分区。The partitioning rules are determined by strong constraints, and variables that are not clearly specified in the strong constraints are partitioned using a random simulation approach.

进一步地，根据数据集的类型不同选择不同的初级学习器，针对图像型数据的初级学习器包括卷积神经网络CNN、全卷积网络FCN和深度玻尔兹曼机DBM；针对文本型数据的初级学习器包括循环神经网络RNN、长短期记忆网络LSTM和门控循环单元GRU；针对数值型数据的初级学习器包括logistic回归、支持向量机SVM和朴素贝叶斯。Furthermore, different primary learners are selected according to the type of data set. The primary learners for image data include convolutional neural network CNN, fully convolutional network FCN and deep Boltzmann machine DBM; the primary learners for text data include recurrent neural network RNN, long short-term memory network LSTM and gated recurrent unit GRU; the primary learners for numerical data include logistic regression, support vector machine SVM and naive Bayes.

本发明结合健康增龄纵向队列进展过程中多维动态分子变化特征，基于机器学习增强型的动态贝叶斯网络模型和群体学习等技术构建“虚拟人”仿真的预测模型，揭示复杂暴露和表型特征对多种健康结局的“多因多果”联合效应，筛选具有生物学意义的健康增龄新型标志物，通过不同层面组学信息之间的互补，提高增龄过程中重大疾病早期风险预测的灵敏度和准确性。并在此基础上开发适合临床应用的早期风险筛查工具，促进科研成果的转化应用，为健康增龄综合防治体系的构建提供科学依据。The present invention combines the characteristics of multidimensional dynamic molecular changes in the process of healthy aging longitudinal cohort progress, and builds a prediction model for "virtual human" simulation based on machine learning-enhanced dynamic Bayesian network model and group learning and other technologies, revealing the "multi-cause and multi-effect" joint effects of complex exposure and phenotypic characteristics on multiple health outcomes, screening new markers of healthy aging with biological significance, and improving the sensitivity and accuracy of early risk prediction of major diseases in the aging process through the complementarity of omics information at different levels. On this basis, early risk screening tools suitable for clinical application are developed to promote the transformation and application of scientific research results and provide a scientific basis for the construction of a comprehensive prevention and treatment system for healthy aging.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明的流程示意图。FIG. 1 is a schematic diagram of the process of the present invention.

图2为USPSTF分析框架。Figure 2 shows the USPSTF analysis framework.

图3为本发明的初级学习器选择示意图。FIG3 is a schematic diagram of selecting a primary learner according to the present invention.

图4为本发明的模型准确性对比图。FIG. 4 is a comparison diagram of the model accuracy of the present invention.

图5为本发明的模型灵敏度对比图。FIG. 5 is a comparison diagram of model sensitivity of the present invention.

图6为本发明的模型AUC对比图。FIG6 is a comparison diagram of the AUC of the models of the present invention.

图7为本发明的模型特异度对比图。FIG. 7 is a comparison diagram of the model specificity of the present invention.

具体实施方式Detailed ways

如图1所示，本实施例提供的一种基于虚拟人仿真的常见重大疾病早期预测方法，该方法基于机器学习增强型的动态贝叶斯网络模型和群体学习等技术构建虚拟人仿真的预测模型，有效地进行多源信息的表达与融合。As shown in Figure 1, this embodiment provides a method for early prediction of common major diseases based on virtual human simulation. The method builds a prediction model of virtual human simulation based on machine learning enhanced dynamic Bayesian network model and group learning technology, and effectively expresses and integrates multi-source information.

本实施例通过结构学习和参数学习构建动态贝叶斯网络作为虚拟人预测模型，并通过虚拟人预测模型进行常见重大疾病早期预测。This embodiment constructs a dynamic Bayesian network as a virtual human prediction model through structure learning and parameter learning, and uses the virtual human prediction model to perform early prediction of common major diseases.

所述结构学习是用于从众多潜在因素中筛选出与风险预测相关的因素，并确定动态贝叶斯网络模型的拓扑结构构建出虚拟人预测模型。本实施例致力于识别那些对目标变量具有影响的因素，从而构建一个更加精确和可解释的贝叶斯网络模型结构；其中涉及对数据进行分析和挖掘，以识别变量之间的相关性和因果关系；结构学习的结果将直接影响虚拟人预测模型的质量和预测性能。The structural learning is used to screen out factors related to risk prediction from a large number of potential factors, and determine the topological structure of the dynamic Bayesian network model to build a virtual human prediction model. This embodiment is dedicated to identifying factors that have an impact on the target variable, thereby building a more accurate and interpretable Bayesian network model structure; which involves analyzing and mining data to identify the correlation and causal relationship between variables; the results of structural learning will directly affect the quality and prediction performance of the virtual human prediction model.

本实施例采用模糊理论和分区MCMC方法实现结构学习；在结构学习中包括两个阶段，第一阶段为知识驱动，模糊理论是一种用于处理模糊和不确定信息的数学工具，能够有效地处理医学领域中常见的模糊性和不确定性；第一阶段的目标是将知识转化为归一化模糊隶属度。简单来说，希望通过模糊化的方法将不确定的医学知识转化为一种形式化的表示，以便后续的学习和推理过程；模糊隶属度能被视为反映变量之间因果关系的先验概率；帮助限制变量个数以及贝叶斯网络结构的复杂程度，从而在一定程度上减轻了结构学习的难度。在第一阶段，本实施例利用医学数据库、专家咨询等多种来源获取的丰富知识经验；包括不限于病例数据、医学文献、专家意见等。然后，运用模糊理论对获取的丰富知识经验进行加工，综合运用各方面信息来源以考察这些变量之间的因果关系。This embodiment uses fuzzy theory and partitioned MCMC method to realize structural learning; there are two stages in structural learning. The first stage is knowledge-driven. Fuzzy theory is a mathematical tool for processing fuzzy and uncertain information, which can effectively deal with common fuzziness and uncertainty in the medical field; the goal of the first stage is to convert knowledge into normalized fuzzy membership. In short, it is hoped that the uncertain medical knowledge can be converted into a formalized representation through the fuzzification method for subsequent learning and reasoning processes; fuzzy membership can be regarded as a priori probability reflecting the causal relationship between variables; it helps to limit the number of variables and the complexity of the Bayesian network structure, thereby reducing the difficulty of structural learning to a certain extent. In the first stage, this embodiment uses rich knowledge and experience obtained from multiple sources such as medical databases and expert consultations; including but not limited to case data, medical literature, expert opinions, etc. Then, the rich knowledge and experience obtained are processed using fuzzy theory, and various information sources are comprehensively used to examine the causal relationship between these variables.

一般来说，对于两个变量A和B之间的因果关系可分为三种类型：类型，A→B，即A是B的原因；类型，B→A，即B是A的原因；类型，A和B之间无因果关系。Generally speaking, the causal relationship between two variables A and B can be divided into three types: Type , A→B, that is, A is the cause of B; type , B→A, that is, B is the cause of A; type , there is no causal relationship between A and B.

将这三种类型作为衡量变量A与B关系的三个模糊子集，并依据各条信息来源分别计算变量A与B关系属于三个模糊子集的归一化模糊隶属度。These three types are used as three fuzzy subsets to measure the relationship between variables A and B, and the normalized fuzzy membership of the relationship between variables A and B belonging to the three fuzzy subsets is calculated based on each information source.

根据隶属度的大小可分为以下三种情况：情况，若变量A与B的关系属于某种类型的模糊隶属度等于0，说明变量A与B一定不存在这种关系；例如，若变量A与B关系的类型模糊隶属度等于0，则说明A一定不会是B的原因；情况，若变量A与B的关系属于某种类型的模糊隶属度等于1，则说明变量A与B一定存在这种关系；例如，若变量A与B关系的类型隶属度等于1，则说明A一定是B的原因；情况，若变量A与B的关系属于任意一种类型的隶属度都在大于0和小于1之间，说明通过知识经验难以确定A与B之间的关系，尚需结合数据特征做进一步判断。According to the size of the membership, it can be divided into the following three situations: , if the relationship between variables A and B is of a certain type, the fuzzy membership is equal to 0, indicating that variables A and B do not have such a relationship; for example, if the type of relationship between variables A and B is If the fuzzy membership is equal to 0, it means that A cannot be the cause of B. , if the relationship between variables A and B belongs to a certain type and the fuzzy membership is equal to 1, then variables A and B must have such a relationship; for example, if the type of relationship between variables A and B is If the membership degree is 1, then A must be the cause of B; If the relationship between variables A and B belongs to any type of membership between greater than 0 and less than 1, it means that it is difficult to determine the relationship between A and B through knowledge and experience, and further judgment is needed based on data characteristics.

对于情况和情况，将其作为搜索贝叶斯网络结构空间的强限制条件，即强行限制贝叶斯网络中的节点A与B之间存在(或不存在)连接箭头；对于上述情况，可将其作为搜索贝叶斯网络结构空间的弱限制条件，以先验概率的形式纳入下一阶段的分区MCMC方法结构学习；通过强限制条件和弱限制条件，可缩小贝叶斯网络结构空间的搜索范围，达到简化分区MCMC方法搜索最优贝叶斯网络结构任务的目的。For the situation and situation , which is used as a strong restriction condition for searching the Bayesian network structure space, that is, forcibly restricting the existence (or non-existence) of connection arrows between nodes A and B in the Bayesian network; for the above situation , it can be used as a weak restriction condition for searching the Bayesian network structure space and incorporated into the next stage of partitioned MCMC method structure learning in the form of prior probability; through strong and weak restriction conditions, the search range of the Bayesian network structure space can be narrowed, so as to simplify the task of searching the optimal Bayesian network structure by the partitioned MCMC method.

第二阶段是利用分区MCMC方法探索最优贝叶斯网络结构，利用贝叶斯网络作为模型来描述变量之间的潜在关系。贝叶斯网络是一种概率图模型，它通过有向无环图来表示变量之间的依赖关系，并使用概率分布来描述这些关系。在第二阶段的目标是在受限的贝叶斯网络结构空间中寻找最优的网络结构，并以该网络结构作为虚拟人预测模型，分区MCMC方法能够更高效地探索大型搜索空间，从而加速贝叶斯网络结构的学习过程。The second stage is to use the partitioned MCMC method to explore the optimal Bayesian network structure, using the Bayesian network as a model to describe the potential relationship between variables. The Bayesian network is a probabilistic graphical model that represents the dependencies between variables through a directed acyclic graph and uses probability distributions to describe these relationships. The goal of the second stage is to find the optimal network structure in the restricted Bayesian network structure space and use this network structure as the virtual human prediction model. The partitioned MCMC method can explore large search spaces more efficiently, thereby accelerating the learning process of the Bayesian network structure.

所述分区MCMC方法包括如下步骤：The partitioned MCMC method comprises the following steps:

首先，将根据贝叶斯网络的初始拓扑结构中所有变量分成m个区，并依次编号，第1区、第2区……第m区，并且要求(a)同一个区内的变量之间没有箭头连接；(b)第1区的变量没有父节点；(c)除第1区外，其他每个区的各个变量必须至少有一个来自前一个区的父节点。分区规则主要由第一阶段中的强限制条件决定，而对于第一阶段的强限制条件中没有明确规定的变量，则采用随机模拟的方式分区。First, all variables in the initial topological structure of the Bayesian network are divided into m regions and numbered in sequence, region 1, region 2, ... region m, and it is required that (a) there are no arrows connecting variables in the same region; (b) variables in region 1 have no parent nodes; (c) except region 1, each variable in each region must have at least one parent node from the previous region. The partitioning rules are mainly determined by the strong constraints in the first stage, and for variables that are not clearly specified in the strong constraints in the first stage, random simulation is used for partitioning.

然后，设第i区(i=1,2,…,m)的变量个数为k_i，所有的变量个数，每一种分区方法，采用π_λ为分区方法λ对应的变量排序；按照分区规则，λ 为各个分区中的变量个数，π_λ为记录每个分区中的具体变量，标记分区Λ=(λ,π_λ)，并采用表示标记分区Λ下的贝叶斯网络结构；在给定数据D的情况下，标记分区Λ的后验分布正比于对贝叶斯网络结构中每个节点X_i及其父节点Pa_i的评分进行合并后得到的总评分，总评分表明标记分区空间与贝叶斯网络结构空间的等价性，即一旦能够在标记分区空间搜索到评分最大的标记分区，则能够根据评分最大的标记分区确定贝叶斯网络结构空间中相应的最佳网络结构。 Then, let the number of variables in the i-th region (i=1,2,…,m) be k _i , and the number of all variables , each partitioning method , using π _λ as the variable sorting corresponding to the partition method λ; according to the partitioning rule, λ is the number of variables in each partition, π _λ is to record the specific variables in each partition, mark the partition Λ=(λ,π _λ ), and use represents the Bayesian network structure under the label partition Λ; given the data D, the posterior distribution of the label partition Λ is proportional to the total score obtained by merging the scores of each node _Xi and its parent node _Pai in the Bayesian network structure ,total score The equivalence of the label partition space and the Bayesian network structure space is shown, that is, once the label partition with the largest score can be searched in the label partition space, the corresponding optimal network structure in the Bayesian network structure space can be determined based on the label partition with the largest score.

然后，构建基于标记分区空间的MCMC方法；在MCMC方法的每次迭代中，记当前标记分区为Λ，提议标记分区为Λ^*，接受概率为，其中，为标记分区的邻域，该邻域由标记分区通过将一个分区拆分成两个分区或者将两个相邻分区合并成为一个分区等操作实现。分区MCMC方法能够在给定训练数据集的条件下提供一系列满足特定平稳后验概率分布的贝叶斯网络结构的样本，并进一步采用贝叶斯理论对这些贝叶斯网络结构的样本进行统计推断，从而能够较好处理由于高维复杂数据结构和抽样误差等原因导致的贝叶斯网络结构的不确定性问题。 Then, we construct an MCMC method based on the label partition space. In each iteration of the MCMC method, we denote the current label partition as Λ, the proposed label partition as Λ ^* , and the acceptance probability as ,in, To mark the partition The neighborhood of is partitioned by the label This is achieved by splitting a partition into two partitions or merging two adjacent partitions into one partition. The partitioned MCMC method can provide a series of samples of Bayesian network structures that meet a specific stable posterior probability distribution under the condition of a given training data set, and further use Bayesian theory to perform statistical inference on these Bayesian network structure samples, so as to better deal with the uncertainty of the Bayesian network structure caused by high-dimensional complex data structures and sampling errors.

本实施例的结构学习通过模糊理论和分区MCMC方法的结合实现，充分发挥模糊理论和分区MCMC方法在处理不确定性信息方面的优势；通过将最优贝叶斯网络结构的搜索范围限制在具有医学专业意义的范围内，确保所得到的结构具有实际意义，并且能够提高算法的效率。The structural learning of this embodiment is achieved through the combination of fuzzy theory and partitioned MCMC method, giving full play to the advantages of fuzzy theory and partitioned MCMC method in processing uncertain information; by limiting the search scope of the optimal Bayesian network structure to the scope of medical professional significance, it is ensured that the obtained structure has practical significance and can improve the efficiency of the algorithm.

所述参数学习用于确定网络中每个变量在给定其父节点集的条件下的概率分布。本实施例将利用已有的数据来估计网络中各个节点之间的条件概率关系，为贝叶斯网络中的每个节点分配适当的概率参数，从而使网络能够更准确地反映数据的统计特征和概率分布。通过结构学习和参数学习构成了机器学习增强型的动态贝叶斯网络的基本构建过程，并构建出最终虚拟人预测模型；通过结合真实因果关系和机器学习技术，虚拟人预测模型能够更好地利用多源信息进行风险预测和决策支持，为应用于实际场景中的数据分析和预测任务提供了有效的工具和方法。The parameter learning is used to determine the probability distribution of each variable in the network given its parent node set. This embodiment will use existing data to estimate the conditional probability relationship between each node in the network, and assign appropriate probability parameters to each node in the Bayesian network, so that the network can more accurately reflect the statistical characteristics and probability distribution of the data. The basic construction process of the machine learning enhanced dynamic Bayesian network is constructed through structural learning and parameter learning, and the final virtual human prediction model is constructed; by combining real causal relationships and machine learning technology, the virtual human prediction model can better utilize multi-source information for risk prediction and decision support, and provide effective tools and methods for data analysis and prediction tasks applied in actual scenarios.

在参数学习过程中，本实施例以常见重大疾病分期作为风险评估的分类结局指标；常见重大疾病分类结局指标的不同类别的构成比例存在显著差异。本实施例将常见重大疾病不同类别的分级综合为一个指标，导致复合指标不同类别的构成比例差异进一步扩大，使得数据集呈现不平衡状态。在面对不平衡数据集进行疾病预测模型训练时，传统的机器学习算法通常倾向于生成能最大化总体分类精度的模型，而对于少数类别则容易被忽视，这会导致模型的性能下降，特别是对于少数群体的预测精度将受到严重影响。鉴于常见重大疾病不同分期数据不平衡的问题可能对模型参数估计造成影响并降低算法运行时间，本实施例按照决策级融合的思想，当初级分类器分别对不同类型的数据做出决策后，将采用集成学习的方式将基分类器结果进行融合。In the parameter learning process, this embodiment uses the staging of common major diseases as the classification outcome indicator for risk assessment; there are significant differences in the composition ratios of different categories of the classification outcome indicators of common major diseases. This embodiment combines the grading of different categories of common major diseases into one indicator, which further expands the differences in the composition ratios of different categories of composite indicators, making the data set unbalanced. When training disease prediction models in the face of unbalanced data sets, traditional machine learning algorithms usually tend to generate models that maximize the overall classification accuracy, while minority categories are easily ignored, which will lead to a decline in model performance, especially for minority groups. The prediction accuracy will be seriously affected. In view of the fact that the imbalance of data in different stages of common major diseases may affect the estimation of model parameters and reduce the running time of the algorithm, this embodiment follows the idea of decision-level fusion. After the primary classifiers make decisions on different types of data, the results of the base classifiers will be fused using ensemble learning.

本实施例采用扩增型两阶段stacking算法处理数据非平衡问题，包括两个方面：第一方面，改变层级之间的输入属性表示方法。传统的stacking算法通常是将N个不同的初级学习器对同一样本数据的输出结果同时作为元学习器的一个输入向量的特征元素，导致当初级学习器的个数增加时，元层特征维数也在不断增长，从而延长了算法的运行时间。本实施例将N个不同初级学习器对同一样本的输出类后验概率分别作为元层分类器的N个固定维度的输入向量，不仅可以避免元层分类器的训练数据维度随着初级分类器的增加而扩大，还能提高元层学习器的训练数据的样本含量，从而在节省运行时间的同时，避免数据维度过高导致的数据稀疏性问题。This embodiment uses an expanded two-stage stacking algorithm to deal with the problem of data imbalance, including two aspects: First, changing the input attribute representation method between levels. The traditional stacking algorithm usually uses the output results of N different primary learners for the same sample data as the characteristic elements of an input vector of the meta-learner, resulting in the continuous growth of the meta-layer feature dimension when the number of primary learners increases, thereby extending the running time of the algorithm. This embodiment uses the output class posterior probabilities of N different primary learners for the same sample as N fixed-dimensional input vectors of the meta-layer classifier, which can not only avoid the expansion of the training data dimension of the meta-layer classifier as the increase of primary classifiers, but also increase the sample content of the training data of the meta-layer learner, thereby saving running time while avoiding data sparsity problems caused by excessive data dimensions.

第二方面，增加stacking算法的层数；本实施例在stacking算法的基础上增加一个初级层，即将传统的stacking两层算法扩增为三层算法；由于本实施例已考虑到通过改变层级之间的输入属性表示方法来实现节省运行时间的目的，增加初级层后的stacking算法运行时间并不比使用相同个数的初级学习器的传统stacking算法运行时间长。另外，当再新增一个初级层后，将有助于进一步提高集成学习的泛化能力。本实施例的扩增型两阶段stacking算法在传统stacking算法的基础上增加了算法的层数，不仅能够确保诊断模型估计结果的准确性，还能保证其在外推性上的可靠性。能够更有效地解决不平衡数据集带来的挑战，提高模型在少数类别预测上的准确性和稳健性。The second aspect is to increase the number of layers of the stacking algorithm; this embodiment adds a primary layer on the basis of the stacking algorithm, that is, to expand the traditional stacking two-layer algorithm into a three-layer algorithm; because this embodiment has taken into account the purpose of saving running time by changing the input attribute representation method between the levels, the running time of the stacking algorithm after adding the primary layer is not longer than the running time of the traditional stacking algorithm using the same number of primary learners. In addition, when another primary layer is added, it will help to further improve the generalization ability of ensemble learning. The expanded two-stage stacking algorithm of this embodiment increases the number of algorithm layers on the basis of the traditional stacking algorithm, which can not only ensure the accuracy of the diagnostic model estimation results, but also ensure its reliability in extrapolation. It can more effectively solve the challenges brought by unbalanced data sets and improve the accuracy and robustness of the model in predicting minority categories.

在参数学习过程中利用EKM（Ensemble K-modes）方法对主要类别样本进行聚类；通过EKM方法能够有效地将主要类别样本划分成不同的簇。在聚类完成后，采用两种不同的数据组合策略，（1）将多数类样本聚类成K个簇，然后分别与少数类样本构成新的数据集s₁；（2）将多数类样本聚类成K个簇，然后将聚类的每个簇划分为K份，从每个簇中选择一份共同与少数类样本组成新的平衡样本s₂。In the parameter learning process, the EKM (Ensemble K-modes) method is used to cluster the main category samples; the EKM method can effectively divide the main category samples into different clusters. After clustering, two different data combination strategies are adopted: (1) cluster the majority class samples into K clusters, and then form a new data set s ₁ with the minority class samples; (2) cluster the majority class samples into K clusters, and then divide each cluster into K parts, and select one part from each cluster to form a new balanced sample s ₂ together with the minority class samples.

在开展集成学习时，本实施例首先根据不同的数据类型选择相应的若干种初级学习器，如图3所示，针对图像型数据的初级学习器包括卷积神经网络CNN、全卷积网络FCN和深度玻尔兹曼机DBM等；针对文本型数据的初级学习器包括循环神经网络RNN、长短期记忆网络LSTM和门控循环单元GRU等；而针对数值型数据的初级学习器包括logistic回归、支持向量机SVM和朴素贝叶斯等。When carrying out ensemble learning, this embodiment first selects several corresponding primary learners according to different data types. As shown in Figure 3, the primary learners for image data include convolutional neural network CNN, full convolutional network FCN and deep Boltzmann machine DBM, etc.; the primary learners for text data include recurrent neural network RNN, long short-term memory network LSTM and gated recurrent unit GRU, etc.; and the primary learners for numerical data include logistic regression, support vector machine SVM and naive Bayes, etc.

然后，通过网格搜索法和5折交叉验证来选择模型的最优参数，从而在网络中每个节点计算出条件概率表。最终，利用这些条件概率表进行风险评估，针对常见重大疾病进行分期评估。这一评估结果可以为常见重大疾病的风险评估和防治提供重要的参考依据。Then, the optimal parameters of the model are selected through grid search and 5-fold cross validation, so as to calculate the conditional probability table at each node in the network. Finally, these conditional probability tables are used for risk assessment and stage assessment for common major diseases. This assessment result can provide an important reference for risk assessment and prevention and treatment of common major diseases.

验证本实施例提供的预测方法有效性，对结构学习和参数学习后的虚拟人预测模型评价；本实施例是指针对患者的病情状态预测结果的评价。由于患者的实际病情状态可以通过病历和随访等方式获取，因此只需比较患者病情状态的预测结果与实际情况的一致性。Verify the effectiveness of the prediction method provided in this embodiment, and evaluate the virtual human prediction model after structure learning and parameter learning; this embodiment refers to the evaluation of the prediction results of the patient's condition. Since the patient's actual condition can be obtained through medical records and follow-up, it is only necessary to compare the consistency between the predicted results of the patient's condition and the actual situation.

评价指标：针对重大疾病早期的预测模型评价可以选择很多指标，如针对定量变量的均方根误差RMSE、平均绝对误差MAE和平均绝对百分比误差MAPE等，以及针对分类变量的错误率和ROC分析等。由于这些指标比较常规，故不再赘述。但是需要注意的是，传统的ROC曲线只适用于分析二分类结局的情况(如是否死亡)，而本实施例结局存在多于两类的健康状态(如风险极低、风险低、风险中等、风险高和风险极高等)。传统处理多类结局变量的方法是对该变量重新按照二分类结局进行分类。但是这种做法将可能会对分类预测结果的估计带来严重偏倚。因此，本实施例采用高维ROC分析的方法，通过定义不同类别的正确分类率CCR，构建高维ROC曲面的不同维度。具体地，以各维度类别的正确分类率作为坐标轴，形成坐标系；并在该坐标系中标出不同临界值组合所对应的坐标位置(即工作点)，然后通过连接各点绘制形成高维ROC曲面。与二维ROC曲线类似，在高维ROC分析中，采用高维ROC曲面下体积VUS衡量治疗选择标志物对全体受试者的准确判别能力。VUS的概率统计学含义是“从每个总体中任取一个个体组成新样本以后，把该样本中的每个个体都正确分到其实际所在组的概率。”本实施例采用非参数和半参数两种方法估计VUS，它们的准确度和精确性已在既往研究中得到验证。Evaluation indicators: Many indicators can be selected for the evaluation of early prediction models for major diseases, such as root mean square error RMSE, mean absolute error MAE and mean absolute percentage error MAPE for quantitative variables, as well as error rate and ROC analysis for categorical variables. Since these indicators are relatively conventional, they will not be repeated. However, it should be noted that the traditional ROC curve is only applicable to the analysis of binary outcomes (such as death), while the outcomes of this embodiment have more than two types of health states (such as extremely low risk, low risk, medium risk, high risk and extremely high risk, etc.). The traditional method for processing multi-class outcome variables is to reclassify the variables according to binary outcomes. However, this approach may bring serious bias to the estimation of classification prediction results. Therefore, this embodiment adopts a high-dimensional ROC analysis method to construct different dimensions of a high-dimensional ROC surface by defining the correct classification rate CCR of different categories. Specifically, the correct classification rate of each dimensional category is used as the coordinate axis to form a coordinate system; and the coordinate positions (i.e., working points) corresponding to different critical value combinations are marked in the coordinate system, and then the high-dimensional ROC surface is formed by connecting the points. Similar to the two-dimensional ROC curve, in the high-dimensional ROC analysis, the volume under the high-dimensional ROC surface VUS is used to measure the accurate discrimination ability of the treatment selection marker for all subjects. The probabilistic statistical meaning of VUS is "after taking an individual from each population to form a new sample, the probability that each individual in the sample is correctly classified into the group to which it actually belongs." This embodiment uses non-parametric and semi-parametric methods to estimate VUS, and their accuracy and precision have been verified in previous studies.

采用EKM方法对数据集聚类，新样本与训练集的相似度为mad，新样本与训练集被正确分类样本的相似度mac，即包含四种不用的聚类样本数据集，分别为s1 mad、s1 mac、s2mad、s2 mac，并通过四种初级学习器（LR、C4.5、SVM、KNN）后校正预测概率并计算准确性、灵敏度、特异度和AUC，如图4~7所示。The EKM method is used to cluster the data set. The similarity between the new sample and the training set is mad, and the similarity between the new sample and the correctly classified sample in the training set is mac. That is, it contains four different clustering sample data sets, namely s1 mad, s1 mac, s2mad, and s2 mac. The prediction probability is corrected and the accuracy, sensitivity, specificity and AUC are calculated after passing through four primary learners (LR, C4.5, SVM, KNN), as shown in Figures 4 to 7.

（1）准确性：如图4所示，表现呈现随着IR的增大而增加的趋势。整体而言，s1 mad和s1 mac的表现都优于s2 mad和s2 mac的表现。(1) Accuracy: As shown in Figure 4, the performance shows a trend of increasing with the increase of IR. Overall, the performance of s1 mad and s1 mac is better than that of s2 mad and s2 mac.

（2）灵敏度：如图5所示，表现趋势相似。在IR较小时（2、4），s1 mad和s1 mac表现好于s2 mad和s2 mac，但在IR较大时（16、32），结果相反，s2 mad和s2 mac表现好于s1 mad和s1 mac。(2) Sensitivity: As shown in Figure 5, the performance trends are similar. When the IR is small (2, 4), s1 mad and s1 mac perform better than s2 mad and s2 mac, but when the IR is large (16, 32), the results are opposite, and s2 mad and s2 mac perform better than s1 mad and s1 mac.

（3）AUC：如图6所示，与在准确性和特异度上的表现相似，在四种学习器上，四种方法的表现都呈现随着IR的增大而增加的趋势。整体而言，s1 mad和s1 mac的表现都优于s2mad和s2 mac的表现。(3) AUC: As shown in Figure 6, similar to the performance in accuracy and specificity, the performance of the four methods on the four learners shows an increasing trend with the increase of IR. Overall, the performance of s1 mad and s1 mac is better than that of s2 mad and s2 mac.

（4）特异度：如图7所示，表现呈现随着IR的增大而增加的趋势，与在准确性上的表现相似。整体而言，除了在IR=2时，在四种分类器上，s1 mad和s1 mac的表现都优于s2 mad和s2 mac的表现。(4) Specificity: As shown in Figure 7, the performance shows a trend of increasing with the increase of IR, which is similar to the performance in accuracy. Overall, except when IR=2, the performance of s1 mad and s1 mac is better than that of s2 mad and s2 mac on the four classifiers.

以上所述仅是本发明优选的实施方式，但本发明的保护范围并不局限于此，任何基于本发明所提供的技术方案和发明构思进行的改造和替换都应涵盖在本发明的保护范围内。The above is only a preferred implementation of the present invention, but the protection scope of the present invention is not limited thereto, and any modification and replacement based on the technical solution and inventive concept provided by the present invention should be included in the protection scope of the present invention.

Claims

1. The method for early prediction of common major diseases based on virtual human simulation is characterized by comprising the following steps of: constructing a virtual human prediction model based on a dynamic Bayesian network through structure learning and parameter learning, and carrying out early prediction on common major diseases through the virtual human prediction model;

The structure learning is used for screening out factors related to risk prediction from potential factors, and determining the topological structure of a dynamic Bayesian network to construct a virtual human prediction model;

The structure learning is to firstly convert knowledge into normalized fuzzy membership degree by utilizing a fuzzy theory, determine the causal relationship between two variables by the membership degree, and construct a plurality of initial virtual human prediction models by taking the relationship between the two variables as a strong limiting condition or a weak limiting condition for searching the Bayesian network structure space; then, an optimal virtual person prediction model is explored from a plurality of initial virtual person prediction models by using a partition MCMC method;

The parameter learning is used for determining probability distribution of each variable in the optimal virtual human prediction model under the condition of giving a father node set, estimating a conditional probability relation among all nodes in the optimal virtual human prediction model by using the existing data set, and distributing probability parameters for each node; the parameter learning takes the stage of common major diseases as a classification ending index of risk assessment, clusters the important diseases through an EKM method, and predicts the result through an amplification type two-stage stacking algorithm; finally, determining optimal parameters of the virtual person prediction model through a grid search method and 5-fold cross verification, so as to construct a final virtual person prediction model;

the amplification type two-stage stacking algorithm takes the posterior probabilities of the output classes of N different primary learners on the same data set as the input vectors of N fixed dimensions of the element layer classifier respectively, and adds one primary layer on the basis of the stacking algorithm.

2. The method for early prediction of common major diseases based on virtual human simulation according to claim 1, wherein the method comprises the following steps: determining the causal relationship between two variables through membership: when the membership is 0, there is no causal relationship between the two variables, when the membership is 1, there is causal relationship between the two variables, and when the membership is between 0 and 1, the causal relationship between the two variables cannot be determined.

3. The method for early prediction of common major diseases based on virtual human simulation according to claim 2, wherein the method comprises the following steps: the strong constraint is the causal relationship of two variables when the membership degree is 0 and the membership degree is 1, and the weak constraint is the causal relationship of two variables when the membership degree is between 0 and 1.

4. The method for early prediction of common major diseases based on virtual human simulation according to claim 1, wherein the method comprises the following steps: the partition MCMC method comprises the following steps:

Dividing all variables in an initial topological structure of the Bayesian network into m zones according to partition requirements and partition rules, numbering the m zones in sequence, setting the variable number of the ith zone (i=1, 2, …, m) as k _i, and setting the variable number Number of variables in each partitionThe specific variable in each partition is pi _λ, the partition Λ= (λ, pi _λ) is marked, and the bayesian network structure under the partition Λ is marked as; Given data D, the posterior distribution of the labeled partition Λ is proportional to the total score obtained by combining the scores of each node X _i and its parent Pa _i in the Bayesian network structureTotal scoreDetermining an optimal network structure in the Bayesian network structure space according to the marked partition with the largest total score for the equivalence of the marked partition space and the Bayesian network structure space;

In each iteration, the current marker partition is Λ, the proposed marker partition is Λ ^*, and the acceptance probability is Wherein, the method comprises the steps of, wherein,To mark partitionsIs partitioned by a markerOne partition is split into two partitions or two adjacent partitions are merged into one partition.

5. The method for early prediction of common major diseases based on virtual human simulation according to claim 4, wherein the method comprises the following steps: the partition requirement comprises that no arrow connection exists between variables in the same region, no father node exists in the variables in the 1 st region, and at least one father node from the previous region is needed in each variable of each region except the 1 st region;

the partitioning rule is determined by a strong constraint, and a random simulation mode is adopted for partitioning variables which are not explicitly specified in the strong constraint.

6. The method for early prediction of common major diseases based on virtual human simulation according to claim 1, wherein the method comprises the following steps: selecting different primary learners according to different types of data sets, wherein the primary learners for image data comprise a convolutional neural network CNN, a full convolutional network FCN and a deep Boltzmann machine DBM; the primary learner for text data includes a cyclic neural network RNN, a long-short-term memory network LSTM and a gating cyclic unit GRU; the primary learner for the numerical data includes logistic regression, support vector machines SVM, and naive bayes.