CN115101133A

CN115101133A - Integrated learning-based SNP interaction detection system

Info

Publication number: CN115101133A
Application number: CN202210860224.4A
Authority: CN
Inventors: 王峻; 王昕�; 余国先; 郭茂祖; 何伟
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2022-07-21
Filing date: 2022-07-21
Publication date: 2022-09-23

Abstract

The utility model provides a SNP interaction detecting system based on ensemble learning, which belongs to the technical field of artificial intelligence data mining classification and bioinformatics, and the scheme comprises: a data acquisition module configured to: acquiring SNP sequence information of a diseased sample and a non-diseased sample, and carrying out pretreatment to realize the construction of a whole genome SNP set; a SNP subset partitioning and combination generation module configured to: dividing a genome-wide SNP set into a plurality of SNP subsets, and constructing SNP combinations based on the SNP subsets; a multi-classifier parallel evaluation module configured to: evaluating the association of the SNP combination with the disease in parallel using a plurality of classifiers; a result verification module configured to: statistical significance validation was performed using the chi-square test on SNP combinations associated with disease assessed using several classifiers.

Description

A SNP Interaction Detection System Based on Ensemble Learning

技术领域technical field

本公开属于人工智能数据挖掘分类以及生物信息学技术领域，尤其涉及一种基于集成学习的SNP交互作用检测系统。The present disclosure belongs to the field of artificial intelligence data mining classification and bioinformatics technology, and in particular relates to an integrated learning-based SNP interaction detection system.

背景技术Background technique

本部分的陈述仅仅是提供了与本公开相关的背景技术信息，不必然构成在先技术。The statements in this section merely provide background information related to the present disclosure and do not necessarily constitute prior art.

随着全基因组测序和高通量技术的发展，研究人员得以获取到数以亿计的单核苷酸多态性(single nucleotide polymorphism,SNP)信息。但是，如何使用合适的机器学习技术，从海量的SNP数据中找到与疾病相关的多个SNP构成的组合，是当前机器学习技术在关联检测应用中的仍需解决的一个难点。With the development of whole-genome sequencing and high-throughput technologies, researchers have been able to obtain information on hundreds of millions of single nucleotide polymorphisms (SNPs). However, how to use appropriate machine learning technology to find the combination of multiple SNPs related to disease from massive SNP data is a difficulty that still needs to be solved in the application of current machine learning technology in association detection.

目前已有的与疾病关联的SNP组合的方法有：利用假设检验等统计学方法，研究每个SNP组合与疾病关联的显著性；使用SNP信息将样本划分多个子集，根据划分结果苹果与疾病的相关性；使用聚类算法先对全基因组SNP数据进行聚类，之后在每个聚簇内部寻找相关的SNP组合，该方法可以显著减小候选集合，减小计算负担。The existing methods for SNP combinations associated with diseases include: using statistical methods such as hypothesis testing to study the significance of the association between each SNP combination and disease; using SNP information to divide samples into multiple subsets, according to the results of dividing apples and diseases The correlation of the whole genome SNP data is firstly clustered by using the clustering algorithm, and then the related SNP combinations are found within each cluster. This method can significantly reduce the candidate set and reduce the computational burden.

发明人发现，由于SNP数据的高维性，且随着维度增高，组合数目呈指数型增长，存在逐个检测负担过重，难以实现，且易出现假阳率过高等问题，使得多个SNP的交互作用与疾病或性状的关联研究在机器学习技术中还存在较多的改进空间。The inventor found that due to the high dimensionality of SNP data, and the number of combinations increases exponentially with the increase of the dimensionality, there are problems such as excessive detection burden, difficult to achieve, and prone to high false positive rate, which makes the detection of multiple SNPs difficult. There is still much room for improvement in machine learning techniques for the study of associations between interactions and diseases or traits.

发明内容SUMMARY OF THE INVENTION

本公开为了解决上述问题，提供了一种基于集成学习的SNP交互作用检测系统，所述方案通过将SNP集合划分为多个子集，在子集中选择可能与疾病相关的SNP组合，并进一步迭代地选取更相关的SNP组合，减小了所需的内存空间和运行时间；系统采用了多个分类器共同评估，可以减小不同分类器对疾病模型的偏好对算法整体效果的影响；使用多个分类器并行检测，提高了检测速度，降低了系统的硬件要求。In order to solve the above problems, the present disclosure provides a SNP interaction detection system based on ensemble learning. The scheme divides the SNP set into multiple subsets, selects SNP combinations that may be related to the disease in the subsets, and further iteratively Selecting more relevant SNP combinations reduces the required memory space and running time; the system uses multiple classifiers to evaluate together, which can reduce the impact of different classifiers' preferences on disease models on the overall effect of the algorithm; using multiple classifiers The parallel detection of the classifiers improves the detection speed and reduces the hardware requirements of the system.

根据本公开实施例的第一个方面，提供了一种基于集成学习的SNP交互作用检测系统，包括：According to a first aspect of the embodiments of the present disclosure, a system for detecting SNP interactions based on ensemble learning is provided, including:

数据获取模块，其被配置为：获取患病样本和不患病样本的SNP序列信息，并进行预处理，实现全基因组SNP集合的构建；a data acquisition module, which is configured to: acquire the SNP sequence information of the diseased samples and the non-diseased samples, and perform preprocessing to realize the construction of a genome-wide SNP set;

SNP子集划分与组合生成模块，其被配置为：将全基因组SNP集合划分为多个SNP子集，并基于所述SNP子集构建SNP组合；A SNP subset division and combination generation module is configured to: divide a genome-wide SNP set into a plurality of SNP subsets, and construct a SNP combination based on the SNP subsets;

多分类器并行评估模块，其被配置为：使用多个分类器并行的对SNP组合与疾病的相关性进行评估；A multi-classifier parallel evaluation module, which is configured to: use multiple classifiers to evaluate the correlation between the SNP combination and the disease in parallel;

结果验证模块，其被配置为：使用卡方检验对采用若干分类器评估得到的与疾病相关的SNP组合进行统计学上的显著性验证。A result validation module configured to: use a chi-square test to validate the statistical significance of combinations of disease-related SNPs evaluated using several classifiers.

进一步的，所述将全基因组SNP集合划分为多个SNP子集，并基于所述SNP子集构建高维SNP组合，以两位点SNP组合为例，具体为：Further, the whole genome SNP set is divided into multiple SNP subsets, and a high-dimensional SNP combination is constructed based on the SNP subset, taking the two-point SNP combination as an example, specifically:

将全基因组SNP集合均匀地划分为多个SNP子集；Divide the genome-wide SNP collection evenly into multiple SNP subsets;

在第一次迭代过程中，针对每一个SNP子集，选择两个不同的SNP，构成两位点SNP组合，该子集内所有可能的SNP组合构成一个集合；In the first iteration process, for each SNP subset, two different SNPs are selected to form a two-site SNP combination, and all possible SNP combinations in the subset form a set;

在第二次及以后的迭代过程中，每两个不同SNP子集，在每个子集中分别选一个SNP，构成两位点SNP组合；该两个子集之间所有可能的SNP组合构成一个集合；且将上一次迭代过程输出的可能与疾病相关的SNP组合输入到尚未被检测过的SNP组合集合中，作为本次迭代过程中分类器的输入。In the second and subsequent iterations, for every two different SNP subsets, one SNP is selected in each subset to form a two-site SNP combination; all possible SNP combinations between the two subsets constitute a set; And input the possible disease-related SNP combination output in the previous iteration process into the SNP combination set that has not been detected, as the input of the classifier in this iteration process.

进一步的，所述多分类器并行评估模块包括打分投票模块，交换投票模块和筛选模块，其中：Further, the multi-classifier parallel evaluation module includes a scoring voting module, an exchange voting module and a screening module, wherein:

打分投票模块，其被配置为：使用每个分类器对输入的SNP组合进行打分，并根据分数进行投票；A scoring voting module, which is configured to: use each classifier to score the input SNP combination, and vote according to the score;

交换投票模块，其被配置为：将每个分类器认为与疾病可能相关的SNP组合交换到其他所有分类器中，重复进行打分投票；an exchange voting module, which is configured to: exchange the SNP combinations that each classifier thinks may be related to the disease to all other classifiers, and repeat the scoring voting;

筛选模块，其被配置为：统计各个分类器的投票情况，筛选出得票总数大于预设阈值的SNP组合，输入至结果验证模块。The screening module is configured to: count the voting situation of each classifier, screen out the SNP combination whose total number of votes is greater than the preset threshold, and input it to the result verification module.

进一步的，所述使用卡方检验对采用若干分类器评估得到的与疾病相关的SNP组合进行统计学上的显著性验证，具体为：Further, the chi-square test is used to verify the statistical significance of disease-related SNP combinations obtained by using several classifiers, specifically:

使用卡方检验对所有多分类器认为与疾病相关的SNP组合计算其p值；p-values were calculated for all combinations of SNPs considered to be disease-related by the multiclassifier using the chi-square test;

根据p值对这些SNP组合进行升序排序；Sort these SNP combinations in ascending order according to their p-values;

找到一个p值的拐点，输出拐点之前的SNP组合，作为最终的检测结果。Find a p-value inflection point and output the SNP combination before the inflection point as the final detection result.

进一步的，所述获取患病样本和不患病样本的SNP序列信息，并进行预处理，实现全基因组SNP集合的构建，具体为：Further, the SNP sequence information of the diseased sample and the non-diseased sample is obtained, and preprocessed to realize the construction of the whole genome SNP set, specifically:

将患病样本标记为1；将未患病样本标记为0；对于SNP数据样本在每个SNP位点处的突变情况，如果两个等位基因都未发生突变，则标记为0；如果两个等位基因中有一个发生了突变，则标记为1；如果两个等位基因都发生了突变，则标记为2；如果该位点数据缺失，则标记为3；同时，删除缺失SNP数大于5％的样本；删除缺失样本数大于5％的SNP；使用卡方检验计算每个SNP的p值，删除p值>0.0001的SNP；删除次等位基因频率小于0.1的SNP。The diseased sample is marked as 1; the unaffected sample is marked as 0; for the mutation of the SNP data sample at each SNP site, if both alleles are not mutated, it is marked as 0; If one of the alleles is mutated, it is marked as 1; if both alleles are mutated, it is marked as 2; if the data at this locus is missing, it is marked as 3; at the same time, delete the number of missing SNPs Samples greater than 5%; SNPs with more than 5% missing samples were deleted; p-values for each SNP were calculated using the chi-square test, and SNPs with p-values > 0.0001 were deleted; SNPs with less than 0.1 minor allele frequencies were deleted.

进一步的，所述多个分类器包括基于gini指数、k2-score、熵、信息增益及APDS的分类器。Further, the plurality of classifiers include classifiers based on gini index, k2-score, entropy, information gain and APDS.

根据本公开实施例的第二个方面，提供了一种电子设备，包括存储器、处理器及存储在存储器上运行的计算机程序，所述处理器执行所述程序时实现以下步骤：According to a second aspect of the embodiments of the present disclosure, an electronic device is provided, including a memory, a processor, and a computer program stored in the memory and running on the memory, where the processor implements the following steps when executing the program:

获取患病个体和不患病个体的SNP数据，并进行预处理,实现全基因组SNP集合的构建；Obtain SNP data of diseased individuals and non-diseased individuals, and perform preprocessing to realize the construction of genome-wide SNP collections;

将全基因组SNP集合划分为多个SNP子集，并基于所述SNP子集构建SNP组合；dividing the genome-wide SNP set into multiple SNP subsets, and constructing SNP combinations based on the SNP subsets;

将部分SNP组合输入分类器中进行投票；同时，将每个分类器下，与疾病相关性较强的SNP组合交换到其他所有分类器中，再次评估投票；筛选出总得票数高于预设阈值的SNP组合，得票数清零，同尚未被分类器评估过的SNP组合混合在一起，重复上述步骤，直到所有SNP组合都被评估过；Input some SNP combinations into the classifier for voting; at the same time, exchange the SNP combination with strong disease correlation under each classifier to all other classifiers, and evaluate the vote again; filter out the total number of votes higher than the preset threshold The number of SNP combinations obtained is cleared, mixed with the SNP combinations that have not been evaluated by the classifier, and the above steps are repeated until all SNP combinations have been evaluated;

基于卡方检验验证最后一次迭代过程筛选出的总得票数高于预设阈值的SNP组合，从p值序列找到数据拐点，输出拐点之前的SNP组合。Based on the chi-square test, the SNP combination with the total number of votes screened out in the last iteration process is higher than the preset threshold, find the data inflection point from the p-value sequence, and output the SNP combination before the inflection point.

根据本公开实施例的第三个方面，提供了一种非暂态计算机可读存储介质，其上存储有计算机程序，该程序被处理器执行时实现以下步骤：According to a third aspect of the embodiments of the present disclosure, a non-transitory computer-readable storage medium is provided, on which a computer program is stored, and when the program is executed by a processor, the following steps are implemented:

根据本公开实施例的第四个方面，提供了一种计算机程序产品，包括计算机程序，所述计算机程序当在一个或多个处理器上运行时执行以下步骤：According to a fourth aspect of embodiments of the present disclosure, there is provided a computer program product comprising a computer program that, when run on one or more processors, performs the following steps:

与现有技术相比，本公开的有益效果是：Compared with the prior art, the beneficial effects of the present disclosure are:

(1)本公开提供了一种基于集成学习的SNP交互作用检测系统，所述方案通过将SNP集合划分为多个子集，在子集中选择可能与疾病相关的SNP组合，并进一步迭代地选取更相关的SNP组合，减小了所需的内存空间和运行时间；系统采用了多个分类器共同评估，可以减小不同分类器对疾病模型的偏好对算法整体效果的影响；使用多个分类器并行检测，提高了检测速度，降低了系统的硬件要求。(1) The present disclosure provides a SNP interaction detection system based on ensemble learning. The scheme divides the SNP set into multiple subsets, selects SNP combinations that may be related to the disease in the subsets, and further iteratively selects more The combination of related SNPs reduces the required memory space and running time; the system uses multiple classifiers to evaluate together, which can reduce the influence of the preferences of different classifiers on the disease model on the overall effect of the algorithm; the use of multiple classifiers Parallel detection improves the detection speed and reduces the hardware requirements of the system.

(2)本公开所述方案将所有可能的SNP组合纳入关联性评估范围，避免了遗漏掉与疾病显著相关的SNP组合，使得算法结果可信度增强；同时，使用多个分类器对SNP组合的关联性进行评估，减轻了单个分类器对模型偏好对算法整体结果的影响，且多个分类器可以在多个设备上并行执行，减小了计算负担和对实验环境的要求。(2) The scheme described in this disclosure incorporates all possible SNP combinations into the scope of association evaluation, avoids omitting SNP combinations that are significantly related to the disease, and enhances the reliability of the algorithm results; at the same time, multiple classifiers are used to combine SNPs The correlation of the classifier is evaluated, which reduces the influence of a single classifier on the model preference on the overall result of the algorithm, and multiple classifiers can be executed in parallel on multiple devices, reducing the computational burden and the requirements for the experimental environment.

(3)本公开所述方案将全基因组SNP集合划分为多个SNP子集，多次迭代逐步评估SNP组合，而非常用的直接一次评估所有可能的SNP组合，减小了对设备存储空间的要求；根据卡方检验p值的拐点划分与疾病显著相关的SNP集合，而非硬性的划分边界，减小了参数设置对实验结果的影响。(3) The scheme described in this disclosure divides the whole genome SNP set into multiple SNP subsets, and gradually evaluates SNP combinations in multiple iterations, instead of directly evaluating all possible SNP combinations at one time, which reduces the storage space of the device. Requirements; according to the inflection point of the p-value of the chi-square test, the SNP set that is significantly related to the disease is divided, rather than a rigid division boundary, which reduces the influence of parameter settings on the experimental results.

本公开附加方面的优点将在下面的描述中部分给出，部分将从下面的描述中变得明显，或通过本公开的实践了解到。Advantages of additional aspects of the disclosure will be set forth in part in the description that follows, and in part will become apparent from the description below, or will be learned by practice of the disclosure.

附图说明Description of drawings

构成本公开的一部分的说明书附图用来提供对本公开的进一步理解，本公开的示意性实施例及其说明用于解释本公开，并不构成对本公开的不当限定。The accompanying drawings that constitute a part of the present disclosure are used to provide further understanding of the present disclosure, and the exemplary embodiments of the present disclosure and their descriptions are used to explain the present disclosure and do not constitute an improper limitation of the present disclosure.

图1为本公开实施例中所述的基于集成学习的SNP交互作用检测系统的整体流程示意图。FIG. 1 is a schematic diagram of the overall flow of the SNP interaction detection system based on ensemble learning described in the embodiment of the present disclosure.

具体实施方式Detailed ways

下面结合附图与实施例对本公开做进一步说明。The present disclosure will be further described below with reference to the accompanying drawings and embodiments.

应该指出，以下详细说明都是例示性的，旨在对本公开提供进一步的说明。除非另有指明，本文使用的所有技术和科学术语具有与本公开所属技术领域的普通技术人员通常理解的相同含义。It should be noted that the following detailed description is exemplary and intended to provide further explanation of the present disclosure. Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

需要注意的是，这里所使用的术语仅是为了描述具体实施方式，而非意图限制根据本公开的示例性实施方式。如在这里所使用的，除非上下文另外明确指出，否则单数形式也意图包括复数形式，此外，还应当理解的是，当在本说明书中使用术语“包含”和/或“包括”时，其指明存在特征、步骤、操作、器件、组件和/或它们的组合。It should be noted that the terminology used herein is for the purpose of describing specific embodiments only, and is not intended to limit the exemplary embodiments according to the present disclosure. As used herein, unless the context clearly dictates otherwise, the singular is intended to include the plural as well, furthermore, it is to be understood that when the terms "comprising" and/or "including" are used in this specification, it indicates that There are features, steps, operations, devices, components and/or combinations thereof.

在不冲突的情况下，本公开中的实施例及实施例中的特征可以相互组合。The embodiments of this disclosure and features of the embodiments may be combined with each other without conflict.

实施例一：Example 1:

本实施例的目的是提供一种基于集成学习的SNP交互作用检测系统。The purpose of this embodiment is to provide a SNP interaction detection system based on ensemble learning.

一种基于集成学习的SNP交互作用检测系统，包括：An ensemble learning-based SNP interaction detection system, comprising:

进一步的，SNP子集划分与组合生成模块具体包括SNP子集划分模块与组合生成模块；其中：Further, the SNP subset division and combination generation module specifically includes a SNP subset division module and a combination generation module; wherein:

SNP子集划分模块，其被配置为：SNP subset partitioning module, which is configured as:

将全基因组SNP集合划分为多个SNP子集；Divide the genome-wide SNP collection into multiple SNP subsets;

组合生成模块，其被配置为：The composition builds the module, which is configured as:

根据SNP子集构建高维SNP集合；融合上一次迭代过程认为与疾病相关的SNP组合和尚未被评估过的SNP组合，作为本次迭代过程的输入。Construct a high-dimensional SNP set based on a subset of SNPs; fuse the SNP combinations considered to be disease-related in the previous iterative process and the SNP combinations that have not been evaluated as the input of this iterative process.

进一步的，多分类器并行评估模块包括打分投票模块，交换投票模块和筛选模块；其中：Further, the multi-classifier parallel evaluation module includes a scoring voting module, an exchange voting module and a screening module; wherein:

打分投票模块，其被配置为：使用单个分类器对SNP组合进行打分，根据分数投票；A scoring and voting module, which is configured to: use a single classifier to score SNP combinations, and vote according to the scores;

交换投票模块，其被配置为：将单个分类器认为与疾病可能相关的SNP组合交换到其他所有分类器中，重复进行打分投票；an exchange voting module, which is configured to: exchange the SNP combinations that a single classifier thinks may be related to the disease to all other classifiers, and repeat the scoring voting;

筛选模型，其被配置为：统计各个分类器的投票情况，筛选出得票总数高的SNP组合，输入至组合生成模块。The screening model is configured to: count the voting situation of each classifier, filter out the SNP combination with a high total number of votes, and input it to the combination generation module.

进一步地，获取患病样本和不患病样本的SNP序列信息，并进行数据预处理；Further, obtain SNP sequence information of diseased samples and non-diseased samples, and perform data preprocessing;

其中，将患病样本标记为1；将未患病样本标记为0；Among them, the diseased sample is marked as 1; the unaffected sample is marked as 0;

其中，SNP数据中样本在每个SNP位点处的突变情况，如果两个等位基因都未发生突变，则标记为0；如果两个等位基因中有一个发生了突变，则标记为1；如果两个等位基因都发生了突变，则标记为2；如果该位点数据缺失，则标记为3；Among them, the mutation of the sample at each SNP site in the SNP data, if neither alleles are mutated, it is marked as 0; if one of the two alleles is mutated, it is marked as 1 ; If both alleles are mutated, mark 2; if data at this locus is missing, mark 3;

其中，数据预处理具体步骤为:删除缺失SNP数大于5％的样本；删除缺失样本数大于5％的SNP；使用卡方检验计算每个SNP的p值，删除p值>0.0001的SNP；删除次等位基因频率小于0.1的SNP。Among them, the specific steps of data preprocessing are: delete samples with missing SNPs greater than 5%; delete SNPs with missing samples greater than 5%; use chi-square test to calculate the p value of each SNP, delete SNPs with p value>0.0001; delete SNPs with minor allele frequencies less than 0.1.

进一步地，将全基因组SNP集合划分为多个SNP子集并生成SNP组合，以两位点SNP组合为例，具体包括：Further, the whole genome SNP set is divided into multiple SNP subsets and SNP combinations are generated, taking the two-point SNP combination as an example, specifically including:

S1021，将全基因组SNP集合均匀地划分为多个SNP子集；S1021, evenly dividing the whole genome SNP set into multiple SNP subsets;

S1022，在第一次迭代过程中，针对每一个SNP子集，选择两个不同的SNP，构成两位点SNP组合，该子集内所有可能的SNP组合构成一个集合；S1022, in the first iteration process, for each SNP subset, select two different SNPs to form a two-site SNP combination, and all possible SNP combinations in the subset form a set;

S1023，在第二次及以后的迭代过程中，每两个不同SNP子集，在每个子集中分别选一个SNP，构成两位点SNP组合；该两个子集之间所有可能的SNP组合构成一个集合；且将上一次迭代过程输出的可能与疾病相关的SNP组合输入到一个尚未被检测过的SNP组合集合中，作为本次迭代过程中某个分类器的输入。S1023, in the second and subsequent iterations, for every two different SNP subsets, select one SNP in each subset to form a two-site SNP combination; all possible SNP combinations between the two subsets constitute a and input the possible disease-related SNP combination output from the previous iteration process into a SNP combination set that has not been detected yet, as the input of a classifier in this iteration process.

进一步地，使用多个分类器对多个SNP组合集合并行进行评估；具体包括：Further, use multiple classifiers to evaluate multiple SNP combination sets in parallel; specifically including:

令

为第t次迭代过程中，输入到第i个分类器的SNP组合，i∈{1,…,|D|}，拟采用|D|＝5个分类器进行检测：gini指数，k2-score，熵，信息增益，APDS(AbsoluteProbability Different Score)；具体的：make

is the SNP combination input to the i-th classifier during the t-th iteration, i∈{1,...,|D|}, and |D|=5 classifiers are to be used for detection: gini index, k2-score , entropy, information gain, APDS (AbsoluteProbability Different Score); specific:

S1031，Gini指数(Gini index,GI)为每个SNP组合计算分数为GI＝Gini(parent)-Gini(split)，且

其中N是所有样本的数目，N_case(m)为具有基因型m且表现为患病个体的数目，N_control(m)为具有基因型m且表现为不患病个体的数目，N_total(m)为具有基因型m的样本的数目。分数越高代表SNP组合与疾病相关性越强；S1031, the Gini index (Gini index, GI) calculates the score for each SNP combination as GI=Gini(parent)-Gini(split), and

where N is the number of all samples, N _case (m) is the number of individuals with genotype m and manifested as having the disease, N _control (m) is the number of individuals with genotype m and manifested as not having the disease, N _total ( m) is the number of samples with genotype m. The higher the score, the stronger the correlation between the SNP combination and the disease;

S1032，K2-score为每个SNP组合计算分数为

为了简便计算，我们取其log形式：

分数越高代表SNP组合与疾病相关性越强；S1032, K2-score calculates the score for each SNP combination as

For simplicity of calculation, we take its log form:

The higher the score, the stronger the correlation between the SNP combination and the disease;

S1033，熵(Entropy score,ES)为每个SNP组合计算分数为

其中

分数越高代表SNP组合与疾病相关性越强；S1033, Entropy score (ES) calculates the score for each SNP combination as

in

S1034，信息增益(Information Gain,IG)为每个SNP组合计算分数为：IG＝[H(S_i|Y)+H(S_j|Y)-H(S_i,S_j|Y)]-[H(S_i)+H(S_j)-H(S_i,S_j)]，且

分数越高代表SNP组合与疾病相关性越强；S1034, the information gain (Information Gain, IG) calculates the score for each SNP combination as: IG=[H(S _i |Y)+H(S _j |Y)-H(S _i ,S _j |Y)]- [H(S _i )+H(S _j )-H(S _i ,S _j )], and

S1035，APDS为每个SNP组合计算分数为

分数越高代表SNP组合与疾病相关性越强；S1035, APDS calculates the score for each SNP combination as

S1036，各个分类器对输入到其中的所有SNP组合打分完成后，将SNP组合根据分数进行降序排序，排序后顺序小于

的SNP组合被认为可能与疾病相关，并对这些SNP组合在当前分类器下的投票数更新为

其中b_u是预先设定的参数，o是排序后的顺序，

是本次迭代过程输入到该分类器中的SNP组合个数。由于各个分类器相互独立，输入组合也互不影响，因此该过程可以并行执行；S1036, after each classifier completes the scoring of all SNP combinations input into it, the SNP combinations are sorted in descending order according to the scores, and the sorted order is less than

of SNP combinations considered likely to be disease-related, and the number of votes for these SNP combinations under the current classifier is updated as

where b _u is a preset parameter, o is the sorted order,

is the number of SNP combinations input into the classifier in this iteration process. Since the classifiers are independent of each other and the input combinations do not affect each other, the process can be executed in parallel;

S1037，将每个分类器中认为与疾病可能相关的SNP组合交换到其他所有的分类器中，交换后的待评估SNP组合集合为：

其中，j∈D/i，且j₁∪j₂∪…∪j_|D|-1∪i＝D。对新组合集合的打分投票方式同上一步相同；S1037, exchange the SNP combinations considered to be possibly related to the disease in each classifier to all other classifiers, and the set of SNP combinations to be evaluated after the exchange is:

where j∈D/i, and j ₁ ∪j ₂ ∪…∪j _|D|-1 ∪i=D. The scoring and voting method for the new combination set is the same as the previous step;

S1038，统计每个SNP组合被各个分类器投票的情况，将投票总数大于V_S的组合作为本次迭代过程的输出

若仍有SNP组合集合未被任何一个分类器检测过，则将

输入至组合生成模块，进入下一次迭代过程；否则，将

输出至结果验证模块；S1038 , count the votes of each SNP combination by each classifier, and use the combination whose total number of votes is greater than V _S as the output of this iteration process

If there is still a SNP combination set that has not been detected by any classifier, the

Input to the combination generation module to enter the next iteration process; otherwise, the

Output to the result verification module;

进一步地，使用卡方检验对多个分类器评估结果进行进一步验证，选出与疾病具有显著相关关系的SNP组合；具体包括：Further, the chi-square test was used to further verify the evaluation results of multiple classifiers, and the SNP combination with a significant correlation with the disease was selected; specifically:

S1041，使用卡方检验对所有多分类器认为与疾病相关的SNP组合计算其p值；S1041, use chi-square test to calculate its p-value for all SNP combinations considered to be disease-related by the multi-classifier;

S1042，根据p值对这些SNP组合进行升序排序；S1042, sort these SNP combinations in ascending order according to the p value;

S1043，找到一个p值的拐点，将拐点之前的SNP组合认为是与疾病显著相关的，作为算法最终的输出结果。S1043, find a p-value inflection point, and consider the SNP combination before the inflection point to be significantly related to the disease, as the final output result of the algorithm.

本公开将全基因组SNP集合划分为多个SNP子集，然后由这些子集生成多个高维SNP组合集合，之后将数个SNP组合集合输入至多分类器中，并行对集合中的每个SNP组合计算分数，并根据分数进行投票，将每个分类器下分数高的SNP组合交换到其他所有分类器中再次进行打分投票，然后将多个分类器投票总数高的SNP组合认为是可能与疾病相关的，输入到下一次迭代过程中或进入下一步骤，使用卡方检验验证这些SNP组合在统计学上与疾病的相关程度，最后输出与疾病显著相关的SNP组合作为算法的最终结果。本公开通过将SNP集合划分为多个子集，多次迭代直到所有SNP组合都被检测过，避免了遗漏真正相关的组合，每次迭代过程中，多个分类器的检测过程可以并行执行，减小了计算负担，使得可以在全基因组中执行穷举所有可能的情况。此外，本算法使用多个分类器共同决定某个SNP组合是否与疾病相关，减轻了单个分类器对数据模型的偏好对算法结果的影响；最后使用卡方检验进一步验证，增加了算法结果的可信度，且没有使用硬性指标划分是否与疾病相关，减小了参数设置对结果的影响。The present disclosure divides the genome-wide SNP set into multiple SNP subsets, then generates multiple high-dimensional SNP combination sets from these subsets, and then inputs several SNP combination sets into a multi-classifier, and performs parallel analysis on each SNP in the set Scores are calculated in combination, and votes are made according to the scores, and the SNP combination with high score under each classifier is exchanged to all other classifiers for scoring and voting again, and then the SNP combination with a high total number of votes from multiple classifiers is considered to be likely to be related to the disease. Relevant, input into the next iterative process or enter the next step, use the chi-square test to verify the degree of statistical correlation between these SNP combinations and the disease, and finally output the SNP combinations that are significantly related to the disease as the final result of the algorithm. The present disclosure avoids missing truly relevant combinations by dividing the SNP set into multiple subsets and repeating multiple iterations until all SNP combinations have been detected. In each iteration process, the detection process of multiple classifiers can be performed in parallel, reducing The reduced computational burden makes it possible to perform an exhaustive enumeration of all possible cases in the whole genome. In addition, the algorithm uses multiple classifiers to jointly determine whether a certain SNP combination is related to disease, which alleviates the influence of a single classifier's preference on the data model on the algorithm results. Finally, the chi-square test is used for further verification, which increases the reliability of the algorithm results. reliability, and did not use hard indicators to classify whether it was related to disease, which reduced the impact of parameter settings on the results.

实施例二：Embodiment 2:

本实施例的目的是提供一种电子设备。The purpose of this embodiment is to provide an electronic device.

一种电子设备，包括存储器、处理器及存储在存储器上运行的计算机程序，所述处理器执行所述程序时实现以下步骤：An electronic device, comprising a memory, a processor and a computer program stored on the memory to run, the processor implements the following steps when executing the program:

进一步的，本实施例所述电子设备所执行的步骤与实施例一中所述系统执行的方案一致，其技术细节已经在实施例一中进行了详细描述，故此处不再赘述。Further, the steps executed by the electronic device described in this embodiment are consistent with the solution executed by the system described in Embodiment 1, and the technical details thereof have been described in detail in Embodiment 1, so they are not repeated here.

实施例三：Embodiment three:

本实施例的目的是提供一种非暂态计算机可读存储介质。The purpose of this embodiment is to provide a non-transitory computer-readable storage medium.

一种非暂态计算机可读存储介质，其上存储有计算机程序，该程序被处理器执行时实现以下步骤：A non-transitory computer-readable storage medium on which a computer program is stored, the program implements the following steps when executed by a processor:

进一步的，本实施例所述非暂态计算机可读存储介质所执行的步骤与实施例一中所述系统执行的方案一致，其技术细节已经在实施例一中进行了详细描述，故此处不再赘述。Further, the steps executed by the non-transitory computer-readable storage medium described in this embodiment are consistent with the solution executed by the system described in Embodiment 1, and the technical details thereof have been described in detail in Embodiment 1, so they are not described here. Repeat.

实施例四：Embodiment 4:

本实施例的目的是提供一种计算机程序产品。The purpose of this embodiment is to provide a computer program product.

一种计算机程序产品，包括计算机程序，所述计算机程序当在一个或多个处理器上运行时执行以下步骤：A computer program product comprising a computer program that, when run on one or more processors, performs the following steps:

进一步的，本实施例所述计算机程序产品所执行的步骤与实施例一中所述系统执行的方案一致，其技术细节已经在实施例一中进行了详细描述，故此处不再赘述。Further, the steps performed by the computer program product described in this embodiment are consistent with the solution performed by the system described in Embodiment 1, and the technical details thereof have been described in detail in Embodiment 1, so they are not repeated here.

上述实施例提供的一种基于集成学习的SNP交互作用检测系统可以实现，具有广阔的应用前景。The SNP interaction detection system based on ensemble learning provided by the above embodiments can be implemented and has broad application prospects.

以上所述仅为本公开的优选实施例而已，并不用于限制本公开，对于本领域的技术人员来说，本公开可以有各种更改和变化。凡在本公开的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本公开的保护范围之内。The above descriptions are only preferred embodiments of the present disclosure, and are not intended to limit the present disclosure. For those skilled in the art, the present disclosure may have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present disclosure shall be included within the protection scope of the present disclosure.

Claims

1. a SNP interaction detection system based on ensemble learning, is characterized in that, comprises:

a data acquisition module, which is configured to: acquire the SNP sequence information of the diseased samples and the non-diseased samples, and perform preprocessing to realize the construction of a genome-wide SNP set;

A SNP subset division and combination generation module is configured to: divide a genome-wide SNP set into a plurality of SNP subsets, and construct a SNP combination based on the SNP subsets;

A multi-classifier parallel evaluation module, which is configured to: use multiple classifiers to evaluate the correlation between the SNP combination and the disease in parallel;

A result validation module configured to: use a chi-square test to validate the statistical significance of combinations of disease-related SNPs evaluated using several classifiers.

2. a kind of SNP interaction detection system based on ensemble learning as claimed in claim 1, is characterized in that, described dividing whole genome SNP collection into multiple SNP subsets, and based on described SNP subsets construct high-dimensional The SNP combination, taking the two-point SNP combination as an example, is as follows:

Divide the genome-wide SNP collection evenly into multiple SNP subsets;

In the first iteration process, for each SNP subset, two different SNPs are selected to form a two-site SNP combination, and all possible SNP combinations in the subset form a set;

In the second and subsequent iterations, for every two different SNP subsets, one SNP is selected in each subset to form a two-site SNP combination; all possible SNP combinations between the two subsets constitute a set; And input the possible disease-related SNP combination output in the previous iteration process into the SNP combination set that has not been detected, as the input of the classifier in this iteration process.

3. a kind of SNP interaction detection system based on ensemble learning as claimed in claim 1, is characterized in that, described multi-classifier parallel evaluation module comprises scoring voting module, exchange voting module and screening module, wherein:

A scoring voting module, which is configured to: use each classifier to score the input SNP combination, and vote according to the score;

an exchange voting module, which is configured to: exchange the SNP combinations that each classifier thinks may be related to the disease to all other classifiers, and repeat the scoring voting;

The screening module is configured to: count the voting situation of each classifier, screen out the SNP combination whose total number of votes is greater than the preset threshold, and input it to the result verification module.

4. a kind of SNP interaction detection system based on ensemble learning as claimed in claim 1, it is characterized in that, described using chi-square test to adopt some classifiers to evaluate and obtain and carry out statistical analysis on the SNP combination that obtains related to disease. Significance verification, specifically:

p-values were calculated for all combinations of SNPs considered to be disease-related by the multiclassifier using the chi-square test;

Sort these SNP combinations in ascending order according to their p-values;

Find a p-value inflection point and output the SNP combination before the inflection point as the final detection result.

5. a kind of SNP interaction detection system based on ensemble learning as claimed in claim 1, is characterized in that, described obtaining the SNP sequence information of diseased sample and non-diseased sample, and carry out preprocessing, realize whole genome SNP The construction of the collection, specifically:

The diseased sample is marked as 1; the unaffected sample is marked as 0; for the mutation of the SNP data sample at each SNP site, if both alleles are not mutated, it is marked as 0; If one of the alleles is mutated, it is marked as 1; if both alleles are mutated, it is marked as 2; if data is missing for this locus, it is marked as 3.

6. The system for detecting SNP interaction based on ensemble learning according to claim 5, wherein the preprocessing further comprises: deleting samples with more than 5% missing SNPs; deleting samples with more than 5% missing samples SNPs; p-values for each SNP were calculated using the chi-square test, SNPs with p-values > 0.0001 were deleted; SNPs with minor allele frequencies less than 0.1 were deleted.

7 . The SNP interaction detection system based on ensemble learning according to claim 1 , wherein the multiple classifiers comprise classifiers based on gini index, k2-score, entropy, information gain and APDS. 8 .

8. An electronic device, comprising a memory, a processor and a computer program stored on the memory, wherein the processor implements the following steps when executing the program:

Obtain SNP data of diseased individuals and non-diseased individuals, and perform preprocessing to realize the construction of genome-wide SNP collections;

dividing the genome-wide SNP set into multiple SNP subsets, and constructing SNP combinations based on the SNP subsets;

Input some SNP combinations into the classifier for voting; at the same time, exchange the SNP combination with strong disease correlation under each classifier to all other classifiers, and evaluate the vote again; filter out the total number of votes higher than the preset threshold The number of SNP combinations obtained is zero, mixed with the SNP combinations that have not been evaluated by the classifier, and the above steps are repeated until all SNP combinations have been evaluated.

Based on the chi-square test, the SNP combination with the total number of votes screened out in the last iteration process is higher than the preset threshold, find the data inflection point from the p-value sequence, and output the SNP combination before the inflection point.

9. A non-transitory computer-readable storage medium on which a computer program is stored, characterized in that, when the program is executed by a processor, the following steps are implemented:

Input some SNP combinations into the classifier for voting; at the same time, exchange the SNP combination with strong disease correlation under each classifier to all other classifiers, and evaluate the vote again; filter out the total number of votes higher than the preset threshold The number of SNP combinations obtained is cleared, mixed with the SNP combinations that have not been evaluated by the classifier, and the above steps are repeated until all SNP combinations have been evaluated;

10. A computer program product, comprising a computer program, characterized in that the computer program, when run on one or more processors, performs the following steps: