CN107301323B

CN107301323B - A method for constructing a classification model related to psoriasis

Info

Publication number: CN107301323B
Application number: CN201710692864.8A
Authority: CN
Inventors: 孙良丹; 张涛; 甄琪; 王文俊; 钱文君; 莫晓东; 吴静; 郑晓冬; 李报
Original assignee: BGI Shenzhen Co Ltd; First Affiliated Hospital of Anhui Medical University
Current assignee: BGI Shenzhen Co Ltd; First Affiliated Hospital of Anhui Medical University
Priority date: 2017-08-14
Filing date: 2017-08-14
Publication date: 2020-11-03
Anticipated expiration: 2037-08-14
Also published as: CN107301323A

Abstract

The invention relates to the technical field of medical detection, in particular to a method for constructing a classification model related to psoriasis, which comprises the following steps: (1) selecting psoriasis susceptible sites; (2) converting the susceptible loci into input data according to different types of susceptible loci; (3) and classifying the data by using an Adaboost-SVM model. At present, relevant technologies are lacked to classify and predict psoriasis data, and only the existence of judgment sites is remained to infer the diseased situation. The invention utilizes the effective machine learning classifier SVM to classify, integrates the SVM by the adaboost frame, and improves the accuracy of the classifier. The model can integrate SNP, amino acid and type data for classification, comprehensively considers the information of each dimension, and improves the accuracy of the classification result.

Description

A method for constructing a classification model related to psoriasis

技术领域technical field

本发明涉及医学检测技术领域，具体涉及一种与银屑病相关的分类模型的构建方法。The invention relates to the technical field of medical detection, in particular to a method for constructing a classification model related to psoriasis.

背景技术Background technique

银屑病又称牛皮癣是一种常见的复杂疾病，有报道银屑病的发生与遗传因素相关，尤其是人类白细胞抗原区域(HLA)，但真正相关的位点并未可知。Psoriasis, also known as psoriasis, is a common and complex disease. It has been reported that the occurrence of psoriasis is related to genetic factors, especially the human leukocyte antigen region (HLA), but the real relevant site is unknown.

随着测序技术的发展和基因组研究的深入，在去年《自然遗传》上就有报道中国人MHC区域的高深度测序和精准变异检测，在其基因组关联分析中定位了数个银屑病的易感位点。但是目前尚缺乏基于HLA区域的易感位点的分类和预测模型。所以急需开发相关的分类预测工具利用HLA区域易感位点对数据进行分类预测。With the development of sequencing technology and the deepening of genome research, high-depth sequencing and precise variant detection of Chinese MHC regions were reported in Nature Genetics last year, and several psoriasis prone spots were located in its genome association analysis. sense site. However, there is still a lack of classification and prediction models of susceptibility loci based on HLA regions. Therefore, there is an urgent need to develop relevant classification prediction tools to use HLA region susceptible loci to classify and predict data.

银屑病与HLA最显著相关，但目前的技术缺乏对HLA区域针对性的运用。近期HLA区域进行精准变异检测得到突破，精准的定位了HLA上与银屑病相关的易感位点。本发明针对这些易感位点对其进行编码和再用机器学习模型Adaboost进行分类，可以整合利用HLA区域找到的易感位点信息。利用机器学习模型对数据进行综合分析，提高分类准确性，为银屑病的预防筛查提供依据。Psoriasis is most significantly associated with HLA, but current technologies lack the ability to target HLA regions. Recently, a breakthrough has been made in accurate mutation detection in the HLA region, and the psoriasis-related susceptibility loci on the HLA have been accurately located. The present invention encodes these susceptibility sites and then uses the machine learning model Adaboost to classify them, and can integrate the susceptibility site information found by using the HLA region. The machine learning model is used to comprehensively analyze the data, improve the classification accuracy, and provide a basis for the prevention and screening of psoriasis.

发明内容SUMMARY OF THE INVENTION

本发明的目的是解决上述现有技术的不足，基于对MHC区域的全覆盖找到与银屑病相关的生物标记，基于HLA区域独立相关的易感位点，利用SVM-Adaboost构建银屑病的分类模型，提供一种与银屑病相关的分类模型的构建方法，为银屑病的预防筛查提供依据。The purpose of the present invention is to solve the above-mentioned deficiencies of the prior art, find the biomarkers related to psoriasis based on the full coverage of the MHC region, and use SVM-Adaboost to construct a psoriasis-related susceptibility site based on the independent related susceptibility sites of the HLA region. The classification model provides a method for constructing a classification model related to psoriasis, and provides a basis for the prevention and screening of psoriasis.

本发明是通过以下技术方案实现的：The present invention is achieved through the following technical solutions:

1 数据处理和转换1 Data processing and transformation

将各个样本的变异进行编码。通过高通量测序数据获得变异信息，包括HLA型别(C*06:02、C*07:04、DPB1*05:01)，单核苷酸多态性位点(SNP位点)和氨基酸(snp31443520、B:Y33Y、B:Y91C、B:Y140S、snp32472030)。Coding the variation of each sample. Variation information, including HLA type (C*06:02, C*07:04, DPB1*05:01), single nucleotide polymorphism sites (SNP sites) and amino acids, was obtained from high-throughput sequencing data (snp31443520, B:Y33Y, B:Y91C, B:Y140S, snp32472030).

然后对每样本，根据易感位点，转化为本发明所需要的输入数据。针对HLA型别采用编辑距离打分，SNP和氨基酸采用0/1打分。具体方法如下：①针对易感HLA型别，计算每个个体该型别与易感型别的编辑距离并打分；②针对SNP位点，如果突变存在记为1，不存在记为0；③针对氨基酸突变，如果突变存在记为1，不存在记为0。Then, for each sample, according to the susceptibility locus, it is converted into the input data required by the present invention. Edit distance scoring was used for HLA types, and 0/1 scoring was used for SNPs and amino acids. The specific methods are as follows: ① For the susceptible HLA type, calculate the edit distance between the type and the susceptible type of each individual and score; ② For the SNP site, if the mutation exists, it is recorded as 1, and if there is no mutation, it is recorded as 0; ③ For amino acid mutations, the presence of the mutation is scored as 1, and the absence of the mutation is scored as 0.

打分完成后，将数据随机拆分，拆分为测试集和训练集，注意测试集和训练集数据没有重叠。样本数少的时候，可以按照5折交叉法(或10折交叉法)将数据分成5份(10份)，每次取出1作为测试集，其余的作为训练集。After the scoring is completed, the data is randomly split into a test set and a training set. Note that the test set and training set data do not overlap. When the number of samples is small, the data can be divided into 5 parts (10 parts) according to the 5-fold crossover method (or 10-fold crossover method), and each time 1 is taken as the test set, and the rest are used as the training set.

2 利用adaboost-SVM模型进行数据的分类2 Classification of data using adaboost-SVM model

本发明利用adaboost方法来集成支持向量机(SVM)分类器，整合利用所有的易感位点信息，提高数据的分类的正确率。The invention utilizes the adaboost method to integrate the support vector machine (SVM) classifier, integrates and utilizes all susceptible site information, and improves the accuracy of data classification.

2.1 关于分类模型的构建2.1 About the construction of the classification model

2.1.1 子分类模型SVM2.1.1 Subclassification model SVM

支持向量机模型SVM是经典的机器学习分类软件，属于有监督式学习。本发明首先利用的高斯核函数(公式1)将数据投射到高维度空间。Support Vector Machine Model SVM is a classic machine learning classification software, which belongs to supervised learning. The Gaussian kernel function (Equation 1) first utilized by the present invention projects the data into a high-dimensional space.

其中，x为空间中任意一点，y为所选空间中心，σ为宽度参数，K(x,y)为x到y的空间距离。Among them, x is any point in the space, y is the center of the selected space, σ is the width parameter, and K(x, y) is the spatial distance from x to y.

之后高维度空间中用SVM模型构建分隔平面。分隔平面构建主要是通过距离分隔平面最近的数个点来确定(如图1所示A点就是最近的点之一)，并且将最近的点到分隔平面的连线称为支持向量，当支持向量达到最大化时候的平面就设为分隔平面，也即是通过分隔平面将数据最大地分开。本发明采用基于python 2的SVM模型(参考网站https://www.manning.com/books/machine-learning-in-action)。Afterwards, the SVM model is used to construct the separation plane in the high-dimensional space. The construction of the separation plane is mainly determined by the points closest to the separation plane (as shown in Figure 1, point A is one of the closest points), and the connection between the nearest point and the separation plane is called a support vector. The plane when the vector is maximized is set as the separation plane, that is, the data is maximally separated by the separation plane. The present invention adopts the SVM model based on python 2 (refer to the website https://www.manning.com/books/machine-learning-in-action).

2.1.2 分类模型集成算法Adaboost2.1.2 Classification Model Integration Algorithm Adaboost

Adaboost是一种基于错误提升分类器性能的集成方法，通过每一个样本多次训练，通过错误率反复修正分类器最后整合得到集成后的结果。具体方法：首先对样本赋予一样同等的权重。然后在训练数集数据上训练SVM并计算该分类器的错误率(ε，公式2)。Adaboost is an ensemble method that improves the performance of classifiers based on errors. Through multiple training of each sample, the classifier is repeatedly corrected by the error rate and finally integrated to obtain the integrated results. Specific method: First, assign the same weight to the samples. The SVM is then trained on the training dataset and the error rate (ε, Equation 2) of this classifier is calculated.

错误率ε＝正确分类数目/总样本数目 (公式2)Error rate ε=Number of correct classifications/Number of total samples (Formula 2)

然后调整高斯核函数σ，之后在同一数据集上再次SVM。在分类器的第二次训练当中，将会重新调整每个样本的权重(这里的权重是一个多维度的向量)，其中分类正确样本的下次分类权重将会降低，分类错误的样本的下次权重将会提高。也就是说，最终达到分类正确时候的权重会比分类错误的权重占比要大。具体方法是根据错误率计算每个分类器的权重α。Then adjust the Gaussian kernel function σ, and then perform SVM again on the same dataset. In the second training of the classifier, the weight of each sample will be re-adjusted (here the weight is a multi-dimensional vector), in which the next classification weight of the correctly classified sample will be reduced, and the lower classification weight of the wrongly classified sample will be reduced. The secondary weight will increase. That is to say, in the end, the weight of the correct classification will be larger than the weight of the wrong classification. The specific method is to calculate the weight α of each classifier according to the error rate.

计算出α之后可以对权重进行更新。The weights can be updated after α is calculated.

分类正确：Correct classification:

分类错误:Misclassification:

α为基本分类器在最终分类器中的权重，ε为分类器的错误率；(t)代表顺序，t代表本次，t+1代表下一次；D_i为第i个训练样本权值。α is the weight of the basic classifier in the final classifier, ε is the error rate of the classifier; (t) represents the order, t represents this time, and t+1 represents the next time; D _i is the weight of the ith training sample.

计算权值D之后，开始进入下一轮迭代。不断地重复训练和调整权重的过程，直到训练错误率为0或者弱分类器的数目达到指定值。本发明采用基于python2的adaboost集成框架(参考网站https://www.manning.com/books/machine-learning-in-action)After calculating the weight D, start to enter the next round of iteration. The process of training and adjusting the weights is repeated continuously until the training error rate is 0 or the number of weak classifiers reaches a specified value. The present invention adopts the adaboost integration framework based on python2 (refer to the website https://www.manning.com/books/machine-learning-in-action)

3 对数据进行分类和评估3 Classify and evaluate data

构建好输入训练集和测试集之后，代入构建的adaboost-SVM模型中进行分类。通过分类模型的结果与实际患病与否的情况进行比较。通过计算准确率和绘制ROC曲线来对结果进行评估。After the input training set and test set are constructed, they are substituted into the constructed adaboost-SVM model for classification. The results of the classification model are compared with the actual disease or not. The results were evaluated by calculating the accuracy and plotting the ROC curve.

ROC曲线是用于选择最佳的信号模型的方法。通常可计算ROC曲线下方面积(AUC)来判断分类模型好坏，具体参考表1。The ROC curve is the method used to select the best signal model. Usually, the area under the ROC curve (AUC) can be calculated to judge whether the classification model is good or bad. For details, refer to Table 1.

表1Table 1

本发明的有益效果在于：The beneficial effects of the present invention are:

目前缺乏相关的技术来对银屑病数据进行分类和预测，只停留在判断位点有无来推断患病情况。本发明利用有效的机器学习分类器SVM进行分类，并通过了adaboost框架来集成SVM，提高分类器的准确性。该模型可以整合SNP、氨基酸和型别数据进行分类，综合考虑各个维度的信息，提高了数据了分类结果的准确性。At present, there is a lack of relevant technologies to classify and predict psoriasis data, and it only stops at judging the presence or absence of loci to infer the disease situation. The invention uses the effective machine learning classifier SVM for classification, and integrates the SVM through the adaboost framework, so as to improve the accuracy of the classifier. The model can integrate SNP, amino acid and type data for classification, comprehensively consider the information of each dimension, and improve the accuracy of the classification results.

附图说明Description of drawings

图1为高维度空间中用SVM模型构建分隔平面的示意图；Fig. 1 is a schematic diagram of constructing a separation plane with an SVM model in a high-dimensional space;

图2为本发明训练集分类结果的ROC曲线；Fig. 2 is the ROC curve of training set classification result of the present invention;

图3为本发明测试集分类结果的ROC曲线。Fig. 3 is the ROC curve of the classification result of the test set of the present invention.

具体实施方式Detailed ways

为更好理解本发明，下面结合实施例及附图对本发明作进一步描述，以下实施例仅是对本发明进行说明而非对其加以限定。In order to better understand the present invention, the present invention will be further described below with reference to the embodiments and the accompanying drawings. The following embodiments are only to illustrate the present invention and not to limit it.

实施例1Example 1

选择了银屑病30岁以下样本进行研究共计5168例。利用基于python2语言的adaboost-SVM模型针对易感位点构建模型进行分类。A total of 5168 patients with psoriasis under the age of 30 were selected for the study. The adaboost-SVM model based on the python2 language was used to construct a model for the classification of susceptible loci.

1 数据的处理和转换1 Data processing and transformation

本实施案例中，首先通过变异检测获得样本的变异信息ped和map文件。之后根据易感位点(表2)提取出HLA区域变异信息。其中型别(1、2、7)的打分按照编辑距离进行打分(打分矩阵见表3)，氨基酸位点和SNP位点(3、4、5、6、8)按照存在与否进行打分，存在打分为1，不存在打分为0。In this implementation case, the variation information ped and map files of the sample are first obtained through variation detection. Afterwards, HLA region variation information was extracted according to the susceptible sites (Table 2). Types (1, 2, 7) are scored according to edit distance (see Table 3 for scoring matrix), and amino acid sites and SNP sites (3, 4, 5, 6, 8) are scored according to their presence or absence, The presence is scored as 1, the absence is scored as 0.

表2 易感位点Table 2 Susceptibility sites

表3 编辑距离打分矩阵Table 3 Edit distance scoring matrix

得到数据列表，由于数据量5168例，所以本案选择2000例作为训练集，余下样本作为测试集。The data list is obtained. Due to the data volume of 5168 cases, 2000 cases are selected as the training set in this case, and the remaining samples are used as the test set.

2 代入模型2 Substitute the model

将处理好的数据代入本发明构建的adaboost-SVM模型中进行计算，本案设置9个SVM分类器，σ取值从30到3，从大到小逐次递减。Substitute the processed data into the adaboost-SVM model constructed by the present invention for calculation. In this case, 9 SVM classifiers are set, and the value of σ ranges from 30 to 3, decreasing successively from large to small.

3 得到结果3 get the result

如图2和3所示，本案分类错误率为23.9％，训练集AUC(ROC曲线下面积)为0.833，测试集AUC为0.868，说明本发明在本实施例中达到良好效果。As shown in Figures 2 and 3, the classification error rate of this case is 23.9%, the training set AUC (area under the ROC curve) is 0.833, and the test set AUC is 0.868, indicating that the present invention achieves good results in this embodiment.

以上所述实施方式仅仅是对本发明的优选实施方式进行描述，并非对本发明的范围进行限定，在不脱离本发明设计精神的前提下，本领域普通技术人员对本发明的技术方案作出的各种变形和改进，均应落入本发明的权利要求书确定的保护范围内。The above-mentioned embodiments are only to describe the preferred embodiments of the present invention, and do not limit the scope of the present invention. On the premise of not departing from the design spirit of the present invention, various modifications made by those of ordinary skill in the art to the technical solutions of the present invention and improvements, all should fall within the protection scope determined by the claims of the present invention.

Claims

1. a construction method of a classification model relevant to psoriasis, is characterized in that, comprises the following steps:

(1) Select psoriasis susceptibility sites;

(2) According to different types of susceptibility loci, convert it into input data;

(3) Use the Adaboost-SVM model to classify data;

The psoriasis susceptibility site of step (1) includes at least one of HLA type, SNP site and amino acid;

The susceptibility site of the HLA type includes at least one of C*06:02, C*07:04, and DPB1*05:01;

The SNP site and the susceptibility site of the amino acid include at least one of snp31443520, B:Y33Y, B:Y91C, B:Y140S, and snp32472030;

The transformation method described in step (2) is: using edit distance scoring for HLA type, and 0/1 scoring for SNP and amino acid; the specific method is as follows: 1. For susceptible HLA type, calculate the difference between the type and the susceptible HLA type of each individual; 2) For SNP sites, if the mutation exists, it is recorded as 1, and if there is no mutation, it is recorded as 0; (3) for amino acid mutations, if the mutation exists, it is recorded as 1, and if there is no mutation, it is recorded as 0;

The classification of step (3) includes the following steps:

(31) using Gaussian kernel function to project the data into high-dimensional space, and then use SVM model to construct the separation plane in the high-dimensional space;

(32) Give the same equal weight to the sample, then train the SVM on the training data set data and calculate the error rate of the classifier to train the weak classifier, and then combine the weak classifiers obtained by each training into a strong classifier;

(33) Classify and evaluate data.

2. the construction method of a kind of classification model relevant to psoriasis according to claim 1, is characterized in that, the formula of the described Gaussian kernel function of step (31) is:

Among them, x is any point in the space, y is the center of the selected space, σ is the width parameter, and K(x, y) is the spatial distance from x to y.

3. The method for constructing a classification model related to psoriasis according to claim 1, wherein the evaluation method in step (33) is to calculate the area under the ROC curve.