CN107301323B - Method for constructing classification model related to psoriasis - Google Patents

Method for constructing classification model related to psoriasis Download PDF

Info

Publication number
CN107301323B
CN107301323B CN201710692864.8A CN201710692864A CN107301323B CN 107301323 B CN107301323 B CN 107301323B CN 201710692864 A CN201710692864 A CN 201710692864A CN 107301323 B CN107301323 B CN 107301323B
Authority
CN
China
Prior art keywords
mutation
psoriasis
susceptible
data
svm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710692864.8A
Other languages
Chinese (zh)
Other versions
CN107301323A (en
Inventor
孙良丹
张涛
甄琪
王文俊
钱文君
莫晓东
吴静
郑晓冬
李报
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BGI Shenzhen Co Ltd
First Affiliated Hospital of Anhui Medical University
Original Assignee
BGI Shenzhen Co Ltd
First Affiliated Hospital of Anhui Medical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BGI Shenzhen Co Ltd, First Affiliated Hospital of Anhui Medical University filed Critical BGI Shenzhen Co Ltd
Priority to CN201710692864.8A priority Critical patent/CN107301323B/en
Publication of CN107301323A publication Critical patent/CN107301323A/en
Application granted granted Critical
Publication of CN107301323B publication Critical patent/CN107301323B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Genetics & Genomics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Public Health (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Software Systems (AREA)
  • Investigating Or Analysing Biological Materials (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of medical detection, in particular to a method for constructing a classification model related to psoriasis, which comprises the following steps: (1) selecting psoriasis susceptible sites; (2) converting the susceptible loci into input data according to different types of susceptible loci; (3) and classifying the data by using an Adaboost-SVM model. At present, relevant technologies are lacked to classify and predict psoriasis data, and only the existence of judgment sites is remained to infer the diseased situation. The invention utilizes the effective machine learning classifier SVM to classify, integrates the SVM by the adaboost frame, and improves the accuracy of the classifier. The model can integrate SNP, amino acid and type data for classification, comprehensively considers the information of each dimension, and improves the accuracy of the classification result.

Description

Method for constructing classification model related to psoriasis
Technical Field
The invention relates to the technical field of medical detection, in particular to a method for constructing a classification model related to psoriasis.
Background
Psoriasis, also known as psoriasis, is a common complex disease, and the occurrence of psoriasis has been reported to be associated with genetic factors, particularly human leukocyte antigen regions (HLA), but the sites of true association are unknown.
With the development of sequencing technology and the deepening of genome research, high-depth sequencing and accurate variation detection of MHC (major histocompatibility complex) regions of Chinese people are reported in the last year 'Nature genetics', and a plurality of psoriasis susceptibility sites are positioned in the genome association analysis. But classification and prediction models based on the susceptibility sites of HLA regions are currently lacking. Therefore, it is urgently needed to develop related classification prediction tools for performing classification prediction on data by using HLA region susceptibility sites.
Psoriasis is most significantly associated with HLA, but current technology lacks the application of HLA domain targeting. The recent accurate variation detection of the HLA region is broken through, and the susceptible sites on the HLA, which are related to psoriasis, are accurately positioned. The invention encodes the susceptible loci according to the susceptible loci and classifies the susceptible loci by using a machine learning model Adaboost, and can integrate the information of the susceptible loci found by using HLA regions. And the machine learning model is utilized to comprehensively analyze the data, so that the classification accuracy is improved, and a basis is provided for the prevention screening of psoriasis.
Disclosure of Invention
The invention aims to solve the defects of the prior art, finds biomarkers related to psoriasis based on complete coverage of MHC (major histocompatibility complex) regions, constructs a classification model of the psoriasis based on susceptibility sites independently related to HLA regions and by using SVM (support vector machine) -Adaboost, provides a method for constructing the classification model related to the psoriasis, and provides a basis for the prevention screening of the psoriasis.
The invention is realized by the following technical scheme:
1 data processing and conversion
The variation of each sample is encoded. Variation information including HLA types (C06: 02, C07: 04, DPB1 x 05:01), single nucleotide polymorphism sites (SNP sites) and amino acids (SNP31443520, B: Y33Y, B: Y91C, B: Y140S, SNP32472030) were obtained by high throughput sequencing data.
Then, for each sample, the input data required by the invention is converted according to susceptible sites. Edit distance was scored for HLA type, and SNP and amino acid were scored with 0/1. The specific method comprises the following steps: calculating the edit distance between each individual type and the susceptible type and scoring aiming at the susceptible HLA type; secondly, aiming at the SNP locus, if the mutation exists, the mutation is marked as 1, and if the mutation does not exist, the mutation is marked as 0; ③ for amino acid mutation, if the mutation exists, the mutation is marked as 1, and if the mutation does not exist, the mutation is marked as 0.
And after the scoring is finished, randomly splitting the data, splitting the data into a test set and a training set, and paying attention to the fact that the data of the test set and the data of the training set are not overlapped. When the number of samples is small, the data may be divided into 5 parts (10 parts) by the 5-fold cross method (or 10-fold cross method), and each time 1 is taken as a test set, the rest is taken as a training set.
2 classifying data by utilizing adaboost-SVM model
The invention integrates a Support Vector Machine (SVM) classifier by using an adaboost method, integrates and utilizes all susceptible site information, and improves the accuracy of data classification.
2.1 construction of the Classification model
2.1.1 sub-classification model SVM
The support vector machine model SVM is classic machine learning classification software and belongs to supervised learning. The present invention first projects data into a high dimensional space using a gaussian kernel function (equation 1).
Figure BDA0001378328710000021
Wherein x is any point in space, y is the selected space center, σ is the width parameter, and K (x, y) is the spatial distance from x to y.
The separation plane is then constructed in a high dimensional space using an SVM model. The partition plane construction is mainly determined by a plurality of points closest to the partition plane (as shown in fig. 1, point a is one of the closest points), and a connecting line from the closest point to the partition plane is called a support vector, and a plane when the support vector reaches the maximum is set as the partition plane, that is, the data is maximally separated by the partition plane. The method adopts an SVM model (reference website https:// www.manning.com/books/machine-learning-in-action) based on python 2.
2.1.2 Classification model integration Algorithm Adaboost
Adaboost is an integration method for improving the performance of a classifier based on errors, and an integrated result is obtained by training each sample for multiple times, repeatedly correcting the classifier through an error rate and finally integrating. The specific method comprises the following steps: samples are first given the same equal weight. The SVM is then trained on the training number set data and the error rate for the classifier is calculated (equation 2).
Error rate ═ number of correct classifications/total number of samples (equation 2)
The gaussian kernel function σ is then adjusted, followed by another SVM on the same data set. During the second training of the classifier, the weight of each sample is re-adjusted (where the weight is a multidimensional vector), wherein the next classification weight for correctly classified samples is decreased and the next weight for incorrectly classified samples is increased. That is, the final weight when the classification is correct is larger than the weight when the classification is wrong. The specific method is to calculate the weight alpha of each classifier according to the error rate.
Figure BDA0001378328710000031
The weights may be updated after alpha is calculated.
The classification is correct:
Figure BDA0001378328710000032
and (3) classification errors:
Figure BDA0001378328710000033
alpha is the weight of the basic classifier in the final classifier and is the error rate of the classifier; (t) represents the sequence, t represents this time, and t +1 represents the next time;Diis the ith training sample weight.
After the weight value D is calculated, the next iteration is started. The process of training and adjusting weights is repeated until the training error rate is 0 or the number of weak classifiers reaches a specified value. The invention adopts an adaboost integrated framework based on python2 (reference website https:// www.manning.com/books/machine-learning-in-action)
3 classifying and evaluating data
After the input training set and the test set are constructed, the input training set and the test set are substituted into the constructed adaboost-SVM model for classification. The results from the classification model are compared to actual disease or lack thereof. The results were evaluated by calculating the accuracy and plotting ROC curves.
The ROC curve is a method for selecting the best signal model. The area under the ROC curve (AUC) can be calculated to determine the classification model, which is referred to table 1.
TABLE 1
Figure BDA0001378328710000041
The invention has the beneficial effects that:
at present, relevant technologies are lacked to classify and predict psoriasis data, and only the existence of judgment sites is remained to infer the diseased situation. The invention utilizes the effective machine learning classifier SVM to classify, integrates the SVM by the adaboost frame, and improves the accuracy of the classifier. The model can integrate SNP, amino acid and type data for classification, comprehensively considers the information of each dimension, and improves the accuracy of the classification result.
Drawings
FIG. 1 is a schematic diagram of the construction of a separation plane using an SVM model in a high dimensional space;
FIG. 2 is a ROC curve of the classification results of the training set of the present invention;
FIG. 3 is a ROC curve of the test set classification results of the present invention.
Detailed Description
For a better understanding of the present invention, the present invention will be further described with reference to the following examples and the accompanying drawings, which are illustrative of the present invention and are not to be construed as limiting thereof.
Example 1
A total of 5168 samples from psoriasis under 30 years of age were selected for the study. And classifying the constructed models of the susceptible sites by using an adaboost-SVM model based on python2 language.
1 processing and conversion of data
In this embodiment, the variation information ped and the map file of the sample are obtained through variation detection. Thereafter, HLA domain variation information was extracted from the susceptible sites (table 2). Wherein the types (1, 2, 7) are scored according to edit distance (see scoring matrix in table 3), the amino acid sites and SNP sites (3, 4, 5, 6, 8) are scored according to presence, presence is scored as 1, and absence is scored as 0.
TABLE 2 susceptible sites
Figure BDA0001378328710000051
TABLE 3 edit distance scoring matrix
Figure BDA0001378328710000052
The data list is obtained, and the data volume is 5168 cases, so 2000 cases are selected as the training set, and the rest samples are used as the test set.
Substitution model 2
The processed data are substituted into the adaboost-SVM model constructed by the method for calculation, 9 SVM classifiers are arranged in the scheme, and the value of sigma is gradually decreased from 30 to 3 from large to small.
3 obtaining the result
As shown in fig. 2 and 3, the classification error rate of the present invention is 23.9%, the training set AUC (area under ROC curve) is 0.833, and the test set AUC is 0.868, which indicates that the present invention achieves a good effect in this embodiment.
The above-mentioned embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements made to the technical solution of the present invention by those skilled in the art without departing from the spirit of the present invention should fall within the protection scope defined by the claims of the present invention.

Claims (3)

1. A method for constructing a classification model related to psoriasis is characterized by comprising the following steps:
(1) selecting psoriasis susceptible sites;
(2) converting the susceptible loci into input data according to different types of susceptible loci;
(3) classifying data by using an Adaboost-SVM model;
the psoriasis susceptibility site in the step (1) comprises at least one of HLA type, SNP site and amino acid;
the susceptible sites of the HLA class comprise at least one of C06: 02, C07: 04, DPB 1: 05: 01;
the SNP site and the susceptibility site of the amino acid comprise at least one of SNP31443520, B: Y33Y, B: Y91C, B: Y140S and SNP 32472030;
the transformation method in the step (2) comprises the following steps: scoring for HLA type by edit distance, scoring for SNP and amino acid by 0/1; the specific method comprises the following steps: calculating the edit distance between each individual type and the susceptible type and scoring aiming at the susceptible HLA type; secondly, aiming at the SNP locus, if the mutation exists, the mutation is marked as 1, and if the mutation does not exist, the mutation is marked as 0; ③ aiming at the amino acid mutation, if the mutation exists, the mutation is marked as 1, and if the mutation does not exist, the mutation is marked as 0;
the classification of step (3) includes the steps of:
(31) projecting data to a high-dimensional space by using a Gaussian kernel function, and constructing a separation plane by using an SVM (support vector machine) model in the high-dimensional space;
(32) the samples are endowed with the same and same weight, then the SVM is trained on the training number set data, the error rate of the classifier is calculated, weak classifiers are trained, and then the weak classifiers obtained by training are combined into a strong classifier;
(33) the data is classified and evaluated.
2. The method for constructing a classification model related to psoriasis according to claim 1, wherein the gaussian kernel function of step (31) is formulated as:
Figure FDA0002644322990000021
wherein x is any point in space, y is the selected space center, σ is the width parameter, and K (x, y) is the spatial distance from x to y.
3. The method of claim 1, wherein the step (33) comprises calculating an area under the ROC curve.
CN201710692864.8A 2017-08-14 2017-08-14 Method for constructing classification model related to psoriasis Active CN107301323B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710692864.8A CN107301323B (en) 2017-08-14 2017-08-14 Method for constructing classification model related to psoriasis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710692864.8A CN107301323B (en) 2017-08-14 2017-08-14 Method for constructing classification model related to psoriasis

Publications (2)

Publication Number Publication Date
CN107301323A CN107301323A (en) 2017-10-27
CN107301323B true CN107301323B (en) 2020-11-03

Family

ID=60131823

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710692864.8A Active CN107301323B (en) 2017-08-14 2017-08-14 Method for constructing classification model related to psoriasis

Country Status (1)

Country Link
CN (1) CN107301323B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108052796B (en) * 2017-12-26 2021-07-13 云南大学 Global human mtDNA development tree classification query method based on ensemble learning
CN108961207B (en) * 2018-05-02 2022-11-04 上海大学 Auxiliary diagnosis method for benign and malignant lymph node lesion based on multi-modal ultrasound images
CN114371135B (en) * 2021-10-25 2024-01-30 孙良丹 Evaluation system for evaluating psoriasis and application

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016183348A1 (en) * 2015-05-12 2016-11-17 The Johns Hopkins University Methods, systems and devices comprising support vector machine for regulatory sequence features
CN106778065A (en) * 2016-12-30 2017-05-31 同济大学 A kind of Forecasting Methodology based on multivariate data prediction DNA mutation influence interactions between protein

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20030032395A (en) * 2001-10-24 2003-04-26 김명호 Method for Analyzing Correlation between Multiple SNP and Disease
GB2477705B (en) * 2008-11-17 2014-04-23 Veracyte Inc Methods and compositions of molecular profiling for disease diagnostics
CN106202936A (en) * 2016-07-13 2016-12-07 为朔医学数据科技(北京)有限公司 A kind of disease risks Forecasting Methodology and system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016183348A1 (en) * 2015-05-12 2016-11-17 The Johns Hopkins University Methods, systems and devices comprising support vector machine for regulatory sequence features
CN106778065A (en) * 2016-12-30 2017-05-31 同济大学 A kind of Forecasting Methodology based on multivariate data prediction DNA mutation influence interactions between protein

Also Published As

Publication number Publication date
CN107301323A (en) 2017-10-27

Similar Documents

Publication Publication Date Title
CN110738247B (en) Fine-grained image classification method based on selective sparse sampling
CN107301323B (en) Method for constructing classification model related to psoriasis
CN109448787B (en) Protein subnuclear localization method for feature extraction and fusion based on improved PSSM
CN112837741B (en) Protein secondary structure prediction method based on cyclic neural network
CN106156805A (en) A kind of classifier training method of sample label missing data
CN114566216B (en) Attention mechanism-based splice site prediction and interpretation method
CN106202999A (en) Microorganism high-pass sequencing data based on different scale tuple word frequency analyzes agreement
CN107480441B (en) Modeling method and system for children septic shock prognosis prediction
CN106951728B (en) Tumor key gene identification method based on particle swarm optimization and scoring criterion
CN102930291A (en) Automatic K adjacent local search heredity clustering method for graphic image
CN112489689B (en) Cross-database voice emotion recognition method and device based on multi-scale difference countermeasure
CN114596253A (en) Alzheimer's disease identification method based on brain imaging genome features
CN108319935B (en) Face group identification method based on region sparsity
CN111863135B (en) False positive structure variation filtering method, storage medium and computing device
CN112599190B (en) Method for identifying deafness-related genes based on mixed classifier
CN117637035A (en) Classification model and method for multiple groups of credible integration of students based on graph neural network
CN117274988A (en) Wheat stripe rust spore image detection method and system based on Yolov5s
CN117238515A (en) Screening system for turner syndrome
Paylakhi et al. A novel gene selection method using GA/SVM and fisher criteria in Alzheimer's disease
CN111128300A (en) Protein interaction influence judgment method based on mutation information
CN105095689B (en) A kind of electronic nose data digging method based on the prediction of Wei grace
CN111091867B (en) Gene variation site screening method and system
CN110135306B (en) Behavior identification method based on angle loss function
CN115066503A (en) Using bulk sequencing data to guide analysis of single cell sequencing data
CN104834834A (en) Construction method and device of promoter recognition system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant