CN107301323A - A kind of construction method of the disaggregated model related to psoriasis - Google Patents

A kind of construction method of the disaggregated model related to psoriasis Download PDF

Info

Publication number
CN107301323A
CN107301323A CN201710692864.8A CN201710692864A CN107301323A CN 107301323 A CN107301323 A CN 107301323A CN 201710692864 A CN201710692864 A CN 201710692864A CN 107301323 A CN107301323 A CN 107301323A
Authority
CN
China
Prior art keywords
psoriasis
mrow
data
disaggregated model
susceptibility loci
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710692864.8A
Other languages
Chinese (zh)
Other versions
CN107301323B (en
Inventor
孙良丹
张涛
甄琪
王文俊
钱文君
莫晓东
吴静
郑晓冬
李报
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BGI Shenzhen Co Ltd
First Affiliated Hospital of Anhui Medical University
Original Assignee
BGI Shenzhen Co Ltd
First Affiliated Hospital of Anhui Medical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BGI Shenzhen Co Ltd, First Affiliated Hospital of Anhui Medical University filed Critical BGI Shenzhen Co Ltd
Priority to CN201710692864.8A priority Critical patent/CN107301323B/en
Publication of CN107301323A publication Critical patent/CN107301323A/en
Application granted granted Critical
Publication of CN107301323B publication Critical patent/CN107301323B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Genetics & Genomics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Public Health (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Software Systems (AREA)
  • Investigating Or Analysing Biological Materials (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to technical field of medical detection, and in particular to a kind of construction method of the disaggregated model related to psoriasis, comprises the following steps:(1) psoriasis susceptibility loci is chosen;(2) according to different types of susceptibility loci, it is converted into input data;(3) classification of data is carried out using Adaboost SVM models.The related technology of current shortage only rests on psoriasis data are classified and predicted and judges site whether there is to infer disease condition.The present invention is classified using effective Machine learning classifiers SVM, and has passed through adaboost frameworks come integrated SVM, improves the accuracy of grader.The model can integrate SNP, amino acid and type data and be classified, and consider the information of each dimension, improve the accuracy of data classification results.

Description

A kind of construction method of the disaggregated model related to psoriasis
Technical field
The present invention relates to technical field of medical detection, and in particular to a kind of structure side of the disaggregated model related to psoriasis Method.
Background technology
It is a kind of common complex disease that psoriasis, which is also known as psoriasis, has been reported that generation and the inherent cause phase of psoriasis Close, especially HLA region (HLA), but very positively related site is not understood.
With the development of sequencing technologies and going deep into for genome research, in last year《Naturally it is hereditary》On just have been reported that Chinese The high depth sequencing and precisely variation detection in MHC regions, the susceptible of several psoriasis is located in the analysis of its genome association Site.But still lack classification and the forecast model of the susceptibility loci based on HLA regions at present.So being badly in need of related point of exploitation Class forecasting tool carries out classification prediction using HLA regions susceptibility loci to data.
Psoriasis is most significantly correlated with HLA, but current technology shortage is targetedly used HLA regions.Recent HLA areas Domain carries out precisely variation detection and broken through, and accurately located susceptibility loci related to psoriasis on HLA.The present invention is directed to These susceptibility locis carry out coding to it and classified again with machine learning model Adaboost, can integrate and utilize HLA areas The susceptibility loci information that domain is found.Comprehensive analysis is carried out to data using machine learning model, classification accuracy is improved, is silver-colored bits The prevention examination of disease provides foundation.
The content of the invention
Present invention aim to address above-mentioned the deficiencies in the prior art, found and silver bits based on all standing to MHC regions Sick related biomarker, based on the independent related susceptibility loci in HLA regions, point of psoriasis is built using SVM-Adaboost Class model provides foundation there is provided a kind of construction method of the disaggregated model related to psoriasis for the prevention examination of psoriasis.
The present invention is achieved by the following technical solutions:
1 data processing and conversion
The variation of each sample is encoded.Variation information, including HLA types are obtained by high-flux sequence data (C*06:02、C*07:04、DPB1*05:01), mononucleotide polymorphism site (SNP site) and amino acid (snp31443520, B:Y33Y、B:Y91C、B:Y140S、snp32472030)。
Then to every sample, according to susceptibility loci, it is converted into the input data required for the present invention.Adopted for HLA types Given a mark with editing distance, SNP and amino acid are using 0/1 marking.Specific method is as follows:1. susceptible HLA types are directed to, calculate each Editing distance and marking of the individual type with susceptible type;2. SNP site is directed to, if mutation, which exists, is designated as 1, in the absence of note For 0;3. amino acid mutation is directed to, if mutation, which exists, is designated as 1, in the absence of being designated as 0.
After the completion of marking, data are split at random, test set and training set is split as, test set and training set data is noted It is not overlapping.When sample number is few, data can be divided into 5 parts (10 parts) according to 5 folding interior extrapolation methods (or 10 folding interior extrapolation methods), often Secondary to take out 1 as test set, remaining is used as training set.
2 carry out the classification of data using adaboost-SVM models
The present invention come integrated supporting vector machine (SVM) grader, integrates all susceptible of utilization using adaboost methods Site information, improves the accuracy of the classification of data.
2.1 structure on disaggregated model
2.1.1 subclassification model SVM
Supporting vector machine model SVM is classical machine learning classification software, belongs to learning method with supervision.The present invention is first The gaussian kernel function (formula 1) utilized is by data projection to high-dimensional space.
Wherein, x is any point in space, and y is selected space center, and σ is width parameter, and K (x, y) is x to y space Distance.
SVM model construction separation planes are used in high-dimensional space afterwards.Separation plane, which is built, mainly passes through separating distance Plane nearest several points determine (A points as shown in Figure 1 are exactly one of nearest point), and by nearest point to separation plane Line be referred to as supporting vector, plane when supporting vector reaches maximization is just set to separation plane, that is to say by point Data are maximally separated every plane.The present invention uses SVM models (the reference site https based on python 2:// www.manning.com/books/machine-learning-in-action)。
2.1.2 disaggregated model Integrated Algorithm Adaboost
Adaboost is a kind of integrated approach based on mistake lifting classifier performance, is repeatedly instructed by each sample Practice, corrected repeatedly by error rate grader finally integrate obtain it is integrated after result.Specific method:First one is assigned to sample The equal weight of sample.Then train SVM in training manifold data and calculate the error rate (ε, formula 2) of the grader.
Error rate ε=number/total number of samples mesh (formula 2) of correctly classifying
Then gaussian kernel function σ is adjusted, afterwards the SVM again on same data set.Work as in second of training of grader In, it will the weight (weight here is the vector of a various dimensions) of each sample is readjusted, wherein correct sample of classifying Next classified weight will reduce, the next weight of the sample of classification error will be improved.That is, being finally reached classification Weight when correct can be bigger than the weight accounting of classification error.Specific method is to calculate each grader according to error rate Weight α.
Calculating can be updated to weight after α.
Classification is correct:
Classification error:
α is weight of the basic classification device in final classification device, and ε is the error rate of grader;(t) representative order, t is represented This, t+1 is represented next time;DiFor i-th of training sample weights.
Calculate after weights D, initially enter next round iteration.The process of training and adjustment weight is repeated continuously, until Training error rate is 0 or the number of Weak Classifier reaches designated value.The present invention is integrated using the adaboost based on python2 Framework (reference site https://www.manning.com/books/machine-learning-in-action)
3 pairs of data are classified and assessed
Build after input training set and test set, substitute into and classified in the adaboost-SVM models built.It is logical The result and actual diseased whether situation for crossing disaggregated model are compared.By calculate accuracy rate and draw ROC curve come pair As a result it is estimated.
ROC curve is the method for selecting optimal signal model.ROC curve area under (AUC) can generally be calculated To judge disaggregated model quality, with specific reference to table 1.
Table 1
The beneficial effects of the present invention are:
Lack related technology at present psoriasis data are classified and predicted, only rest on and judge site whether there is Infer disease condition.The present invention classified using effective Machine learning classifiers SVM, and passed through adaboost frameworks come Integrated SVM, improves the accuracy of grader.The model can integrate SNP, amino acid and type data and be classified, and synthesis is examined Consider the information of each dimension, improve the accuracy of data classification results.
Brief description of the drawings
Fig. 1 is the schematic diagram in high-dimensional space with SVM model construction separation planes;
Fig. 2 is the ROC curve of training set classification results of the present invention;
Fig. 3 is the ROC curve of test set classification results of the present invention.
Embodiment
To be best understood from the present invention, with reference to embodiment and accompanying drawing, the invention will be further described, following examples Only it is that the present invention will be described rather than it is limited.
Embodiment 1
It has selected sample below psoriasis 30 years old and studied 5168 altogether.Using based on python2 language Adaboost-SVM models build model for susceptibility loci and classified.
The processing and conversion of 1 data
In the implementation case, variation information ped and the map file of sample is obtained by the detection that makes a variation first.Basis afterwards Susceptibility loci (table 2) extracts HLA regions variation information.The marking of wherein type (1,2,7) is given a mark according to editing distance (scoring matrix is shown in Table 3), amino acid sites and SNP site (3,4,5,6,8) are given a mark according to presence or absence, and there is marking is 1, it is 0 in the absence of marking.
The susceptibility loci of table 2
The editing distance scoring matrix of table 3
Data list is obtained, due to data volume 5168, so this case selects 2000 as training set, remaining sample is made For test set.
2 substitute into model
The data handled well are substituted into and calculated in the adaboost-SVM models that the present invention is built, this case sets 9 SVM classifier, σ values are gradually successively decreased from big to small from 30 to 3.
3 obtain result
As shown in Figures 2 and 3, this case classification error rate is 23.9%, and training set AUC (area under ROC curve) is 0.833, Test set AUC is 0.868, illustrates that the present invention reaches good result in the present embodiment.
Embodiment described above is only that the preferred embodiment of the present invention is described, not to the model of the present invention Enclose and be defined, on the premise of design spirit of the present invention is not departed from, technical side of the those of ordinary skill in the art to the present invention In various modifications and improvement that case is made, the protection domain that claims of the present invention determination all should be fallen into.

Claims (7)

1. a kind of construction method of the disaggregated model related to psoriasis, it is characterised in that comprise the following steps:
(1) psoriasis susceptibility loci is chosen;
(2) according to different types of susceptibility loci, it is converted into input data;
(3) classification of data is carried out using Adaboost-SVM models.
2. a kind of construction method of disaggregated model related to psoriasis according to claim 1, it is characterised in that:Step (1) the psoriasis susceptibility loci includes at least one of HLA types, SNP site and amino acid.
3. a kind of construction method of disaggregated model related to psoriasis according to claim 2, it is characterised in that:It is described The susceptibility loci of HLA types includes C*06:02、C*07:04、DPB1*05:At least one of 01;The SNP site and amino The susceptibility loci of acid includes snp31443520, B:Y33Y、B:Y91C、B:At least one of Y140S, snp32472030.
4. a kind of construction method of disaggregated model related to psoriasis according to claim 1, it is characterised in that step (2) method for transformation described in is:If susceptibility loci is one section of region, given a mark according to its similarity;If susceptibility loci is One site, then give a mark according to its presence or absence.
5. a kind of construction method of disaggregated model related to psoriasis according to claim 1, it is characterised in that step (3) classification comprises the following steps:
(31) then data projection to high-dimensional space is used into SVM model constructions in high-dimensional space using gaussian kernel function Separation plane;
(32) equally equal weight is assigned to sample, then SVM is trained in training manifold data and calculates the grader Error rate trains Weak Classifier, then the Weak Classifier that each training is obtained to be combined into strong classifier;
(33) data are classified and assessed.
6. a kind of construction method of disaggregated model related to psoriasis according to claim 5, it is characterised in that step (31) formula of the gaussian kernel function described in is:
<mrow> <mi>K</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mi>y</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>exp</mi> <mrow> <mo>(</mo> <mfrac> <mrow> <mo>-</mo> <mo>|</mo> <mo>|</mo> <mi>x</mi> <mo>-</mo> <mi>y</mi> <mo>|</mo> <msup> <mo>|</mo> <mn>2</mn> </msup> </mrow> <mrow> <mn>2</mn> <msup> <mi>&amp;sigma;</mi> <mn>2</mn> </msup> </mrow> </mfrac> <mo>)</mo> </mrow> <mo>,</mo> </mrow>
Wherein, x is any point in space, and y is selected space center, and σ is width parameter, K (x, y) for x to y space away from From.
7. a kind of method of disaggregated model related to psoriasis according to claim 5, it is characterised in that step (33) The appraisal procedure is calculating ROC curve area under.
CN201710692864.8A 2017-08-14 2017-08-14 Method for constructing classification model related to psoriasis Active CN107301323B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710692864.8A CN107301323B (en) 2017-08-14 2017-08-14 Method for constructing classification model related to psoriasis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710692864.8A CN107301323B (en) 2017-08-14 2017-08-14 Method for constructing classification model related to psoriasis

Publications (2)

Publication Number Publication Date
CN107301323A true CN107301323A (en) 2017-10-27
CN107301323B CN107301323B (en) 2020-11-03

Family

ID=60131823

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710692864.8A Active CN107301323B (en) 2017-08-14 2017-08-14 Method for constructing classification model related to psoriasis

Country Status (1)

Country Link
CN (1) CN107301323B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108052796A (en) * 2017-12-26 2018-05-18 云南大学 Global human mtDNA development tree classification querying methods based on integrated study
CN108961207A (en) * 2018-05-02 2018-12-07 上海大学 Lymph node Malignant and benign lesions aided diagnosis method based on multi-modal ultrasound image
CN114371135A (en) * 2021-10-25 2022-04-19 孙良丹 Evaluation system for evaluating psoriasis and application

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030077617A1 (en) * 2001-10-24 2003-04-24 Myungho Kim Method for diagnosis of a disease by using multiple SNP (single nucleotide polymorphism) variations and clinical data
US20130225662A1 (en) * 2008-11-17 2013-08-29 Veracyte, Inc. Methods and compositions of molecular profiling for disease diagnostics
WO2016183348A1 (en) * 2015-05-12 2016-11-17 The Johns Hopkins University Methods, systems and devices comprising support vector machine for regulatory sequence features
CN106202936A (en) * 2016-07-13 2016-12-07 为朔医学数据科技(北京)有限公司 A kind of disease risks Forecasting Methodology and system
CN106778065A (en) * 2016-12-30 2017-05-31 同济大学 A kind of Forecasting Methodology based on multivariate data prediction DNA mutation influence interactions between protein

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030077617A1 (en) * 2001-10-24 2003-04-24 Myungho Kim Method for diagnosis of a disease by using multiple SNP (single nucleotide polymorphism) variations and clinical data
US20130225662A1 (en) * 2008-11-17 2013-08-29 Veracyte, Inc. Methods and compositions of molecular profiling for disease diagnostics
WO2016183348A1 (en) * 2015-05-12 2016-11-17 The Johns Hopkins University Methods, systems and devices comprising support vector machine for regulatory sequence features
CN106202936A (en) * 2016-07-13 2016-12-07 为朔医学数据科技(北京)有限公司 A kind of disease risks Forecasting Methodology and system
CN106778065A (en) * 2016-12-30 2017-05-31 同济大学 A kind of Forecasting Methodology based on multivariate data prediction DNA mutation influence interactions between protein

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
VIMAL K. SHRIVASTAVA ET AL: "A novel and robust Bayesian approach for segmentation of psoriasis lesions and its risk stratification", 《COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE》 *
刘杰: "肺癌关联的基因多态位点的识别与预测模型的构建", 《中国优秀硕士学位论文全文数据库 医药卫生科技辑》 *
王文俊: "汉族人银屑病HLA区域精细定位研究", 《中国博士学位论文全文数据库 医药卫生科技辑》 *
王晓丹: "一种基于AdaBoost的SVM分类器", 《空军工程大学学报(自然科学版)》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108052796A (en) * 2017-12-26 2018-05-18 云南大学 Global human mtDNA development tree classification querying methods based on integrated study
CN108052796B (en) * 2017-12-26 2021-07-13 云南大学 Global human mtDNA development tree classification query method based on ensemble learning
CN108961207A (en) * 2018-05-02 2018-12-07 上海大学 Lymph node Malignant and benign lesions aided diagnosis method based on multi-modal ultrasound image
CN108961207B (en) * 2018-05-02 2022-11-04 上海大学 Auxiliary diagnosis method for benign and malignant lymph node lesion based on multi-modal ultrasound images
CN114371135A (en) * 2021-10-25 2022-04-19 孙良丹 Evaluation system for evaluating psoriasis and application
CN114371135B (en) * 2021-10-25 2024-01-30 孙良丹 Evaluation system for evaluating psoriasis and application

Also Published As

Publication number Publication date
CN107301323B (en) 2020-11-03

Similar Documents

Publication Publication Date Title
CN105426842B (en) Multiclass hand motion recognition method based on support vector machines and surface electromyogram signal
CN104408478B (en) A kind of hyperspectral image classification method based on the sparse differentiation feature learning of layering
CN107301323A (en) A kind of construction method of the disaggregated model related to psoriasis
CN106778853A (en) Unbalanced data sorting technique based on weight cluster and sub- sampling
CN105975992A (en) Unbalanced data classification method based on adaptive upsampling
CN104008375B (en) The integrated face identification method of feature based fusion
CN104462409B (en) Across language affection resources data identification method based on AdaBoost
CN103793694B (en) Human face recognition method based on multiple-feature space sparse classifiers
CN103400144A (en) Active learning method based on K-neighbor for support vector machine (SVM)
CN105069774A (en) Object segmentation method based on multiple-instance learning and graph cuts optimization
CN111369045A (en) Method for predicting short-term photovoltaic power generation power
CN107767387A (en) Profile testing method based on the global modulation of changeable reception field yardstick
CN103426004B (en) Model recognizing method based on error correcting output codes
CN107943830A (en) A kind of data classification method suitable for higher-dimension large data sets
CN106251362A (en) A kind of sliding window method for tracking target based on fast correlation neighborhood characteristics point and system
Aldhlan et al. Novel mechanism to improve hadith classifier performance
CN103631753A (en) Progressively-decreased subspace ensemble learning algorithm
Ozkok et al. Convolutional neural network analysis of recurrence plots for high resolution melting classification
CN116821698A (en) Wheat scab spore detection method based on semi-supervised learning
CN103810482A (en) Multi-information fusion classification and identification method
CN106951728B (en) Tumor key gene identification method based on particle swarm optimization and scoring criterion
CN111815609A (en) Pathological image classification method and system based on context awareness and multi-model fusion
CN102930291A (en) Automatic K adjacent local search heredity clustering method for graphic image
CN107451538A (en) Human face data separability feature extracting method based on weighting maximum margin criterion
CN103246897B (en) A kind of Weak Classifier inner structure method of adjustment based on AdaBoost

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant