CN103559423B

CN103559423B - Method and device for predicting methylation

Info

Publication number: CN103559423B
Application number: CN201310534661.8A
Authority: CN
Inventors: 周丰丰; 赵苗苗; 张召; 刘记奎; 葛瑞泉
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2013-10-31
Filing date: 2013-10-31
Publication date: 2017-02-15
Anticipated expiration: 2033-10-31
Also published as: CN103559423A

Abstract

The invention is suitable for the technical field of biological information and provides a method and a device for predicting methylation. The method comprises the following steps of: downloading to obtain methylated data; according to the methylated data, acquiring original protein sequence data; carrying out preprocessing on the original protein sequence data to obtain a positive data set and a negative data set; carrying out encoding on character string data in the positive data set and the negative data set to obtain numeric data; carrying out modeling on the numeric data in the positive data set and the negative data set by utilizing a classification algorithm, calculating the optimal division mode according to a model obtained by modeling, and finally, according to the division mode, dividing the data which need to be predicted whether to be methylated into two categories in a centralized mode, wherein one category is the methylated data and the other category is data which are not methylated. According to the invention, personnel do not need to participate, a graph also does not need to be drawn, time can be saved and cost is low.

Description

A kind of Forecasting Methodology of methylation, device

Technical field

The invention belongs to technical field of biological information, more particularly, to a kind of Forecasting Methodology of methylation, device.

Background technology

Methylating is a kind of important modification of protein and nucleic acid, the expression of regulatory gene and closing, with cancer, decline Always, the numerous disease such as senile dementia is closely related, is one of important research content of epigenetics.Therefore, understand specific first Base mechanism of action will affect current molecular biological many fields, the side such as the correlative study to disease and drug design Face is also all very helpful.

The Joseph Ecker of Salk biological study institute of the U.S. and its colleague, just by the method for high-flux sequence, represent The complete collection of illustrative plates of all methylcysteins in one human embryo stem cell.Graduate Meissner of U.S. Whitehead etc. Also once depicted similar collection of illustrative plates.They utilize high-throughout bisulfite sequencing and single-molecule sequencing, create covering big The DNA methylation collection of illustrative plates on part CpG island.

In addition, George Church of two independent research groups, respectively Harvard University etc., and University of California Kun Zhang together with Univ Virginia Commonwealth Yuan Gao etc., also by the heavy sulfurous of traditional instrument such as DNA that methylates Hydrochlorate conversion is combined with target gene group capture technique and high-flux sequence, methylating in quantitative determination human genome.

Although the method for drafting of these methylation profiles is slightly different, they employ bisulfite conversion, will Unmethylated Cytosines become uracil, and change into thymidine in subsequent amplification step.Although this methyl The assay method of change effect is very effective, but this method needs some manual operationss to guarantee completely to convert and it needs to pass through Calculate analysis to draw collection of illustrative plates.

In a word, measure methylated method by above-mentioned laboratory facilities, either based on inner or in vitro technology, not only Very time-consuming, expense also costly, also can be limited by enzyme reaction.

Content of the invention

Embodiments provide a kind of Forecasting Methodology of methylation, device provides it is intended to solve prior art The methylated method of mensure, not only very time-consuming, expense also costly, the problem that also can be limited by enzyme reaction.

On the one hand, provide a kind of Forecasting Methodology of methylation, methods described includes：

Download the data obtaining methylated effect；

Data acquisition urporotein sequence data according to described methylated effect；

Described urporotein sequence data is pre-processed, obtains positive data collection and negative data set；

String data in described positive data collection and described feminine gender data set is encoded, obtains numeric type number According to；

Numeric type data in described positive data collection and described feminine gender data set is modeled using sorting algorithm, root Calculate optimal partitioning scheme according to the model that obtains of modeling, methylated by needing to predict whether finally according to described partitioning scheme Data set in data be divided into two classes, a class is the data of methylated effect, and another kind of is not have methylated effect Data.

Further, the described data acquisition urporotein sequence data according to described methylated effect includes：

It is successively read the protein title of methylated effect from the data of described methylated effect；

According to protein title successively from webpage http:Search and each egg in //www.uniprot.org/uniprot/ The corresponding data of white matter title；

It is made up of and each protein title corresponding urporotein sequence, described urporotein sequence these data Data includes the data with the corresponding methylated effect of each protein title in the data of described methylated effect With the data not having methylated effect.

Further, described described urporotein sequence data is pre-processed, obtain positive data collection and feminine gender Data set includes：

From described urporotein sequence data, centered on K or R, choose the character string of preseting length；

Using the character string of methylated effect as positive control, and others do not have the character string of methylated effect to make For negative control；

Positive control is added to positive data set, negative control is added to negative data set.

Further, described string data to described positive data collection and in described feminine gender data set encodes, Obtain the coding method in numeric type data and include probabilistic type coding, value number type coding, orthogonal type coding and binary system volume One of code.

Further, described sorting algorithm is random forest, random tree（RandomTree）One of.

On the other hand, a kind of prediction meanss of methylation are provided, described device includes：

Data download unit, for downloading the data obtaining methylated effect；

Initial data acquiring unit, for the data acquisition urporotein sequence number according to described methylated effect According to；

Pretreatment unit, for pre-processing to described urporotein sequence data, obtains positive data collection and the moon Property data set；

Coding unit, for encoding to the string data in described positive data collection and described feminine gender data set, Obtain numeric type data；

Taxon, for being calculated using classification to the numeric type data in described positive data collection and described feminine gender data set Method is modeled, and calculates optimal partitioning scheme according to the model that modeling obtains, will need finally according to described partitioning scheme pre- The data surveyed in whether methylated data set is divided into two classes, and a class is the data of methylated effect, another kind of for not having There is the data of methylated effect.

Further, described initial data acquiring unit includes：

Protein title acquisition module, for being successively read methylated effect from the data of described methylated effect Protein title；

Data search module, for according to protein title successively from webpage http://www.uniprot.org/ Search and each corresponding data of protein title in uniprot/；

Data concatenation module, for being made up of and each protein title corresponding urporotein sequence these data, Described urporotein sequence data includes corresponding with each protein title in the data of described methylated effect The data of methylated effect and the data not having methylated effect.

Further, described pretreatment unit includes：

Character string chosen module, for, centered on K or R, choosing and setting length from described urporotein sequence data The character string of degree；

Positive negative control acquisition module, for using the character string of methylated effect, as positive control, and others do not have The character string having methylated effect is as negative control；

Data set obtains module, for adding positive control to positive data set, negative control is added to feminine gender In data set.

Further, the coding method that described coding unit adopts is probabilistic type coding, value number type coding, orthogonal type One of coding and binary coding.

Further, the sorting algorithm that described taxon adopts is one of random forest, random tree.

In the embodiment of the present invention, when carrying out methylation prediction, whole process is automatically performed by computer, compares existing skill Art, can be time-consuming it is not necessary to artificial participation is it is not required that draw collection of illustrative plates, and expense is also cheap.

Brief description

Fig. 1 is the flowchart of the Forecasting Methodology of methylation that the embodiment of the present invention one provides；

Fig. 2 is the structured flowchart of the prediction meanss of methylation that the embodiment of the present invention one provides.

Specific embodiment

In order that the objects, technical solutions and advantages of the present invention become more apparent, below in conjunction with drawings and Examples, right The present invention is further elaborated.It should be appreciated that specific embodiment described herein is only in order to explain the present invention, and It is not used in the restriction present invention.

In embodiments of the present invention, first download the data obtaining methylated effect, according to described methylated effect Data acquisition urporotein sequence data；Again described urporotein sequence data is pre-processed, obtain positive data Collection and negative data set；Then the string data in described positive data collection and described feminine gender data set is encoded, obtain To numeric type data；Finally the numeric type data in described positive data collection and described feminine gender data set is entered using sorting algorithm Row modeling, calculates optimal partitioning scheme according to the model that obtains of modeling, the prediction will be needed to be finally according to described partitioning scheme Data in no methylated data set is divided into two classes, and a class is the data of methylated effect, another kind of be not by The data of methylation.

Below in conjunction with specific embodiment, the realization of the present invention is described in detail：

Embodiment one

Fig. 1 show the embodiment of the present invention one provide the Forecasting Methodology of methylation realize flow process, details are as follows：

In step S101, download the data obtaining methylated effect.

In the present embodiment, by downloading network address：http://dbptm.mbc.nctu.edu.tw/download.php can To obtain the data of methylated effect, this data is derived from database：dbPTM.Download the data storage obtaining in document In Methylation_K.txt and Methylation_R.txt, Methylation_K.txt is included by lysine（K）Methyl The data of change effect, Methylation_R.txt is included by arginine（R）The data of methylation, in the specific implementation, Need respectively the data in two documents of Methylation_K.txt and Methylation_R.txt is carried out with classification prediction.

In step s 102, the data acquisition urporotein sequence data according to described methylated effect.

In the present embodiment, the process obtaining urporotein sequence data includes：

Step 1, it is successively read the protein title of methylated effect from the data of described methylated effect；

Step 2, according to protein title successively from webpage http:In //www.uniprot.org/uniprot/ search with Each corresponding data of protein title；

Step 3, it is made up of and each protein title corresponding urporotein sequence, described original protein these data Matter sequence data includes and the corresponding methylated effect of each protein title in the data of described methylated effect Data and the data not having methylated effect.

There is no urporotein sequence data due in document Methylation_K.txt, only include 1013 quilts of total The protein sequence data of methylation sites data is it is therefore desirable to the protein that included according to this Methylation_K.txt The every kind of protein of protein name acquiring of the methylated effect in sequence data does not have the data of methylated effect.

When being embodied as, it is possible to use python written in code realizes above-mentioned 3 steps, from webpage http:// 1013 methylated site data in document Methylation_K.txt are obtained in www.uniprot.org/uniprot/ Corresponding urporotein sequence data.

For document Methylation_R.txt, same processing procedure, it is possible to use the python code write is from net Page http:1192 methylated sites of Methylation_R.txt are obtained in //www.uniprot.org/uniprot/ Data corresponding urporotein sequence data.

In step s 103, described urporotein sequence data is pre-processed, obtain positive data collection and feminine gender Data set.

In the present embodiment, define the peptide fragment of basic conception methylation sites.One methylation sites, this methyl Change the peptide fragment referred to as methylation sites that m residue and this methylation sites before site n residue below is formed Peptide fragment PSP (m, n), its biological significance is that the characteristic of methylation sites is determined generally by the biochemical special of its neighbouring amino acid Property.In actual applications, primary concern is that 11 peptide PSP（5,5）, that is, the value of m and n be 5, when the left and right two of methylation sites While when there is no 5, can use '-' replace.

Described urporotein sequence data is pre-processed, obtains positive data collection and the process bag of negative data set Include：

Step 11, described urporotein sequence data is divided into positive control and negative control.

Specifically, from described urporotein sequence data, first centered on K or R, choose the character string of preseting length, Again using the character string of methylated effect as positive control, and others do not have the character string of methylated effect as feminine gender Comparison.

Step 12, positive control is added to positive data set, negative control is added to negative data set.

For example, for by lysine（K）Stimulate methylates：It is positive right that 1013 methylated site data are considered as According to, and the every other K site on the protein sequence of 1013 methylated site data is considered as negative control.

Wherein, from urporotein sequence data, centered on K, choose the character string of preseting length, these character strings In, have plenty of experiment test and go out methylated effect, using the character string of methylated effect as positive control, and other The character string not having methylated effect is as negative control.

In addition, for the accuracy increasing prediction, increasing during urporotein sequence data is pre-processed Add data validation process：Judge whether 1013 methylated site data are really methylated, if really by methyl Change, then judge to whether there is, in this 1013 methylated site data, the data repeating, if it is, carrying out to repeated data Remove redundancy, finally add the positive control obtaining to positive data set P, the negative control obtaining is added to negative number According in collection N.

Wherein, the methylation positive data including respectively and the negative number that methylates in positive data collection P and negative data set N According to number be respectively 1012 and 23915.In addition, the data in data set is all using 11 peptide PSP（5,5）Form represent.

For by arginine（R）Stimulate methylates, and above to by lysine（K）The methylated processing method stimulating Identical.1192 methylated site data are considered as positive control, and the protein by 1192 methylated site data Every other R site in sequence is considered as negative control.

It is also possible to whether first really methylated to 1192 methylated site data judge, then remove true Methylated methylation sites data in redundant data, finally respectively obtain in positive data collection P and negative data set N Including 11 peptide PSP（5,5）Number be respectively 1189 and 32505.

In step S104, the string data in described positive data collection and described feminine gender data set is encoded, Obtain numeric type data.

In the present embodiment, can be encoded by probabilistic type, the coding method such as digital numbering type coding, orthogonal type coding will String data in data set is converted into numeric type data.Hereinafter do one to be discussed in detail：

（1）, probabilistic type coding

First each position is gone out according to the data statistics in data set P and each alphabetical probability occurs, finally give one Then string data correspondence in data set P is converted into numeric type number using this probability statistics matrix by probability statistics matrix According to.

For example, by lysine（K）The methylation positive data stimulating is 1012, and each data length is 11（Regard as 11 positions）, all packets contain letter and '-' amounts to 21, are then directed to this 1012 data and count the appearance of each position The probability of each letter, can obtain the probability statistics matrix M of a 21*11, and then the string data in data set P, N is pressed It is converted into the numeric type data of 11 features according to this probability statistics matrix M.

（2）, value number type coding

Each character in string data is carried out unique identification with a ten's digit, such as in the present embodiment String data in have 21 characters, therefore can represent each character respectively using 1-21, thus realizing character string number According to being converted into numeric type data.

（3）Orthogonal type encodes

In the present embodiment, for 21 characters in string data, each character is replaced with 21 binary codings, should Have in 21 binary codings and only one be 1, other positions are 0.Assume that 20 amino acid are { A, C, D ... }, add Character '-', amount to 21, then A is encoded to 000000000000000000001（20 0）, C code is 000000000000000000010（20 0）, D is encoded to 000000000000000000100（20 0）, etc..Then will The binary numeral of each position regards a feature as, also just by 11 character codes of every string data be 11*21= 231 0/1 features.

（4）Binary coding

First each letter in string data is carried out unique identification with 5 bit binary value, then will be every The binary numeral of individual position all regards a feature as, that is, every string data is converted into 11*5=55 feature.

Of course for the accuracy improving methylation prediction it is also possible in the albumen obtaining methylated site data After matter sequence data, protein unstable structure interval is obtained according to the protein sequence data of this methylated site data Data, and then obtain the string data of new 11 length, then to former 11 peptide PSP（5,5）Plus this 11 new length String data, obtain amount to length be 22 string data, according still further to the coding method mentioned in step S104 to this Length is that 22 string data is processed, and is translated into numeric type data and carries out follow-up classification prediction again.

In step S105, the numeric type data in described positive data collection and described feminine gender data set is calculated using classification Method is modeled, and calculates optimal partitioning scheme according to the model that modeling obtains, will need finally according to described partitioning scheme pre- The data surveyed in whether methylated data set is divided into two classes, and a class is the data of methylated effect, another kind of for not having There is the data of methylated effect.

In embodiments of the present invention, sorting algorithm can be random forest（RandomForest）, random tree （RandomTree）In any one.Certainly, sorting algorithm can also adopt neural network algorithm, nearest neighbor algorithm, pattra leaves This algorithm, WebLogo algorithm, genetic algorithm, Lasso algorithm, LibSVM algorithm, SVMs, clustering algorithm, individual layer decision-making Which kind of sorting algorithm tree (Decision Stump) algorithm, Logistic algorithm etc., specifically adopt, do not limit in the present embodiment System.

Need to predict whether that methylated data set can be different from described positive data collection and described feminine gender data All or part of number in one new data set of collection or described positive data collection and described feminine gender data set According to.

During practical application, the method carrying out methylation prediction to data includes：

1）、All training set：Self-validation（Entirety makees model, then is predicted with overall）

Can be modeled with the total data in described positive data collection and described feminine gender data set, then to the described positive Total data in data set and described feminine gender data set is predicted.

2）, 30% test：Mould can be set up with a part of data in described positive data collection and described feminine gender data set Type, is predicted to remaining another part data.

For example, it is possible to be modeled with 70% data in described positive data collection and described feminine gender data set, to remaining 30% data is predicted.

3）, 3 times of cross validation Fold-3

4）, 10 times of cross validation Fold-10

Because judging the performance predicted, then need to predict whether the actual type of methylated part data set It is known, then obtains the type of this partial data collection with sorting algorithm prediction, if predicting the type and known class obtaining Type is identical then it represents that prediction correctly, otherwise represents prediction error.

As a preferred embodiment of the present invention, can further the result of prediction be estimated, to verify prediction The reliability of result.

During concrete application, in order to check the reliability of every kind of Forecasting Methodology, employ four evaluating standards：Susceptibility (Sn), specificity (Sp), accuracy (Ac) and coefficient correlation（MCC）The performance predicting the outcome is evaluated, four evaluation and test marks Accurate value bigger then it represents that the accuracy predicting the outcome is higher and stability is better.

Wherein, Sn, Sp, Ac and MCC meet equation below respectively：

Wherein, TP, TN, FP and FN are true positives, true negative, false positive and the false negative number that test obtains respectively.

Specifically, the performance predicting the outcome being obtained by various codings respectively as shown in table 1, table 2, table 3 and table 4, its Middle table 1 be using probabilistic type coding obtain by lysine（K）Stimulate the performance predicting the outcome whether methylation occurs, Table 2 be using value number type coding obtain by lysine（K）Stimulate the property predicting the outcome whether methylation occurs Can, table 3 be using probabilistic type coding obtain by arginine（R）Stimulate the property predicting the outcome whether methylation occurs Can, table 4 is to be obtained by arginine using value number type coding（R）Stimulate the property predicting the outcome whether methylation occurs Energy.

Table 1

Table 2

Table 3

Table 4

The present embodiment, when carrying out methylation prediction, whole process is automatically performed by computer, compared to existing technology, no Need artificial participation it is not required that drawing collection of illustrative plates, can be time-consuming, expense is also cheap.In addition, from taxology angle to egg Whether methylated the acting on of white matter to be predicted, and the accuracy of prediction can more preferably, can be to the correlative study of disease and medicine Thing the aspect such as is related to and provides preferably help.

One of ordinary skill in the art will appreciate that realizing all or part of step in the various embodiments described above method is can Completed with the hardware instructing correlation by program, corresponding program can be stored in a computer read/write memory medium In, described storage medium, such as ROM/RAM, disk or CD etc..

Embodiment two

Fig. 2 shows the concrete structure block diagram of the prediction meanss of methylation that the embodiment of the present invention two provides, in order to It is easy to illustrate, illustrate only the part related to the embodiment of the present invention.The prediction meanss 2 of this methylation include：Under data Carrier unit 21, initial data acquiring unit 22, pretreatment unit 23, coding unit 24 and taxon 25.

Wherein, data download unit 21, for downloading the data obtaining methylated effect；

Initial data acquiring unit 22, for the data acquisition urporotein sequence number according to described methylated effect According to；

Pretreatment unit 23, for pre-processing to described urporotein sequence data, obtain positive data collection and Negative data set；

Coding unit 24, for compiling to the string data in described positive data collection and described feminine gender data set Code, obtains numeric type data；

Taxon 25, for utilizing classification to the numeric type data in described positive data collection and described feminine gender data set Algorithm is modeled, and calculates optimal partitioning scheme according to the model that modeling obtains, will need finally according to described partitioning scheme Predict whether that the data in methylated data set is divided into two classes, a class is the data of methylated effect, another kind of be There is no the data of methylated effect.

Specifically, described initial data acquiring unit 22 includes：

Specifically, described pretreatment unit 23 includes：

Specifically, the coding method that described coding unit 24 adopts is probabilistic type coding, value number type coding, orthogonal type One of coding and binary coding.

Specifically, the sorting algorithm that described taxon 25 adopts is one of random forest, random tree.

The prediction meanss of methylation provided in an embodiment of the present invention can be applied in aforementioned corresponding embodiment of the method In one, details, referring to the description of above-described embodiment one, will not be described here.

It should be noted that in said system embodiment, included unit simply carries out drawing according to function logic Point, but it is not limited to above-mentioned division, as long as being capable of corresponding function；In addition, each functional unit is concrete Title also only to facilitate mutual distinguish, is not limited to protection scope of the present invention.

The foregoing is only presently preferred embodiments of the present invention, not in order to limit the present invention, all essences in the present invention Any modification, equivalent and improvement made within god and principle etc., should be included within the scope of the present invention.

Claims

1. a kind of Forecasting Methodology of methylation is it is characterised in that methods described includes：

Step 1, download obtain the data of methylated effect；

Step 2, the data acquisition urporotein sequence data according to described methylated effect；

Step 3, described urporotein sequence data is pre-processed, obtain positive data collection and negative data set；

Step 4, to described positive data collection and described feminine gender data set in string data encode, obtain numeric type number According to；

Step 5, to described positive data collection and described feminine gender data set in numeric type data be modeled using sorting algorithm, Optimal partitioning scheme is calculated according to the model that modeling obtains, will need to predict whether by methyl finally according to described partitioning scheme Data in the data set changed is divided into two classes, and a class is the data of methylated effect, and another kind of is not have methylated work Data；

Wherein, methods described also comprises the steps：

After the protein sequence data obtaining methylated site data, according to the protein sequence of this methylated site data Column data obtains the interval data of protein unstable structure, and then obtains the string data of new 11 length, then right Former 11 peptide PSP (5,5) add the string data of described 11 new length, obtain amounting to the string data that length is 22, According still further to the coding method mentioned in step 4, the string data that described length is 22 is encoded, obtain numeric type data Carry out follow-up classification prediction again.

2. the method for claim 1 is it is characterised in that the described data acquisition according to described methylated effect is original Protein sequence data includes：

According to protein title successively from webpage http:Search and each protein in //www.uniprot.org/uniprot/ The corresponding data of title；

Formed and each protein title corresponding urporotein sequence with each corresponding data of protein title by searching Row, described urporotein sequence data includes corresponding with each protein title in the data of described methylated effect The data of methylated effect and the data not having methylated effect.

3. the method for claim 1 is it is characterised in that described carry out pre- place to described urporotein sequence data Reason, obtains positive data collection and negative data set includes：

From described urporotein sequence data, centered on K or R, choose the character string of preseting length, described K is to rely ammonia Acid, described R is arginine；

Using the character string of methylated effect as positive control, and others do not have the character string of methylated effect as the moon Property comparison；

4. the method for claim 1 it is characterised in that described to described positive data collection and described feminine gender data set in String data encoded, obtain the coding method in numeric type data include probabilistic type coding, value number type coding, One of orthogonal type coding and binary coding.

5. the method for claim 1 is it is characterised in that described sorting algorithm is one of random forest, random tree.

6. a kind of prediction meanss of methylation are it is characterised in that described device includes：

Data download unit, for downloading the data obtaining methylated effect；

Initial data acquiring unit, for the data acquisition urporotein sequence data according to described methylated effect；

Pretreatment unit, for pre-processing to described urporotein sequence data, obtains positive data collection and negative number According to collection；

Coding unit, for encoding to the string data in described positive data collection and described feminine gender data set, obtains Numeric type data；

Taxon, for being entered using sorting algorithm to the numeric type data in described positive data collection and described feminine gender data set Row modeling, calculates optimal partitioning scheme according to the model that obtains of modeling, the prediction will be needed to be finally according to described partitioning scheme Data in no methylated data set is divided into two classes, and a class is the data of methylated effect, another kind of be not by The data of methylation；

Wherein, after initial data acquiring unit obtains the protein sequence data of methylated site data, according to this by methyl The protein sequence data changing site data obtains the interval data of protein unstable structure, and then obtains new 11 length String data, then former 11 peptide PSP (5,5) are added with the string data of described 11 new length, obtain amounting to length Spend the string data for 22, according still further to the coding method mentioned in coding unit, the string data that described length is 22 is entered Row coding, obtains numeric type data and carries out follow-up classification prediction again.

7. device as claimed in claim 6 is it is characterised in that described initial data acquiring unit includes：

Protein title acquisition module, for being successively read the egg of methylated effect from the data of described methylated effect White matter title；

Data search module, for according to protein title successively from webpage http://www.uniprot.org/uniprot/ Middle lookup and each corresponding data of protein title；

Data concatenation module, for by search with each corresponding data of protein title form with each protein title pair The urporotein sequence answered, described urporotein sequence data include with the data of described methylated effect in each The data of the corresponding methylated effect of individual protein title and the data not having methylated effect.

8. device as claimed in claim 6 is it is characterised in that described pretreatment unit includes：

Character string chosen module, for, centered on K or R, choosing preseting length from described urporotein sequence data Character string, described K is lysine, and described R is arginine；

Positive negative control acquisition module, for using the character string of methylated effect as positive control, and others not by The character string of methylation is as negative control；

Data set obtains module, for adding positive control to positive data set, negative control is added to negative data Concentrate.

9. device as claimed in claim 6 is it is characterised in that the coding method that described coding unit adopts is that probabilistic type is compiled One of code, value number type coding, orthogonal type coding and binary coding.

10. device as claimed in claim 6 it is characterised in that described taxon adopt sorting algorithm be random forest, One of random tree.