CN103559423A

CN103559423A - Method and device for predicting methylation

Info

Publication number: CN103559423A
Application number: CN201310534661.8A
Authority: CN
Inventors: 周丰丰; 赵苗苗; 张召; 刘记奎; 葛瑞泉
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2013-10-31
Filing date: 2013-10-31
Publication date: 2014-02-05
Anticipated expiration: 2033-10-31
Also published as: CN103559423B

Abstract

The invention is suitable for the technical field of biological information and provides a method and a device for predicting methylation. The method comprises the following steps of: downloading to obtain methylated data; according to the methylated data, acquiring original protein sequence data; carrying out preprocessing on the original protein sequence data to obtain a positive data set and a negative data set; carrying out encoding on character string data in the positive data set and the negative data set to obtain numeric data; carrying out modeling on the numeric data in the positive data set and the negative data set by utilizing a classification algorithm, calculating the optimal division mode according to a model obtained by modeling, and finally, according to the division mode, dividing the data which need to be predicted whether to be methylated into two categories in a centralized mode, wherein one category is the methylated data and the other category is data which are not methylated. According to the invention, personnel do not need to participate, a graph also does not need to be drawn, time can be saved and cost is low.

Description

A kind of Forecasting Methodology of methylation, device

Technical field

The invention belongs to biology information technology field, relate in particular to a kind of Forecasting Methodology, device of methylation.

Background technology

Methylating is a kind of important modification of protein and nucleic acid, and the expression of regulatory gene and closing is closely related with numerous diseases such as cancer, aging, senile dementias, is one of important research content of epigenetics.Therefore, understanding specific methylation mechanism will affect current molecular biological many fields, and the aspects such as the correlative study of disease and drug design are also all very helpful.

The Joseph Ecker of U.S. Salk biological study institute and colleague thereof, just by the method for high-flux sequence, have represented the complete collection of illustrative plates of all methylcysteins in a human embryo stem cell.The graduate Meissner of U.S. Whitehead etc. had also once drawn similar collection of illustrative plates.They utilize high-throughout bisulfite sequencing and single-molecule sequencing, have produced the DNA methylation collection of illustrative plates that covers most of CpG island.

In addition, two research groups independently, be respectively the George Church of Harvard University etc., and the Kun Zhang of University of California is together with Yuan Gao of Univ Virginia Commonwealth etc., also traditional instrument that methylates as transforming, the bisulfite of DNA is combined with target gene group capture technique and high-flux sequence, methylating in quantitative measurement human genome.

Although the method for drafting of these methylation profiles is slightly different, they have adopted hydrosulfite conversion, and unmethylated cytimidine is changed into uracil, and change into thymine in amplification step subsequently.Although the assay method of this methylation is very effective, this method needs some manual operationss to guarantee to transform completely, and need to draw collection of illustrative plates by computational analysis.

No matter in a word, by above-mentioned laboratory facilities, measure methylated method, be that not only very time-consuming, expense is also more expensive, also can be limited by enzyme reaction based in body or external technology.

Summary of the invention

The embodiment of the present invention provides a kind of Forecasting Methodology, device of methylation, is intended to solve the methylated method of mensuration that prior art provides, and not only very time-consuming, expense is also more expensive, also can be by the problem that enzyme reaction limited.

On the one hand, provide a kind of Forecasting Methodology of methylation, described method comprises:

Download obtains by the data of methylation;

According to described by the data acquisition urporotein sequence data of methylation;

Described urporotein sequence data is carried out to pre-service, obtain positive data set and negative data set;

String data to described positive data set and described negative data centralization is encoded, and obtains numeric type data;

To the numeric type data of described positive data set and described negative data centralization, utilize sorting algorithm to carry out modeling, the model obtaining according to modeling calculates best partitioning scheme, finally according to described partitioning scheme by needs prediction whether the data of methylated data centralization be divided into two classes, one class is by the data of methylation, and another kind of is not by the data of methylation.

Further, describedly according to the described data acquisition urporotein sequence data by methylation, comprise:

From described, the data of methylation, read successively by the protein title of methylation;

According to protein title, from webpage http://www.uniprot.org/uniprot/, search successively the data corresponding with each protein title;

By these data, form the urporotein sequence corresponding with each protein title, described urporotein sequence data comprise with described by each protein title in the data of methylation corresponding by the data of methylation with not by the data of methylation.

Further, described described urporotein sequence data is carried out to pre-service, obtains positive data set and negative data set comprises:

Centered by K or R, from described urporotein sequence data, choose the character string of preseting length;

Using by the character string of methylation as positive control, and other not by the character string of methylation as negative control;

Positive control is added into positive data centralization, negative control is added into negative data centralization.

Further, the described string data to described positive data set and described negative data centralization is encoded, and obtains coding method in numeric type data and comprises a kind of in probabilistic type coding, value number type coding, orthogonal type coding and binary coding.

Further, described sorting algorithm is a kind of in random forest, random tree (RandomTree).

On the other hand, provide a kind of prediction unit of methylation, described device comprises:

Data download unit, for downloading the data that obtain by methylation;

Raw data acquiring unit, for according to described by the data acquisition urporotein sequence data of methylation;

Pretreatment unit, for described urporotein sequence data is carried out to pre-service, obtains positive data set and negative data set;

Coding unit, for the string data of described positive data set and described negative data centralization is encoded, obtains numeric type data;

Taxon, for utilizing sorting algorithm to carry out modeling to the numeric type data of described positive data set and described negative data centralization, the model obtaining according to modeling calculates best partitioning scheme, finally according to described partitioning scheme by needs prediction whether the data of methylated data centralization be divided into two classes, one class is by the data of methylation, and another kind of is not by the data of methylation.

Further, described raw data acquiring unit comprises:

Protein title acquisition module, for being read successively by the protein title of methylation by the data of methylation from described;

Data search module, for searching the data corresponding with each protein title from webpage http://www.uniprot.org/uniprot/ successively according to protein title;

Data concatenation module, for forming the urporotein sequence corresponding with each protein title by these data, described urporotein sequence data comprise with described by each protein title in the data of methylation corresponding by the data of methylation with not by the data of methylation.

Further, described pretreatment unit comprises:

Character string chosen module for centered by K or R, is chosen the character string of preseting length from described urporotein sequence data;

Sun negative control acquisition module, for using by the character string of methylation as positive control, and other not by the character string of methylation as negative control;

Data set obtains module, for positive control being added into positive data centralization, negative control is added into negative data centralization.

Further, the coding method that described coding unit adopts is a kind of in probabilistic type coding, value number type coding, orthogonal type coding and binary coding.

Further, the sorting algorithm that described taxon adopts is a kind of in random forest, random tree.

In the embodiment of the present invention, when carrying out methylation prediction, whole process is completed automatically by computing machine, compared to existing technology, does not need artificial participation, also does not need to draw collection of illustrative plates, can save time, and expense is also cheap.

Accompanying drawing explanation

Fig. 1 is the realization flow figure of the Forecasting Methodology of the methylation that provides of the embodiment of the present invention one;

Fig. 2 is the structured flowchart of the prediction unit of the methylation that provides of the embodiment of the present invention one.

Embodiment

In order to make object of the present invention, technical scheme and advantage clearer, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein, only in order to explain the present invention, is not intended to limit the present invention.

In embodiments of the present invention, first download and obtain by the data of methylation, according to described by the data acquisition urporotein sequence data of methylation; Again described urporotein sequence data is carried out to pre-service, obtain positive data set and negative data set; Then the string data of described positive data set and described negative data centralization is encoded, obtain numeric type data; Finally to the numeric type data of described positive data set and described negative data centralization, utilize sorting algorithm to carry out modeling, the model obtaining according to modeling calculates best partitioning scheme, finally according to described partitioning scheme by needs prediction whether the data of methylated data centralization be divided into two classes, one class is by the data of methylation, and another kind of is not by the data of methylation.

Below in conjunction with specific embodiment, realization of the present invention is described in detail:

Embodiment mono-

Fig. 1 shows the realization flow of the Forecasting Methodology of the methylation that the embodiment of the present invention one provides, and details are as follows:

In step S101, download and obtain by the data of methylation.

In the present embodiment, by downloading network address: http://dbptm.mbc.nctu.edu.tw/download.php can obtain by the data of methylation, and these data are from database: dbPTM.The data that download obtains are stored in document Methylation_K.txt and Methylation_R.txt, Methylation_K.txt comprises by the data of lysine (K) methylation, Methylation_R.txt comprises by the data of arginine (R) methylation, in the specific implementation, need respectively the prediction of classifying of the data in Methylation_K.txt and two documents of Methylation_R.txt.

In step S102, according to described by the data acquisition urporotein sequence data of methylation.

In the present embodiment, the process of obtaining urporotein sequence data comprises:

Step 1, from described, the data of methylation, read successively by the protein title of methylation;

Step 2, according to protein title, from webpage http://www.uniprot.org/uniprot/, search successively the data corresponding with each protein title;

Step 3, by these data, form the urporotein sequence corresponding with each protein title, described urporotein sequence data comprise with described by each protein title in the data of methylation corresponding by the data of methylation with not by the data of methylation.

Owing to there is no urporotein sequence data in document Methylation_K.txt, only include and amount to the methylate protein sequence data of site data of 1013 quilts, the protein title by methylation in the protein sequence data that therefore need to comprise according to this Methylation_K.txt is obtained every kind of protein not by the data of methylation.

During concrete enforcement, can utilize python code to write and realize above-mentioned 3 steps, from webpage http://www.uniprot.org/uniprot/, obtain 1013 quilts in document Methylation_K.txt urporotein sequence data corresponding to site data that methylate.

For document Methylation_R.txt, same processing procedure, 1192 quilts that can utilize the python code of writing to obtain Methylation_R.txt from webpage http://www.uniprot.org/uniprot/ urporotein sequence data corresponding to site data that methylate.

In step S103, described urporotein sequence data is carried out to pre-service, obtain positive data set and negative data set.

In the present embodiment, define the peptide section in a key concept-site that methylates.The peptide section that m residue before site that methylates, this site that methylates and this site residue of n below that methylates form is called the peptide section PSP (m in the site that methylates, n), its biological significance is the amino acid whose biochemical characteristic that the characteristic in site is decided by that it is contiguous conventionally that methylates.In actual applications, that mainly consider is 11 peptide PSP(5,5) and, the value of m and n is 5, when the right and left in the site that methylates does not have 5, can use '-' to replace.

Described urporotein sequence data is carried out to pre-service, and the process that obtains positive data set and negative data set comprises:

Step 11, described urporotein sequence data is divided into positive control and negative control.

Concrete, first, centered by K or R, from described urporotein sequence data, choose the character string of preseting length, then using by the character string of methylation as positive control, and other not by the character string of methylation as negative control.

Step 12, positive control is added into positive data centralization, negative control is added into negative data centralization.

For example,, for methylating of being stimulated by lysine (K): 1013 quilts site data that methylate are considered as to positive control, and the every other K site that 1013 quilts are methylated on the protein sequence of site data is considered as negative control.

Wherein, centered by K, from urporotein sequence data, choose the character string of preseting length, in these character strings, have plenty of that experiment test goes out by methylation, using by the character string of methylation as positive control, and other not by the character string of methylation as negative control.

In addition, in order to increase the accuracy of prediction, in being carried out to pretreated process, urporotein sequence data increased data verification process: judge whether 1013 quilts site data that methylate are really methylated, if really methylated, judge these 1013 quilts methylate whether there are the data of repetition in the data of site, if, repeating data is removed to redundancy, finally the positive control obtaining is added in positive data set P, the negative control obtaining is added in negative data set N.

Wherein, the positive data that methylate that comprise respectively in positive data set P and negative data set N are respectively 1012 and 23915 with the number of the negative data that methylate.In addition, the data of data centralization all adopt 11 peptide PSP(5,5) form represent.

For methylating of being stimulated by arginine (R), with identical to the methylated disposal route being stimulated by lysine (K) above.1192 quilts site data that methylate are considered as to positive control, and the every other R site that 1192 quilts are methylated on the protein sequence of site data is considered as negative control.

Equally, can be first whether 1192 quilts site data that methylate be really methylated and judged, remove again the genuine methylated redundant data methylating in the data of site, the final 11 peptide PSP(5 that positive data set P and negative data set N comprise, 5 of obtaining respectively) number is respectively 1189 and 32505.

In step S104, the string data of described positive data set and described negative data centralization is encoded, obtain numeric type data.

In the present embodiment, can encode by probabilistic type, the coding method such as numeral number type coding, orthogonal type coding converts the string data of data centralization to numeric type data.Below doing one introduces in detail:

(1), probabilistic type coding

First according to the data statistics in data set P, go out the probability that each position occurs that each is alphabetical, finally obtain a probability statistics matrix, then utilize this probability statistics matrix that the string data correspondence in data set P is converted into numeric type data.

For example, the positive data that methylate that stimulated by lysine (K) are 1012, each data length is 11 (regarding 11 positions as), all packets amount to 21 containing letter and '-', then for these 1012 each positions of data statistics, there is the probability that each is alphabetical, can obtain the probability statistics matrix M of a 21*11, and then the string data in data set P, N is converted into the numeric type data of 11 features according to this probability statistics matrix M.

(2), value number type coding

Each character in string data is carried out to unique identification by a tens digit, for example in the string data in the present embodiment, there are 21 characters, therefore can utilize 1-21 to represent respectively each character, thereby realize, string data is converted into numeric type data.

(3) orthogonal type coding

In the present embodiment, for 21 characters in string data, each character replaces by 21 binary codings, in these 21 binary codings, has and only have one to be 1, and other positions are 0.Suppose 20 amino acid for A, C, D ... }, add character '-', amount to 21, A is encoded to 000000000000000000001(20 individual 0 so), C is encoded to 000000000000000000010(20 individual 0), D is encoded to 000000000000000000100(20 individual 0), etc.Then the binary numeral of each position being regarded as to a feature, is just also 11*21=231 0/1 feature by 11 character codes of every string data.

(4) binary coding

First each letter in string data is carried out to unique identification by 5 bit values, then the binary numeral of each position is regarded as to a feature, namely every string data has been changed into 11*5=55 feature.

Certainly in order to improve the accuracy of methylation prediction, also can be after obtaining the protein sequence data of the site data that methylated, according to the protein sequence data of these site data that methylated, obtain the data in protein unstable structure interval, and then obtain the string data of 11 new length, then to former 11 peptide PSP(5, 5) add the string data of 11 length that this is new, obtain amounting to length and be 22 string data, the string data that is 22 according to the coding method of mentioning in step S104 to this length is again processed, be translated into numeric type data and carry out again follow-up classification prediction.

In step S105, to the numeric type data of described positive data set and described negative data centralization, utilize sorting algorithm to carry out modeling, the model obtaining according to modeling calculates best partitioning scheme, finally according to described partitioning scheme by needs prediction whether the data of methylated data centralization be divided into two classes, one class is by the data of methylation, and another kind of is not by the data of methylation.

In embodiments of the present invention, sorting algorithm can be any one in random forest (RandomForest), random tree (RandomTree).Certainly, sorting algorithm also can adopt neural network algorithm, nearest neighbor algorithm, bayesian algorithm, WebLogo algorithm, genetic algorithm, Lasso algorithm, LibSVM algorithm, support vector machine, clustering algorithm, individual layer decision tree (Decision Stump) algorithm, Logistic algorithm etc., concrete which kind of sorting algorithm that adopts, does not limit in the present embodiment.

Whether methylated data set can be a new data set that is different from described positive data set and described negative data set to need prediction, can be also all or part of data of described positive data set and described negative data centralization.

During practical application, the method for data being carried out to methylation prediction comprises:

1), All training set: checking certainly (integral body is made model, then predicts with whole)

Can carry out modeling by the total data of described positive data set and described negative data centralization, then the total data of described positive data set and described negative data centralization is predicted.

2), 30% test: can set up model by a part of data of described positive data set and described negative data centralization, remaining another part data are predicted.

For example, can carry out modeling by 70% data of described positive data set and described negative data centralization, 30% remaining data are predicted.

3), 3 times of cross validation Fold-3

4), 10 times of cross validation Fold-10

Because judge the performance of prediction, need so prediction whether the actual type of methylated part data set be known, then with sorting algorithm prediction, obtain the type of this partitioned data set (PDS), if type and known type that prediction obtains are identical, represent that prediction is correct, otherwise represent prediction error.

As a preferred embodiment of the present invention, can further to the result of prediction, assess, to verify the reliability predicting the outcome.

During concrete application, in order to check the reliability of every kind of Forecasting Methodology, four evaluating standards have been adopted: susceptibility (Sn), specificity (Sp), accuracy (Ac) and related coefficient (MCC) are evaluated the performance predicting the outcome, the value of four evaluating standards is larger, represents that the accuracy predicting the outcome is higher and stability better.

Wherein, Sn, Sp, Ac and MCC meet respectively following formula:

Sn = \frac{TP}{TP + FN};

Sp = \frac{TN}{TN + FP};

Ac = \frac{TP + TN}{TP + FP + TN + FN};

MCC = \frac{(TP \times TN) - (FN \times FP)}{\sqrt{(TP + FN) \times (TN + FP) \times (TP + FP) \times (TN + FN)}} .

Wherein, TP, TN, FP and FN are respectively true positives, true negative, false positive and the false negative numbers that test obtains.

Concrete, the performance predicting the outcome obtaining by various codings is respectively as table 1, table 2, shown in table 3 and table 4, wherein table 1 be utilize that probabilistic type coding obtains by lysine (K), stimulated the performance predicting the outcome whether methylation occurs, table 2 be utilize that value number type coding obtains by lysine (K), stimulated the performance predicting the outcome whether methylation occurs, table 3 be utilize that probabilistic type coding obtains by arginine (R), stimulated the performance predicting the outcome whether methylation occurs, table 4 is to utilize value number type coding to obtain being stimulated by arginine (R) performance predicting the outcome whether methylation occurs.

Table 1

Table 2

Table 3

Table 4

The present embodiment, when carrying out methylation prediction, whole process is completed automatically by computing machine, compared to existing technology, does not need artificial participation, also does not need to draw collection of illustrative plates, can save time, and expense is also cheap.In addition, from taxonomy angle, whether protein is predicted by methylation, the accuracy of prediction can be better, can provide better help to aspects such as the correlative study of disease and medicine relate to.

One of ordinary skill in the art will appreciate that all or part of step realizing in the various embodiments described above method is to come the hardware that instruction is relevant to complete by program, corresponding program can be stored in a computer read/write memory medium, described storage medium, as ROM/RAM, disk or CD etc.

Embodiment bis-

Fig. 2 shows the concrete structure block diagram of the prediction unit of the methylation that the embodiment of the present invention two provides, and for convenience of explanation, only shows the part relevant to the embodiment of the present invention.The prediction unit 2 of this methylation comprises: data download unit 21, raw data acquiring unit 22, pretreatment unit 23, coding unit 24 and taxon 25.

Wherein, data download unit 21, for downloading the data that obtain by methylation;

Raw data acquiring unit 22, for according to described by the data acquisition urporotein sequence data of methylation;

Pretreatment unit 23, for described urporotein sequence data is carried out to pre-service, obtains positive data set and negative data set;

Coding unit 24, for the string data of described positive data set and described negative data centralization is encoded, obtains numeric type data;

Taxon 25, for utilizing sorting algorithm to carry out modeling to the numeric type data of described positive data set and described negative data centralization, the model obtaining according to modeling calculates best partitioning scheme, finally according to described partitioning scheme by needs prediction whether the data of methylated data centralization be divided into two classes, one class is by the data of methylation, and another kind of is not by the data of methylation.

Concrete, described raw data acquiring unit 22 comprises:

Concrete, described pretreatment unit 23 comprises:

Concrete, the coding method that described coding unit 24 adopts is a kind of in probabilistic type coding, value number type coding, orthogonal type coding and binary coding.

Concrete, the sorting algorithm that described taxon 25 adopts is a kind of in random forest, random tree.

The prediction unit of the methylation that the embodiment of the present invention provides can be applied in the embodiment of the method one of aforementioned correspondence, and details, referring to the description of above-described embodiment one, do not repeat them here.

It should be noted that in said system embodiment, included unit is just divided according to function logic, but is not limited to above-mentioned division, as long as can realize corresponding function; In addition, the concrete title of each functional unit also, just for the ease of mutual differentiation, is not limited to protection scope of the present invention.

The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, all any modifications of doing within the spirit and principles in the present invention, be equal to and replace and improvement etc., within all should being included in protection scope of the present invention.

Claims

1. a Forecasting Methodology for methylation, is characterized in that, described method comprises:

Download obtains by the data of methylation;

2. the method for claim 1, is characterized in that, describedly according to the described data acquisition urporotein sequence data by methylation, comprises:

3. the method for claim 1, is characterized in that, described described urporotein sequence data is carried out to pre-service, obtains positive data set and negative data set comprises:

4. the method for claim 1, it is characterized in that, the described string data to described positive data set and described negative data centralization is encoded, and obtains coding method in numeric type data and comprises a kind of in probabilistic type coding, value number type coding, orthogonal type coding and binary coding.

5. the method for claim 1, is characterized in that, described sorting algorithm is a kind of in random forest, random tree.

6. a prediction unit for methylation, is characterized in that, described device comprises:

Data download unit, for downloading the data that obtain by methylation;

7. device as claimed in claim 6, is characterized in that, described raw data acquiring unit comprises:

8. device as claimed in claim 6, is characterized in that, described pretreatment unit comprises:

9. device as claimed in claim 6, is characterized in that, the coding method that described coding unit adopts is a kind of in probabilistic type coding, value number type coding, orthogonal type coding and binary coding.

10. device as claimed in claim 6, is characterized in that, the sorting algorithm that described taxon adopts is a kind of in random forest, random tree.