CN103559423A - Method and device for predicting methylation - Google Patents

Method and device for predicting methylation Download PDF

Info

Publication number
CN103559423A
CN103559423A CN201310534661.8A CN201310534661A CN103559423A CN 103559423 A CN103559423 A CN 103559423A CN 201310534661 A CN201310534661 A CN 201310534661A CN 103559423 A CN103559423 A CN 103559423A
Authority
CN
China
Prior art keywords
data
methylation
negative
positive
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310534661.8A
Other languages
Chinese (zh)
Other versions
CN103559423B (en
Inventor
周丰丰
赵苗苗
张召
刘记奎
葛瑞泉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN201310534661.8A priority Critical patent/CN103559423B/en
Publication of CN103559423A publication Critical patent/CN103559423A/en
Application granted granted Critical
Publication of CN103559423B publication Critical patent/CN103559423B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention is suitable for the technical field of biological information and provides a method and a device for predicting methylation. The method comprises the following steps of: downloading to obtain methylated data; according to the methylated data, acquiring original protein sequence data; carrying out preprocessing on the original protein sequence data to obtain a positive data set and a negative data set; carrying out encoding on character string data in the positive data set and the negative data set to obtain numeric data; carrying out modeling on the numeric data in the positive data set and the negative data set by utilizing a classification algorithm, calculating the optimal division mode according to a model obtained by modeling, and finally, according to the division mode, dividing the data which need to be predicted whether to be methylated into two categories in a centralized mode, wherein one category is the methylated data and the other category is data which are not methylated. According to the invention, personnel do not need to participate, a graph also does not need to be drawn, time can be saved and cost is low.

Description

A kind of Forecasting Methodology of methylation, device
Technical field
The invention belongs to biology information technology field, relate in particular to a kind of Forecasting Methodology, device of methylation.
Background technology
Methylating is a kind of important modification of protein and nucleic acid, and the expression of regulatory gene and closing is closely related with numerous diseases such as cancer, aging, senile dementias, is one of important research content of epigenetics.Therefore, understanding specific methylation mechanism will affect current molecular biological many fields, and the aspects such as the correlative study of disease and drug design are also all very helpful.
The Joseph Ecker of U.S. Salk biological study institute and colleague thereof, just by the method for high-flux sequence, have represented the complete collection of illustrative plates of all methylcysteins in a human embryo stem cell.The graduate Meissner of U.S. Whitehead etc. had also once drawn similar collection of illustrative plates.They utilize high-throughout bisulfite sequencing and single-molecule sequencing, have produced the DNA methylation collection of illustrative plates that covers most of CpG island.
In addition, two research groups independently, be respectively the George Church of Harvard University etc., and the Kun Zhang of University of California is together with Yuan Gao of Univ Virginia Commonwealth etc., also traditional instrument that methylates as transforming, the bisulfite of DNA is combined with target gene group capture technique and high-flux sequence, methylating in quantitative measurement human genome.
Although the method for drafting of these methylation profiles is slightly different, they have adopted hydrosulfite conversion, and unmethylated cytimidine is changed into uracil, and change into thymine in amplification step subsequently.Although the assay method of this methylation is very effective, this method needs some manual operationss to guarantee to transform completely, and need to draw collection of illustrative plates by computational analysis.
No matter in a word, by above-mentioned laboratory facilities, measure methylated method, be that not only very time-consuming, expense is also more expensive, also can be limited by enzyme reaction based in body or external technology.
Summary of the invention
The embodiment of the present invention provides a kind of Forecasting Methodology, device of methylation, is intended to solve the methylated method of mensuration that prior art provides, and not only very time-consuming, expense is also more expensive, also can be by the problem that enzyme reaction limited.
On the one hand, provide a kind of Forecasting Methodology of methylation, described method comprises:
Download obtains by the data of methylation;
According to described by the data acquisition urporotein sequence data of methylation;
Described urporotein sequence data is carried out to pre-service, obtain positive data set and negative data set;
String data to described positive data set and described negative data centralization is encoded, and obtains numeric type data;
To the numeric type data of described positive data set and described negative data centralization, utilize sorting algorithm to carry out modeling, the model obtaining according to modeling calculates best partitioning scheme, finally according to described partitioning scheme by needs prediction whether the data of methylated data centralization be divided into two classes, one class is by the data of methylation, and another kind of is not by the data of methylation.
Further, describedly according to the described data acquisition urporotein sequence data by methylation, comprise:
From described, the data of methylation, read successively by the protein title of methylation;
According to protein title, from webpage http://www.uniprot.org/uniprot/, search successively the data corresponding with each protein title;
By these data, form the urporotein sequence corresponding with each protein title, described urporotein sequence data comprise with described by each protein title in the data of methylation corresponding by the data of methylation with not by the data of methylation.
Further, described described urporotein sequence data is carried out to pre-service, obtains positive data set and negative data set comprises:
Centered by K or R, from described urporotein sequence data, choose the character string of preseting length;
Using by the character string of methylation as positive control, and other not by the character string of methylation as negative control;
Positive control is added into positive data centralization, negative control is added into negative data centralization.
Further, the described string data to described positive data set and described negative data centralization is encoded, and obtains coding method in numeric type data and comprises a kind of in probabilistic type coding, value number type coding, orthogonal type coding and binary coding.
Further, described sorting algorithm is a kind of in random forest, random tree (RandomTree).
On the other hand, provide a kind of prediction unit of methylation, described device comprises:
Data download unit, for downloading the data that obtain by methylation;
Raw data acquiring unit, for according to described by the data acquisition urporotein sequence data of methylation;
Pretreatment unit, for described urporotein sequence data is carried out to pre-service, obtains positive data set and negative data set;
Coding unit, for the string data of described positive data set and described negative data centralization is encoded, obtains numeric type data;
Taxon, for utilizing sorting algorithm to carry out modeling to the numeric type data of described positive data set and described negative data centralization, the model obtaining according to modeling calculates best partitioning scheme, finally according to described partitioning scheme by needs prediction whether the data of methylated data centralization be divided into two classes, one class is by the data of methylation, and another kind of is not by the data of methylation.
Further, described raw data acquiring unit comprises:
Protein title acquisition module, for being read successively by the protein title of methylation by the data of methylation from described;
Data search module, for searching the data corresponding with each protein title from webpage http://www.uniprot.org/uniprot/ successively according to protein title;
Data concatenation module, for forming the urporotein sequence corresponding with each protein title by these data, described urporotein sequence data comprise with described by each protein title in the data of methylation corresponding by the data of methylation with not by the data of methylation.
Further, described pretreatment unit comprises:
Character string chosen module for centered by K or R, is chosen the character string of preseting length from described urporotein sequence data;
Sun negative control acquisition module, for using by the character string of methylation as positive control, and other not by the character string of methylation as negative control;
Data set obtains module, for positive control being added into positive data centralization, negative control is added into negative data centralization.
Further, the coding method that described coding unit adopts is a kind of in probabilistic type coding, value number type coding, orthogonal type coding and binary coding.
Further, the sorting algorithm that described taxon adopts is a kind of in random forest, random tree.
In the embodiment of the present invention, when carrying out methylation prediction, whole process is completed automatically by computing machine, compared to existing technology, does not need artificial participation, also does not need to draw collection of illustrative plates, can save time, and expense is also cheap.
Accompanying drawing explanation
Fig. 1 is the realization flow figure of the Forecasting Methodology of the methylation that provides of the embodiment of the present invention one;
Fig. 2 is the structured flowchart of the prediction unit of the methylation that provides of the embodiment of the present invention one.
Embodiment
In order to make object of the present invention, technical scheme and advantage clearer, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein, only in order to explain the present invention, is not intended to limit the present invention.
In embodiments of the present invention, first download and obtain by the data of methylation, according to described by the data acquisition urporotein sequence data of methylation; Again described urporotein sequence data is carried out to pre-service, obtain positive data set and negative data set; Then the string data of described positive data set and described negative data centralization is encoded, obtain numeric type data; Finally to the numeric type data of described positive data set and described negative data centralization, utilize sorting algorithm to carry out modeling, the model obtaining according to modeling calculates best partitioning scheme, finally according to described partitioning scheme by needs prediction whether the data of methylated data centralization be divided into two classes, one class is by the data of methylation, and another kind of is not by the data of methylation.
Below in conjunction with specific embodiment, realization of the present invention is described in detail:
Embodiment mono-
Fig. 1 shows the realization flow of the Forecasting Methodology of the methylation that the embodiment of the present invention one provides, and details are as follows:
In step S101, download and obtain by the data of methylation.
In the present embodiment, by downloading network address: http://dbptm.mbc.nctu.edu.tw/download.php can obtain by the data of methylation, and these data are from database: dbPTM.The data that download obtains are stored in document Methylation_K.txt and Methylation_R.txt, Methylation_K.txt comprises by the data of lysine (K) methylation, Methylation_R.txt comprises by the data of arginine (R) methylation, in the specific implementation, need respectively the prediction of classifying of the data in Methylation_K.txt and two documents of Methylation_R.txt.
In step S102, according to described by the data acquisition urporotein sequence data of methylation.
In the present embodiment, the process of obtaining urporotein sequence data comprises:
Step 1, from described, the data of methylation, read successively by the protein title of methylation;
Step 2, according to protein title, from webpage http://www.uniprot.org/uniprot/, search successively the data corresponding with each protein title;
Step 3, by these data, form the urporotein sequence corresponding with each protein title, described urporotein sequence data comprise with described by each protein title in the data of methylation corresponding by the data of methylation with not by the data of methylation.
Owing to there is no urporotein sequence data in document Methylation_K.txt, only include and amount to the methylate protein sequence data of site data of 1013 quilts, the protein title by methylation in the protein sequence data that therefore need to comprise according to this Methylation_K.txt is obtained every kind of protein not by the data of methylation.
During concrete enforcement, can utilize python code to write and realize above-mentioned 3 steps, from webpage http://www.uniprot.org/uniprot/, obtain 1013 quilts in document Methylation_K.txt urporotein sequence data corresponding to site data that methylate.
For document Methylation_R.txt, same processing procedure, 1192 quilts that can utilize the python code of writing to obtain Methylation_R.txt from webpage http://www.uniprot.org/uniprot/ urporotein sequence data corresponding to site data that methylate.
In step S103, described urporotein sequence data is carried out to pre-service, obtain positive data set and negative data set.
In the present embodiment, define the peptide section in a key concept-site that methylates.The peptide section that m residue before site that methylates, this site that methylates and this site residue of n below that methylates form is called the peptide section PSP (m in the site that methylates, n), its biological significance is the amino acid whose biochemical characteristic that the characteristic in site is decided by that it is contiguous conventionally that methylates.In actual applications, that mainly consider is 11 peptide PSP(5,5) and, the value of m and n is 5, when the right and left in the site that methylates does not have 5, can use '-' to replace.
Described urporotein sequence data is carried out to pre-service, and the process that obtains positive data set and negative data set comprises:
Step 11, described urporotein sequence data is divided into positive control and negative control.
Concrete, first, centered by K or R, from described urporotein sequence data, choose the character string of preseting length, then using by the character string of methylation as positive control, and other not by the character string of methylation as negative control.
Step 12, positive control is added into positive data centralization, negative control is added into negative data centralization.
For example,, for methylating of being stimulated by lysine (K): 1013 quilts site data that methylate are considered as to positive control, and the every other K site that 1013 quilts are methylated on the protein sequence of site data is considered as negative control.
Wherein, centered by K, from urporotein sequence data, choose the character string of preseting length, in these character strings, have plenty of that experiment test goes out by methylation, using by the character string of methylation as positive control, and other not by the character string of methylation as negative control.
In addition, in order to increase the accuracy of prediction, in being carried out to pretreated process, urporotein sequence data increased data verification process: judge whether 1013 quilts site data that methylate are really methylated, if really methylated, judge these 1013 quilts methylate whether there are the data of repetition in the data of site, if, repeating data is removed to redundancy, finally the positive control obtaining is added in positive data set P, the negative control obtaining is added in negative data set N.
Wherein, the positive data that methylate that comprise respectively in positive data set P and negative data set N are respectively 1012 and 23915 with the number of the negative data that methylate.In addition, the data of data centralization all adopt 11 peptide PSP(5,5) form represent.
For methylating of being stimulated by arginine (R), with identical to the methylated disposal route being stimulated by lysine (K) above.1192 quilts site data that methylate are considered as to positive control, and the every other R site that 1192 quilts are methylated on the protein sequence of site data is considered as negative control.
Equally, can be first whether 1192 quilts site data that methylate be really methylated and judged, remove again the genuine methylated redundant data methylating in the data of site, the final 11 peptide PSP(5 that positive data set P and negative data set N comprise, 5 of obtaining respectively) number is respectively 1189 and 32505.
In step S104, the string data of described positive data set and described negative data centralization is encoded, obtain numeric type data.
In the present embodiment, can encode by probabilistic type, the coding method such as numeral number type coding, orthogonal type coding converts the string data of data centralization to numeric type data.Below doing one introduces in detail:
(1), probabilistic type coding
First according to the data statistics in data set P, go out the probability that each position occurs that each is alphabetical, finally obtain a probability statistics matrix, then utilize this probability statistics matrix that the string data correspondence in data set P is converted into numeric type data.
For example, the positive data that methylate that stimulated by lysine (K) are 1012, each data length is 11 (regarding 11 positions as), all packets amount to 21 containing letter and '-', then for these 1012 each positions of data statistics, there is the probability that each is alphabetical, can obtain the probability statistics matrix M of a 21*11, and then the string data in data set P, N is converted into the numeric type data of 11 features according to this probability statistics matrix M.
(2), value number type coding
Each character in string data is carried out to unique identification by a tens digit, for example in the string data in the present embodiment, there are 21 characters, therefore can utilize 1-21 to represent respectively each character, thereby realize, string data is converted into numeric type data.
(3) orthogonal type coding
In the present embodiment, for 21 characters in string data, each character replaces by 21 binary codings, in these 21 binary codings, has and only have one to be 1, and other positions are 0.Suppose 20 amino acid for A, C, D ... }, add character '-', amount to 21, A is encoded to 000000000000000000001(20 individual 0 so), C is encoded to 000000000000000000010(20 individual 0), D is encoded to 000000000000000000100(20 individual 0), etc.Then the binary numeral of each position being regarded as to a feature, is just also 11*21=231 0/1 feature by 11 character codes of every string data.
(4) binary coding
First each letter in string data is carried out to unique identification by 5 bit values, then the binary numeral of each position is regarded as to a feature, namely every string data has been changed into 11*5=55 feature.
Certainly in order to improve the accuracy of methylation prediction, also can be after obtaining the protein sequence data of the site data that methylated, according to the protein sequence data of these site data that methylated, obtain the data in protein unstable structure interval, and then obtain the string data of 11 new length, then to former 11 peptide PSP(5, 5) add the string data of 11 length that this is new, obtain amounting to length and be 22 string data, the string data that is 22 according to the coding method of mentioning in step S104 to this length is again processed, be translated into numeric type data and carry out again follow-up classification prediction.
In step S105, to the numeric type data of described positive data set and described negative data centralization, utilize sorting algorithm to carry out modeling, the model obtaining according to modeling calculates best partitioning scheme, finally according to described partitioning scheme by needs prediction whether the data of methylated data centralization be divided into two classes, one class is by the data of methylation, and another kind of is not by the data of methylation.
In embodiments of the present invention, sorting algorithm can be any one in random forest (RandomForest), random tree (RandomTree).Certainly, sorting algorithm also can adopt neural network algorithm, nearest neighbor algorithm, bayesian algorithm, WebLogo algorithm, genetic algorithm, Lasso algorithm, LibSVM algorithm, support vector machine, clustering algorithm, individual layer decision tree (Decision Stump) algorithm, Logistic algorithm etc., concrete which kind of sorting algorithm that adopts, does not limit in the present embodiment.
Whether methylated data set can be a new data set that is different from described positive data set and described negative data set to need prediction, can be also all or part of data of described positive data set and described negative data centralization.
During practical application, the method for data being carried out to methylation prediction comprises:
1), All training set: checking certainly (integral body is made model, then predicts with whole)
Can carry out modeling by the total data of described positive data set and described negative data centralization, then the total data of described positive data set and described negative data centralization is predicted.
2), 30% test: can set up model by a part of data of described positive data set and described negative data centralization, remaining another part data are predicted.
For example, can carry out modeling by 70% data of described positive data set and described negative data centralization, 30% remaining data are predicted.
3), 3 times of cross validation Fold-3
4), 10 times of cross validation Fold-10
Because judge the performance of prediction, need so prediction whether the actual type of methylated part data set be known, then with sorting algorithm prediction, obtain the type of this partitioned data set (PDS), if type and known type that prediction obtains are identical, represent that prediction is correct, otherwise represent prediction error.
As a preferred embodiment of the present invention, can further to the result of prediction, assess, to verify the reliability predicting the outcome.
During concrete application, in order to check the reliability of every kind of Forecasting Methodology, four evaluating standards have been adopted: susceptibility (Sn), specificity (Sp), accuracy (Ac) and related coefficient (MCC) are evaluated the performance predicting the outcome, the value of four evaluating standards is larger, represents that the accuracy predicting the outcome is higher and stability better.
Wherein, Sn, Sp, Ac and MCC meet respectively following formula:
Sn = TP TP + FN ;
Sp = TN TN + FP ;
Ac = TP + TN TP + FP + TN + FN ;
MCC = ( TP × TN ) - ( FN × FP ) ( TP + FN ) × ( TN + FP ) × ( TP + FP ) × ( TN + FN ) .
Wherein, TP, TN, FP and FN are respectively true positives, true negative, false positive and the false negative numbers that test obtains.
Concrete, the performance predicting the outcome obtaining by various codings is respectively as table 1, table 2, shown in table 3 and table 4, wherein table 1 be utilize that probabilistic type coding obtains by lysine (K), stimulated the performance predicting the outcome whether methylation occurs, table 2 be utilize that value number type coding obtains by lysine (K), stimulated the performance predicting the outcome whether methylation occurs, table 3 be utilize that probabilistic type coding obtains by arginine (R), stimulated the performance predicting the outcome whether methylation occurs, table 4 is to utilize value number type coding to obtain being stimulated by arginine (R) performance predicting the outcome whether methylation occurs.
Figure BDA0000406551960000111
Table 1
Figure BDA0000406551960000112
Table 2
Table 3
Figure BDA0000406551960000114
Table 4
The present embodiment, when carrying out methylation prediction, whole process is completed automatically by computing machine, compared to existing technology, does not need artificial participation, also does not need to draw collection of illustrative plates, can save time, and expense is also cheap.In addition, from taxonomy angle, whether protein is predicted by methylation, the accuracy of prediction can be better, can provide better help to aspects such as the correlative study of disease and medicine relate to.
One of ordinary skill in the art will appreciate that all or part of step realizing in the various embodiments described above method is to come the hardware that instruction is relevant to complete by program, corresponding program can be stored in a computer read/write memory medium, described storage medium, as ROM/RAM, disk or CD etc.
Embodiment bis-
Fig. 2 shows the concrete structure block diagram of the prediction unit of the methylation that the embodiment of the present invention two provides, and for convenience of explanation, only shows the part relevant to the embodiment of the present invention.The prediction unit 2 of this methylation comprises: data download unit 21, raw data acquiring unit 22, pretreatment unit 23, coding unit 24 and taxon 25.
Wherein, data download unit 21, for downloading the data that obtain by methylation;
Raw data acquiring unit 22, for according to described by the data acquisition urporotein sequence data of methylation;
Pretreatment unit 23, for described urporotein sequence data is carried out to pre-service, obtains positive data set and negative data set;
Coding unit 24, for the string data of described positive data set and described negative data centralization is encoded, obtains numeric type data;
Taxon 25, for utilizing sorting algorithm to carry out modeling to the numeric type data of described positive data set and described negative data centralization, the model obtaining according to modeling calculates best partitioning scheme, finally according to described partitioning scheme by needs prediction whether the data of methylated data centralization be divided into two classes, one class is by the data of methylation, and another kind of is not by the data of methylation.
Concrete, described raw data acquiring unit 22 comprises:
Protein title acquisition module, for being read successively by the protein title of methylation by the data of methylation from described;
Data search module, for searching the data corresponding with each protein title from webpage http://www.uniprot.org/uniprot/ successively according to protein title;
Data concatenation module, for forming the urporotein sequence corresponding with each protein title by these data, described urporotein sequence data comprise with described by each protein title in the data of methylation corresponding by the data of methylation with not by the data of methylation.
Concrete, described pretreatment unit 23 comprises:
Character string chosen module for centered by K or R, is chosen the character string of preseting length from described urporotein sequence data;
Sun negative control acquisition module, for using by the character string of methylation as positive control, and other not by the character string of methylation as negative control;
Data set obtains module, for positive control being added into positive data centralization, negative control is added into negative data centralization.
Concrete, the coding method that described coding unit 24 adopts is a kind of in probabilistic type coding, value number type coding, orthogonal type coding and binary coding.
Concrete, the sorting algorithm that described taxon 25 adopts is a kind of in random forest, random tree.
The prediction unit of the methylation that the embodiment of the present invention provides can be applied in the embodiment of the method one of aforementioned correspondence, and details, referring to the description of above-described embodiment one, do not repeat them here.
It should be noted that in said system embodiment, included unit is just divided according to function logic, but is not limited to above-mentioned division, as long as can realize corresponding function; In addition, the concrete title of each functional unit also, just for the ease of mutual differentiation, is not limited to protection scope of the present invention.
The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, all any modifications of doing within the spirit and principles in the present invention, be equal to and replace and improvement etc., within all should being included in protection scope of the present invention.

Claims (10)

1. a Forecasting Methodology for methylation, is characterized in that, described method comprises:
Download obtains by the data of methylation;
According to described by the data acquisition urporotein sequence data of methylation;
Described urporotein sequence data is carried out to pre-service, obtain positive data set and negative data set;
String data to described positive data set and described negative data centralization is encoded, and obtains numeric type data;
To the numeric type data of described positive data set and described negative data centralization, utilize sorting algorithm to carry out modeling, the model obtaining according to modeling calculates best partitioning scheme, finally according to described partitioning scheme by needs prediction whether the data of methylated data centralization be divided into two classes, one class is by the data of methylation, and another kind of is not by the data of methylation.
2. the method for claim 1, is characterized in that, describedly according to the described data acquisition urporotein sequence data by methylation, comprises:
From described, the data of methylation, read successively by the protein title of methylation;
According to protein title, from webpage http://www.uniprot.org/uniprot/, search successively the data corresponding with each protein title;
By these data, form the urporotein sequence corresponding with each protein title, described urporotein sequence data comprise with described by each protein title in the data of methylation corresponding by the data of methylation with not by the data of methylation.
3. the method for claim 1, is characterized in that, described described urporotein sequence data is carried out to pre-service, obtains positive data set and negative data set comprises:
Centered by K or R, from described urporotein sequence data, choose the character string of preseting length;
Using by the character string of methylation as positive control, and other not by the character string of methylation as negative control;
Positive control is added into positive data centralization, negative control is added into negative data centralization.
4. the method for claim 1, it is characterized in that, the described string data to described positive data set and described negative data centralization is encoded, and obtains coding method in numeric type data and comprises a kind of in probabilistic type coding, value number type coding, orthogonal type coding and binary coding.
5. the method for claim 1, is characterized in that, described sorting algorithm is a kind of in random forest, random tree.
6. a prediction unit for methylation, is characterized in that, described device comprises:
Data download unit, for downloading the data that obtain by methylation;
Raw data acquiring unit, for according to described by the data acquisition urporotein sequence data of methylation;
Pretreatment unit, for described urporotein sequence data is carried out to pre-service, obtains positive data set and negative data set;
Coding unit, for the string data of described positive data set and described negative data centralization is encoded, obtains numeric type data;
Taxon, for utilizing sorting algorithm to carry out modeling to the numeric type data of described positive data set and described negative data centralization, the model obtaining according to modeling calculates best partitioning scheme, finally according to described partitioning scheme by needs prediction whether the data of methylated data centralization be divided into two classes, one class is by the data of methylation, and another kind of is not by the data of methylation.
7. device as claimed in claim 6, is characterized in that, described raw data acquiring unit comprises:
Protein title acquisition module, for being read successively by the protein title of methylation by the data of methylation from described;
Data search module, for searching the data corresponding with each protein title from webpage http://www.uniprot.org/uniprot/ successively according to protein title;
Data concatenation module, for forming the urporotein sequence corresponding with each protein title by these data, described urporotein sequence data comprise with described by each protein title in the data of methylation corresponding by the data of methylation with not by the data of methylation.
8. device as claimed in claim 6, is characterized in that, described pretreatment unit comprises:
Character string chosen module for centered by K or R, is chosen the character string of preseting length from described urporotein sequence data;
Sun negative control acquisition module, for using by the character string of methylation as positive control, and other not by the character string of methylation as negative control;
Data set obtains module, for positive control being added into positive data centralization, negative control is added into negative data centralization.
9. device as claimed in claim 6, is characterized in that, the coding method that described coding unit adopts is a kind of in probabilistic type coding, value number type coding, orthogonal type coding and binary coding.
10. device as claimed in claim 6, is characterized in that, the sorting algorithm that described taxon adopts is a kind of in random forest, random tree.
CN201310534661.8A 2013-10-31 2013-10-31 Method and device for predicting methylation Active CN103559423B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310534661.8A CN103559423B (en) 2013-10-31 2013-10-31 Method and device for predicting methylation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310534661.8A CN103559423B (en) 2013-10-31 2013-10-31 Method and device for predicting methylation

Publications (2)

Publication Number Publication Date
CN103559423A true CN103559423A (en) 2014-02-05
CN103559423B CN103559423B (en) 2017-02-15

Family

ID=50013669

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310534661.8A Active CN103559423B (en) 2013-10-31 2013-10-31 Method and device for predicting methylation

Country Status (1)

Country Link
CN (1) CN103559423B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107247873A (en) * 2017-03-29 2017-10-13 电子科技大学 A kind of recognition methods of differential methylation site
CN107506600A (en) * 2017-09-04 2017-12-22 上海美吉生物医药科技有限公司 The Forecasting Methodology and device of cancer types based on the data that methylate
CN111180012A (en) * 2019-12-27 2020-05-19 哈尔滨工业大学 Gene identification method based on empirical Bayes and Mendelian randomized fusion
CN111627499A (en) * 2020-05-27 2020-09-04 广州市基准医疗有限责任公司 Methylation level vectorization representation and specific sequencing interval detection method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102220412A (en) * 2004-11-29 2011-10-19 塞昆纳姆股份有限公司 Kits and methods for detecting methylated DNA
US20120122088A1 (en) * 2010-11-15 2012-05-17 Hongzhi Zou Methylation assay
CN103310126A (en) * 2013-07-04 2013-09-18 中国人民解放军国防科学技术大学 Classification-model building method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102220412A (en) * 2004-11-29 2011-10-19 塞昆纳姆股份有限公司 Kits and methods for detecting methylated DNA
US20120122088A1 (en) * 2010-11-15 2012-05-17 Hongzhi Zou Methylation assay
CN103310126A (en) * 2013-07-04 2013-09-18 中国人民解放军国防科学技术大学 Classification-model building method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ADRIAN BIRD: "The Essentials of DNA Methylation", 《CELL》 *
凡时财 等: "人类基因组CpG岛甲基化概况的预测", 《科学通报》 *
施绍萍: "基于支持向量机的蛋白质功能预测新方法研究", 《中国博士学位论文全文数据库 基础科学辑》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107247873A (en) * 2017-03-29 2017-10-13 电子科技大学 A kind of recognition methods of differential methylation site
CN107247873B (en) * 2017-03-29 2020-04-14 电子科技大学 Differential methylation site recognition method
CN107506600A (en) * 2017-09-04 2017-12-22 上海美吉生物医药科技有限公司 The Forecasting Methodology and device of cancer types based on the data that methylate
CN111180012A (en) * 2019-12-27 2020-05-19 哈尔滨工业大学 Gene identification method based on empirical Bayes and Mendelian randomized fusion
CN111627499A (en) * 2020-05-27 2020-09-04 广州市基准医疗有限责任公司 Methylation level vectorization representation and specific sequencing interval detection method and device
CN111627499B (en) * 2020-05-27 2020-12-08 广州市基准医疗有限责任公司 Methylation level vectorization representation and specific sequencing interval detection method and device

Also Published As

Publication number Publication date
CN103559423B (en) 2017-02-15

Similar Documents

Publication Publication Date Title
US20190156915A1 (en) Method, apparatus, device and storage medium for predicting protein binding site
Bertolazzi et al. Learning to classify species with barcodes
CN103559423A (en) Method and device for predicting methylation
Wang et al. A brief review of machine learning methods for RNA methylation sites prediction
Wang et al. EMDLP: Ensemble multiscale deep learning model for RNA methylation site prediction
Wang et al. Its2vec: fungal species identification using sequence embedding and random forest classification
Volkovich et al. The method of N-grams in large-scale clustering of DNA texts
Mu et al. iPseU-Layer: identifying RNA pseudouridine sites using layered ensemble model
Prezza et al. Detecting mutations by ebwt
Bhattacharyya et al. Mining the largest quasi-clique in human protein interactome
Nguyen et al. Efficient agglomerative hierarchical clustering for biological sequence analysis
Wang et al. Fusang: a framework for phylogenetic tree inference via deep learning
Li et al. Extracting DNA words based on the sequence features: non-uniform distribution and integrity
Bickmann et al. TEclass2: Classification of transposable elements using Transformers
Park A short report on the markov property of DNA sequences on 200-bp genomic units of roadmap genomics ChromHMM annotations: a computational perspective
Xu et al. The wide and deep flexible neural tree and its ensemble in predicting long non-coding RNA subcellular localization
Bhat et al. OTU clustering: A window to analyse uncultured microbial world
Gong et al. BDLR: lncRNA identification using ensemble learning
Achawanantakun et al. ncRNA consensus secondary structure derivation using grammar strings
Kishk et al. AmpliconNet: Sequence Based Multi-layer Perceptron for Amplicon Read Classification Using Real-time Data Augmentation
Tang et al. Sequence fusion algorithm of tumor gene sequencing and alignment based on machine learning
Al-Khafaji et al. A New Approach to Motif Templates Analysis via Compilation Technique
Liu et al. RMDGCN: Prediction of RNA methylation and disease associations based on graph convolutional network with attention mechanism
CN117976040A (en) Mutation pathogenicity annotation method, prediction mutation effect map construction method and system
Beknazarov et al. DeepZ: A Deep Learning Approach for Z-DNA Prediction

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant