CN103559423B - Method and device for predicting methylation - Google Patents

Method and device for predicting methylation Download PDF

Info

Publication number
CN103559423B
CN103559423B CN201310534661.8A CN201310534661A CN103559423B CN 103559423 B CN103559423 B CN 103559423B CN 201310534661 A CN201310534661 A CN 201310534661A CN 103559423 B CN103559423 B CN 103559423B
Authority
CN
China
Prior art keywords
data
methylated
positive
protein
effect
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310534661.8A
Other languages
Chinese (zh)
Other versions
CN103559423A (en
Inventor
周丰丰
赵苗苗
张召
刘记奎
葛瑞泉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN201310534661.8A priority Critical patent/CN103559423B/en
Publication of CN103559423A publication Critical patent/CN103559423A/en
Application granted granted Critical
Publication of CN103559423B publication Critical patent/CN103559423B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention is suitable for the technical field of biological information and provides a method and a device for predicting methylation. The method comprises the following steps of: downloading to obtain methylated data; according to the methylated data, acquiring original protein sequence data; carrying out preprocessing on the original protein sequence data to obtain a positive data set and a negative data set; carrying out encoding on character string data in the positive data set and the negative data set to obtain numeric data; carrying out modeling on the numeric data in the positive data set and the negative data set by utilizing a classification algorithm, calculating the optimal division mode according to a model obtained by modeling, and finally, according to the division mode, dividing the data which need to be predicted whether to be methylated into two categories in a centralized mode, wherein one category is the methylated data and the other category is data which are not methylated. According to the invention, personnel do not need to participate, a graph also does not need to be drawn, time can be saved and cost is low.

Description

A kind of Forecasting Methodology of methylation, device
Technical field
The invention belongs to technical field of biological information, more particularly, to a kind of Forecasting Methodology of methylation, device.
Background technology
Methylating is a kind of important modification of protein and nucleic acid, the expression of regulatory gene and closing, with cancer, decline Always, the numerous disease such as senile dementia is closely related, is one of important research content of epigenetics.Therefore, understand specific first Base mechanism of action will affect current molecular biological many fields, the side such as the correlative study to disease and drug design Face is also all very helpful.
The Joseph Ecker of Salk biological study institute of the U.S. and its colleague, just by the method for high-flux sequence, represent The complete collection of illustrative plates of all methylcysteins in one human embryo stem cell.Graduate Meissner of U.S. Whitehead etc. Also once depicted similar collection of illustrative plates.They utilize high-throughout bisulfite sequencing and single-molecule sequencing, create covering big The DNA methylation collection of illustrative plates on part CpG island.
In addition, George Church of two independent research groups, respectively Harvard University etc., and University of California Kun Zhang together with Univ Virginia Commonwealth Yuan Gao etc., also by the heavy sulfurous of traditional instrument such as DNA that methylates Hydrochlorate conversion is combined with target gene group capture technique and high-flux sequence, methylating in quantitative determination human genome.
Although the method for drafting of these methylation profiles is slightly different, they employ bisulfite conversion, will Unmethylated Cytosines become uracil, and change into thymidine in subsequent amplification step.Although this methyl The assay method of change effect is very effective, but this method needs some manual operationss to guarantee completely to convert and it needs to pass through Calculate analysis to draw collection of illustrative plates.
In a word, measure methylated method by above-mentioned laboratory facilities, either based on inner or in vitro technology, not only Very time-consuming, expense also costly, also can be limited by enzyme reaction.
Content of the invention
Embodiments provide a kind of Forecasting Methodology of methylation, device provides it is intended to solve prior art The methylated method of mensure, not only very time-consuming, expense also costly, the problem that also can be limited by enzyme reaction.
On the one hand, provide a kind of Forecasting Methodology of methylation, methods described includes:
Download the data obtaining methylated effect;
Data acquisition urporotein sequence data according to described methylated effect;
Described urporotein sequence data is pre-processed, obtains positive data collection and negative data set;
String data in described positive data collection and described feminine gender data set is encoded, obtains numeric type number According to;
Numeric type data in described positive data collection and described feminine gender data set is modeled using sorting algorithm, root Calculate optimal partitioning scheme according to the model that obtains of modeling, methylated by needing to predict whether finally according to described partitioning scheme Data set in data be divided into two classes, a class is the data of methylated effect, and another kind of is not have methylated effect Data.
Further, the described data acquisition urporotein sequence data according to described methylated effect includes:
It is successively read the protein title of methylated effect from the data of described methylated effect;
According to protein title successively from webpage http:Search and each egg in //www.uniprot.org/uniprot/ The corresponding data of white matter title;
It is made up of and each protein title corresponding urporotein sequence, described urporotein sequence these data Data includes the data with the corresponding methylated effect of each protein title in the data of described methylated effect With the data not having methylated effect.
Further, described described urporotein sequence data is pre-processed, obtain positive data collection and feminine gender Data set includes:
From described urporotein sequence data, centered on K or R, choose the character string of preseting length;
Using the character string of methylated effect as positive control, and others do not have the character string of methylated effect to make For negative control;
Positive control is added to positive data set, negative control is added to negative data set.
Further, described string data to described positive data collection and in described feminine gender data set encodes, Obtain the coding method in numeric type data and include probabilistic type coding, value number type coding, orthogonal type coding and binary system volume One of code.
Further, described sorting algorithm is random forest, random tree(RandomTree)One of.
On the other hand, a kind of prediction meanss of methylation are provided, described device includes:
Data download unit, for downloading the data obtaining methylated effect;
Initial data acquiring unit, for the data acquisition urporotein sequence number according to described methylated effect According to;
Pretreatment unit, for pre-processing to described urporotein sequence data, obtains positive data collection and the moon Property data set;
Coding unit, for encoding to the string data in described positive data collection and described feminine gender data set, Obtain numeric type data;
Taxon, for being calculated using classification to the numeric type data in described positive data collection and described feminine gender data set Method is modeled, and calculates optimal partitioning scheme according to the model that modeling obtains, will need finally according to described partitioning scheme pre- The data surveyed in whether methylated data set is divided into two classes, and a class is the data of methylated effect, another kind of for not having There is the data of methylated effect.
Further, described initial data acquiring unit includes:
Protein title acquisition module, for being successively read methylated effect from the data of described methylated effect Protein title;
Data search module, for according to protein title successively from webpage http://www.uniprot.org/ Search and each corresponding data of protein title in uniprot/;
Data concatenation module, for being made up of and each protein title corresponding urporotein sequence these data, Described urporotein sequence data includes corresponding with each protein title in the data of described methylated effect The data of methylated effect and the data not having methylated effect.
Further, described pretreatment unit includes:
Character string chosen module, for, centered on K or R, choosing and setting length from described urporotein sequence data The character string of degree;
Positive negative control acquisition module, for using the character string of methylated effect, as positive control, and others do not have The character string having methylated effect is as negative control;
Data set obtains module, for adding positive control to positive data set, negative control is added to feminine gender In data set.
Further, the coding method that described coding unit adopts is probabilistic type coding, value number type coding, orthogonal type One of coding and binary coding.
Further, the sorting algorithm that described taxon adopts is one of random forest, random tree.
In the embodiment of the present invention, when carrying out methylation prediction, whole process is automatically performed by computer, compares existing skill Art, can be time-consuming it is not necessary to artificial participation is it is not required that draw collection of illustrative plates, and expense is also cheap.
Brief description
Fig. 1 is the flowchart of the Forecasting Methodology of methylation that the embodiment of the present invention one provides;
Fig. 2 is the structured flowchart of the prediction meanss of methylation that the embodiment of the present invention one provides.
Specific embodiment
In order that the objects, technical solutions and advantages of the present invention become more apparent, below in conjunction with drawings and Examples, right The present invention is further elaborated.It should be appreciated that specific embodiment described herein is only in order to explain the present invention, and It is not used in the restriction present invention.
In embodiments of the present invention, first download the data obtaining methylated effect, according to described methylated effect Data acquisition urporotein sequence data;Again described urporotein sequence data is pre-processed, obtain positive data Collection and negative data set;Then the string data in described positive data collection and described feminine gender data set is encoded, obtain To numeric type data;Finally the numeric type data in described positive data collection and described feminine gender data set is entered using sorting algorithm Row modeling, calculates optimal partitioning scheme according to the model that obtains of modeling, the prediction will be needed to be finally according to described partitioning scheme Data in no methylated data set is divided into two classes, and a class is the data of methylated effect, another kind of be not by The data of methylation.
Below in conjunction with specific embodiment, the realization of the present invention is described in detail:
Embodiment one
Fig. 1 show the embodiment of the present invention one provide the Forecasting Methodology of methylation realize flow process, details are as follows:
In step S101, download the data obtaining methylated effect.
In the present embodiment, by downloading network address:http://dbptm.mbc.nctu.edu.tw/download.php can To obtain the data of methylated effect, this data is derived from database:dbPTM.Download the data storage obtaining in document In Methylation_K.txt and Methylation_R.txt, Methylation_K.txt is included by lysine(K)Methyl The data of change effect, Methylation_R.txt is included by arginine(R)The data of methylation, in the specific implementation, Need respectively the data in two documents of Methylation_K.txt and Methylation_R.txt is carried out with classification prediction.
In step s 102, the data acquisition urporotein sequence data according to described methylated effect.
In the present embodiment, the process obtaining urporotein sequence data includes:
Step 1, it is successively read the protein title of methylated effect from the data of described methylated effect;
Step 2, according to protein title successively from webpage http:In //www.uniprot.org/uniprot/ search with Each corresponding data of protein title;
Step 3, it is made up of and each protein title corresponding urporotein sequence, described original protein these data Matter sequence data includes and the corresponding methylated effect of each protein title in the data of described methylated effect Data and the data not having methylated effect.
There is no urporotein sequence data due in document Methylation_K.txt, only include 1013 quilts of total The protein sequence data of methylation sites data is it is therefore desirable to the protein that included according to this Methylation_K.txt The every kind of protein of protein name acquiring of the methylated effect in sequence data does not have the data of methylated effect.
When being embodied as, it is possible to use python written in code realizes above-mentioned 3 steps, from webpage http:// 1013 methylated site data in document Methylation_K.txt are obtained in www.uniprot.org/uniprot/ Corresponding urporotein sequence data.
For document Methylation_R.txt, same processing procedure, it is possible to use the python code write is from net Page http:1192 methylated sites of Methylation_R.txt are obtained in //www.uniprot.org/uniprot/ Data corresponding urporotein sequence data.
In step s 103, described urporotein sequence data is pre-processed, obtain positive data collection and feminine gender Data set.
In the present embodiment, define the peptide fragment of basic conception methylation sites.One methylation sites, this methyl Change the peptide fragment referred to as methylation sites that m residue and this methylation sites before site n residue below is formed Peptide fragment PSP (m, n), its biological significance is that the characteristic of methylation sites is determined generally by the biochemical special of its neighbouring amino acid Property.In actual applications, primary concern is that 11 peptide PSP(5,5), that is, the value of m and n be 5, when the left and right two of methylation sites While when there is no 5, can use '-' replace.
Described urporotein sequence data is pre-processed, obtains positive data collection and the process bag of negative data set Include:
Step 11, described urporotein sequence data is divided into positive control and negative control.
Specifically, from described urporotein sequence data, first centered on K or R, choose the character string of preseting length, Again using the character string of methylated effect as positive control, and others do not have the character string of methylated effect as feminine gender Comparison.
Step 12, positive control is added to positive data set, negative control is added to negative data set.
For example, for by lysine(K)Stimulate methylates:It is positive right that 1013 methylated site data are considered as According to, and the every other K site on the protein sequence of 1013 methylated site data is considered as negative control.
Wherein, from urporotein sequence data, centered on K, choose the character string of preseting length, these character strings In, have plenty of experiment test and go out methylated effect, using the character string of methylated effect as positive control, and other The character string not having methylated effect is as negative control.
In addition, for the accuracy increasing prediction, increasing during urporotein sequence data is pre-processed Add data validation process:Judge whether 1013 methylated site data are really methylated, if really by methyl Change, then judge to whether there is, in this 1013 methylated site data, the data repeating, if it is, carrying out to repeated data Remove redundancy, finally add the positive control obtaining to positive data set P, the negative control obtaining is added to negative number According in collection N.
Wherein, the methylation positive data including respectively and the negative number that methylates in positive data collection P and negative data set N According to number be respectively 1012 and 23915.In addition, the data in data set is all using 11 peptide PSP(5,5)Form represent.
For by arginine(R)Stimulate methylates, and above to by lysine(K)The methylated processing method stimulating Identical.1192 methylated site data are considered as positive control, and the protein by 1192 methylated site data Every other R site in sequence is considered as negative control.
It is also possible to whether first really methylated to 1192 methylated site data judge, then remove true Methylated methylation sites data in redundant data, finally respectively obtain in positive data collection P and negative data set N Including 11 peptide PSP(5,5)Number be respectively 1189 and 32505.
In step S104, the string data in described positive data collection and described feminine gender data set is encoded, Obtain numeric type data.
In the present embodiment, can be encoded by probabilistic type, the coding method such as digital numbering type coding, orthogonal type coding will String data in data set is converted into numeric type data.Hereinafter do one to be discussed in detail:
(1), probabilistic type coding
First each position is gone out according to the data statistics in data set P and each alphabetical probability occurs, finally give one Then string data correspondence in data set P is converted into numeric type number using this probability statistics matrix by probability statistics matrix According to.
For example, by lysine(K)The methylation positive data stimulating is 1012, and each data length is 11(Regard as 11 positions), all packets contain letter and '-' amounts to 21, are then directed to this 1012 data and count the appearance of each position The probability of each letter, can obtain the probability statistics matrix M of a 21*11, and then the string data in data set P, N is pressed It is converted into the numeric type data of 11 features according to this probability statistics matrix M.
(2), value number type coding
Each character in string data is carried out unique identification with a ten's digit, such as in the present embodiment String data in have 21 characters, therefore can represent each character respectively using 1-21, thus realizing character string number According to being converted into numeric type data.
(3)Orthogonal type encodes
In the present embodiment, for 21 characters in string data, each character is replaced with 21 binary codings, should Have in 21 binary codings and only one be 1, other positions are 0.Assume that 20 amino acid are { A, C, D ... }, add Character '-', amount to 21, then A is encoded to 000000000000000000001(20 0), C code is 000000000000000000010(20 0), D is encoded to 000000000000000000100(20 0), etc..Then will The binary numeral of each position regards a feature as, also just by 11 character codes of every string data be 11*21= 231 0/1 features.
(4)Binary coding
First each letter in string data is carried out unique identification with 5 bit binary value, then will be every The binary numeral of individual position all regards a feature as, that is, every string data is converted into 11*5=55 feature.
Of course for the accuracy improving methylation prediction it is also possible in the albumen obtaining methylated site data After matter sequence data, protein unstable structure interval is obtained according to the protein sequence data of this methylated site data Data, and then obtain the string data of new 11 length, then to former 11 peptide PSP(5,5)Plus this 11 new length String data, obtain amount to length be 22 string data, according still further to the coding method mentioned in step S104 to this Length is that 22 string data is processed, and is translated into numeric type data and carries out follow-up classification prediction again.
In step S105, the numeric type data in described positive data collection and described feminine gender data set is calculated using classification Method is modeled, and calculates optimal partitioning scheme according to the model that modeling obtains, will need finally according to described partitioning scheme pre- The data surveyed in whether methylated data set is divided into two classes, and a class is the data of methylated effect, another kind of for not having There is the data of methylated effect.
In embodiments of the present invention, sorting algorithm can be random forest(RandomForest), random tree (RandomTree)In any one.Certainly, sorting algorithm can also adopt neural network algorithm, nearest neighbor algorithm, pattra leaves This algorithm, WebLogo algorithm, genetic algorithm, Lasso algorithm, LibSVM algorithm, SVMs, clustering algorithm, individual layer decision-making Which kind of sorting algorithm tree (Decision Stump) algorithm, Logistic algorithm etc., specifically adopt, do not limit in the present embodiment System.
Need to predict whether that methylated data set can be different from described positive data collection and described feminine gender data All or part of number in one new data set of collection or described positive data collection and described feminine gender data set According to.
During practical application, the method carrying out methylation prediction to data includes:
1)、All training set:Self-validation(Entirety makees model, then is predicted with overall)
Can be modeled with the total data in described positive data collection and described feminine gender data set, then to the described positive Total data in data set and described feminine gender data set is predicted.
2), 30% test:Mould can be set up with a part of data in described positive data collection and described feminine gender data set Type, is predicted to remaining another part data.
For example, it is possible to be modeled with 70% data in described positive data collection and described feminine gender data set, to remaining 30% data is predicted.
3), 3 times of cross validation Fold-3
4), 10 times of cross validation Fold-10
Because judging the performance predicted, then need to predict whether the actual type of methylated part data set It is known, then obtains the type of this partial data collection with sorting algorithm prediction, if predicting the type and known class obtaining Type is identical then it represents that prediction correctly, otherwise represents prediction error.
As a preferred embodiment of the present invention, can further the result of prediction be estimated, to verify prediction The reliability of result.
During concrete application, in order to check the reliability of every kind of Forecasting Methodology, employ four evaluating standards:Susceptibility (Sn), specificity (Sp), accuracy (Ac) and coefficient correlation(MCC)The performance predicting the outcome is evaluated, four evaluation and test marks Accurate value bigger then it represents that the accuracy predicting the outcome is higher and stability is better.
Wherein, Sn, Sp, Ac and MCC meet equation below respectively:
Wherein, TP, TN, FP and FN are true positives, true negative, false positive and the false negative number that test obtains respectively.
Specifically, the performance predicting the outcome being obtained by various codings respectively as shown in table 1, table 2, table 3 and table 4, its Middle table 1 be using probabilistic type coding obtain by lysine(K)Stimulate the performance predicting the outcome whether methylation occurs, Table 2 be using value number type coding obtain by lysine(K)Stimulate the property predicting the outcome whether methylation occurs Can, table 3 be using probabilistic type coding obtain by arginine(R)Stimulate the property predicting the outcome whether methylation occurs Can, table 4 is to be obtained by arginine using value number type coding(R)Stimulate the property predicting the outcome whether methylation occurs Energy.
Table 1
Table 2
Table 3
Table 4
The present embodiment, when carrying out methylation prediction, whole process is automatically performed by computer, compared to existing technology, no Need artificial participation it is not required that drawing collection of illustrative plates, can be time-consuming, expense is also cheap.In addition, from taxology angle to egg Whether methylated the acting on of white matter to be predicted, and the accuracy of prediction can more preferably, can be to the correlative study of disease and medicine Thing the aspect such as is related to and provides preferably help.
One of ordinary skill in the art will appreciate that realizing all or part of step in the various embodiments described above method is can Completed with the hardware instructing correlation by program, corresponding program can be stored in a computer read/write memory medium In, described storage medium, such as ROM/RAM, disk or CD etc..
Embodiment two
Fig. 2 shows the concrete structure block diagram of the prediction meanss of methylation that the embodiment of the present invention two provides, in order to It is easy to illustrate, illustrate only the part related to the embodiment of the present invention.The prediction meanss 2 of this methylation include:Under data Carrier unit 21, initial data acquiring unit 22, pretreatment unit 23, coding unit 24 and taxon 25.
Wherein, data download unit 21, for downloading the data obtaining methylated effect;
Initial data acquiring unit 22, for the data acquisition urporotein sequence number according to described methylated effect According to;
Pretreatment unit 23, for pre-processing to described urporotein sequence data, obtain positive data collection and Negative data set;
Coding unit 24, for compiling to the string data in described positive data collection and described feminine gender data set Code, obtains numeric type data;
Taxon 25, for utilizing classification to the numeric type data in described positive data collection and described feminine gender data set Algorithm is modeled, and calculates optimal partitioning scheme according to the model that modeling obtains, will need finally according to described partitioning scheme Predict whether that the data in methylated data set is divided into two classes, a class is the data of methylated effect, another kind of be There is no the data of methylated effect.
Specifically, described initial data acquiring unit 22 includes:
Protein title acquisition module, for being successively read methylated effect from the data of described methylated effect Protein title;
Data search module, for according to protein title successively from webpage http://www.uniprot.org/ Search and each corresponding data of protein title in uniprot/;
Data concatenation module, for being made up of and each protein title corresponding urporotein sequence these data, Described urporotein sequence data includes corresponding with each protein title in the data of described methylated effect The data of methylated effect and the data not having methylated effect.
Specifically, described pretreatment unit 23 includes:
Character string chosen module, for, centered on K or R, choosing and setting length from described urporotein sequence data The character string of degree;
Positive negative control acquisition module, for using the character string of methylated effect, as positive control, and others do not have The character string having methylated effect is as negative control;
Data set obtains module, for adding positive control to positive data set, negative control is added to feminine gender In data set.
Specifically, the coding method that described coding unit 24 adopts is probabilistic type coding, value number type coding, orthogonal type One of coding and binary coding.
Specifically, the sorting algorithm that described taxon 25 adopts is one of random forest, random tree.
The prediction meanss of methylation provided in an embodiment of the present invention can be applied in aforementioned corresponding embodiment of the method In one, details, referring to the description of above-described embodiment one, will not be described here.
It should be noted that in said system embodiment, included unit simply carries out drawing according to function logic Point, but it is not limited to above-mentioned division, as long as being capable of corresponding function;In addition, each functional unit is concrete Title also only to facilitate mutual distinguish, is not limited to protection scope of the present invention.
The foregoing is only presently preferred embodiments of the present invention, not in order to limit the present invention, all essences in the present invention Any modification, equivalent and improvement made within god and principle etc., should be included within the scope of the present invention.

Claims (10)

1. a kind of Forecasting Methodology of methylation is it is characterised in that methods described includes:
Step 1, download obtain the data of methylated effect;
Step 2, the data acquisition urporotein sequence data according to described methylated effect;
Step 3, described urporotein sequence data is pre-processed, obtain positive data collection and negative data set;
Step 4, to described positive data collection and described feminine gender data set in string data encode, obtain numeric type number According to;
Step 5, to described positive data collection and described feminine gender data set in numeric type data be modeled using sorting algorithm, Optimal partitioning scheme is calculated according to the model that modeling obtains, will need to predict whether by methyl finally according to described partitioning scheme Data in the data set changed is divided into two classes, and a class is the data of methylated effect, and another kind of is not have methylated work Data;
Wherein, methods described also comprises the steps:
After the protein sequence data obtaining methylated site data, according to the protein sequence of this methylated site data Column data obtains the interval data of protein unstable structure, and then obtains the string data of new 11 length, then right Former 11 peptide PSP (5,5) add the string data of described 11 new length, obtain amounting to the string data that length is 22, According still further to the coding method mentioned in step 4, the string data that described length is 22 is encoded, obtain numeric type data Carry out follow-up classification prediction again.
2. the method for claim 1 is it is characterised in that the described data acquisition according to described methylated effect is original Protein sequence data includes:
It is successively read the protein title of methylated effect from the data of described methylated effect;
According to protein title successively from webpage http:Search and each protein in //www.uniprot.org/uniprot/ The corresponding data of title;
Formed and each protein title corresponding urporotein sequence with each corresponding data of protein title by searching Row, described urporotein sequence data includes corresponding with each protein title in the data of described methylated effect The data of methylated effect and the data not having methylated effect.
3. the method for claim 1 is it is characterised in that described carry out pre- place to described urporotein sequence data Reason, obtains positive data collection and negative data set includes:
From described urporotein sequence data, centered on K or R, choose the character string of preseting length, described K is to rely ammonia Acid, described R is arginine;
Using the character string of methylated effect as positive control, and others do not have the character string of methylated effect as the moon Property comparison;
Positive control is added to positive data set, negative control is added to negative data set.
4. the method for claim 1 it is characterised in that described to described positive data collection and described feminine gender data set in String data encoded, obtain the coding method in numeric type data include probabilistic type coding, value number type coding, One of orthogonal type coding and binary coding.
5. the method for claim 1 is it is characterised in that described sorting algorithm is one of random forest, random tree.
6. a kind of prediction meanss of methylation are it is characterised in that described device includes:
Data download unit, for downloading the data obtaining methylated effect;
Initial data acquiring unit, for the data acquisition urporotein sequence data according to described methylated effect;
Pretreatment unit, for pre-processing to described urporotein sequence data, obtains positive data collection and negative number According to collection;
Coding unit, for encoding to the string data in described positive data collection and described feminine gender data set, obtains Numeric type data;
Taxon, for being entered using sorting algorithm to the numeric type data in described positive data collection and described feminine gender data set Row modeling, calculates optimal partitioning scheme according to the model that obtains of modeling, the prediction will be needed to be finally according to described partitioning scheme Data in no methylated data set is divided into two classes, and a class is the data of methylated effect, another kind of be not by The data of methylation;
Wherein, after initial data acquiring unit obtains the protein sequence data of methylated site data, according to this by methyl The protein sequence data changing site data obtains the interval data of protein unstable structure, and then obtains new 11 length String data, then former 11 peptide PSP (5,5) are added with the string data of described 11 new length, obtain amounting to length Spend the string data for 22, according still further to the coding method mentioned in coding unit, the string data that described length is 22 is entered Row coding, obtains numeric type data and carries out follow-up classification prediction again.
7. device as claimed in claim 6 is it is characterised in that described initial data acquiring unit includes:
Protein title acquisition module, for being successively read the egg of methylated effect from the data of described methylated effect White matter title;
Data search module, for according to protein title successively from webpage http://www.uniprot.org/uniprot/ Middle lookup and each corresponding data of protein title;
Data concatenation module, for by search with each corresponding data of protein title form with each protein title pair The urporotein sequence answered, described urporotein sequence data include with the data of described methylated effect in each The data of the corresponding methylated effect of individual protein title and the data not having methylated effect.
8. device as claimed in claim 6 is it is characterised in that described pretreatment unit includes:
Character string chosen module, for, centered on K or R, choosing preseting length from described urporotein sequence data Character string, described K is lysine, and described R is arginine;
Positive negative control acquisition module, for using the character string of methylated effect as positive control, and others not by The character string of methylation is as negative control;
Data set obtains module, for adding positive control to positive data set, negative control is added to negative data Concentrate.
9. device as claimed in claim 6 is it is characterised in that the coding method that described coding unit adopts is that probabilistic type is compiled One of code, value number type coding, orthogonal type coding and binary coding.
10. device as claimed in claim 6 it is characterised in that described taxon adopt sorting algorithm be random forest, One of random tree.
CN201310534661.8A 2013-10-31 2013-10-31 Method and device for predicting methylation Active CN103559423B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310534661.8A CN103559423B (en) 2013-10-31 2013-10-31 Method and device for predicting methylation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310534661.8A CN103559423B (en) 2013-10-31 2013-10-31 Method and device for predicting methylation

Publications (2)

Publication Number Publication Date
CN103559423A CN103559423A (en) 2014-02-05
CN103559423B true CN103559423B (en) 2017-02-15

Family

ID=50013669

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310534661.8A Active CN103559423B (en) 2013-10-31 2013-10-31 Method and device for predicting methylation

Country Status (1)

Country Link
CN (1) CN103559423B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107247873B (en) * 2017-03-29 2020-04-14 电子科技大学 Differential methylation site recognition method
CN107506600B (en) * 2017-09-04 2021-05-14 上海美吉生物医药科技有限公司 Cancer type prediction method and device based on methylation data
CN111180012A (en) * 2019-12-27 2020-05-19 哈尔滨工业大学 Gene identification method based on empirical Bayes and Mendelian randomized fusion
CN111627499B (en) * 2020-05-27 2020-12-08 广州市基准医疗有限责任公司 Methylation level vectorization representation and specific sequencing interval detection method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102220412A (en) * 2004-11-29 2011-10-19 塞昆纳姆股份有限公司 Kits and methods for detecting methylated DNA
CN103310126A (en) * 2013-07-04 2013-09-18 中国人民解放军国防科学技术大学 Classification-model building method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8916344B2 (en) * 2010-11-15 2014-12-23 Exact Sciences Corporation Methylation assay

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102220412A (en) * 2004-11-29 2011-10-19 塞昆纳姆股份有限公司 Kits and methods for detecting methylated DNA
CN103310126A (en) * 2013-07-04 2013-09-18 中国人民解放军国防科学技术大学 Classification-model building method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
The Essentials of DNA Methylation;Adrian Bird;《Cell》;19920710;第70卷;第5-8页 *
人类基因组CpG岛甲基化概况的预测;凡时财 等;《科学通报》;20101231;第55卷(第14期);第1329-1334页 *
基于支持向量机的蛋白质功能预测新方法研究;施绍萍;《中国博士学位论文全文数据库 基础科学辑》;20121015(第10期);第38页第9-11段,第38页倒数第3段至第41页第3段,表5-6 *

Also Published As

Publication number Publication date
CN103559423A (en) 2014-02-05

Similar Documents

Publication Publication Date Title
US11620567B2 (en) Method, apparatus, device and storage medium for predicting protein binding site
Rannala et al. Species delimitation
CN103559423B (en) Method and device for predicting methylation
CN105243297A (en) Quick comparing and positioning method for gene sequence segments on reference genome
JP4912646B2 (en) Gene transcript mapping method and system
Bhargava et al. DNA barcoding in plants: evolution and applications of in silico approaches and resources
Merget et al. A molecular phylogeny of Hypnales (Bryophyta) inferred from ITS2 sequence-structure data
Caetano-Anoll Evolutionary genomics and systems biology
CN106021992A (en) Computation pipeline of location-dependent variant calls
Liu et al. Mixed-Weight Neural Bagging for Detecting $ m^ 6A $ Modifications in SARS-CoV-2 RNA Sequencing
CN100428254C (en) Cross reaction antigen computer-aided screening method
Elkhani et al. Membrane computing to model feature selection of microarray cancer data
Vezzi Next generation sequencing revolution challenges: Search, assemble, and validate genomes
CN113223609A (en) Drug target interaction prediction method based on heterogeneous information network
Li et al. Extracting DNA words based on the sequence features: non-uniform distribution and integrity
Martin Algorithms and tools for the analysis of high throughput DNA sequencing data
CN105224826A (en) A kind of DNA sequence dna similarity analysis method based on S-PCNN and huffman coding
Bickmann et al. TEclass2: Classification of transposable elements using Transformers
Upama et al. A new approach in pattern matching: codon detection in DNA and RNA using hash function (CDDRHF)
CN112530520A (en) CircRNA function prediction method based on scoring mechanism and LightGBM
Valiente Combinatorial pattern matching algorithms in computational biology using Perl and R
Malusare et al. Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level Precision
Nie et al. Evolution-guided large language model is a predictor of virus mutation trends
Tang et al. Sequence fusion algorithm of tumor gene sequencing and alignment based on machine learning
Dray et al. Spiking neural networks for cancer gene expression time series modelling and analysis

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant