CN106599615B - A kind of sequence signature analysis method for predicting miRNA target gene - Google Patents
A kind of sequence signature analysis method for predicting miRNA target gene Download PDFInfo
- Publication number
- CN106599615B CN106599615B CN201611081932.9A CN201611081932A CN106599615B CN 106599615 B CN106599615 B CN 106599615B CN 201611081932 A CN201611081932 A CN 201611081932A CN 106599615 B CN106599615 B CN 106599615B
- Authority
- CN
- China
- Prior art keywords
- mirna
- sequence
- characteristic
- target site
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
Abstract
The invention discloses a kind of sequence signature analysis methods for predicting miRNA target gene.This method is based on CLASH experimental data set, constructs 27 miRNA- target site matched sequence correlated characteristics, in conjunction with traditional characteristic, constitutes the characteristic set comprising 84 characteristic values;And machine learning is carried out using Random Forest model, miRNA microRNA target prediction model is constructed, the identification of miRNA target gene is carried out.The model of this method building has good accuracy rate, susceptibility, specificity, accuracy, can relatively accurately predict miRNA target gene.
Description
Technical field
The invention belongs to molecular biology and bioinformatics technique fields.More particularly, to a kind of prediction miRNA target
The sequence signature analysis method of gene.
Background technique
MicroRNAs (miRNAs) is a kind of endogenous, the non-coding RNA for being about 23 nucleotide (nt).They are main
By realizing complete or incomplete base pair complementarity with the 3 ' of mRNA UTR sequences, to reach cracking mRNA and inhibit mRNA
The purpose for translating into protein plays important Gene regulation effect in rear transcription period and translation grade.So far,
Have found that a mankind miRNA, these miRNA may regulate and control the gene of human body 80% more than 2000, in various vital movements and disease
Very crucial effect is played in disease regulation.Since the specific mechanism of miRNA target gene identification is still not clear, miRNA and its target
The mechanism of action of gene is sufficiently complex, therefore, effectively identifies that miRNA target gene is always the hot issues of miRNA research field.
Use detected by Western blot merely, the BIOLOGICAL TEST METHODSs such as Microarray identify miRNA target gene, it is time-consuming and
And it expends.Therefore by chemical-biological information approach, the potential target gene of miRNA is excavated, can further inquire into miRNA effect machine
System and miR-96 gene regulated and control network have most important theories meaning and practical value.Nearly ten years, research worker proposes more
Kind biological computation method identifies miRNA target gene.MiRanda by giving a mark to the pairing situation of miRNA and its target gene,
Then it calculates miRNA and target gene forms the minimum free energy after double-strand, while introducing the conservative of target site as last
One condition finally obtains potential miRNA target gene by screening layer by layer.TargetScan proposes " seed " area
The concept in (section that the end miRNA 5 ' starts the 2nd to the 8th nucleotide), finds the match condition of seed region to miRNA target
The identification of gene has significant impact.PITA considers the secondary structure of target gene, proposes the connecing property concept of target site, it is believed that
MiRNA, which will receive different secondary structures from the binding ability of target gene, to be influenced.As first generation biological computation method, although research
Personnel have found more useful feature, but studies have shown that these features are not fully suitable for miRNA in conjunction with target gene
Situation.Using these features as screening conditions, prediction false negative rate can be greatly improved, the second generation for being then based on machine learning is raw
The method that object calculates is come into being.
MiRNA target gene is predicted with the method for machine learning, the basic principle is that using reliable data set, according to institute
The binding sequence feature of miRNA and target gene is digitized, is then merged these features to constructed by the feature of proposition
Model be trained, and target gene is predicted.Huang extracts sample from expression map data and is used for training pattern,
Method has used CLIP (crosslinking and immunoprecipitate) data for model training.Recently, the CLASH of Helwak
(crosslinking ligation and sequencing of hybrids) directly provides miRNA target corresponding with its
Site sequence data, miRNA is further studied for researcher and the effect of its target gene site sequence provides good platform.
In recent years, many researchs use miRNA and target site forms the minimum free energy of double-strand, miRNA seed region
Number of pairs, target site conservative, the common feature such as accessibility of target site, but these methods have specificity it is too low
The shortcomings that.Therefore building miRNA and target gene binding characteristic have great importance to the identification of miRNA target gene.
Summary of the invention
The technical problem to be solved by the present invention is to overcome the defect of the above-mentioned prior art and deficiencies, provide one kind and are based on
The feature of miRNA- target site pairing establishes model with random forests algorithm, carries out miRNA in conjunction with a series of traditional characteristics
Target gene knows method for distinguishing.
The object of the present invention is to provide a kind of sequence signature analysis methods for predicting miRNA target gene.
Above-mentioned purpose of the present invention is achieved through the following technical solutions:
A kind of sequence signature analysis method for predicting miRNA target gene, includes the following steps:
S1: data set is collected, positive negative sample is constructed
Select CLASH data set as positive sample, and according to the dataset construction negative sample, it will be in CLASH data set
MiRNA and target site sequence random pair, delete positive sample therein, then 18514 are randomly choosed from remaining data set
As negative sample;
S2: according to the calculation method of traditional characteristic, the characteristic value of sample traditional characteristic is calculated
According to used traditional characteristic, the characteristic value of each sample is calculated, and traditional characteristic value is combined to construct sample
Feature vector;
S3: miRNA and target site binding sequence feature are calculated, and constructs sampling feature vectors
Positive negative sample is carried out by sequences match using improved Smith-Waterman method, and is converted to binary system sequence
Column;The case where further according to positive sample sequences match, constructs weight vectors w, and is obtained with the sequences match that this vector calculates positive negative sample
Dtex sign;MiRNA- target site matched sequence feature is proposed, in conjunction with traditional characteristic, constituting one includes 84 characteristic values
Characteristic set;
S4: building model carries out the identification of miRNA target gene
MiRNA microRNA target prediction model, and the parameter of training pattern are constructed using the method for random forest;
S5: model measurement.
Wherein, CLASH data set described in step S1 using document (Helwak A, Kudla G, Dudnakova T,
et al.Mapping the Human miRNA Interactome by CLASH Reveals Frequent
Noncanonical Binding [J] .Cell, 2013,153 (3): 654-65.) provided in data set, the public can be from
Its supplemental information is downloaded to obtain.
Furthermore it is preferred that step S1 method particularly includes:
S11. from CLASH collection selection positive sample data, the positive sample data include miRNA, miRNA sequence,
MRNA belonging to target site, final position of initial position, target site of the target site on mRNA on mRNA, target site sequence
Column;
Wherein, belonging to the target site mRNA be derived from ENSEMBL database;
S12. by miRNA and target site information random fit involved in positive sample, positive sample therein is got rid of,
Then therefrom 18514 data of random selection, as negative sample;Wherein, positive and negative sample proportion is 1:1.
Preferably, step S1 collects the target position point data for having the very high miRNA of confidence level and can be in connection.
Preferably, step S2 method particularly includes:
Based on document report, traditional characteristic of the miRNA in conjunction with its target gene is selected, and its spy is calculated according to feature description
Value indicative;The traditional characteristic includes: that miRNA with its target site is combined into the minimum free energy of double-strand, miRNA seed region is matched
To, AU content, the conservative of seed region, the conservative of flank chain, double-strand pairing near target site accessibility, seed region
Number, target site length, longest continuously match number of pairs, the miRNA of length, longest continuous sequence position, the end miRNA 3 '
Seed zone and 3 ' poor, the miRNA puppet dinucleotides features of end pairing, target site sequence puppet dinucleotides feature, AC number of target site,
UG number of target site, AG number of target site, CG number of target site, target site G/C content, target site upstream G/C content and target site
3 ' end G/C contents.
Preferably, step S3 method particularly includes:
S31. improved Smith-Waterman algorithm is used to allow that is, according to base A:U and G:C complementary pairing principle
G:U mispairing carries out sequences match to miRNA sequence in each sample and target site sequence;
S32. the sequences match situation based on S31, since miRNA sequence 5 ' hold first nucleotide and target site
The corresponding nucleotide of sequence is compared, if it does, then being indicated with " 1 ", if it does not match, being indicated with " 0 ";Because
In CLASH data set the length of major part miRNA be 23, therefore this method by each miRNA with target site in conjunction with after pair
The binary sequence that chain conversion forms for 23 " 0 " or " 1 ", if the length of miRNA, less than 23, this feature value is mended with 0
It fills, if miRNA length is greater than 23, extra characteristic value is not considered;Finally, feature set is added in this 23 characteristic values;
S33. according to the corresponding binary sequence of positive sample, each nucleotide position of miRNA in positive sample can be calculated
The probability of successful matching, and weight vectors w can be constructed with this;
S34. it according to description, sequence of calculation matching score, and is added in characteristic set;
For the match condition x of i-th bit on miRNAi, there is its corresponding weight wi;Therefore, construct " complete sequence matching
Feature 1 ", can be by the average value of all location matches scores of calculating, and calculation formula is as follows, wherein N (N=23) is sequence
Column length:
In view of the importance of miRNA seed sequence (the 2nd to the 8th), using the matching score of seed region miRNA as
One feature, constructs " seed region matching characteristic 1 ", calculation formula is as follows:
" complete sequence matching characteristic 1 " and " seed region matching characteristic 1 " considers successful matching and identifies to miRNA target gene
Influence;
For the match condition x of i-th bit on miRNAiIf xi=1, corresponding weight is wi;If xi=0, it is right
The weight answered then is qi=1-wi, " complete sequence matching characteristic 2 " is constructed, it can be by calculating the flat of whole section of sequences match score
Mean value s3, calculation formula is as follows, wherein N (N=23) is sequence length:
" seed region matching characteristic 2 " can pass through the matching score average value of calculating seed region, the following institute of formula
Show:
These features had both considered successful match situation, it is also considered that match unsuccessful situation.
Preferably, the parameter optimization scheme and result of the method building model of random forest described in step S4 are as follows:
There are two important parameter, n_estimators to indicate the number set in forest, max_feature table for random forest
Show the Characteristic Number selected when generating decision tree every time;100 to 1000 institute is extracted with 100 gradients for n_estimators
There is whole hundred number value (100,200 ... ..., 1000);For max_feature, institute in scikit-learn kit is had studied
There is value, finally using n_estimators=400 and max_feature=4 as model parameter.
Step S4 optimizes characteristic set and random forest parameter, and building optimal models identify miRNA target gene.
This method is based on CLASH data set, proposes miRNA- target site matched sequence feature, special in conjunction with a series of tradition
Sign, and modeled using random forest, carry out the identification of miRNA target gene.And same data are used with other two reported in the literature
The model that collection is established compares.The experimental results showed that the accuracy rate of this model, susceptibility, specificity, accuracy, geneva
The AUC that related coefficient reaches 90.05%, 89.47%, 90.56%, 90.43% and 0.7998, ROC and PRC is respectively 0.954,
0.958 compares with existing method, and this method shows more good performance, illustrates the miRNA- target site newly introduced
Matched sequence feature has very great influence to the identification of miRNA target gene.
The invention has the following advantages:
This method is based on CLASH data set, proposes miRNA- target site matched sequence feature, and build using random forest
Mould can relatively accurately predict miRNA target gene.This method than existing methods, has following clear advantage:
(1) CLASH data set, the target gene site sequence that each sample both provides miRNA and accurately combined with it are used
Column information.The data set that traditional method uses tends not to provide the miRNA target gene site sequence accurately combined, therefore first
First need method using sequences match from found in mRNA can potential target gene site in conjunction with miRNA, then form
Sample calculates sample characteristics.Because the potential target site of miRNA may be inaccuracy, this method uses CLASH
Data set is more more reliable than the set of data samples that conventional method constructs.
(2) each sample is converted to a binary sequence by the pairing situation based on miRNA Yu its target site, and
It is calculated based on the sequence and matches score, is constructed miRNA- target site matched sequence feature, has preferably been measured miRNA and its
The combination possibility in target gene site.
(3) it is modeled using random forests algorithm, is capable of handling very high-dimensional data, trained and predetermined speed is fast.When depositing
When classifying unbalanced situation, random forests algorithm is capable of the otherness of effectively equilibrium data collection.When data are concentrated with greatly
Still preferable precision of prediction can be kept when the shortage of data of ratio, find influencing each other between each feature and importance
Degree, it is not easy to over-fitting occur.
Detailed description of the invention
Fig. 1 experiment flow figure.
Fig. 2 sequences match binarization indicates.
The positive and negative sample matches comparison of Fig. 3.
The positive and negative sample matches difference of Fig. 4.
Prediction result of the Fig. 5 based on different characteristic subset.
ROC the and PRC curve of Fig. 6 experimental result.
Specific embodiment
The present invention is further illustrated below in conjunction with Figure of description and specific embodiment, but embodiment is not to the present invention
It limits in any form.
Unless stated otherwise, the reagent used in the present invention, method and apparatus are the art conventional reagent, method and set
It is standby.
Unless stated otherwise, following embodiment agents useful for same and material are commercially available.
1 experimental method of embodiment
1, experimental situation
Laboratory apparatus: ASUS N551JM type computer
Programming software: Anaconda3 Spyder, Visual Studio 2013
Programming language: Python 3.5, C++
2, positive negative sample and its form
Positive sample is selected from CLASH experimental data set, totally 18514 data, all comprises the following information that in each data
MiRNA, miRNA sequence, mRNA (being derived from ENSEMBL database) belonging to target site, starting of the target site on mRNA
Position, final position of the target site on mRNA, target site sequence.
Because being that target position that far smaller than cannot be in combination is counted certainly with the combinable target position points of miRNA
Mesh, so positive sample therein is got rid of by by miRNA and target site information random fit involved in positive sample,
Then therefrom 18514 data of random selection, as negative sample.
By taking positive sample as an example, sample form is as shown in table 1:
1 sample form of table
3, characteristic set
26 kinds of features (84 characteristic values) has been selected altogether, and specific characteristic set is as shown in table 2.Wherein preceding 21 kinds of features,
Totally 57 characteristic values, document have report;Rear 5 kinds of features (dash area) constructed by this method contain 27 characteristic values, this
A little characteristic values have fully considered the operative condition of miRNA Yu its target gene.
2 miRNA of table and target site binding characteristic set
4, feature selecting
Feature selecting is proposed for high dimensional data computational problem, by rejecting redundancy feature and extraneous features,
Improve the Generalization Capability and operational efficiency of machine learning algorithm.This method has used minimal redundancy maximal correlation algorithm (minimal
Redundancy maximal relevance criterion, mRMR) to 84 feature orderings, and selected optimal feature
Subset constructs model.
5, random forest
Random forest is a kind of combined method, is made of many decision trees, because these decision trees are formed by
Random method, therefore also referred to as stochastic decision tree.Be between tree in random forest it is no associated, when test data enter
When random forest, each decision tree is allowed to classify, it is final for finally taking that class that classification results are most in all decision trees
Result.This method uses random forest machine learning method as training pattern, and algorithm derives from scikit-learn
(http://scikit-learn.org/stable/) kit, entire program are developed using python.It optimizes in forest and sets
Number and each tree two parameters of characteristic.
6, performance indicator
The performance of classifier can be assessed by some independent indexs.For the performance of assessment models, accuracy rate
(Acc), susceptibility (Sen), specific (Spe), accuracy (Pre), totally five kinds of indexs are introduced into and comment geneva related coefficient (Mcc)
Estimate the performance of model.The calculation method of these indexs is as follows:
Wherein, TP, which refers to, is judged as positive sample, in fact and the number of positive sample;TN, which refers to, is judged as negative sample, thing
It is also the number of negative sample in reality;FN refers to the number for being judged as negative sample, but being in fact positive sample;FP, which refers to be determined, to be positive
Sample, but be in fact the number of negative sample.In addition, Receiver operating curve (receiver operating
Characteristic curve, ROC curve) and accuracy rate-recall rate curve (precision-recall curve, PRC song
Line) it is also introduced into the performance of assessment models.ROC curve is the overall target for reflecting sensibility and specificity continuous variable, is used
Composition method discloses the correlation of sensibility and specificity, continuous variable is set out to multiple and different critical values, to calculate
A series of sensibility and specificities out.It is again that abscissa is depicted as curve using sensibility as ordinate, (1- specificity), under curve
For area closer to 1, model performance is better.PRC curve is to reflect that the synthesis of accuracy rate and recall rate (sensibility) continuous variable refers to
Mark is disclosed the correlation of accuracy rate and sensibility using composition method, continuous variable is set out to multiple and different critical values, from
And calculate a series of accuracys rate and sensibility.It is again that abscissa is depicted as curve, curve using accuracy rate as ordinate, sensibility
For lower area closer to 1, model performance is better.
7, experiment flow
The microRNA target prediction of miRNA belongs to Machine Learning Problems, and the process entirely tested is as shown in Figure 1.
Step 1: selecting CLASH data set as positive sample, and according to the dataset construction negative sample, by CLASH data
The miRNA and target site sequence random pair of concentration, delete positive sample therein, then randomly choose from remaining data set
18514 are used as negative sample, and positive and negative sample proportion is 1:1;
Step 2: according to the calculation method of traditional characteristic, calculating the characteristic value of sample traditional characteristic;
Step 3: using improved Smith-Waterman method by positive negative sample carry out sequences match, and be converted to two into
Sequence processed.The case where further according to positive sample sequences match, constructs weight vectors w, and the sequence of positive negative sample is calculated with this vector
With score feature;
Step 4: model, and the parameter of training pattern are constructed using the method for random forest;
Step 5: model measurement;
Step 6: compared with other models and analyzing.
Embodiment 2 predicts the sequence signature analysis of miRNA target gene
1, it is matched based on miRNA- target site
MiRNA and its target site are not exact matching, and match condition is widely different.This method is according in sample set
The pairing situation of miRNA and its target site, by each miRNA in conjunction with target site after double-strand be expressed as by " 0 " and " 1 " group
At binary sequence, and the binary sequence of composition is analyzed, detailed process is as shown in Fig. 2, wherein dash area is
" seed region ".
In Fig. 2, BEYLA sequence is the corresponding target site sequence of miR-149.Improved Smith- is used first
Waterman method carries out sequences match according to base A:U and G:C complementary pairing principle, allows G:U mispairing.From miR-149 sequence
First nucleotide that column 5 ' are held starts and each nucleotide of BEYLA sequence is compared, if it does, then with " 1 " table
Show, corresponding nucleotide is connected in its corresponding position with a vertical line " | ";If it does not match, being indicated with " 0 ".Often
May all there are some strigula "-" in one sequence, indicate the position without any nucleotide.Therefore, miR-149 sequence and
The matching of BEYLA target site sequence can be converted into binary sequence " 11111111011110111110010 ", contain 23 altogether
" 0 " and " 1 " characteristic value.Because the length of major part miRNA is 23 in CLASH data set, this method is by each miRNA
The characteristic value sequence that double-strand conversion after in conjunction with target site forms for 23 " 0 " or " 1 ", if the length of miRNA is less than
23, then this feature value is supplemented with 0, if miRNA length is greater than 23, extra characteristic value is not considered.Finally, this method will
Feature set is added in this 23 characteristic values.
The method encoded using numbers above has carried out the negative sample of CLASH data set and random configuration to compare analysis.
Firstly, each sample has been carried out sequences match, it is then converted into binary zero and " 1 " sequence, and counted each position
The probability of successful matching, as a result as shown in Figure 3.
In Fig. 3, horizontal axis indicates the position of each nucleotide of miRNA, and what the longitudinal axis indicated is each position pairing on miRNA
Successful probability.Curve above in figure indicates the probability of each position successful matching on miRNA in positive sample, curve below
Indicate the probability of each position successful matching on miRNA in negative sample.It can be found that the match condition of positive sample entirety from figure
Match condition better than negative sample, before especially the 20th, positive sample are obviously better than negative sample.While it was also found that
The probability of positive and negative sample sequence both ends successful matching will be well below the probability of intermediate nucleotides position successful matching.In order to intuitive
The positive negative sample of display otherness, this method calculates the difference value of each position, and result is as shown in Figure 4.
From fig. 4, it can be seen that horizontal axis represents miRNA nucleotide position, the longitudinal axis indicates positive negative sample on each position
Matching difference.Analysis finds that positive negative sample can be much larger relative to other positions in the difference of the 2nd to the 8th pairing situation,
This is also consistent with research viewpoint before, i.e., the pairing situation of miRNA seed region has very the target gene identification of miRNA
Important role.
Based on above-mentioned discovery, this method according to the successful match rate of position each in positive sample construct a weight to
Measure w.And based on this vector, propose several method and give a mark to the matching sequence of miRNA, obtains 4 crucial spies
Sign.
Match condition x of the feature 1. for i-th bit on miRNAi, there is its corresponding weight wi.Therefore, " total order is constructed
Column matching characteristic 1 " can pass through the average value of all location matches scores of calculating, calculation formula such as formula (6), wherein N (N
=23) it is sequence length:
Feature 2. considers the importance of miRNA seed sequence (the 2nd to the 8th), by matching for seed region miRNA
It is allocated as constructing " seed region matching characteristic 1 " for a feature, calculation formula such as formula (7):
" complete sequence matching characteristic 1 " and " seed region matching characteristic 1 " considers successful matching and identifies to miRNA target gene
Influence.
Match condition x of the feature 3. for i-th bit on miRNAiIf xi=1, corresponding weight is wi;If xi=
0, corresponding weight is then qi=1-wi, " complete sequence matching characteristic 2 " is constructed, can be obtained by calculating whole section of sequences match
The average value s divided3, calculation formula such as formula (8)-(9), wherein N (N=23) is sequence length:
Feature 4. " seed region matching characteristic 2 ", can be by the matching score average value of calculating seed region, and formula is such as
(10) shown in:
Feature 3 and feature 4, had both considered successful match situation, it is also considered that match unsuccessful situation.
Therefore, situation is matched according to miRNA- target site, construct 23 sequence signatures and " complete sequence matching characteristic 1 ",
4 subsequence score features of " seed region matching characteristic 1 ", " complete sequence matching characteristic 2 " and " seed region matching characteristic 2 ",
Totally 27 characteristic values.
2 feature selectings
Comprising the feature set of 84 features according to constructed by table 2, in order to study the contribution of each feature, using mRMR method
It is sorted to each feature, preceding 29 feature rankings are as shown in table 3.
3 29 feature rankings of table
It can be seen that constructed " seed region matching characteristic 1 " ranking the 4th, " complete sequence matching characteristic 1 " row from the table
Name the 5th, " seed zone sequences match feature 2 " ranking the 8th, " global sequence's matching characteristic 2 " ranking the 9th.It illustrates newly to construct
Feature has considerable effect to the identification of miRNA target gene.Simultaneously it can further be seen that traditional characteristic such as minimum free energy, is protected
Keeping property and seed region pairing all play an important role to the identification of miRNA target gene.
Be gradient with 1 according to the ranking of each feature, used 85 before ranking respectively, 84 ..., 3,2,1 feature composition
Character subset is then based on each character subset and constructs corresponding model, calculates Acc, Sen, Spe, Pre and Mcc, with
The performance of constructed model is investigated, concrete outcome is as shown in Figure 5.
From fig. 5, it can be seen that model performance is substantially unchanged when the characteristic in character subset is greater than 29, therefore this
Method has finally chosen preceding 29 features as character subset.Before ranking in 29 features, method proposes totally 13 spies
It levies (as shown in the shade of table 2), it is feasible to show that this method proposition is characterized in.
3, parameter training
There are two important parameter, n_estimators to indicate the number set in forest, max_feature table for random forest
Show the Characteristic Number selected when generating decision tree every time.100 to 1000 institute is extracted with 100 gradients for n_estimators
There are value (100,200 ... ..., 1000).For max_feature, all values in scikitlearn software package are had studied.
The result shows that the performance of model has reached best as n_estimators=400 and max_feature=4.
4, robustness is assessed
According to above-mentioned step, the model based on random forests algorithm algorithm is established, miRNA target gene has been carried out pre-
It surveys.For the robustness of research model, negative sample has carried out 10 stochastical samplings, according to the data set established, constructs model
With calculate each performance indicator, concrete outcome is as shown in table 4.
4 model robustness assessment result of table
From table 4, it can be seen that accuracy rate, susceptibility, specificity, the average value of accuracy, geneva related coefficient are respectively as follows:
90.05%, 89.47%, 90.56%, 90.43%, 0.7998, and also relative standard deviation (RSD%) is respectively less than 1.6%.
The result shows that the model that this method is established has very strong robustness.Meanwhile it being based on highest accuracy rate value,
This method depicts ROC and PRC curve (Fig. 6), and calculating area under curve value is respectively 0.9537,0.9584, illustrates mould
Type shows good performance for microRNA target prediction.
3 model construction of embodiment and prediction miRNA target gene method
Based on researching and analysing above, prediction miRNA target gene method and model are constructed, specific as follows:
1, data set (collecting the target position point data that there is the very high miRNA of confidence level and can be in connection) is collected,
Construct positive negative sample
Select CLASH data set as positive sample, and according to the dataset construction negative sample, it will be in CLASH data set
MiRNA and target site sequence random pair, delete positive sample therein, then 18514 are randomly choosed from remaining data set
As negative sample;
(1) from CLASH collection selection positive sample data, the positive sample data include miRNA, miRNA sequence, target
MRNA belonging to site, final position of initial position, target site of the target site on mRNA on mRNA, target site sequence
Column;
Wherein, belonging to the target site mRNA be derived from ENSEMBL database;
(2) by miRNA and target site information random fit involved in positive sample, positive sample therein is got rid of,
Then therefrom 18514 data of random selection, as negative sample;Wherein, positive and negative sample proportion is 1:1.
2, selection miRNA calculates sample traditional characteristic in conjunction with its target gene, and according to the calculation method of traditional characteristic
Characteristic value, and combine traditional characteristic value construct sampling feature vectors;
Based on document report, traditional characteristic of the miRNA in conjunction with its target gene is selected, and its spy is calculated according to feature description
Value indicative;The traditional characteristic includes: that miRNA with its target site is combined into the minimum free energy of double-strand, miRNA seed region is matched
To, AU content, the conservative of seed region, the conservative of flank chain, double-strand pairing near target site accessibility, seed region
Number, target site length, longest continuously match number of pairs, the miRNA of length, longest continuous sequence position, the end miRNA 3 '
Seed zone and 3 ' poor, the miRNA puppet dinucleotides features of end pairing, target site sequence puppet dinucleotides feature, AC number of target site,
UG number of target site, AG number of target site, CG number of target site, target site G/C content, target site upstream G/C content and target site
3 ' end G/C contents.
3, miRNA and target site binding sequence feature are calculated, and constructs sampling feature vectors
Positive negative sample is carried out by sequences match using improved Smith-Waterman method, and is converted to binary system sequence
Column;The case where further according to positive sample sequences match, constructs weight vectors w, and is obtained with the sequences match that this vector calculates positive negative sample
Dtex sign;MiRNA- target site matched sequence feature is proposed, in conjunction with traditional characteristic, constituting one includes 84 characteristic values
Characteristic set;The specific method is as follows:
(1) improved Smith-Waterman algorithm is used to allow G that is, according to base A:U and G:C complementary pairing principle:
U mispairing carries out sequences match to miRNA sequence in each sample and target site sequence;
(2) be based on (1) sequences match situation, since miRNA sequence 5 ' hold first nucleotide and target site sequence
It arranges corresponding nucleotide to be compared, if it does, then being indicated with " 1 ", if it does not match, being indicated with " 0 ";Because of CLASH
In data set the length of major part miRNA be 23, therefore this method by each miRNA with target site in conjunction with after double-strand conversion
For the binary sequence of 23 " 0 " or " 1 " composition, if the length of miRNA, less than 23, this feature value is supplemented with 0, if
MiRNA length is greater than 23, and extra characteristic value is not considered;Finally, feature set is added in this 23 characteristic values;
(3) according to the corresponding binary sequence of positive sample, each nucleotide position of miRNA in positive sample can be calculated and matched
Weight vectors w can be constructed to successful probability, and with this;
(4) it according to description, sequence of calculation matching score, and is added in characteristic set;
For the match condition x of i-th bit on miRNAi, there is its corresponding weight wi;Therefore, construct " complete sequence matching
Feature 1 ", can be by the average value of all location matches scores of calculating, and calculation formula is as follows, wherein N (N=23) is sequence
Column length:
In view of the importance of miRNA seed sequence (the 2nd to the 8th), using the matching score of seed region miRNA as
One feature, constructs " seed region matching characteristic 1 ", calculation formula is as follows:
" complete sequence matching characteristic 1 " and " seed region matching characteristic 1 " considers successful matching and identifies to miRNA target gene
Influence;
For the match condition x of i-th bit on miRNAiIf xi=1, corresponding weight is wi;If xi=0, it is right
The weight answered then is qi=1-wi, " complete sequence matching characteristic 2 " is constructed, it can be by calculating the flat of whole section of sequences match score
Mean value s3, calculation formula is as follows, wherein N (N=23) is sequence length:
" seed region matching characteristic 2 " can pass through the matching score average value of calculating seed region, the following institute of formula
Show:
These features had both considered successful match situation, it is also considered that match unsuccessful situation.
4, miRNA microRNA target prediction model is constructed using the method for random forest, carries out the identification of miRNA target gene, and instruct
Practice the parameter of model;Characteristic set and random forest parameter are optimized, building optimal models identify miRNA target gene.
The parameter optimization scheme and result of the method building model of the random forest are as follows:
There are two important parameter, n_estimators to indicate the number set in forest, max_feature table for random forest
Show the Characteristic Number selected when generating decision tree every time;100 to 1000 institute is extracted with 100 gradients for n_estimators
There is whole hundred number value (100,200 ... ..., 1000);For max_feature, institute in scikit-learn kit is had studied
There is value, finally using n_estimators=400 and max_feature=4 as model parameter.
5, model measurement.
Embodiment 4 is compared with other methods
1, in order to verify the validity of new construction feature, miRNA microRNA target prediction model is constructed based on traditional characteristic collection,
And it is compared with model used in this method.
Meanwhile in order to further verify the performance of model, this method and other two moulds using the building of same data set
Type MirTarget and TarPmiR are compared.
2, the results are shown in Table 5.
5 distinct methods of table compare
The result shows that the performance of model is greatly improved, and accuracy rate improves after the feature newly constructed is added
6%, specificity and accuracy improve nearly 5%, and the improvement of susceptibility is obvious, improves nearly 9%, ROC and PRC area under the curve
10% or so is improved, the validity of new construction feature is further demonstrated.Meanwhile by this method and existing TarPmiR and
MirTarget method is compared, it can be seen that model overall performance used by this method shows better performance.
Wherein the accuracy rate of this method has increased separately 8% and 5% compared to TarPmiR and MirTarget, improves obvious.This mould simultaneously
ROC the and PRC area under the curve of type is up to 0.95 or more, also demonstrates the stability of this model performance.
Claims (5)
1. a kind of sequence signature analysis method for predicting miRNA target gene, which comprises the steps of:
S1: data set is collected, positive negative sample is constructed
Select CLASH data set as positive sample, and according to the dataset construction negative sample, by the miRNA in CLASH data set
With target site sequence random pair, positive sample therein is deleted, then randomly chooses 18514 as negative from remaining data set
Sample;
S2: according to the calculation method of traditional characteristic, the characteristic value of sample traditional characteristic is calculated
According to the calculation method of traditional characteristic, the characteristic value of each sample traditional characteristic is calculated, and binding characteristic value constructs sample
Eigen vector, the traditional characteristic include: that miRNA and its target site are combined into the minimum free energy of double-strand, miRNA seed zone
AU content, the conservative of seed region, the conservative of flank chain, double-strand near domain pairing, target site accessibility, seed region
Pairing number, target site length, longest continuously match length, longest continuous sequence position, the end miRNA3 ' number of pairs,
Poor, miRNA puppet dinucleotides feature, target site sequence puppet dinucleotides feature, target site AC are matched in miRNA seed zone and 3 ' ends
Number, UG number of target site, AG number of target site, CG number of target site, target site G/C content, target site upstream G/C content and target
Hold G/C content in site 3 ';
S3: miRNA and target site binding sequence feature are calculated, and constructs sampling feature vectors
Positive negative sample is carried out by sequences match using improved Smith-Waterman method, and is converted to binary sequence;Again
According to construction weight vectors w the case where positive sample sequences match, and dtex is obtained with the sequences match that this vector calculates positive negative sample
Sign;MiRNA- target site matched sequence feature is proposed, in conjunction with traditional characteristic, constitutes the feature comprising 84 characteristic values
Set;
S4: building model carries out the identification of miRNA target gene
MiRNA microRNA target prediction model, and the parameter of training pattern are constructed using the method for random forest;
S5: model measurement.
2. the method according to claim 1, wherein step S1 method particularly includes:
S11. from CLASH collection selection positive sample data, the positive sample data include miRNA, miRNA sequence, target position
The final position and target site sequence of initial position, target site on mRNA of mRNA name, target site on mRNA belonging to point;
Wherein, belonging to the target site mRNA be derived from ENSEMBL database;
S12. by miRNA and target site information random fit involved in positive sample, positive sample therein is got rid of, then
Therefrom 18514 data of random selection, as negative sample;Wherein, positive and negative sample proportion is 1:1.
3. the method according to claim 1, wherein step S2 method particularly includes: selection miRNA and its target base
Because of the traditional characteristic of combination, and described to calculate its characteristic value according to feature;
4. the method according to claim 1, wherein step S3 method particularly includes:
S31. improved Smith-Waterman algorithm is used to allow G:U wrong that is, according to base A:U and G:C complementary pairing principle
Match, sequences match is carried out to miRNA sequence in each sample and target site sequence;
S32. the sequences match situation based on S31, since miRNA sequence 5 ' hold first nucleotide and target site sequence
Corresponding nucleotide is compared, if it does, then being indicated with " 1 ", if it does not match, being indicated with " 0 ";Because of CLASH number
According to concentrate major part miRNA length be 23, therefore Smith-Waterman method by each miRNA in conjunction with target site after
The binary sequence that form for 23 " 0 " or " 1 " of double-strand conversion, if the length of miRNA less than 23, this feature value use
0 supplement;If miRNA length is greater than 23, extra characteristic value is not considered;Finally, feature is added in this 23 characteristic values
Collection;
S33. according to the corresponding binary sequence of positive sample, each nucleotide position of miRNA in positive sample can be calculated and matched
Successful probability, and weight vectors w can be constructed with this;
S34. it according to description, sequence of calculation matching score, and is added in characteristic set;
For the match condition x of i-th bit on miRNAi, there is its corresponding weight wi;Therefore, " complete sequence matching characteristic is constructed
1 ", can be by the average value of all location matches scores of calculating, calculation formula is as follows, wherein N is sequence length, N=
23;S1For " complete sequence matching characteristic 1 ":
In view of the importance of miRNA seed sequence, the miRNA seed sequence refers to since 5 ' ends the 2nd to the 8th, will
The matching score of seed region miRNA constructs " seed region matching characteristic 1 ", calculation formula is such as a feature
Under, wherein S2For " seed region matching characteristic 1 ":
" complete sequence matching characteristic 1 " and " seed region matching characteristic 1 " considers the shadow that successful matching identifies miRNA target gene
It rings;
For the match condition x of i-th bit on miRNAiIf xi=1, corresponding weight is wi;If xi=0, it is corresponding
Weight is then qi=1-wi, " complete sequence matching characteristic 2 " is constructed, the average value s of the whole matching score of calculating can be passed through3, meter
It is as follows to calculate formula, wherein N is sequence length, N=23;qiFor " weight of i-th bit:
" seed region matching characteristic 2 ", can be by the matching score average value of calculating seed region, and formula is as follows,
In, S4For " seed region matching characteristic 2 ", tiFor " matching score of i-th bit ":
5. predicting the sequence signature analysis method of miRNA target gene according to claim 1, which is characterized in that step S4 institute
Parameter optimization scheme and the result for stating the method building model of random forest are as follows:
For random forest there are two important parameter, n_estimators indicates that the number set in forest, max_feature indicate every
The Characteristic Number selected when secondary generation decision tree;For n_estimators, with 100 gradients, extract 100 to 1000 it is all whole
Hundred number values;For max_feature, all values in scikit-learn kit are had studied, finally with n_
Estimators=400 and max_feature=4 are as model parameter.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611081932.9A CN106599615B (en) | 2016-11-30 | 2016-11-30 | A kind of sequence signature analysis method for predicting miRNA target gene |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611081932.9A CN106599615B (en) | 2016-11-30 | 2016-11-30 | A kind of sequence signature analysis method for predicting miRNA target gene |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106599615A CN106599615A (en) | 2017-04-26 |
CN106599615B true CN106599615B (en) | 2019-04-05 |
Family
ID=58594491
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611081932.9A Active CN106599615B (en) | 2016-11-30 | 2016-11-30 | A kind of sequence signature analysis method for predicting miRNA target gene |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106599615B (en) |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108090327B (en) * | 2017-12-20 | 2022-03-29 | 吉林大学 | Prediction method for exogenous miRNA (micro ribonucleic acid) regulation and control target gene containing three-dimensional free energy |
CN110164505B (en) * | 2018-02-07 | 2021-06-22 | 深圳华大基因科技服务有限公司 | Method for rapidly predicting target gene of target miRNA |
CN108707663B (en) * | 2018-04-19 | 2022-03-08 | 深圳华大基因股份有限公司 | Reagent for cancer sample miRNA sequencing quantitative result evaluation, preparation method and application |
CN110021361B (en) * | 2018-06-27 | 2023-04-07 | 中山大学 | miRNA target gene prediction method based on convolutional neural network |
CN109272056B (en) * | 2018-10-30 | 2021-09-21 | 成都信息工程大学 | Data balancing method based on pseudo negative sample and method for improving data classification performance |
CN109859798B (en) * | 2019-01-21 | 2023-06-23 | 桂林电子科技大学 | Prediction method for interaction of sRNA and target mRNA in bacteria |
CN110517727B (en) * | 2019-08-23 | 2022-03-08 | 苏州浪潮智能科技有限公司 | Sequence alignment method and system |
CN111192629A (en) * | 2019-12-23 | 2020-05-22 | 苏州金唯智生物科技有限公司 | Construction method and application of gene sequence difficulty analysis model |
CN112599196B (en) * | 2020-12-21 | 2021-11-05 | 北京诺赛基因组研究中心有限公司 | Method for constructing model for classifying nucleic acid sequences and application thereof |
CN113409889A (en) * | 2021-05-25 | 2021-09-17 | 电子科技大学长三角研究院(衢州) | Target activity prediction method, device, equipment and storage medium of sgRNA |
CN113838527B (en) * | 2021-09-26 | 2023-09-01 | 平安科技(深圳)有限公司 | Method and device for generating target gene prediction model and storage medium |
CN116798513B (en) * | 2023-02-21 | 2023-12-15 | 苏州赛赋新药技术服务有限责任公司 | Method and system for screening siRNA sequence to reduce off-target effect |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101710362A (en) * | 2009-12-10 | 2010-05-19 | 浙江大学 | microRNA target position point prediction method based on support vector machine |
CN102597257A (en) * | 2009-09-04 | 2012-07-18 | 国立大学法人富山大学 | Specific method for preparing joined DNA fragments including sequences derived from target genes |
CN103218544A (en) * | 2013-04-03 | 2013-07-24 | 河海大学 | Gene identification method based on sequence similarity and periodicity of frequency spectrum 3 |
CN106032532A (en) * | 2015-03-17 | 2016-10-19 | 中国医学科学院北京协和医院 | Small-activating RNA, preparation method and applications thereof |
CN106909806A (en) * | 2015-12-22 | 2017-06-30 | 广州华大基因医学检验所有限公司 | The method and apparatus of fixed point detection variation |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CA2824533A1 (en) * | 2011-01-13 | 2012-07-19 | Laboratory Corporation Of America Holdings | Methods and systems for predictive modeling of hiv-1 replication capacity |
US9047559B2 (en) * | 2011-07-22 | 2015-06-02 | Sas Institute Inc. | Computer-implemented systems and methods for testing large scale automatic forecast combinations |
-
2016
- 2016-11-30 CN CN201611081932.9A patent/CN106599615B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102597257A (en) * | 2009-09-04 | 2012-07-18 | 国立大学法人富山大学 | Specific method for preparing joined DNA fragments including sequences derived from target genes |
CN101710362A (en) * | 2009-12-10 | 2010-05-19 | 浙江大学 | microRNA target position point prediction method based on support vector machine |
CN103218544A (en) * | 2013-04-03 | 2013-07-24 | 河海大学 | Gene identification method based on sequence similarity and periodicity of frequency spectrum 3 |
CN106032532A (en) * | 2015-03-17 | 2016-10-19 | 中国医学科学院北京协和医院 | Small-activating RNA, preparation method and applications thereof |
CN106909806A (en) * | 2015-12-22 | 2017-06-30 | 广州华大基因医学检验所有限公司 | The method and apparatus of fixed point detection variation |
Also Published As
Publication number | Publication date |
---|---|
CN106599615A (en) | 2017-04-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106599615B (en) | A kind of sequence signature analysis method for predicting miRNA target gene | |
CN108763865A (en) | A kind of integrated learning approach of prediction DNA protein binding sites | |
CN110111843B (en) | Method, apparatus and storage medium for clustering nucleic acid sequences | |
Morgado et al. | Computational tools for plant small RNA detection and categorization | |
CN106033502A (en) | Virus identification method and device | |
CN110459264A (en) | Based on grad enhancement decision tree prediction circular rna and disease associated method | |
CN113066527B (en) | Target prediction method and system for siRNA knockdown mRNA | |
CN106202999A (en) | Microorganism high-pass sequencing data based on different scale tuple word frequency analyzes agreement | |
CN109599149A (en) | A kind of prediction technique of RNA coding potential | |
CN108090327B (en) | Prediction method for exogenous miRNA (micro ribonucleic acid) regulation and control target gene containing three-dimensional free energy | |
CN106446601B (en) | A kind of method of extensive mark lncRNA function | |
CN113823356A (en) | Methylation site identification method and device | |
CN108595914A (en) | One grows tobacco mitochondrial RNA (mt RNA) editing sites high-precision forecasting method | |
CN101710364A (en) | Method for calculating and identifying protein-RNA interaction sites | |
CN113764031A (en) | Prediction method of N6 methyladenosine locus in trans-tissue/species RNA | |
KR102376212B1 (en) | Gene expression marker screening method using neural network based on gene selection algorithm | |
CN107630104A (en) | A kind of phylogenetic tree and authentication method for being used to identify Dendrobidium huoshanness or dendrobium candidum | |
CN110592093B (en) | Aptamer capable of recognizing EpCAM protein, and preparation method and application thereof | |
CN108959843B (en) | Computer screening method of chemical small molecule drug of target RNA | |
CN113838528A (en) | Single cell horizontal coupling visualization method based on single cell immune group library data | |
CN107038350B (en) | Long non-coding RNA target prediction method and system of medicine | |
Turner et al. | rG4detector: convolutional neural network to predict RNA G-quadruplex propensity based on rG4-seq data | |
Wen et al. | Computational prediction of candidate miRNAs and their targets from Medicago truncatula non-protein-coding transcripts | |
Wang et al. | PLANNER: a multi-scale deep language model for the origins of replication site prediction | |
CN115240775B (en) | Cas protein prediction method based on stacking integrated learning strategy |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |