CN106599615B

CN106599615B - A kind of sequence signature analysis method for predicting miRNA target gene

Info

Publication number: CN106599615B
Application number: CN201611081932.9A
Authority: CN
Inventors: 邹小勇; 夏飞迪; 王洋; 戴宗
Original assignee: Guangdong University of Technology; SYSU CMU Shunde International Joint Research Institute; National Sun Yat Sen University
Current assignee: Guangdong University of Technology; SYSU CMU Shunde International Joint Research Institute; National Sun Yat Sen University
Priority date: 2016-11-30
Filing date: 2016-11-30
Publication date: 2019-04-05
Anticipated expiration: 2036-11-30
Also published as: CN106599615A

Abstract

The invention discloses a kind of sequence signature analysis methods for predicting miRNA target gene.This method is based on CLASH experimental data set, constructs 27 miRNA- target site matched sequence correlated characteristics, in conjunction with traditional characteristic, constitutes the characteristic set comprising 84 characteristic values；And machine learning is carried out using Random Forest model, miRNA microRNA target prediction model is constructed, the identification of miRNA target gene is carried out.The model of this method building has good accuracy rate, susceptibility, specificity, accuracy, can relatively accurately predict miRNA target gene.

Description

A kind of sequence signature analysis method for predicting miRNA target gene

Technical field

The invention belongs to molecular biology and bioinformatics technique fields.More particularly, to a kind of prediction miRNA target The sequence signature analysis method of gene.

Background technique

MicroRNAs (miRNAs) is a kind of endogenous, the non-coding RNA for being about 23 nucleotide (nt).They are main By realizing complete or incomplete base pair complementarity with the 3 ' of mRNA UTR sequences, to reach cracking mRNA and inhibit mRNA The purpose for translating into protein plays important Gene regulation effect in rear transcription period and translation grade.So far, Have found that a mankind miRNA, these miRNA may regulate and control the gene of human body 80% more than 2000, in various vital movements and disease Very crucial effect is played in disease regulation.Since the specific mechanism of miRNA target gene identification is still not clear, miRNA and its target The mechanism of action of gene is sufficiently complex, therefore, effectively identifies that miRNA target gene is always the hot issues of miRNA research field.

Use detected by Western blot merely, the BIOLOGICAL TEST METHODSs such as Microarray identify miRNA target gene, it is time-consuming and And it expends.Therefore by chemical-biological information approach, the potential target gene of miRNA is excavated, can further inquire into miRNA effect machine System and miR-96 gene regulated and control network have most important theories meaning and practical value.Nearly ten years, research worker proposes more Kind biological computation method identifies miRNA target gene.MiRanda by giving a mark to the pairing situation of miRNA and its target gene, Then it calculates miRNA and target gene forms the minimum free energy after double-strand, while introducing the conservative of target site as last One condition finally obtains potential miRNA target gene by screening layer by layer.TargetScan proposes " seed " area The concept in (section that the end miRNA 5 ' starts the 2nd to the 8th nucleotide), finds the match condition of seed region to miRNA target The identification of gene has significant impact.PITA considers the secondary structure of target gene, proposes the connecing property concept of target site, it is believed that MiRNA, which will receive different secondary structures from the binding ability of target gene, to be influenced.As first generation biological computation method, although research Personnel have found more useful feature, but studies have shown that these features are not fully suitable for miRNA in conjunction with target gene Situation.Using these features as screening conditions, prediction false negative rate can be greatly improved, the second generation for being then based on machine learning is raw The method that object calculates is come into being.

MiRNA target gene is predicted with the method for machine learning, the basic principle is that using reliable data set, according to institute The binding sequence feature of miRNA and target gene is digitized, is then merged these features to constructed by the feature of proposition Model be trained, and target gene is predicted.Huang extracts sample from expression map data and is used for training pattern, Method has used CLIP (crosslinking and immunoprecipitate) data for model training.Recently, the CLASH of Helwak (crosslinking ligation and sequencing of hybrids) directly provides miRNA target corresponding with its Site sequence data, miRNA is further studied for researcher and the effect of its target gene site sequence provides good platform.

In recent years, many researchs use miRNA and target site forms the minimum free energy of double-strand, miRNA seed region Number of pairs, target site conservative, the common feature such as accessibility of target site, but these methods have specificity it is too low The shortcomings that.Therefore building miRNA and target gene binding characteristic have great importance to the identification of miRNA target gene.

Summary of the invention

The technical problem to be solved by the present invention is to overcome the defect of the above-mentioned prior art and deficiencies, provide one kind and are based on The feature of miRNA- target site pairing establishes model with random forests algorithm, carries out miRNA in conjunction with a series of traditional characteristics Target gene knows method for distinguishing.

The object of the present invention is to provide a kind of sequence signature analysis methods for predicting miRNA target gene.

Above-mentioned purpose of the present invention is achieved through the following technical solutions:

A kind of sequence signature analysis method for predicting miRNA target gene, includes the following steps:

S1: data set is collected, positive negative sample is constructed

Select CLASH data set as positive sample, and according to the dataset construction negative sample, it will be in CLASH data set MiRNA and target site sequence random pair, delete positive sample therein, then 18514 are randomly choosed from remaining data set As negative sample；

S2: according to the calculation method of traditional characteristic, the characteristic value of sample traditional characteristic is calculated

According to used traditional characteristic, the characteristic value of each sample is calculated, and traditional characteristic value is combined to construct sample Feature vector；

S3: miRNA and target site binding sequence feature are calculated, and constructs sampling feature vectors

Positive negative sample is carried out by sequences match using improved Smith-Waterman method, and is converted to binary system sequence Column；The case where further according to positive sample sequences match, constructs weight vectors w, and is obtained with the sequences match that this vector calculates positive negative sample Dtex sign；MiRNA- target site matched sequence feature is proposed, in conjunction with traditional characteristic, constituting one includes 84 characteristic values Characteristic set；

S4: building model carries out the identification of miRNA target gene

MiRNA microRNA target prediction model, and the parameter of training pattern are constructed using the method for random forest；

S5: model measurement.

Wherein, CLASH data set described in step S1 using document (Helwak A, Kudla G, Dudnakova T, et al.Mapping the Human miRNA Interactome by CLASH Reveals Frequent Noncanonical Binding [J] .Cell, 2013,153 (3): 654-65.) provided in data set, the public can be from Its supplemental information is downloaded to obtain.

Furthermore it is preferred that step S1 method particularly includes:

S11. from CLASH collection selection positive sample data, the positive sample data include miRNA, miRNA sequence, MRNA belonging to target site, final position of initial position, target site of the target site on mRNA on mRNA, target site sequence Column；

Wherein, belonging to the target site mRNA be derived from ENSEMBL database；

S12. by miRNA and target site information random fit involved in positive sample, positive sample therein is got rid of, Then therefrom 18514 data of random selection, as negative sample；Wherein, positive and negative sample proportion is 1:1.

Preferably, step S1 collects the target position point data for having the very high miRNA of confidence level and can be in connection.

Preferably, step S2 method particularly includes:

Based on document report, traditional characteristic of the miRNA in conjunction with its target gene is selected, and its spy is calculated according to feature description Value indicative；The traditional characteristic includes: that miRNA with its target site is combined into the minimum free energy of double-strand, miRNA seed region is matched To, AU content, the conservative of seed region, the conservative of flank chain, double-strand pairing near target site accessibility, seed region Number, target site length, longest continuously match number of pairs, the miRNA of length, longest continuous sequence position, the end miRNA 3 ' Seed zone and 3 ' poor, the miRNA puppet dinucleotides features of end pairing, target site sequence puppet dinucleotides feature, AC number of target site, UG number of target site, AG number of target site, CG number of target site, target site G/C content, target site upstream G/C content and target site 3 ' end G/C contents.

Preferably, step S3 method particularly includes:

S31. improved Smith-Waterman algorithm is used to allow that is, according to base A:U and G:C complementary pairing principle G:U mispairing carries out sequences match to miRNA sequence in each sample and target site sequence；

S32. the sequences match situation based on S31, since miRNA sequence 5 ' hold first nucleotide and target site The corresponding nucleotide of sequence is compared, if it does, then being indicated with " 1 ", if it does not match, being indicated with " 0 "；Because In CLASH data set the length of major part miRNA be 23, therefore this method by each miRNA with target site in conjunction with after pair The binary sequence that chain conversion forms for 23 " 0 " or " 1 ", if the length of miRNA, less than 23, this feature value is mended with 0 It fills, if miRNA length is greater than 23, extra characteristic value is not considered；Finally, feature set is added in this 23 characteristic values；

S33. according to the corresponding binary sequence of positive sample, each nucleotide position of miRNA in positive sample can be calculated The probability of successful matching, and weight vectors w can be constructed with this；

S34. it according to description, sequence of calculation matching score, and is added in characteristic set；

For the match condition x of i-th bit on miRNA_i, there is its corresponding weight w_i；Therefore, construct " complete sequence matching Feature 1 ", can be by the average value of all location matches scores of calculating, and calculation formula is as follows, wherein N (N=23) is sequence Column length:

In view of the importance of miRNA seed sequence (the 2nd to the 8th), using the matching score of seed region miRNA as One feature, constructs " seed region matching characteristic 1 ", calculation formula is as follows:

" complete sequence matching characteristic 1 " and " seed region matching characteristic 1 " considers successful matching and identifies to miRNA target gene Influence；

For the match condition x of i-th bit on miRNA_iIf x_i=1, corresponding weight is w_i；If x_i=0, it is right The weight answered then is q_i=1-w_i, " complete sequence matching characteristic 2 " is constructed, it can be by calculating the flat of whole section of sequences match score Mean value s₃, calculation formula is as follows, wherein N (N=23) is sequence length:

" seed region matching characteristic 2 " can pass through the matching score average value of calculating seed region, the following institute of formula Show:

These features had both considered successful match situation, it is also considered that match unsuccessful situation.

Preferably, the parameter optimization scheme and result of the method building model of random forest described in step S4 are as follows:

There are two important parameter, n_estimators to indicate the number set in forest, max_feature table for random forest Show the Characteristic Number selected when generating decision tree every time；100 to 1000 institute is extracted with 100 gradients for n_estimators There is whole hundred number value (100,200 ... ..., 1000)；For max_feature, institute in scikit-learn kit is had studied There is value, finally using n_estimators=400 and max_feature=4 as model parameter.

Step S4 optimizes characteristic set and random forest parameter, and building optimal models identify miRNA target gene.

This method is based on CLASH data set, proposes miRNA- target site matched sequence feature, special in conjunction with a series of tradition Sign, and modeled using random forest, carry out the identification of miRNA target gene.And same data are used with other two reported in the literature The model that collection is established compares.The experimental results showed that the accuracy rate of this model, susceptibility, specificity, accuracy, geneva The AUC that related coefficient reaches 90.05%, 89.47%, 90.56%, 90.43% and 0.7998, ROC and PRC is respectively 0.954, 0.958 compares with existing method, and this method shows more good performance, illustrates the miRNA- target site newly introduced Matched sequence feature has very great influence to the identification of miRNA target gene.

The invention has the following advantages:

This method is based on CLASH data set, proposes miRNA- target site matched sequence feature, and build using random forest Mould can relatively accurately predict miRNA target gene.This method than existing methods, has following clear advantage:

(1) CLASH data set, the target gene site sequence that each sample both provides miRNA and accurately combined with it are used Column information.The data set that traditional method uses tends not to provide the miRNA target gene site sequence accurately combined, therefore first First need method using sequences match from found in mRNA can potential target gene site in conjunction with miRNA, then form Sample calculates sample characteristics.Because the potential target site of miRNA may be inaccuracy, this method uses CLASH Data set is more more reliable than the set of data samples that conventional method constructs.

(2) each sample is converted to a binary sequence by the pairing situation based on miRNA Yu its target site, and It is calculated based on the sequence and matches score, is constructed miRNA- target site matched sequence feature, has preferably been measured miRNA and its The combination possibility in target gene site.

(3) it is modeled using random forests algorithm, is capable of handling very high-dimensional data, trained and predetermined speed is fast.When depositing When classifying unbalanced situation, random forests algorithm is capable of the otherness of effectively equilibrium data collection.When data are concentrated with greatly Still preferable precision of prediction can be kept when the shortage of data of ratio, find influencing each other between each feature and importance Degree, it is not easy to over-fitting occur.

Detailed description of the invention

Fig. 1 experiment flow figure.

Fig. 2 sequences match binarization indicates.

The positive and negative sample matches comparison of Fig. 3.

The positive and negative sample matches difference of Fig. 4.

Prediction result of the Fig. 5 based on different characteristic subset.

ROC the and PRC curve of Fig. 6 experimental result.

Specific embodiment

The present invention is further illustrated below in conjunction with Figure of description and specific embodiment, but embodiment is not to the present invention It limits in any form.

Unless stated otherwise, the reagent used in the present invention, method and apparatus are the art conventional reagent, method and set It is standby.

Unless stated otherwise, following embodiment agents useful for same and material are commercially available.

1 experimental method of embodiment

1, experimental situation

Laboratory apparatus: ASUS N551JM type computer

Programming software: Anaconda3 Spyder, Visual Studio 2013

Programming language: Python 3.5, C++

2, positive negative sample and its form

Positive sample is selected from CLASH experimental data set, totally 18514 data, all comprises the following information that in each data MiRNA, miRNA sequence, mRNA (being derived from ENSEMBL database) belonging to target site, starting of the target site on mRNA Position, final position of the target site on mRNA, target site sequence.

Because being that target position that far smaller than cannot be in combination is counted certainly with the combinable target position points of miRNA Mesh, so positive sample therein is got rid of by by miRNA and target site information random fit involved in positive sample, Then therefrom 18514 data of random selection, as negative sample.

By taking positive sample as an example, sample form is as shown in table 1:

1 sample form of table

3, characteristic set

26 kinds of features (84 characteristic values) has been selected altogether, and specific characteristic set is as shown in table 2.Wherein preceding 21 kinds of features, Totally 57 characteristic values, document have report；Rear 5 kinds of features (dash area) constructed by this method contain 27 characteristic values, this A little characteristic values have fully considered the operative condition of miRNA Yu its target gene.

2 miRNA of table and target site binding characteristic set

4, feature selecting

Feature selecting is proposed for high dimensional data computational problem, by rejecting redundancy feature and extraneous features, Improve the Generalization Capability and operational efficiency of machine learning algorithm.This method has used minimal redundancy maximal correlation algorithm (minimal Redundancy maximal relevance criterion, mRMR) to 84 feature orderings, and selected optimal feature Subset constructs model.

5, random forest

Random forest is a kind of combined method, is made of many decision trees, because these decision trees are formed by Random method, therefore also referred to as stochastic decision tree.Be between tree in random forest it is no associated, when test data enter When random forest, each decision tree is allowed to classify, it is final for finally taking that class that classification results are most in all decision trees Result.This method uses random forest machine learning method as training pattern, and algorithm derives from scikit-learn (http://scikit-learn.org/stable/) kit, entire program are developed using python.It optimizes in forest and sets Number and each tree two parameters of characteristic.

6, performance indicator

The performance of classifier can be assessed by some independent indexs.For the performance of assessment models, accuracy rate (Acc), susceptibility (Sen), specific (Spe), accuracy (Pre), totally five kinds of indexs are introduced into and comment geneva related coefficient (Mcc) Estimate the performance of model.The calculation method of these indexs is as follows:

Wherein, TP, which refers to, is judged as positive sample, in fact and the number of positive sample；TN, which refers to, is judged as negative sample, thing It is also the number of negative sample in reality；FN refers to the number for being judged as negative sample, but being in fact positive sample；FP, which refers to be determined, to be positive Sample, but be in fact the number of negative sample.In addition, Receiver operating curve (receiver operating Characteristic curve, ROC curve) and accuracy rate-recall rate curve (precision-recall curve, PRC song Line) it is also introduced into the performance of assessment models.ROC curve is the overall target for reflecting sensibility and specificity continuous variable, is used Composition method discloses the correlation of sensibility and specificity, continuous variable is set out to multiple and different critical values, to calculate A series of sensibility and specificities out.It is again that abscissa is depicted as curve using sensibility as ordinate, (1- specificity), under curve For area closer to 1, model performance is better.PRC curve is to reflect that the synthesis of accuracy rate and recall rate (sensibility) continuous variable refers to Mark is disclosed the correlation of accuracy rate and sensibility using composition method, continuous variable is set out to multiple and different critical values, from And calculate a series of accuracys rate and sensibility.It is again that abscissa is depicted as curve, curve using accuracy rate as ordinate, sensibility For lower area closer to 1, model performance is better.

7, experiment flow

The microRNA target prediction of miRNA belongs to Machine Learning Problems, and the process entirely tested is as shown in Figure 1.

Step 1: selecting CLASH data set as positive sample, and according to the dataset construction negative sample, by CLASH data The miRNA and target site sequence random pair of concentration, delete positive sample therein, then randomly choose from remaining data set 18514 are used as negative sample, and positive and negative sample proportion is 1:1；

Step 2: according to the calculation method of traditional characteristic, calculating the characteristic value of sample traditional characteristic；

Step 3: using improved Smith-Waterman method by positive negative sample carry out sequences match, and be converted to two into Sequence processed.The case where further according to positive sample sequences match, constructs weight vectors w, and the sequence of positive negative sample is calculated with this vector With score feature；

Step 4: model, and the parameter of training pattern are constructed using the method for random forest；

Step 5: model measurement；

Step 6: compared with other models and analyzing.

Embodiment 2 predicts the sequence signature analysis of miRNA target gene

1, it is matched based on miRNA- target site

MiRNA and its target site are not exact matching, and match condition is widely different.This method is according in sample set The pairing situation of miRNA and its target site, by each miRNA in conjunction with target site after double-strand be expressed as by " 0 " and " 1 " group At binary sequence, and the binary sequence of composition is analyzed, detailed process is as shown in Fig. 2, wherein dash area is " seed region ".

In Fig. 2, BEYLA sequence is the corresponding target site sequence of miR-149.Improved Smith- is used first Waterman method carries out sequences match according to base A:U and G:C complementary pairing principle, allows G:U mispairing.From miR-149 sequence First nucleotide that column 5 ' are held starts and each nucleotide of BEYLA sequence is compared, if it does, then with " 1 " table Show, corresponding nucleotide is connected in its corresponding position with a vertical line " | "；If it does not match, being indicated with " 0 ".Often May all there are some strigula "-" in one sequence, indicate the position without any nucleotide.Therefore, miR-149 sequence and The matching of BEYLA target site sequence can be converted into binary sequence " 11111111011110111110010 ", contain 23 altogether " 0 " and " 1 " characteristic value.Because the length of major part miRNA is 23 in CLASH data set, this method is by each miRNA The characteristic value sequence that double-strand conversion after in conjunction with target site forms for 23 " 0 " or " 1 ", if the length of miRNA is less than 23, then this feature value is supplemented with 0, if miRNA length is greater than 23, extra characteristic value is not considered.Finally, this method will Feature set is added in this 23 characteristic values.

The method encoded using numbers above has carried out the negative sample of CLASH data set and random configuration to compare analysis. Firstly, each sample has been carried out sequences match, it is then converted into binary zero and " 1 " sequence, and counted each position The probability of successful matching, as a result as shown in Figure 3.

In Fig. 3, horizontal axis indicates the position of each nucleotide of miRNA, and what the longitudinal axis indicated is each position pairing on miRNA Successful probability.Curve above in figure indicates the probability of each position successful matching on miRNA in positive sample, curve below Indicate the probability of each position successful matching on miRNA in negative sample.It can be found that the match condition of positive sample entirety from figure Match condition better than negative sample, before especially the 20th, positive sample are obviously better than negative sample.While it was also found that The probability of positive and negative sample sequence both ends successful matching will be well below the probability of intermediate nucleotides position successful matching.In order to intuitive The positive negative sample of display otherness, this method calculates the difference value of each position, and result is as shown in Figure 4.

From fig. 4, it can be seen that horizontal axis represents miRNA nucleotide position, the longitudinal axis indicates positive negative sample on each position Matching difference.Analysis finds that positive negative sample can be much larger relative to other positions in the difference of the 2nd to the 8th pairing situation, This is also consistent with research viewpoint before, i.e., the pairing situation of miRNA seed region has very the target gene identification of miRNA Important role.

Based on above-mentioned discovery, this method according to the successful match rate of position each in positive sample construct a weight to Measure w.And based on this vector, propose several method and give a mark to the matching sequence of miRNA, obtains 4 crucial spies Sign.

Match condition x of the feature 1. for i-th bit on miRNA_i, there is its corresponding weight w_i.Therefore, " total order is constructed Column matching characteristic 1 " can pass through the average value of all location matches scores of calculating, calculation formula such as formula (6), wherein N (N =23) it is sequence length:

Feature 2. considers the importance of miRNA seed sequence (the 2nd to the 8th), by matching for seed region miRNA It is allocated as constructing " seed region matching characteristic 1 " for a feature, calculation formula such as formula (7):

" complete sequence matching characteristic 1 " and " seed region matching characteristic 1 " considers successful matching and identifies to miRNA target gene Influence.

Match condition x of the feature 3. for i-th bit on miRNA_iIf x_i=1, corresponding weight is w_i；If x_i= 0, corresponding weight is then q_i=1-w_i, " complete sequence matching characteristic 2 " is constructed, can be obtained by calculating whole section of sequences match The average value s divided₃, calculation formula such as formula (8)-(9), wherein N (N=23) is sequence length:

Feature 4. " seed region matching characteristic 2 ", can be by the matching score average value of calculating seed region, and formula is such as (10) shown in:

Feature 3 and feature 4, had both considered successful match situation, it is also considered that match unsuccessful situation.

Therefore, situation is matched according to miRNA- target site, construct 23 sequence signatures and " complete sequence matching characteristic 1 ", 4 subsequence score features of " seed region matching characteristic 1 ", " complete sequence matching characteristic 2 " and " seed region matching characteristic 2 ", Totally 27 characteristic values.

2 feature selectings

Comprising the feature set of 84 features according to constructed by table 2, in order to study the contribution of each feature, using mRMR method It is sorted to each feature, preceding 29 feature rankings are as shown in table 3.

3 29 feature rankings of table

It can be seen that constructed " seed region matching characteristic 1 " ranking the 4th, " complete sequence matching characteristic 1 " row from the table Name the 5th, " seed zone sequences match feature 2 " ranking the 8th, " global sequence's matching characteristic 2 " ranking the 9th.It illustrates newly to construct Feature has considerable effect to the identification of miRNA target gene.Simultaneously it can further be seen that traditional characteristic such as minimum free energy, is protected Keeping property and seed region pairing all play an important role to the identification of miRNA target gene.

Be gradient with 1 according to the ranking of each feature, used 85 before ranking respectively, 84 ..., 3,2,1 feature composition Character subset is then based on each character subset and constructs corresponding model, calculates Acc, Sen, Spe, Pre and Mcc, with The performance of constructed model is investigated, concrete outcome is as shown in Figure 5.

From fig. 5, it can be seen that model performance is substantially unchanged when the characteristic in character subset is greater than 29, therefore this Method has finally chosen preceding 29 features as character subset.Before ranking in 29 features, method proposes totally 13 spies It levies (as shown in the shade of table 2), it is feasible to show that this method proposition is characterized in.

3, parameter training

There are two important parameter, n_estimators to indicate the number set in forest, max_feature table for random forest Show the Characteristic Number selected when generating decision tree every time.100 to 1000 institute is extracted with 100 gradients for n_estimators There are value (100,200 ... ..., 1000).For max_feature, all values in scikitlearn software package are had studied. The result shows that the performance of model has reached best as n_estimators=400 and max_feature=4.

4, robustness is assessed

According to above-mentioned step, the model based on random forests algorithm algorithm is established, miRNA target gene has been carried out pre- It surveys.For the robustness of research model, negative sample has carried out 10 stochastical samplings, according to the data set established, constructs model With calculate each performance indicator, concrete outcome is as shown in table 4.

4 model robustness assessment result of table

From table 4, it can be seen that accuracy rate, susceptibility, specificity, the average value of accuracy, geneva related coefficient are respectively as follows: 90.05%, 89.47%, 90.56%, 90.43%, 0.7998, and also relative standard deviation (RSD%) is respectively less than 1.6%.

The result shows that the model that this method is established has very strong robustness.Meanwhile it being based on highest accuracy rate value, This method depicts ROC and PRC curve (Fig. 6), and calculating area under curve value is respectively 0.9537,0.9584, illustrates mould Type shows good performance for microRNA target prediction.

3 model construction of embodiment and prediction miRNA target gene method

Based on researching and analysing above, prediction miRNA target gene method and model are constructed, specific as follows:

1, data set (collecting the target position point data that there is the very high miRNA of confidence level and can be in connection) is collected, Construct positive negative sample

(1) from CLASH collection selection positive sample data, the positive sample data include miRNA, miRNA sequence, target MRNA belonging to site, final position of initial position, target site of the target site on mRNA on mRNA, target site sequence Column；

Wherein, belonging to the target site mRNA be derived from ENSEMBL database；

(2) by miRNA and target site information random fit involved in positive sample, positive sample therein is got rid of, Then therefrom 18514 data of random selection, as negative sample；Wherein, positive and negative sample proportion is 1:1.

2, selection miRNA calculates sample traditional characteristic in conjunction with its target gene, and according to the calculation method of traditional characteristic Characteristic value, and combine traditional characteristic value construct sampling feature vectors；

3, miRNA and target site binding sequence feature are calculated, and constructs sampling feature vectors

Positive negative sample is carried out by sequences match using improved Smith-Waterman method, and is converted to binary system sequence Column；The case where further according to positive sample sequences match, constructs weight vectors w, and is obtained with the sequences match that this vector calculates positive negative sample Dtex sign；MiRNA- target site matched sequence feature is proposed, in conjunction with traditional characteristic, constituting one includes 84 characteristic values Characteristic set；The specific method is as follows:

(1) improved Smith-Waterman algorithm is used to allow G that is, according to base A:U and G:C complementary pairing principle: U mispairing carries out sequences match to miRNA sequence in each sample and target site sequence；

(2) be based on (1) sequences match situation, since miRNA sequence 5 ' hold first nucleotide and target site sequence It arranges corresponding nucleotide to be compared, if it does, then being indicated with " 1 ", if it does not match, being indicated with " 0 "；Because of CLASH In data set the length of major part miRNA be 23, therefore this method by each miRNA with target site in conjunction with after double-strand conversion For the binary sequence of 23 " 0 " or " 1 " composition, if the length of miRNA, less than 23, this feature value is supplemented with 0, if MiRNA length is greater than 23, and extra characteristic value is not considered；Finally, feature set is added in this 23 characteristic values；

(3) according to the corresponding binary sequence of positive sample, each nucleotide position of miRNA in positive sample can be calculated and matched Weight vectors w can be constructed to successful probability, and with this；

(4) it according to description, sequence of calculation matching score, and is added in characteristic set；

4, miRNA microRNA target prediction model is constructed using the method for random forest, carries out the identification of miRNA target gene, and instruct Practice the parameter of model；Characteristic set and random forest parameter are optimized, building optimal models identify miRNA target gene.

The parameter optimization scheme and result of the method building model of the random forest are as follows:

5, model measurement.

Embodiment 4 is compared with other methods

1, in order to verify the validity of new construction feature, miRNA microRNA target prediction model is constructed based on traditional characteristic collection, And it is compared with model used in this method.

Meanwhile in order to further verify the performance of model, this method and other two moulds using the building of same data set Type MirTarget and TarPmiR are compared.

2, the results are shown in Table 5.

5 distinct methods of table compare

The result shows that the performance of model is greatly improved, and accuracy rate improves after the feature newly constructed is added 6%, specificity and accuracy improve nearly 5%, and the improvement of susceptibility is obvious, improves nearly 9%, ROC and PRC area under the curve 10% or so is improved, the validity of new construction feature is further demonstrated.Meanwhile by this method and existing TarPmiR and MirTarget method is compared, it can be seen that model overall performance used by this method shows better performance. Wherein the accuracy rate of this method has increased separately 8% and 5% compared to TarPmiR and MirTarget, improves obvious.This mould simultaneously ROC the and PRC area under the curve of type is up to 0.95 or more, also demonstrates the stability of this model performance.

Claims

1. a kind of sequence signature analysis method for predicting miRNA target gene, which comprises the steps of:

S1: data set is collected, positive negative sample is constructed

Select CLASH data set as positive sample, and according to the dataset construction negative sample, by the miRNA in CLASH data set With target site sequence random pair, positive sample therein is deleted, then randomly chooses 18514 as negative from remaining data set Sample；

According to the calculation method of traditional characteristic, the characteristic value of each sample traditional characteristic is calculated, and binding characteristic value constructs sample Eigen vector, the traditional characteristic include: that miRNA and its target site are combined into the minimum free energy of double-strand, miRNA seed zone AU content, the conservative of seed region, the conservative of flank chain, double-strand near domain pairing, target site accessibility, seed region Pairing number, target site length, longest continuously match length, longest continuous sequence position, the end miRNA3 ' number of pairs, Poor, miRNA puppet dinucleotides feature, target site sequence puppet dinucleotides feature, target site AC are matched in miRNA seed zone and 3 ' ends Number, UG number of target site, AG number of target site, CG number of target site, target site G/C content, target site upstream G/C content and target Hold G/C content in site 3 '；

Positive negative sample is carried out by sequences match using improved Smith-Waterman method, and is converted to binary sequence；Again According to construction weight vectors w the case where positive sample sequences match, and dtex is obtained with the sequences match that this vector calculates positive negative sample Sign；MiRNA- target site matched sequence feature is proposed, in conjunction with traditional characteristic, constitutes the feature comprising 84 characteristic values Set；

S4: building model carries out the identification of miRNA target gene

S5: model measurement.

2. the method according to claim 1, wherein step S1 method particularly includes:

S11. from CLASH collection selection positive sample data, the positive sample data include miRNA, miRNA sequence, target position The final position and target site sequence of initial position, target site on mRNA of mRNA name, target site on mRNA belonging to point；

Wherein, belonging to the target site mRNA be derived from ENSEMBL database；

3. the method according to claim 1, wherein step S2 method particularly includes: selection miRNA and its target base Because of the traditional characteristic of combination, and described to calculate its characteristic value according to feature；

4. the method according to claim 1, wherein step S3 method particularly includes:

S31. improved Smith-Waterman algorithm is used to allow G:U wrong that is, according to base A:U and G:C complementary pairing principle Match, sequences match is carried out to miRNA sequence in each sample and target site sequence；

S32. the sequences match situation based on S31, since miRNA sequence 5 ' hold first nucleotide and target site sequence Corresponding nucleotide is compared, if it does, then being indicated with " 1 ", if it does not match, being indicated with " 0 "；Because of CLASH number According to concentrate major part miRNA length be 23, therefore Smith-Waterman method by each miRNA in conjunction with target site after The binary sequence that form for 23 " 0 " or " 1 " of double-strand conversion, if the length of miRNA less than 23, this feature value use 0 supplement；If miRNA length is greater than 23, extra characteristic value is not considered；Finally, feature is added in this 23 characteristic values Collection；

S33. according to the corresponding binary sequence of positive sample, each nucleotide position of miRNA in positive sample can be calculated and matched Successful probability, and weight vectors w can be constructed with this；

For the match condition x of i-th bit on miRNA_i, there is its corresponding weight w_i；Therefore, " complete sequence matching characteristic is constructed 1 ", can be by the average value of all location matches scores of calculating, calculation formula is as follows, wherein N is sequence length, N= 23；S₁For " complete sequence matching characteristic 1 ":

In view of the importance of miRNA seed sequence, the miRNA seed sequence refers to since 5 ' ends the 2nd to the 8th, will The matching score of seed region miRNA constructs " seed region matching characteristic 1 ", calculation formula is such as a feature Under, wherein S₂For " seed region matching characteristic 1 ":

" complete sequence matching characteristic 1 " and " seed region matching characteristic 1 " considers the shadow that successful matching identifies miRNA target gene It rings；

For the match condition x of i-th bit on miRNA_iIf x_i=1, corresponding weight is w_i；If x_i=0, it is corresponding Weight is then q_i=1-w_i, " complete sequence matching characteristic 2 " is constructed, the average value s of the whole matching score of calculating can be passed through₃, meter It is as follows to calculate formula, wherein N is sequence length, N=23；q_iFor " weight of i-th bit:

" seed region matching characteristic 2 ", can be by the matching score average value of calculating seed region, and formula is as follows, In, S₄For " seed region matching characteristic 2 ", t_iFor " matching score of i-th bit ":

5. predicting the sequence signature analysis method of miRNA target gene according to claim 1, which is characterized in that step S4 institute Parameter optimization scheme and the result for stating the method building model of random forest are as follows:

For random forest there are two important parameter, n_estimators indicates that the number set in forest, max_feature indicate every The Characteristic Number selected when secondary generation decision tree；For n_estimators, with 100 gradients, extract 100 to 1000 it is all whole Hundred number values；For max_feature, all values in scikit-learn kit are had studied, finally with n_ Estimators=400 and max_feature=4 are as model parameter.