CN106599615B - A kind of sequence signature analysis method for predicting miRNA target gene - Google Patents

A kind of sequence signature analysis method for predicting miRNA target gene Download PDF

Info

Publication number
CN106599615B
CN106599615B CN201611081932.9A CN201611081932A CN106599615B CN 106599615 B CN106599615 B CN 106599615B CN 201611081932 A CN201611081932 A CN 201611081932A CN 106599615 B CN106599615 B CN 106599615B
Authority
CN
China
Prior art keywords
mirna
sequence
characteristic
target site
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611081932.9A
Other languages
Chinese (zh)
Other versions
CN106599615A (en
Inventor
邹小勇
夏飞迪
王洋
戴宗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Technology
SYSU CMU Shunde International Joint Research Institute
National Sun Yat Sen University
Original Assignee
Guangdong University of Technology
SYSU CMU Shunde International Joint Research Institute
National Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology, SYSU CMU Shunde International Joint Research Institute, National Sun Yat Sen University filed Critical Guangdong University of Technology
Priority to CN201611081932.9A priority Critical patent/CN106599615B/en
Publication of CN106599615A publication Critical patent/CN106599615A/en
Application granted granted Critical
Publication of CN106599615B publication Critical patent/CN106599615B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Abstract

The invention discloses a kind of sequence signature analysis methods for predicting miRNA target gene.This method is based on CLASH experimental data set, constructs 27 miRNA- target site matched sequence correlated characteristics, in conjunction with traditional characteristic, constitutes the characteristic set comprising 84 characteristic values;And machine learning is carried out using Random Forest model, miRNA microRNA target prediction model is constructed, the identification of miRNA target gene is carried out.The model of this method building has good accuracy rate, susceptibility, specificity, accuracy, can relatively accurately predict miRNA target gene.

Description

A kind of sequence signature analysis method for predicting miRNA target gene
Technical field
The invention belongs to molecular biology and bioinformatics technique fields.More particularly, to a kind of prediction miRNA target The sequence signature analysis method of gene.
Background technique
MicroRNAs (miRNAs) is a kind of endogenous, the non-coding RNA for being about 23 nucleotide (nt).They are main By realizing complete or incomplete base pair complementarity with the 3 ' of mRNA UTR sequences, to reach cracking mRNA and inhibit mRNA The purpose for translating into protein plays important Gene regulation effect in rear transcription period and translation grade.So far, Have found that a mankind miRNA, these miRNA may regulate and control the gene of human body 80% more than 2000, in various vital movements and disease Very crucial effect is played in disease regulation.Since the specific mechanism of miRNA target gene identification is still not clear, miRNA and its target The mechanism of action of gene is sufficiently complex, therefore, effectively identifies that miRNA target gene is always the hot issues of miRNA research field.
Use detected by Western blot merely, the BIOLOGICAL TEST METHODSs such as Microarray identify miRNA target gene, it is time-consuming and And it expends.Therefore by chemical-biological information approach, the potential target gene of miRNA is excavated, can further inquire into miRNA effect machine System and miR-96 gene regulated and control network have most important theories meaning and practical value.Nearly ten years, research worker proposes more Kind biological computation method identifies miRNA target gene.MiRanda by giving a mark to the pairing situation of miRNA and its target gene, Then it calculates miRNA and target gene forms the minimum free energy after double-strand, while introducing the conservative of target site as last One condition finally obtains potential miRNA target gene by screening layer by layer.TargetScan proposes " seed " area The concept in (section that the end miRNA 5 ' starts the 2nd to the 8th nucleotide), finds the match condition of seed region to miRNA target The identification of gene has significant impact.PITA considers the secondary structure of target gene, proposes the connecing property concept of target site, it is believed that MiRNA, which will receive different secondary structures from the binding ability of target gene, to be influenced.As first generation biological computation method, although research Personnel have found more useful feature, but studies have shown that these features are not fully suitable for miRNA in conjunction with target gene Situation.Using these features as screening conditions, prediction false negative rate can be greatly improved, the second generation for being then based on machine learning is raw The method that object calculates is come into being.
MiRNA target gene is predicted with the method for machine learning, the basic principle is that using reliable data set, according to institute The binding sequence feature of miRNA and target gene is digitized, is then merged these features to constructed by the feature of proposition Model be trained, and target gene is predicted.Huang extracts sample from expression map data and is used for training pattern, Method has used CLIP (crosslinking and immunoprecipitate) data for model training.Recently, the CLASH of Helwak (crosslinking ligation and sequencing of hybrids) directly provides miRNA target corresponding with its Site sequence data, miRNA is further studied for researcher and the effect of its target gene site sequence provides good platform.
In recent years, many researchs use miRNA and target site forms the minimum free energy of double-strand, miRNA seed region Number of pairs, target site conservative, the common feature such as accessibility of target site, but these methods have specificity it is too low The shortcomings that.Therefore building miRNA and target gene binding characteristic have great importance to the identification of miRNA target gene.
Summary of the invention
The technical problem to be solved by the present invention is to overcome the defect of the above-mentioned prior art and deficiencies, provide one kind and are based on The feature of miRNA- target site pairing establishes model with random forests algorithm, carries out miRNA in conjunction with a series of traditional characteristics Target gene knows method for distinguishing.
The object of the present invention is to provide a kind of sequence signature analysis methods for predicting miRNA target gene.
Above-mentioned purpose of the present invention is achieved through the following technical solutions:
A kind of sequence signature analysis method for predicting miRNA target gene, includes the following steps:
S1: data set is collected, positive negative sample is constructed
Select CLASH data set as positive sample, and according to the dataset construction negative sample, it will be in CLASH data set MiRNA and target site sequence random pair, delete positive sample therein, then 18514 are randomly choosed from remaining data set As negative sample;
S2: according to the calculation method of traditional characteristic, the characteristic value of sample traditional characteristic is calculated
According to used traditional characteristic, the characteristic value of each sample is calculated, and traditional characteristic value is combined to construct sample Feature vector;
S3: miRNA and target site binding sequence feature are calculated, and constructs sampling feature vectors
Positive negative sample is carried out by sequences match using improved Smith-Waterman method, and is converted to binary system sequence Column;The case where further according to positive sample sequences match, constructs weight vectors w, and is obtained with the sequences match that this vector calculates positive negative sample Dtex sign;MiRNA- target site matched sequence feature is proposed, in conjunction with traditional characteristic, constituting one includes 84 characteristic values Characteristic set;
S4: building model carries out the identification of miRNA target gene
MiRNA microRNA target prediction model, and the parameter of training pattern are constructed using the method for random forest;
S5: model measurement.
Wherein, CLASH data set described in step S1 using document (Helwak A, Kudla G, Dudnakova T, et al.Mapping the Human miRNA Interactome by CLASH Reveals Frequent Noncanonical Binding [J] .Cell, 2013,153 (3): 654-65.) provided in data set, the public can be from Its supplemental information is downloaded to obtain.
Furthermore it is preferred that step S1 method particularly includes:
S11. from CLASH collection selection positive sample data, the positive sample data include miRNA, miRNA sequence, MRNA belonging to target site, final position of initial position, target site of the target site on mRNA on mRNA, target site sequence Column;
Wherein, belonging to the target site mRNA be derived from ENSEMBL database;
S12. by miRNA and target site information random fit involved in positive sample, positive sample therein is got rid of, Then therefrom 18514 data of random selection, as negative sample;Wherein, positive and negative sample proportion is 1:1.
Preferably, step S1 collects the target position point data for having the very high miRNA of confidence level and can be in connection.
Preferably, step S2 method particularly includes:
Based on document report, traditional characteristic of the miRNA in conjunction with its target gene is selected, and its spy is calculated according to feature description Value indicative;The traditional characteristic includes: that miRNA with its target site is combined into the minimum free energy of double-strand, miRNA seed region is matched To, AU content, the conservative of seed region, the conservative of flank chain, double-strand pairing near target site accessibility, seed region Number, target site length, longest continuously match number of pairs, the miRNA of length, longest continuous sequence position, the end miRNA 3 ' Seed zone and 3 ' poor, the miRNA puppet dinucleotides features of end pairing, target site sequence puppet dinucleotides feature, AC number of target site, UG number of target site, AG number of target site, CG number of target site, target site G/C content, target site upstream G/C content and target site 3 ' end G/C contents.
Preferably, step S3 method particularly includes:
S31. improved Smith-Waterman algorithm is used to allow that is, according to base A:U and G:C complementary pairing principle G:U mispairing carries out sequences match to miRNA sequence in each sample and target site sequence;
S32. the sequences match situation based on S31, since miRNA sequence 5 ' hold first nucleotide and target site The corresponding nucleotide of sequence is compared, if it does, then being indicated with " 1 ", if it does not match, being indicated with " 0 ";Because In CLASH data set the length of major part miRNA be 23, therefore this method by each miRNA with target site in conjunction with after pair The binary sequence that chain conversion forms for 23 " 0 " or " 1 ", if the length of miRNA, less than 23, this feature value is mended with 0 It fills, if miRNA length is greater than 23, extra characteristic value is not considered;Finally, feature set is added in this 23 characteristic values;
S33. according to the corresponding binary sequence of positive sample, each nucleotide position of miRNA in positive sample can be calculated The probability of successful matching, and weight vectors w can be constructed with this;
S34. it according to description, sequence of calculation matching score, and is added in characteristic set;
For the match condition x of i-th bit on miRNAi, there is its corresponding weight wi;Therefore, construct " complete sequence matching Feature 1 ", can be by the average value of all location matches scores of calculating, and calculation formula is as follows, wherein N (N=23) is sequence Column length:
In view of the importance of miRNA seed sequence (the 2nd to the 8th), using the matching score of seed region miRNA as One feature, constructs " seed region matching characteristic 1 ", calculation formula is as follows:
" complete sequence matching characteristic 1 " and " seed region matching characteristic 1 " considers successful matching and identifies to miRNA target gene Influence;
For the match condition x of i-th bit on miRNAiIf xi=1, corresponding weight is wi;If xi=0, it is right The weight answered then is qi=1-wi, " complete sequence matching characteristic 2 " is constructed, it can be by calculating the flat of whole section of sequences match score Mean value s3, calculation formula is as follows, wherein N (N=23) is sequence length:
" seed region matching characteristic 2 " can pass through the matching score average value of calculating seed region, the following institute of formula Show:
These features had both considered successful match situation, it is also considered that match unsuccessful situation.
Preferably, the parameter optimization scheme and result of the method building model of random forest described in step S4 are as follows:
There are two important parameter, n_estimators to indicate the number set in forest, max_feature table for random forest Show the Characteristic Number selected when generating decision tree every time;100 to 1000 institute is extracted with 100 gradients for n_estimators There is whole hundred number value (100,200 ... ..., 1000);For max_feature, institute in scikit-learn kit is had studied There is value, finally using n_estimators=400 and max_feature=4 as model parameter.
Step S4 optimizes characteristic set and random forest parameter, and building optimal models identify miRNA target gene.
This method is based on CLASH data set, proposes miRNA- target site matched sequence feature, special in conjunction with a series of tradition Sign, and modeled using random forest, carry out the identification of miRNA target gene.And same data are used with other two reported in the literature The model that collection is established compares.The experimental results showed that the accuracy rate of this model, susceptibility, specificity, accuracy, geneva The AUC that related coefficient reaches 90.05%, 89.47%, 90.56%, 90.43% and 0.7998, ROC and PRC is respectively 0.954, 0.958 compares with existing method, and this method shows more good performance, illustrates the miRNA- target site newly introduced Matched sequence feature has very great influence to the identification of miRNA target gene.
The invention has the following advantages:
This method is based on CLASH data set, proposes miRNA- target site matched sequence feature, and build using random forest Mould can relatively accurately predict miRNA target gene.This method than existing methods, has following clear advantage:
(1) CLASH data set, the target gene site sequence that each sample both provides miRNA and accurately combined with it are used Column information.The data set that traditional method uses tends not to provide the miRNA target gene site sequence accurately combined, therefore first First need method using sequences match from found in mRNA can potential target gene site in conjunction with miRNA, then form Sample calculates sample characteristics.Because the potential target site of miRNA may be inaccuracy, this method uses CLASH Data set is more more reliable than the set of data samples that conventional method constructs.
(2) each sample is converted to a binary sequence by the pairing situation based on miRNA Yu its target site, and It is calculated based on the sequence and matches score, is constructed miRNA- target site matched sequence feature, has preferably been measured miRNA and its The combination possibility in target gene site.
(3) it is modeled using random forests algorithm, is capable of handling very high-dimensional data, trained and predetermined speed is fast.When depositing When classifying unbalanced situation, random forests algorithm is capable of the otherness of effectively equilibrium data collection.When data are concentrated with greatly Still preferable precision of prediction can be kept when the shortage of data of ratio, find influencing each other between each feature and importance Degree, it is not easy to over-fitting occur.
Detailed description of the invention
Fig. 1 experiment flow figure.
Fig. 2 sequences match binarization indicates.
The positive and negative sample matches comparison of Fig. 3.
The positive and negative sample matches difference of Fig. 4.
Prediction result of the Fig. 5 based on different characteristic subset.
ROC the and PRC curve of Fig. 6 experimental result.
Specific embodiment
The present invention is further illustrated below in conjunction with Figure of description and specific embodiment, but embodiment is not to the present invention It limits in any form.
Unless stated otherwise, the reagent used in the present invention, method and apparatus are the art conventional reagent, method and set It is standby.
Unless stated otherwise, following embodiment agents useful for same and material are commercially available.
1 experimental method of embodiment
1, experimental situation
Laboratory apparatus: ASUS N551JM type computer
Programming software: Anaconda3 Spyder, Visual Studio 2013
Programming language: Python 3.5, C++
2, positive negative sample and its form
Positive sample is selected from CLASH experimental data set, totally 18514 data, all comprises the following information that in each data MiRNA, miRNA sequence, mRNA (being derived from ENSEMBL database) belonging to target site, starting of the target site on mRNA Position, final position of the target site on mRNA, target site sequence.
Because being that target position that far smaller than cannot be in combination is counted certainly with the combinable target position points of miRNA Mesh, so positive sample therein is got rid of by by miRNA and target site information random fit involved in positive sample, Then therefrom 18514 data of random selection, as negative sample.
By taking positive sample as an example, sample form is as shown in table 1:
1 sample form of table
3, characteristic set
26 kinds of features (84 characteristic values) has been selected altogether, and specific characteristic set is as shown in table 2.Wherein preceding 21 kinds of features, Totally 57 characteristic values, document have report;Rear 5 kinds of features (dash area) constructed by this method contain 27 characteristic values, this A little characteristic values have fully considered the operative condition of miRNA Yu its target gene.
2 miRNA of table and target site binding characteristic set
4, feature selecting
Feature selecting is proposed for high dimensional data computational problem, by rejecting redundancy feature and extraneous features, Improve the Generalization Capability and operational efficiency of machine learning algorithm.This method has used minimal redundancy maximal correlation algorithm (minimal Redundancy maximal relevance criterion, mRMR) to 84 feature orderings, and selected optimal feature Subset constructs model.
5, random forest
Random forest is a kind of combined method, is made of many decision trees, because these decision trees are formed by Random method, therefore also referred to as stochastic decision tree.Be between tree in random forest it is no associated, when test data enter When random forest, each decision tree is allowed to classify, it is final for finally taking that class that classification results are most in all decision trees Result.This method uses random forest machine learning method as training pattern, and algorithm derives from scikit-learn (http://scikit-learn.org/stable/) kit, entire program are developed using python.It optimizes in forest and sets Number and each tree two parameters of characteristic.
6, performance indicator
The performance of classifier can be assessed by some independent indexs.For the performance of assessment models, accuracy rate (Acc), susceptibility (Sen), specific (Spe), accuracy (Pre), totally five kinds of indexs are introduced into and comment geneva related coefficient (Mcc) Estimate the performance of model.The calculation method of these indexs is as follows:
Wherein, TP, which refers to, is judged as positive sample, in fact and the number of positive sample;TN, which refers to, is judged as negative sample, thing It is also the number of negative sample in reality;FN refers to the number for being judged as negative sample, but being in fact positive sample;FP, which refers to be determined, to be positive Sample, but be in fact the number of negative sample.In addition, Receiver operating curve (receiver operating Characteristic curve, ROC curve) and accuracy rate-recall rate curve (precision-recall curve, PRC song Line) it is also introduced into the performance of assessment models.ROC curve is the overall target for reflecting sensibility and specificity continuous variable, is used Composition method discloses the correlation of sensibility and specificity, continuous variable is set out to multiple and different critical values, to calculate A series of sensibility and specificities out.It is again that abscissa is depicted as curve using sensibility as ordinate, (1- specificity), under curve For area closer to 1, model performance is better.PRC curve is to reflect that the synthesis of accuracy rate and recall rate (sensibility) continuous variable refers to Mark is disclosed the correlation of accuracy rate and sensibility using composition method, continuous variable is set out to multiple and different critical values, from And calculate a series of accuracys rate and sensibility.It is again that abscissa is depicted as curve, curve using accuracy rate as ordinate, sensibility For lower area closer to 1, model performance is better.
7, experiment flow
The microRNA target prediction of miRNA belongs to Machine Learning Problems, and the process entirely tested is as shown in Figure 1.
Step 1: selecting CLASH data set as positive sample, and according to the dataset construction negative sample, by CLASH data The miRNA and target site sequence random pair of concentration, delete positive sample therein, then randomly choose from remaining data set 18514 are used as negative sample, and positive and negative sample proportion is 1:1;
Step 2: according to the calculation method of traditional characteristic, calculating the characteristic value of sample traditional characteristic;
Step 3: using improved Smith-Waterman method by positive negative sample carry out sequences match, and be converted to two into Sequence processed.The case where further according to positive sample sequences match, constructs weight vectors w, and the sequence of positive negative sample is calculated with this vector With score feature;
Step 4: model, and the parameter of training pattern are constructed using the method for random forest;
Step 5: model measurement;
Step 6: compared with other models and analyzing.
Embodiment 2 predicts the sequence signature analysis of miRNA target gene
1, it is matched based on miRNA- target site
MiRNA and its target site are not exact matching, and match condition is widely different.This method is according in sample set The pairing situation of miRNA and its target site, by each miRNA in conjunction with target site after double-strand be expressed as by " 0 " and " 1 " group At binary sequence, and the binary sequence of composition is analyzed, detailed process is as shown in Fig. 2, wherein dash area is " seed region ".
In Fig. 2, BEYLA sequence is the corresponding target site sequence of miR-149.Improved Smith- is used first Waterman method carries out sequences match according to base A:U and G:C complementary pairing principle, allows G:U mispairing.From miR-149 sequence First nucleotide that column 5 ' are held starts and each nucleotide of BEYLA sequence is compared, if it does, then with " 1 " table Show, corresponding nucleotide is connected in its corresponding position with a vertical line " | ";If it does not match, being indicated with " 0 ".Often May all there are some strigula "-" in one sequence, indicate the position without any nucleotide.Therefore, miR-149 sequence and The matching of BEYLA target site sequence can be converted into binary sequence " 11111111011110111110010 ", contain 23 altogether " 0 " and " 1 " characteristic value.Because the length of major part miRNA is 23 in CLASH data set, this method is by each miRNA The characteristic value sequence that double-strand conversion after in conjunction with target site forms for 23 " 0 " or " 1 ", if the length of miRNA is less than 23, then this feature value is supplemented with 0, if miRNA length is greater than 23, extra characteristic value is not considered.Finally, this method will Feature set is added in this 23 characteristic values.
The method encoded using numbers above has carried out the negative sample of CLASH data set and random configuration to compare analysis. Firstly, each sample has been carried out sequences match, it is then converted into binary zero and " 1 " sequence, and counted each position The probability of successful matching, as a result as shown in Figure 3.
In Fig. 3, horizontal axis indicates the position of each nucleotide of miRNA, and what the longitudinal axis indicated is each position pairing on miRNA Successful probability.Curve above in figure indicates the probability of each position successful matching on miRNA in positive sample, curve below Indicate the probability of each position successful matching on miRNA in negative sample.It can be found that the match condition of positive sample entirety from figure Match condition better than negative sample, before especially the 20th, positive sample are obviously better than negative sample.While it was also found that The probability of positive and negative sample sequence both ends successful matching will be well below the probability of intermediate nucleotides position successful matching.In order to intuitive The positive negative sample of display otherness, this method calculates the difference value of each position, and result is as shown in Figure 4.
From fig. 4, it can be seen that horizontal axis represents miRNA nucleotide position, the longitudinal axis indicates positive negative sample on each position Matching difference.Analysis finds that positive negative sample can be much larger relative to other positions in the difference of the 2nd to the 8th pairing situation, This is also consistent with research viewpoint before, i.e., the pairing situation of miRNA seed region has very the target gene identification of miRNA Important role.
Based on above-mentioned discovery, this method according to the successful match rate of position each in positive sample construct a weight to Measure w.And based on this vector, propose several method and give a mark to the matching sequence of miRNA, obtains 4 crucial spies Sign.
Match condition x of the feature 1. for i-th bit on miRNAi, there is its corresponding weight wi.Therefore, " total order is constructed Column matching characteristic 1 " can pass through the average value of all location matches scores of calculating, calculation formula such as formula (6), wherein N (N =23) it is sequence length:
Feature 2. considers the importance of miRNA seed sequence (the 2nd to the 8th), by matching for seed region miRNA It is allocated as constructing " seed region matching characteristic 1 " for a feature, calculation formula such as formula (7):
" complete sequence matching characteristic 1 " and " seed region matching characteristic 1 " considers successful matching and identifies to miRNA target gene Influence.
Match condition x of the feature 3. for i-th bit on miRNAiIf xi=1, corresponding weight is wi;If xi= 0, corresponding weight is then qi=1-wi, " complete sequence matching characteristic 2 " is constructed, can be obtained by calculating whole section of sequences match The average value s divided3, calculation formula such as formula (8)-(9), wherein N (N=23) is sequence length:
Feature 4. " seed region matching characteristic 2 ", can be by the matching score average value of calculating seed region, and formula is such as (10) shown in:
Feature 3 and feature 4, had both considered successful match situation, it is also considered that match unsuccessful situation.
Therefore, situation is matched according to miRNA- target site, construct 23 sequence signatures and " complete sequence matching characteristic 1 ", 4 subsequence score features of " seed region matching characteristic 1 ", " complete sequence matching characteristic 2 " and " seed region matching characteristic 2 ", Totally 27 characteristic values.
2 feature selectings
Comprising the feature set of 84 features according to constructed by table 2, in order to study the contribution of each feature, using mRMR method It is sorted to each feature, preceding 29 feature rankings are as shown in table 3.
3 29 feature rankings of table
It can be seen that constructed " seed region matching characteristic 1 " ranking the 4th, " complete sequence matching characteristic 1 " row from the table Name the 5th, " seed zone sequences match feature 2 " ranking the 8th, " global sequence's matching characteristic 2 " ranking the 9th.It illustrates newly to construct Feature has considerable effect to the identification of miRNA target gene.Simultaneously it can further be seen that traditional characteristic such as minimum free energy, is protected Keeping property and seed region pairing all play an important role to the identification of miRNA target gene.
Be gradient with 1 according to the ranking of each feature, used 85 before ranking respectively, 84 ..., 3,2,1 feature composition Character subset is then based on each character subset and constructs corresponding model, calculates Acc, Sen, Spe, Pre and Mcc, with The performance of constructed model is investigated, concrete outcome is as shown in Figure 5.
From fig. 5, it can be seen that model performance is substantially unchanged when the characteristic in character subset is greater than 29, therefore this Method has finally chosen preceding 29 features as character subset.Before ranking in 29 features, method proposes totally 13 spies It levies (as shown in the shade of table 2), it is feasible to show that this method proposition is characterized in.
3, parameter training
There are two important parameter, n_estimators to indicate the number set in forest, max_feature table for random forest Show the Characteristic Number selected when generating decision tree every time.100 to 1000 institute is extracted with 100 gradients for n_estimators There are value (100,200 ... ..., 1000).For max_feature, all values in scikitlearn software package are had studied. The result shows that the performance of model has reached best as n_estimators=400 and max_feature=4.
4, robustness is assessed
According to above-mentioned step, the model based on random forests algorithm algorithm is established, miRNA target gene has been carried out pre- It surveys.For the robustness of research model, negative sample has carried out 10 stochastical samplings, according to the data set established, constructs model With calculate each performance indicator, concrete outcome is as shown in table 4.
4 model robustness assessment result of table
From table 4, it can be seen that accuracy rate, susceptibility, specificity, the average value of accuracy, geneva related coefficient are respectively as follows: 90.05%, 89.47%, 90.56%, 90.43%, 0.7998, and also relative standard deviation (RSD%) is respectively less than 1.6%.
The result shows that the model that this method is established has very strong robustness.Meanwhile it being based on highest accuracy rate value, This method depicts ROC and PRC curve (Fig. 6), and calculating area under curve value is respectively 0.9537,0.9584, illustrates mould Type shows good performance for microRNA target prediction.
3 model construction of embodiment and prediction miRNA target gene method
Based on researching and analysing above, prediction miRNA target gene method and model are constructed, specific as follows:
1, data set (collecting the target position point data that there is the very high miRNA of confidence level and can be in connection) is collected, Construct positive negative sample
Select CLASH data set as positive sample, and according to the dataset construction negative sample, it will be in CLASH data set MiRNA and target site sequence random pair, delete positive sample therein, then 18514 are randomly choosed from remaining data set As negative sample;
(1) from CLASH collection selection positive sample data, the positive sample data include miRNA, miRNA sequence, target MRNA belonging to site, final position of initial position, target site of the target site on mRNA on mRNA, target site sequence Column;
Wherein, belonging to the target site mRNA be derived from ENSEMBL database;
(2) by miRNA and target site information random fit involved in positive sample, positive sample therein is got rid of, Then therefrom 18514 data of random selection, as negative sample;Wherein, positive and negative sample proportion is 1:1.
2, selection miRNA calculates sample traditional characteristic in conjunction with its target gene, and according to the calculation method of traditional characteristic Characteristic value, and combine traditional characteristic value construct sampling feature vectors;
Based on document report, traditional characteristic of the miRNA in conjunction with its target gene is selected, and its spy is calculated according to feature description Value indicative;The traditional characteristic includes: that miRNA with its target site is combined into the minimum free energy of double-strand, miRNA seed region is matched To, AU content, the conservative of seed region, the conservative of flank chain, double-strand pairing near target site accessibility, seed region Number, target site length, longest continuously match number of pairs, the miRNA of length, longest continuous sequence position, the end miRNA 3 ' Seed zone and 3 ' poor, the miRNA puppet dinucleotides features of end pairing, target site sequence puppet dinucleotides feature, AC number of target site, UG number of target site, AG number of target site, CG number of target site, target site G/C content, target site upstream G/C content and target site 3 ' end G/C contents.
3, miRNA and target site binding sequence feature are calculated, and constructs sampling feature vectors
Positive negative sample is carried out by sequences match using improved Smith-Waterman method, and is converted to binary system sequence Column;The case where further according to positive sample sequences match, constructs weight vectors w, and is obtained with the sequences match that this vector calculates positive negative sample Dtex sign;MiRNA- target site matched sequence feature is proposed, in conjunction with traditional characteristic, constituting one includes 84 characteristic values Characteristic set;The specific method is as follows:
(1) improved Smith-Waterman algorithm is used to allow G that is, according to base A:U and G:C complementary pairing principle: U mispairing carries out sequences match to miRNA sequence in each sample and target site sequence;
(2) be based on (1) sequences match situation, since miRNA sequence 5 ' hold first nucleotide and target site sequence It arranges corresponding nucleotide to be compared, if it does, then being indicated with " 1 ", if it does not match, being indicated with " 0 ";Because of CLASH In data set the length of major part miRNA be 23, therefore this method by each miRNA with target site in conjunction with after double-strand conversion For the binary sequence of 23 " 0 " or " 1 " composition, if the length of miRNA, less than 23, this feature value is supplemented with 0, if MiRNA length is greater than 23, and extra characteristic value is not considered;Finally, feature set is added in this 23 characteristic values;
(3) according to the corresponding binary sequence of positive sample, each nucleotide position of miRNA in positive sample can be calculated and matched Weight vectors w can be constructed to successful probability, and with this;
(4) it according to description, sequence of calculation matching score, and is added in characteristic set;
For the match condition x of i-th bit on miRNAi, there is its corresponding weight wi;Therefore, construct " complete sequence matching Feature 1 ", can be by the average value of all location matches scores of calculating, and calculation formula is as follows, wherein N (N=23) is sequence Column length:
In view of the importance of miRNA seed sequence (the 2nd to the 8th), using the matching score of seed region miRNA as One feature, constructs " seed region matching characteristic 1 ", calculation formula is as follows:
" complete sequence matching characteristic 1 " and " seed region matching characteristic 1 " considers successful matching and identifies to miRNA target gene Influence;
For the match condition x of i-th bit on miRNAiIf xi=1, corresponding weight is wi;If xi=0, it is right The weight answered then is qi=1-wi, " complete sequence matching characteristic 2 " is constructed, it can be by calculating the flat of whole section of sequences match score Mean value s3, calculation formula is as follows, wherein N (N=23) is sequence length:
" seed region matching characteristic 2 " can pass through the matching score average value of calculating seed region, the following institute of formula Show:
These features had both considered successful match situation, it is also considered that match unsuccessful situation.
4, miRNA microRNA target prediction model is constructed using the method for random forest, carries out the identification of miRNA target gene, and instruct Practice the parameter of model;Characteristic set and random forest parameter are optimized, building optimal models identify miRNA target gene.
The parameter optimization scheme and result of the method building model of the random forest are as follows:
There are two important parameter, n_estimators to indicate the number set in forest, max_feature table for random forest Show the Characteristic Number selected when generating decision tree every time;100 to 1000 institute is extracted with 100 gradients for n_estimators There is whole hundred number value (100,200 ... ..., 1000);For max_feature, institute in scikit-learn kit is had studied There is value, finally using n_estimators=400 and max_feature=4 as model parameter.
5, model measurement.
Embodiment 4 is compared with other methods
1, in order to verify the validity of new construction feature, miRNA microRNA target prediction model is constructed based on traditional characteristic collection, And it is compared with model used in this method.
Meanwhile in order to further verify the performance of model, this method and other two moulds using the building of same data set Type MirTarget and TarPmiR are compared.
2, the results are shown in Table 5.
5 distinct methods of table compare
The result shows that the performance of model is greatly improved, and accuracy rate improves after the feature newly constructed is added 6%, specificity and accuracy improve nearly 5%, and the improvement of susceptibility is obvious, improves nearly 9%, ROC and PRC area under the curve 10% or so is improved, the validity of new construction feature is further demonstrated.Meanwhile by this method and existing TarPmiR and MirTarget method is compared, it can be seen that model overall performance used by this method shows better performance. Wherein the accuracy rate of this method has increased separately 8% and 5% compared to TarPmiR and MirTarget, improves obvious.This mould simultaneously ROC the and PRC area under the curve of type is up to 0.95 or more, also demonstrates the stability of this model performance.

Claims (5)

1. a kind of sequence signature analysis method for predicting miRNA target gene, which comprises the steps of:
S1: data set is collected, positive negative sample is constructed
Select CLASH data set as positive sample, and according to the dataset construction negative sample, by the miRNA in CLASH data set With target site sequence random pair, positive sample therein is deleted, then randomly chooses 18514 as negative from remaining data set Sample;
S2: according to the calculation method of traditional characteristic, the characteristic value of sample traditional characteristic is calculated
According to the calculation method of traditional characteristic, the characteristic value of each sample traditional characteristic is calculated, and binding characteristic value constructs sample Eigen vector, the traditional characteristic include: that miRNA and its target site are combined into the minimum free energy of double-strand, miRNA seed zone AU content, the conservative of seed region, the conservative of flank chain, double-strand near domain pairing, target site accessibility, seed region Pairing number, target site length, longest continuously match length, longest continuous sequence position, the end miRNA3 ' number of pairs, Poor, miRNA puppet dinucleotides feature, target site sequence puppet dinucleotides feature, target site AC are matched in miRNA seed zone and 3 ' ends Number, UG number of target site, AG number of target site, CG number of target site, target site G/C content, target site upstream G/C content and target Hold G/C content in site 3 ';
S3: miRNA and target site binding sequence feature are calculated, and constructs sampling feature vectors
Positive negative sample is carried out by sequences match using improved Smith-Waterman method, and is converted to binary sequence;Again According to construction weight vectors w the case where positive sample sequences match, and dtex is obtained with the sequences match that this vector calculates positive negative sample Sign;MiRNA- target site matched sequence feature is proposed, in conjunction with traditional characteristic, constitutes the feature comprising 84 characteristic values Set;
S4: building model carries out the identification of miRNA target gene
MiRNA microRNA target prediction model, and the parameter of training pattern are constructed using the method for random forest;
S5: model measurement.
2. the method according to claim 1, wherein step S1 method particularly includes:
S11. from CLASH collection selection positive sample data, the positive sample data include miRNA, miRNA sequence, target position The final position and target site sequence of initial position, target site on mRNA of mRNA name, target site on mRNA belonging to point;
Wherein, belonging to the target site mRNA be derived from ENSEMBL database;
S12. by miRNA and target site information random fit involved in positive sample, positive sample therein is got rid of, then Therefrom 18514 data of random selection, as negative sample;Wherein, positive and negative sample proportion is 1:1.
3. the method according to claim 1, wherein step S2 method particularly includes: selection miRNA and its target base Because of the traditional characteristic of combination, and described to calculate its characteristic value according to feature;
4. the method according to claim 1, wherein step S3 method particularly includes:
S31. improved Smith-Waterman algorithm is used to allow G:U wrong that is, according to base A:U and G:C complementary pairing principle Match, sequences match is carried out to miRNA sequence in each sample and target site sequence;
S32. the sequences match situation based on S31, since miRNA sequence 5 ' hold first nucleotide and target site sequence Corresponding nucleotide is compared, if it does, then being indicated with " 1 ", if it does not match, being indicated with " 0 ";Because of CLASH number According to concentrate major part miRNA length be 23, therefore Smith-Waterman method by each miRNA in conjunction with target site after The binary sequence that form for 23 " 0 " or " 1 " of double-strand conversion, if the length of miRNA less than 23, this feature value use 0 supplement;If miRNA length is greater than 23, extra characteristic value is not considered;Finally, feature is added in this 23 characteristic values Collection;
S33. according to the corresponding binary sequence of positive sample, each nucleotide position of miRNA in positive sample can be calculated and matched Successful probability, and weight vectors w can be constructed with this;
S34. it according to description, sequence of calculation matching score, and is added in characteristic set;
For the match condition x of i-th bit on miRNAi, there is its corresponding weight wi;Therefore, " complete sequence matching characteristic is constructed 1 ", can be by the average value of all location matches scores of calculating, calculation formula is as follows, wherein N is sequence length, N= 23;S1For " complete sequence matching characteristic 1 ":
In view of the importance of miRNA seed sequence, the miRNA seed sequence refers to since 5 ' ends the 2nd to the 8th, will The matching score of seed region miRNA constructs " seed region matching characteristic 1 ", calculation formula is such as a feature Under, wherein S2For " seed region matching characteristic 1 ":
" complete sequence matching characteristic 1 " and " seed region matching characteristic 1 " considers the shadow that successful matching identifies miRNA target gene It rings;
For the match condition x of i-th bit on miRNAiIf xi=1, corresponding weight is wi;If xi=0, it is corresponding Weight is then qi=1-wi, " complete sequence matching characteristic 2 " is constructed, the average value s of the whole matching score of calculating can be passed through3, meter It is as follows to calculate formula, wherein N is sequence length, N=23;qiFor " weight of i-th bit:
" seed region matching characteristic 2 ", can be by the matching score average value of calculating seed region, and formula is as follows, In, S4For " seed region matching characteristic 2 ", tiFor " matching score of i-th bit ":
5. predicting the sequence signature analysis method of miRNA target gene according to claim 1, which is characterized in that step S4 institute Parameter optimization scheme and the result for stating the method building model of random forest are as follows:
For random forest there are two important parameter, n_estimators indicates that the number set in forest, max_feature indicate every The Characteristic Number selected when secondary generation decision tree;For n_estimators, with 100 gradients, extract 100 to 1000 it is all whole Hundred number values;For max_feature, all values in scikit-learn kit are had studied, finally with n_ Estimators=400 and max_feature=4 are as model parameter.
CN201611081932.9A 2016-11-30 2016-11-30 A kind of sequence signature analysis method for predicting miRNA target gene Active CN106599615B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611081932.9A CN106599615B (en) 2016-11-30 2016-11-30 A kind of sequence signature analysis method for predicting miRNA target gene

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611081932.9A CN106599615B (en) 2016-11-30 2016-11-30 A kind of sequence signature analysis method for predicting miRNA target gene

Publications (2)

Publication Number Publication Date
CN106599615A CN106599615A (en) 2017-04-26
CN106599615B true CN106599615B (en) 2019-04-05

Family

ID=58594491

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611081932.9A Active CN106599615B (en) 2016-11-30 2016-11-30 A kind of sequence signature analysis method for predicting miRNA target gene

Country Status (1)

Country Link
CN (1) CN106599615B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108090327B (en) * 2017-12-20 2022-03-29 吉林大学 Prediction method for exogenous miRNA (micro ribonucleic acid) regulation and control target gene containing three-dimensional free energy
CN110164505B (en) * 2018-02-07 2021-06-22 深圳华大基因科技服务有限公司 Method for rapidly predicting target gene of target miRNA
CN108707663B (en) * 2018-04-19 2022-03-08 深圳华大基因股份有限公司 Reagent for cancer sample miRNA sequencing quantitative result evaluation, preparation method and application
CN110021361B (en) * 2018-06-27 2023-04-07 中山大学 miRNA target gene prediction method based on convolutional neural network
CN109272056B (en) * 2018-10-30 2021-09-21 成都信息工程大学 Data balancing method based on pseudo negative sample and method for improving data classification performance
CN109859798B (en) * 2019-01-21 2023-06-23 桂林电子科技大学 Prediction method for interaction of sRNA and target mRNA in bacteria
CN110517727B (en) * 2019-08-23 2022-03-08 苏州浪潮智能科技有限公司 Sequence alignment method and system
CN111192629A (en) * 2019-12-23 2020-05-22 苏州金唯智生物科技有限公司 Construction method and application of gene sequence difficulty analysis model
CN112599196B (en) * 2020-12-21 2021-11-05 北京诺赛基因组研究中心有限公司 Method for constructing model for classifying nucleic acid sequences and application thereof
CN113409889A (en) * 2021-05-25 2021-09-17 电子科技大学长三角研究院(衢州) Target activity prediction method, device, equipment and storage medium of sgRNA
CN113838527B (en) * 2021-09-26 2023-09-01 平安科技(深圳)有限公司 Method and device for generating target gene prediction model and storage medium
CN116798513B (en) * 2023-02-21 2023-12-15 苏州赛赋新药技术服务有限责任公司 Method and system for screening siRNA sequence to reduce off-target effect

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101710362A (en) * 2009-12-10 2010-05-19 浙江大学 microRNA target position point prediction method based on support vector machine
CN102597257A (en) * 2009-09-04 2012-07-18 国立大学法人富山大学 Specific method for preparing joined DNA fragments including sequences derived from target genes
CN103218544A (en) * 2013-04-03 2013-07-24 河海大学 Gene identification method based on sequence similarity and periodicity of frequency spectrum 3
CN106032532A (en) * 2015-03-17 2016-10-19 中国医学科学院北京协和医院 Small-activating RNA, preparation method and applications thereof
CN106909806A (en) * 2015-12-22 2017-06-30 广州华大基因医学检验所有限公司 The method and apparatus of fixed point detection variation

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2824533A1 (en) * 2011-01-13 2012-07-19 Laboratory Corporation Of America Holdings Methods and systems for predictive modeling of hiv-1 replication capacity
US9047559B2 (en) * 2011-07-22 2015-06-02 Sas Institute Inc. Computer-implemented systems and methods for testing large scale automatic forecast combinations

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102597257A (en) * 2009-09-04 2012-07-18 国立大学法人富山大学 Specific method for preparing joined DNA fragments including sequences derived from target genes
CN101710362A (en) * 2009-12-10 2010-05-19 浙江大学 microRNA target position point prediction method based on support vector machine
CN103218544A (en) * 2013-04-03 2013-07-24 河海大学 Gene identification method based on sequence similarity and periodicity of frequency spectrum 3
CN106032532A (en) * 2015-03-17 2016-10-19 中国医学科学院北京协和医院 Small-activating RNA, preparation method and applications thereof
CN106909806A (en) * 2015-12-22 2017-06-30 广州华大基因医学检验所有限公司 The method and apparatus of fixed point detection variation

Also Published As

Publication number Publication date
CN106599615A (en) 2017-04-26

Similar Documents

Publication Publication Date Title
CN106599615B (en) A kind of sequence signature analysis method for predicting miRNA target gene
CN108763865A (en) A kind of integrated learning approach of prediction DNA protein binding sites
CN110111843B (en) Method, apparatus and storage medium for clustering nucleic acid sequences
Morgado et al. Computational tools for plant small RNA detection and categorization
CN106033502A (en) Virus identification method and device
CN110459264A (en) Based on grad enhancement decision tree prediction circular rna and disease associated method
CN113066527B (en) Target prediction method and system for siRNA knockdown mRNA
CN106202999A (en) Microorganism high-pass sequencing data based on different scale tuple word frequency analyzes agreement
CN109599149A (en) A kind of prediction technique of RNA coding potential
CN108090327B (en) Prediction method for exogenous miRNA (micro ribonucleic acid) regulation and control target gene containing three-dimensional free energy
CN106446601B (en) A kind of method of extensive mark lncRNA function
CN113823356A (en) Methylation site identification method and device
CN108595914A (en) One grows tobacco mitochondrial RNA (mt RNA) editing sites high-precision forecasting method
CN101710364A (en) Method for calculating and identifying protein-RNA interaction sites
CN113764031A (en) Prediction method of N6 methyladenosine locus in trans-tissue/species RNA
KR102376212B1 (en) Gene expression marker screening method using neural network based on gene selection algorithm
CN107630104A (en) A kind of phylogenetic tree and authentication method for being used to identify Dendrobidium huoshanness or dendrobium candidum
CN110592093B (en) Aptamer capable of recognizing EpCAM protein, and preparation method and application thereof
CN108959843B (en) Computer screening method of chemical small molecule drug of target RNA
CN113838528A (en) Single cell horizontal coupling visualization method based on single cell immune group library data
CN107038350B (en) Long non-coding RNA target prediction method and system of medicine
Turner et al. rG4detector: convolutional neural network to predict RNA G-quadruplex propensity based on rG4-seq data
Wen et al. Computational prediction of candidate miRNAs and their targets from Medicago truncatula non-protein-coding transcripts
Wang et al. PLANNER: a multi-scale deep language model for the origins of replication site prediction
CN115240775B (en) Cas protein prediction method based on stacking integrated learning strategy

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant