CN107609351A

CN107609351A - A kind of method based on convolutional neural networks prediction pseudouridine decorating site

Info

Publication number: CN107609351A
Application number: CN201710989588.1A
Authority: CN
Inventors: 樊永显; 李永贞; 杨辉华; 蔡国永; 张向文
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2017-10-23
Filing date: 2017-10-23
Publication date: 2018-01-19

Abstract

The invention discloses a kind of method based on convolutional neural networks prediction pseudouridine decorating site, it is characterized in that, comprise the following steps：1）Data set arranges and conversion；2）Model construction and training convolutional neural networks model；3）Treat forecasting sequence interception and coding；4）Feature extraction and prediction.This method can improve the accuracy rate of pseudouridine site estimation, pseudouridine site estimation is preferably extended to application.

Description

A kind of method based on convolutional neural networks prediction pseudouridine decorating site

Technical field

The present invention relates to gene in RNA sequence during transcription pseudouridine decorating site Predicting Technique, specifically one Method of the kind based on convolutional neural networks prediction pseudouridine decorating site.

Background technology

Gene is during transcription, phenomenon that many RNA are modified.Up to now, it has been found that 100 A variety of RNA modification.With the method for chemistry, the RNA modifications of these covalent atoms have been investigated for 12 years or so, some this There are many positions in a life in kind modification, they influence RNA two level and tertiary structure, influence the speed of gene expression And precision, it is able to maintain that RNA stability, the generation for helping RNA to be correctly decoded on ribosomes, preventing some diseases etc. just It is often significant in terms of traveling biological function.

In this more than hundred kinds modification, pseudouridine is first and is found, and be so far quantity it is most one Kind RNA modifications.Widely known pseudouridine modification at present is present in some non-coding RNAs such as tRNA, rRNA, sRNA, later Thomas M.Carlile et al. are had found by the method for high-flux sequence, are also deposited on the mRNA in the mankind and yeast cell Modification in pseudouridine site.Pseudouridine position is the isomers of uracil, is under certain condition by the transfer shape of covalent bond Into.Such as in eucaryote, the process of pseudouridineization is mainly to pass through box H/ACA RNPs catalytic action, box H/ Each hair clip of ACA RNAs has two bulge loops, it by identifying specific RNA sequence, and below at the structure of bulge loop therewith Base pair complementarity, then by the catalysis of certain enzyme, the uracil on the unpaired place right side in bulge loop top is acted on, makes uracil Chemical constitution rotate 180 ° by axle of the line of position 3 and 6, then phosphoric acid C5 is rotated clockwise to bottom, so and ribose Connected original C-N keys become for C-C keys, form pseudouridine.

Pseudouridine can change RNA structure, increase base stacking, improve base pairing, fixed ribose-phosphate backbone. The bone marrow exhaustion syndrome keratosis of neurogenic disease and x linkages is directly or indirectly related to Parkinson etc. for it, by In its special structure and chemical property, and the meaning of its biology and medical science, the research in pseudouridine site is increasingly drawn Play the concern of people.This problem is identified for pseudouridine site, high throughput sequencing technologies are referred to as ψ-SEQ and are suggested (Carlile,T.M.et al.Pseudouridine profiling reveals regulated mRNA Pseudouridylation in yeast and human cells.Nature 515,143 (2014)), to some species ψ Site has carried out comprehensive, high-resolution mapping, and to determine pseudouridine site, but this technology is that genome sequence is surveyed Sequence, cost is huge, and the consuming time is long, and can be more and more difficult with the increase sequencing of sequence length.Therefore, compel to be essential The information in some more easily computerized algorithm extraction pseudouridine sites is developed, then loci is predicted.

Li at present, Y et al. (Li, Y.H., Zhang, G.＆Cui, Q.PPUS:a web server to predict PUS- Specific pseudouridine sites.Bioinformatics 31,3362 (2015)) and Chen W et al. (Wei, C.,Hua,T.,Jing,Y.,Hao,L.&Chou,K.C.iRNA-PseU:Identifying RNA pseudouridine Sites.Molecular Therapy Nucleic Acids 5, e332 (2016)) et al. by being cut to gene order Take, then sequence is encoded, Chen W add the physicochemical properties of nucleotides in coding, finally use LIBSVM again Algorithm carries out feature extraction and classification, and to determine pseudouridine site, but LIBSVM algorithms carry out the accurate of feature extraction and classification Rate has much room for improvement, in order to more accurately predict pseudouridine site, it is necessary to which the algorithm of higher efficiency carries out sequence signature extraction.

The content of the invention

Mesh of the present invention is in view of the shortcomings of the prior art, and to provide a kind of based on convolutional neural networks prediction pseudouridine modification The method in site.This method can improve the accuracy rate of pseudouridine site estimation, pseudouridine site estimation is preferably extended to Using.

Realizing the technical scheme of the object of the invention is：

A kind of method based on convolutional neural networks prediction pseudouridine decorating site, comprises the following steps：

1) data set is arranged and changed：Choose Wei, C., Hua, T., Jing, Y., Hao, L.＆Chou, K.C.iRNA- PseU:Identifying RNA pseudouridine sites.Molecular Therapy Nucleic Acids 5, The yeast being made up of the positive sample containing pseudouridine site and the negative sample without pseudouridine site in e332-2016 papers These data sets are encoded by bacterium, the data set of three species of people and house mouse, by each in people and house mouse data set Sample is converted into the matrix of 20 × 20 sizes, and saccharomycete data set sample is converted into the matrix of 20 × 30 sizes；

2) model construction and training convolutional neural networks model：Build convolutional neural networks (Convolutional Neural Network, abbreviation CNN) structure, we will be converted into the positive negative sample of matrix as the defeated of CNN in step 1) Enter, while meet the harmony of positive negative sample, adjust the CNN number of plies and the number of convolution kernel and size, then utilize adjustment Good CNN structures carry out feature extraction to data set sequence, train a model for including characteristic vector；

3) forecasting sequence interception and coding are treated：It is FASTA forms that the whole piece sequence of required detection, which is arranged, i.e. first trip First character for '>', behind explanation of the addition to sequence, next behavior sequence to be predicted, with the data of same step 1) The sliding window of collection sample equal length is treated forecasting sequence and intercepted, the sequence form and data set sample form phase of interception Together, and by the sequence of interception the matrix form being converted into step 1)；

4) feature extraction and prediction：Inputted the transformation result of step 3) as forecast set, it is special using convolutional neural networks After sign extraction, the convolutional neural networks model trained according to step 2) is predicted to list entries, then pre- to treating The direction sliding window at sequencing row end, the interception conversion to sequence and step 4) in repetitive cycling step 3), until whole piece sequence The end of row, the pseudouridine site predicted is finally given.

Being encoded to described in step 1)：Shared an A, U, G, tetra- kinds of ribonucleotides of C, arbitrarily successively takes in RNA sequence Two are one group, and one shares 16 kinds of combinations, then carry out 16 dimension displacement codings, and every a pair of combinations can all be encoded as one The column vector of 16 dimensions, for a sample sequence, from left to right takes two adjacent nucleotide codings, then moves to right a nucleosides Acid, take two nucleotides of rear adjacent to carry out displacement coding, repeat such operation and encoded, to the last a nucleosides Acid, understood according to such coded system, two neighboring nucleotides can be converted to the column vector of one 16 dimension, it is simple this Sample encodes or inadequate, for more accurately converting characteristic, also needs the chemical property plus nucleotides, nucleotides is chemically Matter is shown in Table 1, with the 17th dimension represent it is two neighboring in first nucleotides loop configuration, purine with numeral ' 1 ' represent, pyrimidine use Numeral ' 0 ' represents；18th dimension represent it is two neighboring in first nucleotides functional group, amino with numeral ' 1 ' represent, ketone group use Numeral ' 0 ' represents；19th dimension represent it is two neighboring in the pairing of first nucleotide complementary when hydrogen bond power, strong numeral ' 1 ' Represent, it is weak to be represented with numeral ' 0 '；20th dimension table show with it is two neighboring in first nucleotide type identical nucleotides account for sample The middle ratio removed after last nucleotides；For a sample sequence being made up of L+R+1 nucleotides, changed after coding As a matrix, the matrix size is 20 × (L+R),

The chemical property of the ribonucleotide of table 1

The application being extracted in using convolutional neural networks progress sequence signature in pseudouridine site estimation.

This method is extracted and predicted to sequence signature using convolutional neural networks algorithm in deep learning.

The beneficial effect of this method is：Pseudouridine plays an important roll in terms of normal biological function is travelled, because We need accurately to predict pseudouridine site for this, and convolutional neural networks have implying for the mining data for being capable of automatic depth The characteristics of feature, compared with prior art used SVMs (Support Vector Machine, SVM) algorithm, energy Enough more preferable abstraction sequence features, and then improve the accuracy rate of pseudouridine site estimation.

This method can improve the accuracy rate of pseudouridine site estimation, and pseudouridine site estimation is preferably extended to should With.

Brief description of the drawings

Fig. 1 is the method flow schematic diagram of embodiment；

Fig. 2 is the forming process schematic diagram in pseudouridine site in embodiment；

Fig. 3 is the displacement coded system schematic diagram of sequence in embodiment；

Fig. 4 is the structural representation of the CNN by taking species people as an example in embodiment.

Embodiment

Connect and present invention is further elaborated below with drawings and examples, but be not limitation of the invention.

Embodiment：

Pseudouridine is the isomer of uracil, and it is during rna transcription, by the catalytic action of enzyme, such as Shown in Fig. 2, the chemical constitution of uracil is set to rotate 180 ° by axle of the line of position 3 and 6, then phosphoric acid C5 is rotated clockwise to most Below, the original C-N being so connected with ribose becomes for C-C keys, forms pseudouridine.

A kind of reference picture 1, method based on convolutional neural networks prediction pseudouridine decorating site, comprises the following steps：

Specifically coded system is：Displacement coding is first carried out, as shown in figure 3, shared an A, U, G, tetra- kinds of C in RNA sequence Ribonucleotide, it is one group arbitrarily successively to take two, and one shares 16 kinds of combinations, then carries out 16 dimension displacement codings, each The column vector of one 16 dimension can be all encoded as to combination, for a sample sequence, from left to right takes two adjacent nucleosides Acid encoding, then move to right a nucleotides, take two nucleotides of rear adjacent to carry out displacement coding, repeat as operate into Row coding, to the last a nucleotides, understands that two neighboring nucleotides can be converted to one according to such coded system The column vectors of individual 16 dimension, simple so coding or inadequate, for more accurately converting characteristic, also need plus nucleotides Chemical property, the chemical property of nucleotides are shown in Table 1, with the 17th dimension represent it is two neighboring in first nucleotides loop configuration, Purine represents that pyrimidine is represented with numeral ' 0 ' with numeral ' 1 '；18th dimension represent it is two neighboring in first nucleotides functional group, Amino represents that ketone group is represented with numeral ' 0 ' with numeral ' 1 '；19th dimension represent it is two neighboring in the pairing of first nucleotide complementary When hydrogen bond power, it is strong to be represented with numeral ' 1 ', it is weak to be represented with numeral ' 0 '；20th dimension table show with it is two neighboring in first nucleosides The ratio that acids type identical nucleotides is accounted in sample after removing last nucleotides；Such as the coding knot of sequence ' AGAUCU ' Fruit R (AGAUCU) is as shown in formula (1)：

The chemical property of the ribonucleotide of table 1

2) model construction and training convolutional neural networks model：The structure of convolutional neural networks is built, we are by step 1) In be converted into input of the positive negative sample of matrix as CNN, while meet the harmony of positive negative sample, adjust the CNN number of plies with And the number and size of convolution kernel, as shown in figure 4, the structure for the convolutional neural networks that the species people provided adjusts, Ran Houli Feature extraction is carried out to data set sequence with the CNN structures adjusted, trains a model for including characteristic vector；

3) forecasting sequence interception and coding are treated：Using sliding window to whole piece sequence truncation to be predicted and coding, by institute It is FASTA forms to need the whole piece sequence that detects to arrange, i.e., first trip first character be '>', behind explanation of the addition to sequence Illustrate, next behavior sequence to be predicted, forecasting sequence is treated with the sliding window of the data set sample equal length of same step 1) Intercepted, the sequence form of interception is identical with data set sample form, so the interception way of sequence to be predicted is such as formula (2) It is shown, on the basis of the site U being predicted, take L and R nucleotides respectively with downstream at its upstream, the length for intercepting sequence is L + R+1 nucleotides,

S (U)=N_-L N_-(L-1)N_-(L-2)...N_-2N_-1U N₊₁N₊₂N_+(R-2)...N_+(R-1)N_+R(2),

According to the length of data set sample, if sequence to be predicted comes from species people and house mouse, we take L=R=10； If sequence to be predicted comes from species saccharomycete, we take L=R=15, and the sequence of interception is converted into the square in step 1) Formation formula；

4) feature extraction and prediction：Inputted the transformation result of step 3) as forecast set, it is special using convolutional neural networks After sign extraction, the convolutional neural networks model trained according to step 2) is predicted to list entries, then pre- to treating It is sequenced and arranges end direction sliding window, the interception conversion to sequence and step 4) in repetitive cycling step 3), until whole piece sequence End, the pseudouridine site finally predicted in whole piece sequence to be predicted.

Fig. 1 gives the step of pseudouridine site estimation based on convolutional neural networks, and we will enter to data set first Row arranges and code conversion, and sequence data collection is converted into matrix form；Secondly, building for convolutional neural networks model is carried out, Then the convolutional neural networks model that the matrix training being converted into using data set is put up；And then, cut using sliding window Sequence to be predicted is taken, then to the sequential coding of interception；Finally, after carrying out feature extraction using convolutional neural networks, based on instruction The model perfected is predicted to list entries.

Experimental example：

Three species are predicted using three independent test collection S (4), S (5), S (6)：S (4), S (5), S (6) is respectively From species people, saccharomycete and house mouse, wherein, S (4), S (5) from paper (Wei, C., Hua, T., Jing, Y., Hao, L.&Chou,K.C.iRNA-PseU:Identifying RNA pseudouridine sites.Molecular Therapy Nucleic Acids 5, e332 (2016)), S (6) needs individually to construct according to the present embodiment method, S (4), S (5), S (6) respectively comprising 100 positive samples containing pseudouridine site and 100 negative samples for being free of site, prediction result such as table 2 It is shown:

Table 2：The present embodiment method and the prediction result of only two prediction devices contrast

As can be seen from Table 2, predicted using the present embodiment method, its prediction result shows, CNN is substantially better than the current world It is upper only based on SVMs (Support vector machine abbreviations：SVM) two fallout predictor PPUS of algorithm and iRNA-PseU。

Sequence table

<110>Guilin Electronic Science and Technology Univ.

<120>A kind of method based on convolutional neural networks prediction pseudouridine decorating site

<141> 2017-10-20

<160> 3

<170> SIPOSequenceListing 1.0

<210> 2

<211> 6200

<212> RNA

<213> Saccharomyces cerevisiae

<400> 2

cuaucaucgc ugaucuccca cucccugauc ugaagagguc aucgguucga uuccgguugc 60

guguaagaug caagaguucg aaucucuuag caagcgaaag auuagaaauc uuuugggcuu 120

ugccgguuaa ggcgaaagau uagaaaucuu uuggguuuag gaccgagcuu uuaguggaug 180

ucaucaggac acuucugaug uuucaaaaga uauuccaggu acuggacgag aaucgcagaa 240

caauuugacg uagauguuug uuguucaccc acaacugaag aguugucgag uuuuuugagg 300

uuaagaauga aaggucgaaa aaguuucagg caguuucuca gcguugggcc cccgguucga 360

uuccgggcuu gcugguaaaa uccaacguug ccaucguugg gccuaagcgc aagugguuua 420

gugguaaaau ccaagguuaa ggcgaaagau uagaaaucuu uuggggcgaa agauuagaaa 480

ucuuuugggc uuugccggcu ucauuaacau guacuucaac uacggaagug gagaucaucg 540

guucaaaucc gauuggaauu ugguuuucaa guguaauagg cuacgugauc agugguucaa 600

gacgucgccu uuacacggcg uagugguuau cacuuucggu uuugauccgg acacuuucgg 660

uuuugauccg gacaaccccg guaauugauc uauguuguag cugcgcuggc ggcaacucca 720

guucuuuauc uucuuucucc gcuggcgucu gacuucuaau cagaagauua uggguucuuc 780

cgugauaguu uaauggucag aaugggcaga augggcgcuu gucgcgugcc agaucgggug 840

ccagaucggg guucaauucc ccgucgcgag aaaaagccaa ugaugagaua caagccauua 900

ucgacauaug cugguuacau ggcaguagaa gaauauacau ucuauuaucg aaccuggcca 960

ugaaacaaga uuucuguagc auacucgcuu cauacuuguu uucuuuuuug ugccuuuguu 1020

acguugcuuu guggaaguuc gaaacuccaa aguaugagug auggaagugu aguuauccgg 1080

agaucagggu caaaucuucg uugaccguca auuacaugca gcacaaauuu guagacaggc 1140

ugguuugagg auuacuugga cauuaacggu ucuccuauuc aagacaaaag uguucuuuca 1200

ucugcagugu uggcguacag auuguaguug uggcugcuac cuuuuuuaau guccguuucu 1260

augauugggc uauuguucga agguaaugcc uugaucagaa gacuguuggu ccuuaguucg 1320

auccugagug cgagcagcag auugcaaauc uguugguccu uaguuuaucc gauauagugu 1380

aacggcuauc acauccgugg agaccggggu ucgacucccc guaucgggua uguuauuuau 1440

guaacgggua ugcgaacauu cuuuuuuuga uguaauagga uaagcuugcu guucuuuuca 1500

guguaacaac ugaaaugacu guaguaucug uucuuuucag uguaacaacu guguaguauc 1560

uguucuuuuc aguguaacaa caaguguagu aucuguucuu uucaguguaa cacaagugua 1620

guaucuguuc uuuucagugu aacaucaagu guaguaucug uucuuuucag uguaucauug 1680

uucuuggauu ucaaugggug cugucuaaau uucgccacug uagaugaaga agacgaaaaa 1740

ugagaagagu guagauguau uauccuucca agauagacua uguaauggua aagaacauau 1800

ggcggcgggu gccuuuggag cagcaaucga uggugugguc acuguaagag auuggcccca 1860

ccauggacga gccuguagua uacaacggua aacaaagguc uuccuaugau uccggcguuc 1920

gucuuucuca uacccuguag accagaccuc ucuagaauac uuugaagguu uaaccgagga 1980

aaugcgugga gaccgggguu cgacuccccg uaucguuauc cgauauagug uaacggcuau 2040

cacaucggac acuucugaug uuucaaaaga uauuccauaa cugugggaau acucagguau 2100

cguaagaugu aagaugcaag aguucgaauc ucuuagcaaa caauuuucac aguuuaaggc 2160

caagaacaag gcccguuuac acauuuugau acaaccguag acgggagguc ccggguucga 2220

gucccggcuc gcgauucucg cuuagggugc gggagguccc ggggcgugcg acuguuaauc 2280

gcaagaucgu gagucgcaag aucgugaguu caacccucac uggggcguug ggcccccggu 2340

ucgauuccgg gcuugcuggu aaaauccaac guugccaucg uugggcccua gcgcaagugg 2400

uuuaguggua aaauccaagg cuguguucuu cuuucuaaau ucccuaucgg gaaaaacccg 2460

uugcuagaag cgcaacuggu gaaaaaaguu cagaauugca gaaaaguggu gagugguuuc 2520

cuaguguauc agccacuauc ggcauaaggu uagggguucg agcccccuac agggcaaucg 2580

guagcgcgua ugacucuuaa ucauaaacaa aagaagcugu uccagagagc ccaagccgga 2640

caaccccggu ucgaauccgg guaggacacu uucgguuuug auccggacaa ccccgguggu 2700

uaucacuuuc gguuuugauc cggacaacua gugguuauca cuuucgguuu ugauccggau 2760

guugccgcua aguguaagga agucgguauc cugguauauu cuauauacuc acuuauuacu 2820

uuucugguau auucuauaua cucacuuauu acaauggcuc uuuuuguuau ucgaaagcuu 2880

acauaaaaag uucggcuauc ucuugggcuc ugccucugcc cgcgcugguu caaauccugc 2940

uggugcaugg augauauuug uaguauggcg gaaaacgugg agaucaucgg uucaaauccg 3000

auuggaaaua cuauucaguu ucucagauau agguugcagc aauuggaaaa aucuauuaac 3060

ccagaugaac cagugcgucu acuauuacuc ggccaaauau ucguaauuug agaucucugc 3120

aaaacaaugc aaaacaaugc accuccuggc aaaaacauca auaaacauca augucaauug 3180

uuugaacguc aaugaacguc aauucuuguu cguuguccgc aagcaauuaa uauggcuugu 3240

aauggaaaca agcaaaaaca agcaagaucu ucccauaccg uuucccaccg uuuccccugc 3300

auguagaaug caacgauauc aauguuuaau cauaacagau caaagagcau caaagagcag 3360

ugguacuaca gaugcgucaa gguacgcaua agcgugaacc ccggucgacg ccggucgacg 3420

auacauacag agcuguuaca auauagcaaa ggacaguaga aaccugagua auccugagua 3480

auggaucuuu gaaugauauu aacugauauu aacgaaaaug aagagcucca aaaugcucca 3540

aaauuuccau agaaaaauca gcgaauuccc caaggaaaaa uagcgaaacc agaaagguua 3600

augcgggaag auuacauugc cuugaaaugc cuugaaacaa ccuccaagcu ugggagauga 3660

ggagaucucg ucguuuaaga accaagucaa accaagucau ucgguaacaa guuccaagac 3720

guuccaagac auuacugucg aaccucaauc ccguagguaa aguguauuua gugagggaac 3780

gccgcgacaa gugaucaucc auuuauugug acauauugug acacuguauc auuccuuuca 3840

aaccaaacau auuacugcau caaucugguc augucugguc augucaugcu uucugacuuu 3900

gauuuaagau acaaaaauuu guucagaugg auucagaugg auucagaacu aauuccuuug 3960

uugguacuug gcuguacucc auuuaaagga gauaauucac gucaaauuuc cacaugauaa 4020

ggaaguuucg ggaaguuucg aagaauugua aagaccugau acuucuucaa aaaaguucag 4080

uggucguucu uaccccccuc uaauaccugc auuaaaugau aacuccuuuu auauugucuu 4140

gcaauaaaca cccgaaacga ugaugaaauu gaugaggcug auccaggcug auccauucca 4200

ugauuuuaau ucuaugcuac ucugaaaauu auaccuacgg aaaaauuauc uuaugauaaa 4260

aauguaaaaa auuauuuaaa acgagaaagu gaaugaaaaa uauaauauca uauaauauca 4320

uuuauugucu gauaaugcug uacguaccau ccgcaucagu ggauauccaa ugauauccaa 4380

ugauaguaau uucgcgaguu uacgcgaguu uauccguugc uguuauauua ucauauauua 4440

ucacuuuuua auauucuuuu caaaggauuc cuuccgcaau ucuucugaaa uacugcucgc 4500

caguuuuuug uucuuccacg uaauccccuu auuaacggag auuugauuuc ucccagcacc 4560

gauucgagug aguacguuuu caaauauguu caaauaugcu uaaucugauc ucuucugcgg 4620

ccgaucugug ccauuauagu aagcagugcc aagcagugcc acuugucuaa uauaagauga 4680

ugauuuuacc guuuucuggg gacaucauga uacaucauga uaucauuugg uacauaauga 4740

acagauaauu ggauuucuug cauuuuuugc gauuuuuugc gauuauggcu uguugaccau 4800

ucacaaccau ucacaaaagu uggucuaaca uaauuuuaag uccuuguaau auucuagcuu 4860

uugagucucu gggaguggua aaucuacuga ccaucuucuu uuauccaauc aucuggcaag 4920

uccuuaauuu ucaucucuaa aauuuagaua uggacguuug uggacguuug agauauuuuc 4980

guauuucugc cuauuucugc caauucuucc uuuaacugug acuaacugug accguacuga 5040

uucgcuuucc cuugcuuucc cuuugaauuu uuauuauacc cucucauuac uugcuuaucu 5100

gaauuuuuuu ccauuuuguu ugccuauccu uccaucugau gacuugugau gacuugaaau 5160

guucugacag guaagauucu caacauucuu aauccaaacg auguccuucu ccugcuugug 5220

uauuaaagga caugaaauau uucgcuacau guaauggaag aucaucugua gauucguauc 5280

uguaguuucu caucagcaag aucuuucaaa aacgcuugau uugcuggcac cucuuaauag 5340

cgcuuguuuc ugcuugcucu acccucuagg ugaacguuua aucugacauc cgggaaguuu 5400

gaugugaagu auucugcuaa ccguucaggu ccuucaccug uuuguccaag ggaguaaggg 5460

gacuuucugg cuuuuuuuuu uacgaaacuc uuccucauca ucuucagccu caacauuuuc 5520

caaccgcaac uucuuguucu ugcuuaugcc cugcuuauug uggguugucc cgccauuauu 5580

cgccauuauu guuaauagau ucaacaaaau auuuaucauu gaaaauucac gugaucgcaa 5640

uagaucgcaa uauuccguca ggagugauaa auaucgucau ugcacaaauu aguuuauuau 5700

ucacuauuau ucacgacucu uaacaacgac aauuuuagac aggucguccg uagauauuua 5760

cauaaauacu acacagacua cuauuagaau uugcgaaaau uugcgaagga uuuaccgaag 5820

aaaagcacag aaaagcacag accuuauuga gcuuuugaau caauaaccag gaguuucaaa 5880

aacaaacagg cacuuuucau ugaucuauuu gauaaaucug ccacuagagu ccaaucuacg 5940

cgacuuauug caacuuauug cauuccuugg aaggugaaag ucuugcacga ugguccaguu 6000

auugaugaau uuuuauuugg ccauucaacu ucauaagugg ucgguaaggu accaggaaag 6060

uuucugaaac caucccaauc uuuacucuac uuauuaucca uugcaucccc cugaauaucu 6120

uauuuuagca uuagucaaca uuagucaaag aaaugaagcg guucguuuug guucguuuua 6180

uugauagaaa acaggacagu 6200

<210> 3

<211> 4200

<212> RNA

<213> Homo sapiens

<400> 3

gcuaaacagg uacugcuggg cuuauugagu gucuacugug uggauaaacu guuacgcaua 60

uauuugucgg uguuaacaaa auggucgggc cuaguucaaa ccuuuuuuuu aaguauacag 120

gggucuggcc ggucuguagc ggaucacuag cuaucgcuuc ucggccuuug aaaguaacuu 180

ugcccgagca cuauucuguu aaaaucagga gcagcugccu uuccaacagc ccaaaaugac 240

uuucguucuu cuuucagaua cuuacauagu uuuccgaauc aacuuugccg uguugacuca 300

aaguuacucu ccuuccuacc caccuuuccc agaaguggac aauauauuaa auggauugag 360

gacaauauau uaaauggagu guaguaucug uucuuaucaa aguguaguau cuguucuuau 420

ucaaguguag uaucuguucu uagaucaagu guaguaucug uucuucucgg ccuuuuggcu 480

aagaguaauc gcuucucggc cuuugaguaa ucgcuucucg gccuugaugu auuguuugca 540

cucuucauga uucuauuaua guauucuugu uuuuguauug uugcuccuuu cuuuuuuuug 600

gccuuucucg cuaaacaggu acugcugggc ccauuaucgc uucucggccu ucauuaucgc 660

uucucggccu uuuguaauau uuuaucccug gacuaguauc uguucuuauc aguuguagua 720

ucuguucuua ucagugugua guaucuguuc uuaucaaagu guaguaucug uucuuauuca 780

aguguaguau cuguucuuag aucaagugua guaucuguug uauugagugu cuacugugug 840

uuuucaucac uauggcuuag cgcaucaaaa cuucacuuuu ugauuggugg uauaguggug 900

agcgauaaaa ggcuaauauc cagagguccc ugguucgauc ccgggagacu gaagaucuaa 960

agguccggga gagcguuaga cugaagaugg gagagcguua gacugaagaa auccuuucua 1020

aauugcaugc auaaaaaguu uuuucuucag agaguaugga uuccgauaug aaagacauga 1080

auaagaacug augacuuuca auuaucugug ugagccuuuu cuuuguuugu aacuagccau 1140

cagguaagcc aagaucuucu cggccuuuug gcuaagaucc aucgcuucuc ggccuuuggg 1200

cccagggugc uguggagaau uguccuccuu cugaagcccc cuccuuuucu gaggaaggug 1260

auuggaacga uacagagaag aagacuauac uuucagggau cagcgcccca auuauuauga 1320

cuguaaguua uuuugcucuc acuggcaauu ugguuccacc acaucacuca auacuuaccu 1380

ggcaggcacu caauacuuac cuggcagcug gcugcuguag gucuuuucau uguugauauu 1440

ugcccagcag ggccucaguu agcucucaag ucccauggug uaaugguuag cguuagcacu 1500

cuggacuuug aaggacuuug aauccagcga uccgcgaucc gaguucaaau cucgcgaucc 1560

gaguucaaau cucggucauu uuauguauau uuaucaccuu uccaguuacu ccuuauauaa 1620

guuauuuugc ucucacuguc aaguguagua ucuguucuug guaggugagu uuaaagucuu 1680

cucuuaccug uuaaaaucag ggcaacagag uucaacuauc uccauuugcu guuacucugg 1740

agaucaagug uaguaucugu ucuuguaaaa ggguuacucu cauacuuuua uuauuuggau 1800

gaauaucuuc ucggccuuuu ggcuaagaac uaucgcuucu cggccuuuaa acuaucgcuu 1860

cucggccuuc ccuggagguu ccaauccugc uucuccauga uucgugcauc ucuaauuaug 1920

cuggacuguu uuauuggaac gauacagaga agaauauuuc ucauuucuuu uaguuauacu 1980

aaaauuggaa cgauaauugg aacgauacag agaagaacac gcaaauucgu gaagcguaag 2040

uguaguaucu guucuuauuc aaguguagua ucuguucuuu caaguguagu aucuguucuu 2100

gugauauaac ucaguggcag aggccuugga uuucaucccc agggagaggg agugggaaca 2160

ggauuugcaa gacuccuagu accuugugua gcaauggugu ccaggaguaa caaguucagg 2220

uucaccgcaa agucacucua uucugauccc aaagguuuac uuaauguuua gguuccuguu 2280

gcuugccauc uaagagguuu guuguccuau uggaagucuu uuccuuuaaa gucucuuagc 2340

aucagacacu uaagagagag aaugagaauc aucguggaau gaauagacuu aacugucagg 2400

aggcugucuu acguacacaa uugcaugugg aagcugcaau aacucauucc uacagcccca 2460

caaacgguuu aagcuugagu cacaauaauc aucauuucau uccuucaaau aaaaaaaaau 2520

cauuucugaa uucagaugua ucuaucauag uuggguuuaa gaaucagaac auuggguaua 2580

uuccaccaug gugucuggga gcacacauua ccccucccuu cccgcaccaa cgaucugcuu 2640

gugaacagag cuuuagucca gagcaagccc ccgccuuuuu uucuguugua aauuuuguua 2700

ugcaauuaau uuagaggaau agggaaagug gacgugucug uuguuucuca aggguccgga 2760

cuguuugaca cugaugaaug cuuucucaaa aguuuaaaca guuucauuug gaaguagggu 2820

cgccuuaagu caacaucaca gaugcuccag caggcaacca uauguuuaga aauaaaacca 2880

gccgcggugc cagcaaagaa cagacacauu acuugaacuu guucugaguu cuacugucuu 2940

acccaaaugc ucggaaacuc ucuuaugacu gugacuucag aaaaagaagg auuccaaaga 3000

caaacucaaa uucuuagaug accaaggcag acaguaggaa gaguaaugga aauccuuuug 3060

uuuuguuguu cuguuguugu caagugcaaa aauauaauuu guugaauaug ugugcuucug 3120

uccuacuaca uuucuuccau uuuuaauuaa aaaguagagc uaggacccac ucuuguuccu 3180

guacucacug uaggacccca ccuaaaagua uaauccugag aguucacgcu gagccuuuuc 3240

ucucucuucc ugaaaacuga aguguuccca aagcuaugug uaaagguuug guucucaucu 3300

cucucucucu cucucucuug uaggugggua guaggugagc agcugggagu uaaauacucu 3360

guggaaccuc ucuaguuaaa aguaaccagu cugugggaag uaaaagcaac auucccugcu 3420

ggaggcucca ggauccuaag ggacgucugu acucuaaggg gacauuuaaa uugcaucucc 3480

cucauuaaau gaugacugau gcuacuaugu uuaaacauug gauuuaacgu uuauuucauu 3540

guuuuuauuu cacugugggu cugggcuuua agacccucau uuuagcugcc uagccuucag 3600

augaaggggg ggucucugcu aauuauacau cuggaguuca gccuucagaa cuugucagcc 3660

acccuacccu acuuggacca ugucuugaaa agacaagugg uugacuuugg guuucuuaug 3720

uguuuguuug uuuguuuguu uguuuugcuc cugacaccac cacccucuuu uaaguagauu 3780

gugaccagaa uaguaacuaa aauguugaau uuauuugcuu aacaaaugug gcucuaaauu 3840

uuaaggauca uuaugaaaga ugaauagcuc cccuuucucu gcuugugaac acguaugcca 3900

auggacucug cucccguguu acagugugac cuaacuuugg auacuuuuuc cucuauaguu 3960

aaccacauua auuucaaaau ugcagagaaa uggaucacuu ugcaucagua gggcugguaa 4020

auugaaauac uggaccauca cauauuuccu ggugcuucuu uguuuauuca uuuggcuauu 4080

ccauuguucc uguaccauca aucuuucuca guuugugaac augagcucuu gagauucauu 4140

caggaggucu cagaacacua aggcuuuauu gucuccuaau cuuaacucuu ggggcuggua 4200

<210> 3

<211> 4200

<212> RNA

<213> Mus musculus

<400> 3

gacucugcug uuccaaagga caacccagaa uuauuauuuc uuauucuugg uuuuuuuuuu 60

cucauguacu uuguaguggu uuaucugccu uuguuugauc ugagcuauuc uuauauuugu 120

uuuuuagcuu cugggguuug ugauucuucu gcacugcugu cccgagaccu cgcugcuuuu 180

cucaagcaaa ugccccaccu cuggacaagu ggcccugcac uaugauauau guucucaggg 240

uuagaucccc auugccagug gcuucauugg uggcuguuca cuguauuggg ggaaaacaaa 300

uccuuauuca gccuccccag gagguuccaa uccugcggga cccgacuuau uccuuagcgg 360

ucagcccucc gugugcuuuu acagacaauu ucaaagucag uuggugguau uaaagaagac 420

guccucacug uacagugcca aaacaaagau guucuuuugu cucauuugga uuugcauucc 480

agcuacuaag acuuguuggu agcccaccuc uuccuuaagc cugcugcaug gaugcuaugc 540

accccagaag uuuuuguaca ggcagauaag aagcaaguau uaggaccacu gguggcagug 600

gaagcaccac cugcuacucu acccaccaaa agguaccgcu uuccucaagu ggucuacaag 660

cuacacgugg uuccuuuuug aauuuguaag gacguaacau cuguauauuu aaucgaaggc 720

acacuuucag ccagcgucuu ugaaauauua guuucaucuu aacagauuag ugccuuugga 780

ucccaaguuc cuggugaacg cugcugcuuu ucauggucca cccagugacu aacaucugcc 840

gcgcugucuu uuccgaucuc guacauggag guuccucugg gggggcugcg gcuacuucug 900

cacaucggcu cuguagacac uuucuugccc agacuauaua auggcuuguu aaugauuuuu 960

uuuuuucuuu uggcucuaug agcucuggac uccaaacuuc aucauggcgc ucagcuacua 1020

caaccagaga guaauggguu agaaaccauc aguaaugggu uagaaaccau cacacucugc 1080

uuacggucag acucuggacu uuuacaucca cgaccucuug ucaucccugg aagcccucuu 1140

gucaucccug gaagcccagg agcccuacac uucuguagac cgggguucaa uuccuagaga 1200

ccgggguuca auuccuagcc ucuuuccgua ccauucuaga cuaacucugu agaacagucu 1260

uuaucuugua uucauuguac cgaugcugag guacugccuc gcuccaccuu uuuaccaaaa 1320

ugucugugau gucuacaaag uucuacucca agaguaugcg gcagcagaau uauauuauau 1380

ggacaccuac uacuacuacu acuaacuuga agcuguuuau aguagacuga uggccgacua 1440

acuaaaaagc cacgaugugu aucgagccac augcucacuu uauuauccau gggacuccuc 1500

uuuccaucgg aaaaguugag cauauguucu cagaguuuga cauuucgaaa ucugugguca 1560

guauuuuaau aauuagcuuu aaaguuauaa aaagcaaaca gaugucuuug uacccagagc 1620

ugcuauuuuu agauacuuag ucaacuuuua aaauaccacc auaggcaguu acauucugca 1680

guucuuucuu ucuuuuuuuu ucggaccagc uuacagaguc agcugcuaca uuuacauaga 1740

gugcaguuuc uuuuuucaga uuuuuuugua ucacuuugua gaccaacuag gaaucuacag 1800

auuaagugaa gcucuuuaua uaguugaguc uguaacauuc cugaugaucu ucauaaugua 1860

ucccuuacag gguccuuccu acaagaaaga acuuuuaaua uuaguagcag aauuuuuacu 1920

aucuauccau uacagccagu ccuguggcuu gcuagccugg aguucuaauc uucagaucuu 1980

gauuuaacag cagaggaaaa aggcauauag aaaauuugug acaguguagc ugugauucag 2040

ggcccgguuc augacccggc agucuucguu ugucagucaa aaagaagccu uuagugugug 2100

ucaacccacc ugcucucugu agacaguuug cuauuggggu gaauuuagau auucaucuag 2160

caaggugggg caguaauauc uuacccauuu uucauagaug ucuuuucaua ggcaauguaa 2220

gcuuuuaccc agcaccugua gaacaggucu guuuugguga gaaucuguuu ggugagaauc 2280

uguuuuggug gaacgggaau acccugcaug ccucuauugc ucauucuggg ccagcucauc 2340

auguagaugu uggaucuucu ccuaaaggcu uugacagacc cacuggucau agucacuccc 2400

uggauuguau ugcucgcaaa aucacuuaug cuucccuaua auuuucuggc ccuucuacac 2460

uguuauguca uuuuguuucu ugacugaguc uaucugugug accauagguu ucauacugug 2520

ucuagugcac uagugacauu ucccaauuca guuugauuuu uuugagguau uauaaugucc 2580

aaugaagcaa ggcuaacaaa gccuagucau caugucaagu ggacauaugg gaaaguaaaa 2640

uucaccaaau ucaccauaau acuagccacc auauguguaa ggaauuauaa cagggaguuu 2700

guaaucgguu uucagagcau ccauugucac ucuacacaag uagcagucau cgccuuagua 2760

auacaugauc uuuagaguau uauacucuac ucauugauuu gacuacauau uucuuaauga 2820

aaacuaugcu caacuuuuug cacaauggga agacuuaacc uguacauagc uguuuaauuu 2880

cucugacuca cgauaucccu guuuccaaug ugaagacagg caguguaaau agagaugaug 2940

gcagaaugcc ugaauuuaug gauugacugg cugggagccu ucgcaaugag uauuaauuaa 3000

uacagagaga gaauaggaua uggaguuaga gacguggaaa caaugccaug gagacuacag 3060

aguugacagc cugaacuugu acagcacauu aaucuacugg aaaguauaac ugggaagauu 3120

uccaggaaca uaaaauguau uugacucuug cuccaaauaa uaaauuaagg gggccuuaca 3180

cucacuggac agucaccccc ucugaugagc uguagaguug gacuauucug gugagcuuag 3240

ucccugggcc augccgcuug guuagugugu uuugugcucu uuaaaguuga gugauauacc 3300

ucauauauac aacacaccga agagacacuc aggguccuau aucuuuugcu guugagggac 3360

caaugcaggu ucaaggugac acacacuagg uuuagaguca uguguucugu gaucagugga 3420

ccaucuguau ggcugagaug aaauuugucc uucaucucac cauuguguag cccgauccuc 3480

uccucuugau gccuacucau uuuucaguuc uuacuucugc cagagucucc ugcuaaguuu 3540

cccauggaac auguacaucu gaaucuuugc aaccaagcag ugacugaccu uuaauuuggc 3600

aucuuugagu uggaaauccc aguacagagu aaccauuagc ugauaaauga gugagacaua 3660

aagcucugag caggcauugc aaugauaaaa ugaauaaaca caggacuaac uuuacauacu 3720

uaauuacuuc auaacugcaa aauaggaauu auagaucucu cauaaugcuu ugcccagugc 3780

uacugggugc uacuguacaa uaagcagccg auauagguag uucccaccca aguaauucau 3840

ucuccaguuu accuugcaaa accagaugca gagauaggcc ccaauaaaga ggaaaugguc 3900

auaauuuuau uuaauauuuc ccauauacac cucauuaucu gcuguaccuc auuauauaug 3960

aucuauuuuu uagucuuccu uagcaggccc aacucucagc uuuaugacuc cuuacgccag 4020

cucucugagg ccgcagauaa uguauuugug uugaauaaau cccucacaca cagauagcag 4080

cacauagcag cucaucgggc uuucauauca caucaccagc cucucauuag auccuugauu 4140

accacugugc cuucuucaca cagauguuug acacucaauu uuuacccccu ucugauuuac 4200

Claims

1. a kind of method based on convolutional neural networks prediction pseudouridine decorating site, it is characterized in that, comprise the following steps：

1) data set is arranged and changed：Choose Wei, C., Hua, T., Jing, Y., Hao, L.＆Chou, K.C.iRNA-PseU: Identifying RNA pseudouridine sites.Molecular Therapy Nucleic Acids 5,e332- The saccharomycete being made up of the positive sample containing pseudouridine site and the negative sample without pseudouridine site, people in 2016 papers and These data sets are encoded by the data set of three species of house mouse, and each sample in people and house mouse data set is turned Change the matrix of 20 × 20 sizes into, saccharomycete data set sample is converted into the matrix of 20 × 30 sizes；

2) model construction and training convolutional neural networks model：The structure of convolutional neural networks is built, will be converted into step 1) Input of the positive negative sample of matrix as convolutional neural networks, while meet the harmony of positive negative sample, adjust the CNN number of plies with And the number and size of convolution kernel, feature then is carried out to data set sequence using the convolutional neural networks structure adjusted and carried Take, train a model for including characteristic vector；

3) forecasting sequence interception and coding are treated：Using sliding window to whole piece sequence truncation to be predicted and coding, by required for It is FASTA forms that the whole piece sequence of detection, which arranges, i.e., first trip first character be '>', behind explanation of the addition to sequence, Next behavior sequence to be predicted, treat forecasting sequence with the sliding window of the data set sample equal length of same step 1) and cut Take, the sequence form of interception is identical with data set sample form, and the sequence of interception is converted into the matrix form in step 1)；

4) feature extraction and prediction：Input the transformation result of step 3) as forecast set, carried using convolutional neural networks feature After taking, the convolutional neural networks model trained according to step 2) is predicted to list entries, then to sequence to be predicted End direction sliding window is arranged, the interception conversion to sequence and step 4) in repetitive cycling step 3), until the end of whole piece sequence Tail, the pseudouridine site finally predicted in whole piece sequence to be predicted.

2. the method according to claim 1 based on convolutional neural networks prediction pseudouridine decorating site, it is characterized in that, step It is rapid 1) described in be encoded to：Shared an A, U, G, tetra- kinds of ribonucleotides of C in RNA sequence, it is one group arbitrarily successively to take two, One shares 16 kinds of combinations, then carry out 16 dimension displacement coding, it is every a pair of combination can all be encoded as one 16 dimension row to Amount, for a sample sequence, from left to right take two adjacent nucleotide codings, then move to right a nucleotides, take behind Two adjacent nucleotides carry out displacement coding, repeat such operation and are encoded, to the last a nucleotides, according to this The coded system of sample understands that two neighboring nucleotides can be converted to the column vector of one 16 dimension, plus the chemistry of nucleotides Property, the chemical property of nucleotides are shown in Table 1, with the 17th dimension represent it is two neighboring in first nucleotides loop configuration, purine Represented with numeral ' 1 ', pyrimidine is represented with numeral ' 0 '；18th dimension represent it is two neighboring in first nucleotides functional group, amino Represented with numeral ' 1 ', ketone group is represented with numeral ' 0 '；19th dimension represent it is two neighboring in the pairing of first nucleotide complementary when hydrogen The power of key, it is strong to be represented with numeral ' 1 ', it is weak to be represented with numeral ' 0 '；20th dimension table show with it is two neighboring in first ucleotides The ratio that type identical nucleotides is accounted in sample after removing last nucleotides, it is made up of for one L+R+1 nucleotides Sample sequence, be converted into a matrix after coding, the matrix size is 20 × (L+R),

The chemical property of the ribonucleotide of table 1

3. the method according to claim 1 based on convolutional neural networks prediction pseudouridine decorating site, it is characterized in that, profit The application that sequence signature is extracted in pseudouridine site estimation is carried out with convolutional neural networks.