CN109754844B - Method for predicting plant endogenous siRNA on whole genome level - Google Patents

Method for predicting plant endogenous siRNA on whole genome level Download PDF

Info

Publication number
CN109754844B
CN109754844B CN201910020480.0A CN201910020480A CN109754844B CN 109754844 B CN109754844 B CN 109754844B CN 201910020480 A CN201910020480 A CN 201910020480A CN 109754844 B CN109754844 B CN 109754844B
Authority
CN
China
Prior art keywords
mites
sirna
sequence
test
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910020480.0A
Other languages
Chinese (zh)
Other versions
CN109754844A (en
Inventor
张德强
卜琛皞
宋跃朋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Forestry University
Original Assignee
Beijing Forestry University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Forestry University filed Critical Beijing Forestry University
Priority to CN201910020480.0A priority Critical patent/CN109754844B/en
Publication of CN109754844A publication Critical patent/CN109754844A/en
Application granted granted Critical
Publication of CN109754844B publication Critical patent/CN109754844B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Breeding Of Plants And Reproduction By Means Of Culturing (AREA)

Abstract

The invention provides a method for predicting endogenous siRNAs of plants on the whole genome level, which belongs to the technical field of bioinformatics, and the method utilizes MITE-Hunter to detect MITEs elements in whole genome sequence data of plants, predicts 24-nt siRNA based on the MITEs elements, and finally verifies the siRNA by using Pln24 NT. The invention combines the two bioinformatics tools to increase the prediction flux, thereby rapidly predicting the siRNA on the basis of large data volume, and the method is a method generally applicable to plants.

Description

Method for predicting plant endogenous siRNA on whole genome level
Technical Field
The invention relates to the technical field of bioinformatics, in particular to a method for predicting plant endogenous siRNA on the whole genome level.
Background
Transposons (TEs) are widely present in and constitute the largest component of the genome of eukaryotes, for example, in the genome of dogs, at least 31% are transposons, while in the human genome the proportion of transposons occupied is up to 46%, and in maize, transposons occupy even 85% of the entire genome. Transposons are classified into two major classes, DNA-mediated and RNA-mediated, depending on the manner of transposition. Research shows that retrotransposons account for 34% of the human genome, and mainly consist of Alu sequences, SVA (SINEVNRT-Alu), and long interspersed elements (L1). DNA transposons can be further classified into Terminal Inverted Repeats (TIRs), Miniature inverted repeat transposable elements (MITEs), Helitrons and the like according to structural features.
The transposon is one of the sources of endogenous small-molecule RNA in animals and plants, and can generate siRNA by induction. Research shows that transposon-derived small-molecule RNA and non-coding RNA have close relation and participate in a plurality of important regulation functions. Transposons can produce a variety of RNAs by induction, for example: miRNA, siRNA, piRNA, and the like. In the past, siRNAs have been thought to be formed by endogenous double-stranded RNA, but in some species, such as butterflies, etc., transposons have been found to be important sources of siRNA.
Miniature inverted repeat transposable elements (MITEs) are a Class II non-autonomously transposable DNA element found in bacterial, plant and animal genomes in recent years. Since MITEs lack coding sequences and evolve themselves very rapidly, it is still difficult for existing technical means to identify MITEs accurately. However, the rapid expansion of genomic databases now provides many advantages for the study of MITEs.
The current research shows that transposons are involved in many important physiological activities, exhibit many genetic effects, and are a class of DNA elements with important functions and research prospects, for example, the insertion of transposons into genes can cause insertion mutation, so that new genes appear at the inserted positions, resulting in the loss or change of the original gene functions. The MITEs have important functions and functions as a special transposon, and the identification of the MITEs in the whole genome based on the MITE _ Hunter can provide a new method and a new way for researching the MITEs.
Small interfering RNA (siRNA) is sometimes called short interfering RNA (short interfering RNA) or silencing RNA (silencing RNA), and is a double-stranded RNA molecule with the length of 20-25 bp, and an RNA interference pathway is regulated. Small interfering RNA inhibits the translation process by degrading complementary mRNA to interfere with the expression of a specific gene through post-transcriptional regulation. sirnas exert antiviral functions in mechanisms associated with RNA inhibition and play an important role in the formation of genomic chromatin. siRNA can enter the interior of a cell by transfection. In principle, any one gene can be inhibited by synthetic siRNA with complementary sequences, which is an important tool for validating gene function and drug targeting in the post-genome era.
The stem-loop structure of siRNA precursors is quite conserved in different species and has great sequence homology. The prediction method at present is mainly based on the two characteristics of siRNA, bioinformatics software is utilized to search and compare in a sequenced genome sequence, and candidate siRNA meeting conditions and siRNA molecules which are verified through experiments are analyzed and compared according to the sequence of homology, so that the quantity and distribution of siRNA of the species are finally determined. The prediction method at the present stage has large workload, needs genome searching comparison, homology analysis and the like, has long searching comparison process time, usually has species limitation, is mostly only suitable for animals, is also only suitable for a few model plants such as arabidopsis thaliana and the like in the aspect of plants, and lacks of a prediction method which is generally suitable for plants.
Disclosure of Invention
The invention aims to provide a method for predicting plant endogenous siRNA on the whole genome level, and the method has the advantages of rapidness and high flux.
In order to achieve the above object, the present invention provides the following technical solutions:
the invention provides a method for predicting plant endogenous siRNA on the whole genome level, which comprises the following steps:
1) using MITE-Hunter to screen candidate MITEs element from the whole genome sequence data;
2) analyzing the candidate MITEs element in the step 1) by using a multi-sequence comparison method, and filtering false positive results again to obtain an MITEs element sample;
3) extracting sequences with the length of 40-60 bp at two ends of each MITEs element sample from the MITEs element samples in the step 2) to obtain sequences to be analyzed;
4) performing complementary blast analysis on the sequence to be analyzed in the step 3), and screening the sequence with mismatch of 0 and sequence length of 24-100 nt to obtain candidate siRNA;
5) comparing the candidate siRNA in the step 4) with the siRNA in the database by using Pln24NT software, wherein the siRNA with the consistent comparison result is the plant endogenous siRNA.
Preferably, the length of the sequence extracted at both ends of each MITEs element in step 3) is 50 bp.
Preferably, the whole genome sequence data in step 1) is gene fragment data, and the length of each gene fragment is 1.8-2.2 kb.
Preferably, the screening in step 1) is performed by identifying TIR and TSD structural features, obtaining transposons with flanking sequences, and filtering structurally similar false positive results to obtain candidate MITEs elements.
Preferably, after obtaining the MITEs element sample, the method further comprises identifying the MITEs element sample as belonging to a different family.
The invention has the beneficial effects that: the invention provides a method for predicting plant endogenous siRNA at the genome-wide level, which utilizes plant genome-wide sequence data, utilizes MITE-Hunter to detect MITEs elements in the genome-wide level, predicts 24-nt siRNA based on the MITEs elements and finally verifies the siRNA by using Pln24 NT. The invention combines the two bioinformatics tools to increase the prediction flux, thereby rapidly predicting the siRNA on the basis of large data volume, and the method is a method generally applicable to plants.
Description of the drawings:
FIG. 1 is a schematic diagram of the structure of MITEs;
FIG. 2 shows a flow chart of a method for predicting endogenous siRNA in plants at the whole genome level;
FIG. 3 shows the number of families classified by the Mites of Populus tomentosa;
FIG. 4 shows a portion of Chinese white poplar MITEs;
FIG. 5 shows a schematic diagram of the arrangement structure of Chinese white poplar MITEs;
FIG. 6 shows the verification (part) of the prediction result of the whole genome endogenous siRNA of populus trichocarpa;
FIG. 7 shows the validation of the prediction results of Arabidopsis thaliana whole genome endogenous siRNA (part).
Detailed Description
The invention provides a method for predicting plant endogenous siRNA on the whole genome level, which comprises the following steps:
1) using MITE-Hunter to screen candidate MITEs element from the whole genome sequence data;
2) analyzing the candidate MITEs element in the step 1) by using a multi-sequence comparison method, and filtering false positive results again to obtain an MITEs element sample;
3) extracting sequences with the length of 40-60 bp at two ends of each MITEs element sample from the MITEs element samples in the step 2) to obtain sequences to be analyzed;
4) performing complementary blast analysis on the sequence to be analyzed in the step 3), and screening the sequence with mismatch of 0 and sequence length of 24-100 nt to obtain candidate siRNA;
5) comparing the candidate siRNA in the step 4) with the siRNA in the database by using Pln24NT software, wherein the siRNA with the consistent comparison result is the plant endogenous siRNA.
The method utilizes MITE-Hunter to screen and obtain candidate MITEs elements from the whole genome sequence data; the schematic structure of MITEs is shown in FIG. 1; preferably, the screening process comprises identifying TIR and TSD structural features to obtain transposons with flanking sequences, and filtering structurally similar false positive results to obtain candidate MITEs elements; removing the result with only TIR or TSD sequence according to the alignment condition of the preferred pairing sequence of the false positive result with similar filter structure; the whole genome sequence data is preferably gene fragment data; the length of each gene fragment is preferably 1.8-2.2 kb, more preferably 2 kb.
In the present invention, the operating parameters of the MITE-Hunter are preferably:
s12345678 represents the running of the program starting from the initial step 1 and ending of the program at step 8;
i genome sequencing data (input file);
p is preferably 0.25 to 1; -P represents the percentage of genome sequencing data to the full genome data as input file;
g-name (by default, "genome") indicates the entire process name, the name of the subsequent output file, which will start with;
-n is preferably 5; n represents the maximum value of the number of groups into which the MITEs elements are grouped, and "-n 5" in the example represents a grouping with a maximum of 5 groups.
After candidate MITEs elements are obtained, the candidate MITEs elements are analyzed by using a multi-sequence comparison method, and false positive results are filtered again to obtain an MITEs element sample.
According to the invention, after the MITEs element sample is obtained, the MITEs element sample is identified and classified into different families. In the present invention, the classification of samples of MITEs into distinct families serves to further analyze the structural and functional characteristics of MITEs.
After classifying MITEs element samples, extracting sequences with the length of 40-60 bp at two ends of each MITEs element sample from the MITEs element samples to obtain sequences to be analyzed; the length of the sequence extracted from two ends of each MITEs element is preferably 45-55 bp, and more preferably 50 bp; the script of the extracted sequence is a sequence extraction script; the writing program of the sequence extraction script is a python language or a perl language; the sequence extraction script is not particularly limited, and the conventional sequence extraction script in the field can be adopted.
In the invention, the reason for setting the length of the sequences extracted from the two ends of each MITEs element to be 40-60 bp is that the siRNA precursor has a stem-loop structure, so that a corresponding complementary sequence exists on the transposon, and because the two ends of the MITEs element usually have terminal inverted repeat sequences with the length of more than 10nt, the extracted sequences with the length of slightly more than the length of the terminal inverted repeat sequences are selected to increase the accuracy of prediction, and the longer the complementary sequences are, the more compact the pairing is, the more easily the stem-loop structure is formed, so that the siRNA is generated; however, the complementary sequence is too long, which not only increases the amount of calculation but also significantly reduces the screening result, thereby losing the meaning of prediction.
After obtaining a sequence to be analyzed, performing complementary blast analysis on the sequence to be analyzed, and screening a sequence with mismatch of 0 and sequence length of 24-100 nt to obtain candidate siRNA; the sequence with mismatch of 0 is a completely matched sequence, and the sequence can form a tighter stem-loop structure, so that siRNA can be generated more easily, and the prediction accuracy can be greatly improved.
In the invention, the complementary blast analysis of the sequence to be analyzed is used for checking whether the extracted sequences at two ends of a single MITEs element can be complementarily paired, and if so, the siRNA derived from the MITEs is proved to be more easily formed, so that the prediction result is more reliable. The invention can integrate a large amount of data when performing the step, and simultaneously perform the step, so the invention has the characteristic of high flux.
The reason for selecting the sequence with the sequence length of 24-100 nt is that the length of the endogenous siRNA is generally 21-24 nt, the shortest length is set to be 24nt in order to improve the accuracy, and the maximum length range of the selected sequence is set to be 100nt due to the fact that a plurality of siRNAs are closely arranged together to form an siRNA cluster; however, the length of the selected sequence is too large, which increases unnecessary operations and results in an increase in screening time.
In the present invention, the flow chart of the method for predicting the plant endogenous siRNA at the genome-wide level is shown in FIG. 2.
The following examples are provided to illustrate the method of predicting plant endogenous siRNA at genome-wide level, but they should not be construed as limiting the scope of the present invention.
Example 1 prediction of Whole genome endogenous siRNA of hairy fruit poplar
The Populus tomentosa genome sequencing file (found by searching from a database, https:// www.ncbi.nlm.nih.gov/genome/98# opennewwindow) was selected, the MITE-Hunter software was used to identify the MITEs element at the genome wide level, the genome wide MITEs element was identified and the result was used as the input result for the next step, the 24-nt siRNA was predicted, and verified with Pln24 NT.
The operation steps are as follows:
1. MITE-Hunter operation
The MITE-Hunter was run using the following commands, taking the Populus tomentosa genome as an example: perl MITE _ Hunter _ manager. pl-i/iob _ home/srwlab/vehell/scratch/Data _ Room/PTC/Ptc _ b 5-g Ptc-n 5-S12345678.
2. MITE-Hunter output result processing
After the operation result is output, 2306 MITEs elements in the populus tomentosa genome are identified together, the counted number of the MITEs classified families is determined, the identified candidate MITEs elements are classified into 89 families in total, sequences with the length of 50bp at two ends of each MITEs element are extracted, complementary blast analysis is carried out, and completely matched (without any mismatch) sequences are screened, the length range is 24-100 nt, and the sequences are used as candidate siRNA. The results of the genomic mitres screen of populus tomentosa are shown in table 1 and fig. 3 and 4, wherein fig. 3 shows the number of mitella tomentosa mitsung families and fig. 4 shows the partial mitella tomentosa MITEs elements.
TABLE 1 Whole genome MITEs element identification results (partial)
Name (R) Position of Starting position End position
test_2_74181 Chr19 6939894 6939876
test_2_234271 Chr15 48 66
test_1_229785 Chr19 6064277 6064261
test_2_70675 Chr15 3637466 3637448
test_2_250084 Chr08 6263945 6263961
test_2_39898 Chr09 6000054 6000037
test_2_111478 Chr04 20399739 20399712
test_2_168739 Chr04 8177420 8177402
test_1_297918 Chr01 27687553 27687576
test_2_289532 Chr13 2169324 2169344
test_1_294968 Chr13 4291530 4291550
test_1_191549 Chr16 12104841 12104823
test_1_197978 Chr11 11186907 11186927
test_1_7257 Chr03 9490164 9490181
test_2_214385 Chr11 9782531 9782550
test_1_112575 Chr19 2404025 2404041
test_2_82283 scaffold_63 71949 71967
test_1_101157 Chr01 11601184 11601152
test_2_148914 Chr05 15681037 15681019
test_1_91711 Chr10 9132317 9132298
test_2_159756 Chr06 8082520 8082501
test_2_179877 Chr08 3088670 3088650
test_2_290934 Chr05 2877251 2877272
test_2_888 Chr02 15159899 15159921
test_1_7055 Chr07 6343690 6343708
test_1_70997 Chr14 309007 308983
test_1_139882 Chr01 14979254 14979276
test_2_165622 Chr13 7756223 7756244
test_2_312527 Chr19 1395247 1395225
test_2_123882 Chr06 19103173 19103191
test_1_182023 Chr18 10810507 10810488
test_2_84564 Chr08 10241083 10241064
test_2_153321 Chr15 8656417 8656437
test_1_71146 Chr12 3163110 3163088
test_2_236737 Chr15 10615453 10615436
test_1_38647 Chr06 17382258 17382279
test_2_195729 Chr10 20189564 20189540
test_2_164798 Chr02 15939345 15939366
test_1_203741 Chr02 6400198 6400175
test_2_111298 Chr01 37593763 37593744
test_1_128909 Chr03 21214218 21214198
test_2_193989 Chr17 1317128 1317148
test_2_62765 Chr01 6327840 6327864
test_2_88754 Chr09 7248055 7248071
test_1_32349 Chr09 12427276 12427295
test_2_289774 Chr02 800344 800323
test_2_24478 Chr12 7796366 7796386
test_1_180941 Chr10 1417711 1417689
test_1_223222 Chr12 11557119 11557102
test_1_245821 Chr1 3645744 3646731
3. Tailoring MITEs results
The screened MITEs results are subjected to preliminary sorting for prediction of the next step, and the main format is shown in fig. 5. Starting with the name of MITEs (name of MITE-Hunter runtime), the name is exclusive by one line, and the second line starts, which is the sequence information of the corresponding MITEs.
4. Extraction of 50bp sequences at both ends of the MITEs element
And extracting the sequences of 50bp at both ends of the MITEs element by using a sequence extraction script.
5. Predicting 24-nt siRNA
And (3) according to the extracted sequences, performing analysis by blast, screening sequences with the length of 24-100 nt of complete match (mismatch is 0), and taking the predicted final result as the predicted result, wherein the result of part of siRNA predicted by the whole genome of the populus trichocarpa is shown in a table 2.
TABLE 2 partial siRNA results predicted for the entire genome of Populus mauritiana
Name (R) Sequence numbering Sequence of
test_1_191549 SEQ ID NO.1 CTCCCTCCATCCCAAAATATAAGGCATAACCACT
test_1_7257 SEQ ID NO.2 ATGAATGTGGGAAATGCTAGAATGA
test_2_214385 SEQ ID NO.3 AATATGATTTTAATGGAAAATCGCAAAACTA
test_1_101157 SEQ ID NO.4 CTCCCTCCTTCCCAAATTGATCATCATATA
test_2_148914 SEQ ID NO.5 CCCAATCCTGGGTTTGAATCTGGACAT
test_1_91711 SEQ ID NO.6 AGAGTAAATTTCACAAAACTACAT
test_2_159756 SEQ ID NO.7 AATATAAGGGATTTTGGGTGGATGTG
test_2_179877 SEQ ID NO.8 CTGGATTTTTCACATTTTGGTCCTTTT
test_2_290934 SEQ ID NO.9 CTCCCTCCGTCCCAATATATAGCAACCTAGGATGGG
test_2_888 SEQ ID NO.10 CCTAGGATGGGACCCATCCTAGGTT
test_1_7055 SEQ ID NO.11 CTACCTCCGTCCCAAAATAATTGTA
test_1_70997 SEQ ID NO.12 CTCCCTCCGTCCCAAAATATAAGCATTTTTAGCTAT
test_2_195729 SEQ ID NO.13 AATGATAATATGTACTAAAGGACT
test_2_164798 SEQ ID NO.14 TTAGCTCTATATAGGAGTCAAGGAGACG
test_1_203741 SEQ ID NO.15 ATTCCCCATCCCCATCCCACCAAAATTCCC
test_2_111298 SEQ ID NO.16 AATATGTGTAGAAAACTAGAAATTGA
test_2_193989 SEQ ID NO.17 CCTTCAATATACCTTTATAGATTTTAATAGTA
test_2_62765 SEQ ID NO.18 AATTATACCTCATTTTATATAAAATGAGCTAATTA
test_2_289774 SEQ ID NO.19 GTATCTATTATAAATTTCTTGTTATACTTATCATTCC
test_1_223222 SEQ ID NO.20 CATAAGAATTTAACGGTCAACTAACGGTCAACTA
6. Authentication
The position information of the predicted siRNA is compared by using Pln24NT and the position information of the siRNA in the published database (rice and corn siRNA database: http:// sundarlab. ucdavis. edu/smrnas /), and the existing prediction result is verified. Of the 50 randomly selected siRNA predictors, 34 were found in the known database, and another 16 predicted sirnas were not in the known database, with a predicted accuracy of 68%. The results are shown in FIG. 6.
Example 2: arabidopsis whole genome siRNA prediction
Selecting an existing arabidopsis genome file (TAIR-arabidopsis information resource website https:// www.arabidopsis.org /), adopting the method disclosed by the invention, utilizing MITE-Hunter software to screen MITEs elements, processing the result, utilizing the extracted 50bp sequences at two ends to perform complementary blast analysis, and predicting endogenous siRNA. Finally, the verification is carried out by using Pln24 NT.
The operation steps are as follows:
1. MITE-Hunter operation
The MITE-Hunter is run with the following commands:
perl MITE_Hunter_manager.pl -i /iob_home/srwlab/vehell/scratch/Data/ref/genome/ath.genome -g ATH -n 5 -S 12345678。
2. MITE-Hunter output result processing
After the operation result is output, counting the number of MITEs classified families, extracting sequences with the length of 50bp at two ends of each MITEs element, carrying out complementary blast analysis, and screening completely matched (without any mismatch) sequences with the length ranging from 24nt to 100nt to serve as candidate siRNA.
3. Tailoring MITEs results
The screened MITEs results are subjected to preliminary sorting for prediction in the next step, in a format consistent with that shown in the previous embodiment.
4. Extraction of 50bp sequences at both ends of the MITEs element
And extracting 50bp sequences at two ends of the MITEs element by using a foot sequence extraction script.
5. Predicting 24-nt siRNA
And (3) performing analysis by blast according to the extracted sequences, and screening sequences with the length of 24-100 nt of complete match (mismatch is 0) as a predicted final result.
6. Authentication
Comparing the predicted siRNA position information by using Pln24NT and the position information of siRNA in a published database, and verifying the existing prediction result. The predicted siRNA results for the arabidopsis genome are shown in table 3 (50 random selections). Out of 50 predicted results of randomly selected siRNAs, 38 existing siRNAs were detected, and another 12 predicted siRNAs were not in the range of the known database, with a prediction accuracy of about 76%. The results are shown in FIG. 7.
TABLE 3 partial siRNA results predicted from Arabidopsis entire genome
Tag_name Tag_sequence
ATH__18_621 SEQ ID NO.21 ATTCTGAACTAAAGCAAAGACTGA
ATH__22213_2 SEQ ID NO.22 TTATAGTCACGGCTCTGGGTGAAG
ATH__22392_2 SEQ ID NO.23 AGCTTTTCCACCATCTTTCAC
ATH__24_530 SEQ ID NO.24 AGCAGAGGGCAGAGAATCAATCAG
ATH__24801_2 SEQ ID NO.25 AAGATAACTAGCAAAAGCTAGCAT
ATH__25_525 SEQ ID NO.26 AACAGAAGACTTACAAACATGATA
ATH__33512_2 SEQ ID NO.27 TTCTCCAACGGAACCAGCTTGTGAGAGTCCAATCATCAC
ATH__36745_2 SEQ ID NO.28 AGAGAACAAGGCTAGCTAGAAAGA
ATH__382_86 SEQ ID NO.29 ACCCATCCCACCGGTTATTTCCTACGAAGAAGAA
ATH__388_85 SEQ ID NO.30 CAAGAAAGACTACGACGAAGAAAA
ATH__392_85 SEQ ID NO.31 AGAACGAACCAGAAGAAAATGAAG
ATH__393_85 SEQ ID NO.32 ATGGAAGACTCTCATGGAAGACGA
ATH__395_85 SEQ ID NO.33 CTATAAGAAGAAGTAACGGAGAAG
ATH__39806_2 SEQ ID NO.34 ACAGTTTTTCATATTTATATCAATCA
ATH__401_84 SEQ ID NO.35 GACGAACGGAAAAGACGGTAATTT
ATH__41788_2 SEQ ID NO.36 AGGCTAGACAGAAGATTACAAAAC
ATH__501_72 SEQ ID NO.37 ATCGACGAACACGGATGATAAAAA
ATH__504_72 SEQ ID NO.38 GAAGATCCTGTCTTGCTCTTCCTCCATAAG
ATH__523_69 SEQ ID NO.39 ACAAATATTGTTGTAGAAGATGGA
ATH__524_69 SEQ ID NO.40 AGCAGGACGTTCTTCAATCTTTAG
ATH__702_54 SEQ ID NO.41 GAAGGAAAGACTTATACAAAACAC
ATH__714_53 SEQ ID NO.42 AATCCGGGCTAGAAGCGACGCATG
ATH__715_53 SEQ ID NO.43 ACAGATCAACAGAAAACTCGGCAT
As can be seen from the above examples, the present invention provides a method for predicting endogenous siRNA in plants at the whole genome level, which enables rapid prediction of siRNA on a large data volume basis, and which is a method generally applicable to plants.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.
Sequence listing
<110> Beijing university of forestry
<120> a method for predicting plant endogenous siRNAs at the whole genome level
<160>43
<170>SIPOSequenceListing 1.0
<210>1
<211>34
<212>DNA
<213>Chinese white poplar
<400>1
ctccctccat cccaaaatat aaggcataac cact 34
<210>2
<211>25
<212>DNA
<213>Chinese white poplar
<400>2
atgaatgtgg gaaatgctag aatga 25
<210>3
<211>31
<212>DNA
<213>Chinese white poplar
<400>3
aatatgattt taatggaaaa tcgcaaaact a 31
<210>4
<211>30
<212>DNA
<213>Chinese white poplar
<400>4
ctccctcctt cccaaattga tcatcatata 30
<210>5
<211>27
<212>DNA
<213>Chinese white poplar
<400>5
cccaatcctg ggtttgaatc tggacat 27
<210>6
<211>24
<212>DNA
<213>Chinese white poplar
<400>6
agagtaaatt tcacaaaact acat 24
<210>7
<211>26
<212>DNA
<213>Chinese white poplar
<400>7
aatataaggg attttgggtg gatgtg 26
<210>8
<211>27
<212>DNA
<213>Chinese white poplar
<400>8
ctggattttt cacattttgg tcctttt 27
<210>9
<211>36
<212>DNA
<213>Chinese white poplar
<400>9
ctccctccgt cccaatatat agcaacctag gatggg 36
<210>10
<211>25
<212>DNA
<213>Chinese white poplar
<400>10
cctaggatgg gacccatcct aggtt 25
<210>11
<211>25
<212>DNA
<213>Chinese white poplar
<400>11
ctacctccgt cccaaaataa ttgta 25
<210>12
<211>36
<212>DNA
<213>Chinese white poplar
<400>12
ctccctccgt cccaaaatat aagcattttt agctat 36
<210>13
<211>24
<212>DNA
<213>Chinese white poplar
<400>13
aatgataata tgtactaaag gact 24
<210>14
<211>28
<212>DNA
<213>Chinese white poplar
<400>14
ttagctctat ataggagtca aggagacg 28
<210>15
<211>30
<212>DNA
<213>Chinese white poplar
<400>15
attccccatc cccatcccac caaaattccc 30
<210>16
<211>26
<212>DNA
<213>Chinese white poplar
<400>16
aatatgtgta gaaaactaga aattga 26
<210>17
<211>32
<212>DNA
<213>Chinese white poplar
<400>17
ccttcaatat acctttatag attttaatag ta 32
<210>18
<211>35
<212>DNA
<213>Chinese white poplar
<400>18
aattatacct cattttatat aaaatgagct aatta 35
<210>19
<211>37
<212>DNA
<213>Chinese white poplar
<400>19
gtatctatta taaatttctt gttatactta tcattcc 37
<210>20
<211>34
<212>DNA
<213>Chinese white poplar
<400>20
cataagaatt taacggtcaa ctaacggtca acta 34
<210>21
<211>24
<212>DNA
<213> Arabidopsis thaliana (Arabidopsis thaliana)
<400>21
attctgaact aaagcaaaga ctga 24
<210>22
<211>24
<212>DNA
<213> Arabidopsis thaliana (Arabidopsis thaliana)
<400>22
ttatagtcac ggctctgggt gaag 24
<210>23
<211>21
<212>DNA
<213> Arabidopsis thaliana (Arabidopsis thaliana)
<400>23
agcttttcca ccatctttca c 21
<210>24
<211>24
<212>DNA
<213> Arabidopsis thaliana (Arabidopsis thaliana)
<400>24
agcagagggc agagaatcaa tcag 24
<210>25
<211>24
<212>DNA
<213> Arabidopsis thaliana (Arabidopsis thaliana)
<400>25
aagataacta gcaaaagcta gcat 24
<210>26
<211>24
<212>DNA
<213> Arabidopsis thaliana (Arabidopsis thaliana)
<400>26
aacagaagac ttacaaacat gata 24
<210>27
<211>39
<212>DNA
<213> Arabidopsis thaliana (Arabidopsis thaliana)
<400>27
ttctccaacg gaaccagctt gtgagagtcc aatcatcac 39
<210>28
<211>24
<212>DNA
<213> Arabidopsis thaliana (Arabidopsis thaliana)
<400>28
agagaacaag gctagctaga aaga 24
<210>29
<211>34
<212>DNA
<213> Arabidopsis thaliana (Arabidopsis thaliana)
<400>29
acccatccca ccggttattt cctacgaaga agaa 34
<210>30
<211>24
<212>DNA
<213> Arabidopsis thaliana (Arabidopsis thaliana)
<400>30
caagaaagac tacgacgaag aaaa 24
<210>31
<211>24
<212>DNA
<213> Arabidopsis thaliana (Arabidopsis thaliana)
<400>31
agaacgaacc agaagaaaat gaag 24
<210>32
<211>24
<212>DNA
<213> Arabidopsis thaliana (Arabidopsis thaliana)
<400>32
atggaagact ctcatggaag acga 24
<210>33
<211>24
<212>DNA
<213> Arabidopsis thaliana (Arabidopsis thaliana)
<400>33
ctataagaag aagtaacgga gaag 24
<210>34
<211>26
<212>DNA
<213> Arabidopsis thaliana (Arabidopsis thaliana)
<400>34
acagtttttc atatttatat caatca 26
<210>35
<211>24
<212>DNA
<213> Arabidopsis thaliana (Arabidopsis thaliana)
<400>35
gacgaacgga aaagacggta attt 24
<210>36
<211>24
<212>DNA
<213> Arabidopsis thaliana (Arabidopsis thaliana)
<400>36
aggctagaca gaagattaca aaac 24
<210>37
<211>24
<212>DNA
<213> Arabidopsis thaliana (Arabidopsis thaliana)
<400>37
atcgacgaac acggatgata aaaa 24
<210>38
<211>30
<212>DNA
<213> Arabidopsis thaliana (Arabidopsis thaliana)
<400>38
gaagatcctg tcttgctctt cctccataag 30
<210>39
<211>24
<212>DNA
<213> Arabidopsis thaliana (Arabidopsis thaliana)
<400>39
acaaatattg ttgtagaaga tgga 24
<210>40
<211>24
<212>DNA
<213> Arabidopsis thaliana (Arabidopsis thaliana)
<400>40
agcaggacgt tcttcaatct ttag 24
<210>41
<211>24
<212>DNA
<213> Arabidopsis thaliana (Arabidopsis thaliana)
<400>41
gaaggaaaga cttatacaaa acac 24
<210>42
<211>24
<212>DNA
<213> Arabidopsis thaliana (Arabidopsis thaliana)
<400>42
aatccgggct agaagcgacg catg 24
<210>43
<211>24
<212>DNA
<213> Arabidopsis thaliana (Arabidopsis thaliana)
<400>43
acagatcaac agaaaactcg gcat 24

Claims (4)

1. A method of predicting endogenous siRNA in a plant at the genome-wide level, comprising the steps of:
1) using MITE-Hunter to screen candidate MITEs element from the whole genome sequence data;
2) analyzing the candidate MITEs element in the step 1) by using a multi-sequence comparison method, and filtering false positive results to obtain an MITEs element sample;
3) extracting sequences with the length of 40-60 bp at two ends of each MITEs element sample from the MITEs element samples in the step 2) to obtain sequences to be analyzed;
4) performing complementary blast analysis on the sequence to be analyzed in the step 3), and screening the sequence with mismatch of 0 and sequence length of 24-100 nt to obtain candidate siRNA;
5) comparing the candidate siRNA in the step 4) with the existing siRNA in the database by using Pln24NT software, wherein the siRNA with consistent comparison result is plant endogenous siRNA;
the screening process in step 1) is to obtain transposons with flanking sequences by identifying TIR and TSD structural features, and to filter structurally similar false positive results to obtain candidate MITEs elements.
2. The method of claim 1, wherein the length of the sequence extracted across each MITEs element in step 3) is 50 bp.
3. The method according to claim 1, wherein the whole genome sequence data in step 1) are gene fragment data, and each gene fragment has a length of 1.8-2.2 kb.
4. The method of claim 1, wherein after obtaining the MITEs element sample, further comprising identifying the MITEs element sample as classified into a different family.
CN201910020480.0A 2019-01-09 2019-01-09 Method for predicting plant endogenous siRNA on whole genome level Active CN109754844B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910020480.0A CN109754844B (en) 2019-01-09 2019-01-09 Method for predicting plant endogenous siRNA on whole genome level

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910020480.0A CN109754844B (en) 2019-01-09 2019-01-09 Method for predicting plant endogenous siRNA on whole genome level

Publications (2)

Publication Number Publication Date
CN109754844A CN109754844A (en) 2019-05-14
CN109754844B true CN109754844B (en) 2020-09-01

Family

ID=66405291

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910020480.0A Active CN109754844B (en) 2019-01-09 2019-01-09 Method for predicting plant endogenous siRNA on whole genome level

Country Status (1)

Country Link
CN (1) CN109754844B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111354420B (en) * 2020-03-08 2020-12-22 吉林大学 siRNA research and development method for COVID-19 virus drug therapy
CN111808935B (en) * 2020-07-22 2022-09-23 北京林业大学 Identification method of plant endogenous siRNA transcription regulation relationship

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1904900A (en) * 2005-07-28 2007-01-31 中国科学院生物物理研究所 Human autogenous siRNA sequence, its application and screening method
CN102757956A (en) * 2012-07-18 2012-10-31 云南省烟草农业科学研究院 Tobacco genome molecular marker probe and sequence collective group as well as acquiring method and application thereof

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1904900A (en) * 2005-07-28 2007-01-31 中国科学院生物物理研究所 Human autogenous siRNA sequence, its application and screening method
CN102757956A (en) * 2012-07-18 2012-10-31 云南省烟草农业科学研究院 Tobacco genome molecular marker probe and sequence collective group as well as acquiring method and application thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
植物中活性MITEs 转座子研究进展;胡冰杰;《生物工程学报》;20180225;第34卷(第2期);全文 *

Also Published As

Publication number Publication date
CN109754844A (en) 2019-05-14

Similar Documents

Publication Publication Date Title
Li et al. Identification of soybean microRNAs involved in soybean cyst nematode infection by deep sequencing
Wen et al. In Silico identification and characterization of mRNA-like noncoding transcripts in Medicago truncatula
CN109754844B (en) Method for predicting plant endogenous siRNA on whole genome level
Lindow et al. Intragenomic matching reveals a huge potential for miRNA-mediated regulation in plants
Yones et al. miRNAfe: a comprehensive tool for feature extraction in microRNA prediction
WO2017085243A1 (en) Methods for detecting copy-number variations in next-generation sequencing
Wang et al. Evidence for the expression of abundant microRNAs in the locust genome
Mishra et al. Discovering microRNAs and their targets in plants
Rajendiran et al. Computational approaches and related tools to identify MicroRNAs in a species: A Bird’s Eye View
Yao et al. plantMirP: an efficient computational program for the prediction of plant pre-miRNA by incorporating knowledge-based energy features
Bell et al. miRWoods: Enhanced precursor detection and stacked random forests for the sensitive detection of microRNAs
Wen et al. A contig-based strategy for the genome-wide discovery of microRNAs without complete genome resources
Wang et al. An approach to identify individual functional single nucleotide polymorphisms and isoform MicroRNAs
CN111808935B (en) Identification method of plant endogenous siRNA transcription regulation relationship
Oliveira et al. A computational approach for microRNA identification in plants: Combining genome-based predictions with RNA-seq data
Fu et al. New 3D graphical representation for RNA structure analysis and its application in the pre-miRNA identification of plants
Yao et al. Features of sRNA biogenesis in rice revealed by genetic dissection of sRNA expression level
Zytnicki et al. mmannot: How to improve small–RNA annotation?
Aldwairi et al. Prediction Of Novel Pirna Rat Clusters Based On Mouse Pirna Clusters Using Downstream and Upstream Analysis
EP3185157A1 (en) Computer-implemented method for the identification of micrornas
Kuang et al. Plant MicroRNA Identification and Annotation Using Deep Sequencing Data
Ohyanagi et al. Plant Omics: Advances in Big Data Biology
Wu et al. PATMAP: polyadenylation site identification from next-generation sequencing data
CN106755378A (en) A kind of method in detection miRNA sources
Bull The GDR: a novel approach to detect large-scale genomic sequence patterns

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant