WO2020228046A1 - Method for predicting gene coding frame from low-quality ribosome imprint data and system - Google Patents

Method for predicting gene coding frame from low-quality ribosome imprint data and system Download PDF

Info

Publication number
WO2020228046A1
WO2020228046A1 PCT/CN2019/087412 CN2019087412W WO2020228046A1 WO 2020228046 A1 WO2020228046 A1 WO 2020228046A1 CN 2019087412 W CN2019087412 W CN 2019087412W WO 2020228046 A1 WO2020228046 A1 WO 2020228046A1
Authority
WO
WIPO (PCT)
Prior art keywords
rpf
ribosome
coding frame
quality
frame
Prior art date
Application number
PCT/CN2019/087412
Other languages
French (fr)
Chinese (zh)
Inventor
莫蓓莘
宋波
杨晓玉
高雷
陈雪梅
Original Assignee
深圳大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳大学 filed Critical 深圳大学
Publication of WO2020228046A1 publication Critical patent/WO2020228046A1/en

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation

Definitions

  • the invention belongs to the field of biotechnology, and specifically relates to a method for predicting a protein encoding frame using low-quality ribosomal imprinting data, that is, a method for predicting a gene encoding frame from low-quality ribosomal imprinting data, and also relates to a predicting gene encoding frame system.
  • small coding frames in the genome play an important regulatory role in gene expression and translation, and play an important role in the formation of plant traits, yeast development and animal embryo development. Both have a very critical role. It can be seen that the research on small gene coding frames has very broad prospects in medical, industrial and agricultural applications. At the same time, the study of small gene coding frames is also essential for a comprehensive understanding of biological processes and mechanisms.
  • the accurate prediction of gene coding frame is the basic work of all genome research and related research and application.
  • the prediction of the gene coding frame is mainly based on the judgment of the characteristics of the DNA sequence, so as to determine the starting and ending positions of the protein coding gene, and then infer the protein sequence encoded by the base.
  • Existing data shows that this traditional prediction method has high accuracy for the prediction of long coding frames, but it is almost powerless to predict small ORFs.
  • the traditional method uses experimental methods to confirm and verify the small coding boxes one by one. This method is time-consuming and labor-intensive, and is not operable in most organisms. At present, only about 300 small coding frames have been experimentally verified in the yeast genome.
  • Ribo-seq Ribo-seq
  • Ribo-seq Ribo-seq
  • the basic principle is that the translated RNA sequence will be protected by the ribosome. After these protected sequences are proposed and then sequenced, the translated sequence can be obtained to predict the position of the small coding frame.
  • many methods and software for predicting small coding frames based on ribosome sequencing data have also been developed. However, since these main methods are currently developed in the study of model species, they are based on an ideal assumption that ribosomal sequencing data are of high quality (completely with a periodic distribution of 3 bases). ).
  • the purpose of the present invention is to overcome the above-mentioned shortcomings of the prior art and provide a method for predicting gene coding frames from low-quality ribosomal imprinted data.
  • the present invention introduces codon usage frequency, combined with the 3-base periodicity of ribosomal imprinted data, Scientifically measure the quality of ribosome imprinting data and reasonably assign corresponding weights, calculate the probability of each codon located at the ribosome P site, extract sequence features, and comprehensively evaluate the predicted probability value of the coding frame through statistical analysis, and then predict the new code frame.
  • the invention of this method helps to reduce the data quality requirements of ribosome data analysis and rapidly expand its application range.
  • the present invention provides a method for predicting gene coding frame from low-quality ribosome imprinting data, which includes the following steps:
  • step S4 Perform feature training on the RPF reserved in step S2, and perform weight distribution accordingly;
  • step S6 extract the features of the gene coding frame at the same time;
  • the gene coding frame feature in S6 refers to the codon usage frequency of the known coding frame.
  • the 3-base periodicity of each length of RPF is evaluated by the multitaper algorithm, the frequency is displayed as 3.33 Hz ⁇ 0.34 Hz, and the RPF with P value ⁇ 0.01 is retained for subsequent analysis.
  • the 3-base periodicity of each length RPF is evaluated by the multitaper algorithm, the frequency is displayed as 3.33 Hz or 0.34 Hz, and the RPF with a P value ⁇ 0.01 is retained for subsequent analysis.
  • S4 includes:
  • weight distribution calculate the distribution concentration according to the frequency of each RPF at the phase 0, 1 and 2 positions obtained in S41.
  • S41 is specifically: by analyzing the position information of the RPF containing the start or stop codon of the known coding frame and the corresponding start or stop codon, calculate the 5'end of each RPF and the ribosomal P position The distance between the point (P-site) and/or the ribosome A site (A-site), and the frequency of occurrence of different distances between the 5'end of the RPF and the P-site of each length is counted.
  • S42 is specifically: calculating the distribution concentration degree according to the frequency of each RPF in the phase 0, 1, and 2 positions obtained in S41; the distribution concentration degree is described by the complexity Entropy, and the formula (formula 1) is as follows: Among them, i represents different phases, the value range of i is 0, 1, and 2, and Pi is the proportion of each RPF distributed on phase i; according to formula 1, the value of complexity Entropy is calculated, and the weight of RPF is assigned to 1– Entropy, correspondingly, the weight of sequence features is assigned as Entropy.
  • S5 is specifically: according to Ribo-seq to obtain the position information of each RPF and the distance information between the 5'end of each RPF and the P-site, calculate each base or each three-base combination on each transcript The probability of being exactly at the P-site.
  • S6 extracting coding frame features according to the sequence information of each coding frame and the P-site probability calculated in S5, specifically includes the following steps:
  • S7 specifically includes:
  • S74 Output of the prediction result: control the P and P coding boxes in S73 to output the value of the false discovery rate FDR, and output candidate coding boxes that meet the output standard.
  • S7 specifically includes:
  • S71 Extract all candidate coding frame sequences based on the sequence information of all transcripts in S3. According to the standard, they have a start codon (NUG), a stop codon (UAG, UAA, UGA) and their length is an integer multiple of 3. ; First search for candidate encoding boxes starting with AUG, from long to short, calculating one by one. After all candidate encoding boxes starting with AUG are searched completely and do not meet the output conditions, then search and calculate NUG encoding boxes;
  • S72 extract the features of these candidate encoding frames, and perform four sets of statistical tests, namely: one-tailed test (a): the Z-score value at phase 0 is extremely significantly greater than the Z-score at phase 1 score; one-tailed test (b): the Z-score value on phase 0 is extremely significantly greater than the Z-score on phase 2; one-tailed test (c): the frequency value of the codons on phase 0 is extremely significant Greater than the frequency of codons located on phase 1; one-tailed test (d): the frequency of use of codons located on phase 0 is extremely significantly greater than the frequency of codons located on phase 2;
  • the predicted unknown gene coding frame RPF includes a small coding frame and/or a normal gene coding frame.
  • the present invention also provides a system for predicting a gene encoding frame, including a computer-readable storage medium, characterized in that the computer readable storage medium stores a computer program for predicting the gene encoding frame,
  • the computer program for predicting a gene encoding frame is executed by at least one processing component, the steps of the method for predicting a gene encoding frame from low-quality ribosomal imprint data can be realized.
  • the present invention introduces the frequency of codon usage, combined with the 3-base periodicity of ribosomal imprinting data, scientifically measures the quality of ribosomal imprinting data and reasonably assigns corresponding weights, and calculates the probability that each codon is located at the ribosomal P site, Extract sequence features, comprehensively evaluate the predicted probability value of the coding frame through statistical analysis, and then predict a new coding frame.
  • the invention of this method helps to reduce the data quality requirements of ribosome data analysis and rapidly expand its application range. Improve the tolerance to noise data, effectively reducing the requirements for data quality.
  • the prediction method of the present invention is suitable for: in model organisms, it is difficult to obtain high-quality ribosomal imprint data for certain organelles, and the prediction method of the present invention can be used; in non-model organisms, if it is difficult to obtain high-quality ribosome imprints Data, the prediction method of the present invention can be used to predict the gene coding frame.
  • the present invention greatly increases the range of predicted gene coding frames, which is of great significance for advancing the research of small coding frames.
  • the method steps of the present invention are presented to the user in the form of a computer program.
  • the user takes the necessary information such as ribosome imprint data as input, and the computer program can output the predicted gene coding frame. It is beneficial to improve the processing efficiency of users.
  • the implementation of computer programs helps to improve the efficiency of predicting coding frames, so that the prediction method of the present invention can be faster Popularity.
  • Figure 1 is a schematic diagram of the technical route of the present invention, that is, the working flowchart of the present invention
  • Fig. 2 is a schematic diagram of a search strategy for candidate coding frames of the present invention
  • Fig. 3 is an application example of the present invention, in which: Fig. 3(A) is the distribution of the RPF length of the example data; Fig. 3(B) is the three-base periodic evaluation result; Fig. 3(C) is the calculation and weighting of RPF distribution concentration Distribution; Figure 3(D) is the result of the prediction effect evaluation; Figure 3(E) is the prediction result of the small coding box; Figure 3(F) is the supporting evidence of the protein mass spectrum data; Figure 3(G) is the predicted ncsORF The evolution analysis of, among them, Figure 3G is a heat map, and the color depth in the square indicates the value of the value;
  • Figure 4 is an enlarged view of view A in Figure 3;
  • Figure 5 is an enlarged view of view B in Figure 3;
  • Figure 6 is an enlarged view of view C in Figure 3;
  • Figure 7 is an enlarged view of view D in Figure 3;
  • Figure 8 is an enlarged view of view E in Figure 3;
  • Figure 9 is an enlarged view of view F in Figure 3.
  • Figure 10 is an enlarged view of view G in Figure 3;
  • Figure 11 is a schematic diagram of the method for predicting gene coding frames from low-quality ribosomal imprinting data of the present invention.
  • the present invention discloses a method for predicting a gene coding frame from low-quality ribosomal imprinting data.
  • the method can accurately measure the quality of ribosomal imprinting data, and based on this, preliminary filtering of the data and reasonable distribution of corresponding weights are performed, and then the code is integrated
  • the sub-use frequency assists the prediction of the protein coding frame.
  • the method of the invention is insensitive to the quality of ribosome imprinting data and has strong fault tolerance. Not only that, the method of the invention also has excellent performance in high-quality ribosome imprinting data, and can comprehensively and accurately predict the coding frame of translation. Therefore, this method is applicable to all ribosome imprinting data.
  • the main points of the present invention are as follows:
  • the present invention mainly aims at the problem of excessively high quantitative quality requirements in the current ribosome imprinting sequencing data analysis method, and proposes a new method of predicting gene coding frame, which improves the tolerance to noise data and effectively reduces Requirements for data quality. It should be noted that the present invention is only applicable to species with reference genome sequence and annotation information.
  • the method of the present invention mainly includes the following steps:
  • Genomic reference sequences can be obtained from public sources.
  • Step (1) The purpose of genome comparison is to obtain the corresponding position information of the ribosome imprinted sequence on the genome.
  • the genome reference sequence is the known genome sequence, and the ribosome imprinting data is compared with it to obtain their position information on the genome. If the comparison result is wrong, all subsequent predictions are wrong. This is also one of the reasons why the implementation of the prediction method of the present invention requires reference genome sequences.
  • the data that has no periodicity at all are filtered.
  • the specific method is: the periodicity of 3 bases of each length is evaluated by the multitaper algorithm, the frequency is displayed as 3.33Hz ⁇ 0.34Hz, and the RPF with P value ⁇ 0.01 is retained for subsequent analysis.
  • the above step (2) includes the operation of data filtering, specifically: filtering out completely unusable data, and retaining the data that is qualified for evaluation.
  • the multitaper algorithm is used for data quality evaluation.
  • the purpose of quality evaluation is to provide a clear filtering standard for data filtering.
  • step (3) The purpose or meaning of the above step (3) is that the coding frame is predicted based on the sequence of the transcript.
  • the sequence information of the known coding frame is used to train the frequency of codon usage, and its position information is used to train the distance information between the 5'end of the RPF and the corresponding P-site.
  • Feature training Calculate the 5'end of each RPF and the ribosomal P site (P-site) and/or ribosome A site by extracting the RPF alignment information that is aligned to the start or stop codon of the known encoding frame Point (A-site) distance, count the frequency of occurrence of different distances between the 5'end of the RPF and the P-site of each length.
  • P-site ribosomal P site
  • A-site ribosome A site
  • the purpose of feature training in step (4) is to obtain distance information from the 5'end of each RPF to its corresponding P site.
  • the significance or function of the feature training in step (4) is to train the distance information between the 5'end of the RPF and the corresponding P-site. This information will be used to determine the P-site location corresponding to each RPF. Note: Not every RPF knows its corresponding P-site. Only RPFs containing known start or stop codons can get this information; this part of RPF training can obtain this distance information before using it Other RPF.
  • Weight distribution Calculate the distribution concentration of each RPF based on the frequency of each RPF appearing at the phase 0, 1, and 2.
  • the distribution concentration here refers to the concentration of the phase distribution.
  • the distribution concentration degree is described by the complexity (entropy), and the formula is as follows:
  • i denotes different phases (0, 1 and 2)
  • P i is the ratio RPF distribution in phase i.
  • the corresponding weight is assigned to the RPF as (1-Entropy), and correspondingly, the weight of the sequence feature is assigned as Entropy.
  • step (4) “assign a corresponding weight to the RPF”, the weight is a coefficient used to determine the contribution of the evidence in the subsequent prediction process. Specifically: the higher the RPF quality, the higher the weight obtained, and the greater the contribution to subsequent predictions; on the contrary, the lower the RPF quality (higher noise), the smaller its contribution to the prediction, and the prediction results are more dependent on others. Supported by evidence, thereby reducing the adverse effects of RPF noise on the prediction results.
  • “Sequence feature” refers to the feature of the sequence itself. RPF is a non-sequence feature relative to RPF. This specifically refers to the frequency of codon usage.
  • the distance information between the 5'end and the P-site is not a certainty Value, but a series of values, we use 3 values here, each value corresponds to a probability.
  • the calculation method is shown in the feature training part of step (4): by extracting the RPF alignment information that is aligned to the start or stop codon of a known coding frame, calculate each RPF 5'end and the ribosomal P site (P- site) or the distance of the ribosome A site (A-site); calculate the probability that each base or three-base combination on each transcript is exactly located at the P-site, and convert it to Z-score, that is, perform the data Standardization. If a scheme is used to calculate the probability that each base on each transcript is located at the P site, then: each base will get a probability value, which represents a "three-base combination starting from this base" "The probability value of being located at the P site.
  • the position information in step (5) refers to the position of the 5'end of the RPF, which is obtained by comparison with the genome.
  • the three-base combination in step (5) is further defined as: the combination of three consecutively arranged bases.
  • the scheme should be understood as: if the three consecutive base combinations are in the current detected species Under the applicable genetic code rules, if a codon corresponds to a certain codon, calculate the probability that the codon is located at the P site. According to the above method, calculate the probability of the P site for all possible codon combinations in the current transcript. Further, according to the above method, all transcripts are calculated.
  • the features of the encoding frame are extracted as follows:
  • 1Z-score Calculate the probability that each codon is exactly in the P-site, and convert it into Z-score.
  • 2Codon usage frequency According to the codon usage of all coding frames in the genome, calculate the frequency of each codon, and then calculate the average value of the codon frequency in each known coding frame.
  • step (4) is the characteristics of RPF, and the RPF contains the actually measured coding frame information.
  • step (6) trains the sequence characteristics of the known coding frame. The feature training result of step (4) and the feature extraction result of step (6) will be used together to predict the unknown coding frame.
  • Extraction and search of candidate coding frame sequences (please refer to Figure 2): According to the sequence information of all transcripts in (3), extract all candidate coding frame sequences, according to the standard: having a start codon (NUG) and a stop codon (UAG, UAA, UGA) and its length is a multiple of 3. The candidate coding frame starting with AUG is searched first, from long to short, calculating one by one. After all the candidate coding frames starting with AUG are searched completely and the output conditions are not met, the search and calculation of the NUG coding frame are performed.
  • NUG start codon
  • UAA UAA
  • UGA stop codon
  • 3P value combination the 4 P values (P value, which is a parameter used to determine the hypothesis test result) obtained from the above statistics, are combined into the final P value by the weighted chi-square method.
  • the calculation method is as follows ,
  • the P value is converted into the card square value, the formula is as follows:
  • M represents the combined chi-square value
  • i is the i-th test
  • Pi is the p-value of the i-th test
  • wi is the weight of the i-th P value, because the sum of wi must be 1
  • RPF and codon The frequency of use has been checked twice, so the weight of the corresponding P value is half of the weight of the RPF/password frequency calculated in the previous step.
  • w i and w j are the weights of the phases, which are equivalent to the above formula.
  • ⁇ ij is the correlation between the i-th test and the j-th test.
  • can be estimated indirectly from the calculated P value. as follows,
  • the corresponding P value is obtained according to the chi-square distribution 2 ⁇ 2 k /k.
  • Example 1 mainly relates to a method for predicting the protein coding frame using low-quality ribosomal imprinting data.
  • the accurate prediction of protein coding frames (including small coding frames) is the basis of all gene-related research and applications.
  • the rise of ribosome imprinting sequencing technology makes it possible to predict protein coding frames more accurately, especially making it possible to predict small coding frames.
  • the use of these tools must be based on an ideal condition, that is, ribosomal imprinting data are of high quality (completely 3 bases). The periodic distribution of basis).
  • the present invention extracts the frequency of codon usage and combines the 3-base periodicity of ribosomal imprinting data to scientifically measure ribosomal imprinting.
  • the corresponding weight is allocated reasonably, the probability of each codon located at the P site of the ribosome is calculated, the sequence features are extracted, and the predicted probability value of the coding frame is comprehensively evaluated through statistical analysis, and then the new coding frame is predicted.
  • the present invention will greatly reduce the requirements of related work on the quality of ribosome imprinting data, and will greatly promote the expansion of the application of nucleosome imprinting technology, especially its application in crop research.
  • the amount of weight distribution depends on the quality of the data. The higher the quality of ribosomal imprinting data, the higher the weight assigned.
  • the prediction method of the present invention is not limited to "application in crop research".
  • the prediction method of the present invention can be used in the fields of animals, plants, and microorganisms, and they all perform well. Relatively speaking, the quality of data in animals, microorganisms and humans is usually relatively high, and the existing methods can be better processed.
  • the low quality of ribosome imprinting data usually occurs in plant species, especially in non-model species.
  • the genetic coding frame prediction method of the present invention can also process low-quality ribosomal imprint data that cannot be processed by existing prediction methods.
  • ncsORF In order to further verify the accuracy of ncsORF, we performed evolutionary analysis on the predicted ncsORF sequence, and confirmed the accuracy of the prediction through its sequence conservation.
  • Figure 3(G) and Figure 10 show that most of the predicted ncsORFs show strong conservation. Specifically, there are 5 ncsORFs that began to appear in moss, and their sequences are very conservative in all plant branches The other part (4) of ncsORF began to appear from cruciferous plants and is very conserved in this branch. Based on this, we can infer that these ncsORFs have important biological functions, and these prediction results are correct.
  • Embodiment 2 is a specific example of Embodiment 1.
  • the present invention also discloses a system for predicting a gene encoding frame, including a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program for predicting a gene encoding frame, and the computer program for predicting a gene encoding frame When executed by at least one processing component, the steps of the method for predicting a gene coding frame from low-quality ribosomal imprint data can be realized.
  • Embodiment 3 mainly solves the problem that: the existing system for predicting gene coding frame can only process high-quality ribosome imprint data, and cannot do anything for low-quality ribosome imprint data.
  • the storage medium memory can be ROM or other types of static storage devices that can store static information and instructions, RAM or other types of dynamic storage devices that can store information and instructions, or EEPROM, CD-ROM or other optical disk storage , CD storage (including compressed CDs, laser disks, CDs, digital universal CDs, Blu-ray CDs, etc.), disk storage media (including mechanical hard drives, solid state drives, hybrid hard drives, etc.) or other magnetic storage devices (including tape), or can be used Any other medium (including SD card, etc.) that can carry or store desired program codes in the form of instructions or data structures and that can be accessed by a computer, but is not limited to this.
  • the storage medium may be stored locally or set in the cloud.
  • the processing component is a processor, and the processor may be a CPU, a general-purpose processor, DSP, ASIC, FPGA or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof.
  • the processor may also be a combination of computing functions, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, and so on.

Abstract

Provided is a method for predicting gene coding frame from low-quality ribosome imprint data, ribosome imprints and codon using frequency are comprehensively used for predicting a protein coding frame, a multitaper algorithm and complexity are used for describing quality of the ribosome imprint data, a corresponding weight is automatically distributed according to the complexity of the ribosome imprint data, thereby balancing the influence of the data quality. Specifically, the codon using frequency is extracted, combining with the 3-base periodicity of the ribosome imprint data, the quality of the ribosome imprint data is scientifically measured and a corresponding weight is reasonably distributed, the probability of each codon at the P-site of the ribosome is calculated, a sequence characteristic is extracted, the predicted probability value of the coding frame is comprehensively evaluated through statistical analysis, and further a new coding frame is predicted. The requirement for the quality of the ribosome imprint data is greatly reduced, and the extension of ribosome imprint technology application is greatly promoted, particularly application in crop researching.

Description

一种从低质量核糖体印迹数据预测基因编码框的方法和系统Method and system for predicting gene coding frame from low-quality ribosome imprinting data 技术领域Technical field
本发明属于生物技术领域,具体涉及利用低质量的核糖体印迹数据进行蛋白编码框的预测方法,即一种从低质量核糖体印迹数据预测基因编码框的方法,还涉及一种预测基因编码框的系统。The invention belongs to the field of biotechnology, and specifically relates to a method for predicting a protein encoding frame using low-quality ribosomal imprinting data, that is, a method for predicting a gene encoding frame from low-quality ribosomal imprinting data, and also relates to a predicting gene encoding frame system.
背景技术Background technique
随着第二代和第三代基因测序的不断发展,基因组数据近年来呈井喷式增长,极大的促进了生命科学的研究和应用。基因功能是一切生命活动的基础,对基因功能的研究有助于增进我们对疾病发生,以及农作物性状形成机理的了解,并进而帮助人们更加有效的预防和治疗疾病或改良农作物性状。在已有的很多基因组学和生物学研究中,人们主要关注基因组中较大的编码基因(长度>=300bp),而直接忽略基因组中的小编码框,认为其表达量低,编码能力弱,没有或者仅有非常次要的功能。随着人们对基因组研究和认识的不断深入,越来越多证据表明,基因组中的小编码框在基因表达和翻译中均发挥了重要的调控作用,对植物性状形成、酵母发育以及动物胚胎发育都具有非常关键的作用。由此可见,基因小编码框的研究在医学、工业和农业应用中都具有非常广泛的前景。与此同时,基因小编码框的研究对于全面了解生物过程和发生机理也至关重要。With the continuous development of second- and third-generation gene sequencing, genomic data has grown exponentially in recent years, which has greatly promoted the research and application of life sciences. Gene function is the basis of all life activities. The study of gene function can help us to improve our understanding of disease occurrence and the formation mechanism of crop traits, and then help people to prevent and treat diseases more effectively or improve crop traits. In many existing genomics and biological studies, people mainly focus on the larger coding genes (length>=300bp) in the genome, and directly ignore the small coding frames in the genome, thinking that their expression is low and their coding ability is weak. No or only very minor functions. As people continue to deepen their research and understanding of genomes, more and more evidences show that small coding frames in the genome play an important regulatory role in gene expression and translation, and play an important role in the formation of plant traits, yeast development and animal embryo development. Both have a very critical role. It can be seen that the research on small gene coding frames has very broad prospects in medical, industrial and agricultural applications. At the same time, the study of small gene coding frames is also essential for a comprehensive understanding of biological processes and mechanisms.
基因编码框(Open reading frame,ORF)的准确预测是一切基因组研究和相关研究和应用的基础工作。目前,基因编码框的预测主要通过对DNA序列特征进行判断,从而确定蛋白编码基因的起始和结束位置,进而推测基编码的蛋白质序列。现有数据表明,这种传统的预测方法对长编码框的预测具有较高的准确度,但是对小编码框(small ORF)的预测却几乎无能为力。传统的方法通过实验方法对小编码框逐个进行确认和验证,这种手段效率耗时耗力,在绝大多数生物中都不具有可操作性。目前,仅有酵母基因组中完成了约300个小编码框的实验验证工作。近年来,核糖体印迹测序(Ribo-seq)技术的兴起,快速、准确地对全基因组中的小编码框进行预测成为了可能。其基本原理是,被翻译的RNA序列会受到核糖体的保护,将这些受保护的序列提出来以后进行测序,就可以获得被翻译的序列,从而预测小编码框的位置。随着核糖体测序技术应用范围的不断扩展,许多基于核糖体测序数据预测小编码框的方法和软件也随之被开发出来。然而,由于目前这些主要方 法都是在模式物种的研究中开发出来的,因此它们都基于一个理想的假设,即核糖体测序数据均具有较高的质量(完全呈3个碱基的周期性分布)。这一先决条件在模式物种中相对比较容易达到,但是在其它非模式物种中并不总是如此。甚至,即使在模式物种中,对不同细胞器中的核糖体保护序列进行测序也并不总是能获得满足条件的高质量数据。因此,对高质量核糖体印迹数据的要求极大地阻碍了这一技术在非模式物种中的应用,同时也限制了其应用范围。开发新的能够用于低质量核糖体测序数据分析的方法和软件对于推进这一技术的应用以及小编码框的研究具有重要的意义。The accurate prediction of gene coding frame (Open reading frame, ORF) is the basic work of all genome research and related research and application. At present, the prediction of the gene coding frame is mainly based on the judgment of the characteristics of the DNA sequence, so as to determine the starting and ending positions of the protein coding gene, and then infer the protein sequence encoded by the base. Existing data shows that this traditional prediction method has high accuracy for the prediction of long coding frames, but it is almost powerless to predict small ORFs. The traditional method uses experimental methods to confirm and verify the small coding boxes one by one. This method is time-consuming and labor-intensive, and is not operable in most organisms. At present, only about 300 small coding frames have been experimentally verified in the yeast genome. In recent years, the rise of Ribo-seq (Ribo-seq) technology has made it possible to quickly and accurately predict small coding frames in the whole genome. The basic principle is that the translated RNA sequence will be protected by the ribosome. After these protected sequences are proposed and then sequenced, the translated sequence can be obtained to predict the position of the small coding frame. With the continuous expansion of the application range of ribosome sequencing technology, many methods and software for predicting small coding frames based on ribosome sequencing data have also been developed. However, since these main methods are currently developed in the study of model species, they are based on an ideal assumption that ribosomal sequencing data are of high quality (completely with a periodic distribution of 3 bases). ). This prerequisite is relatively easy to achieve in model species, but it is not always the case in other non-model species. Even in model species, sequencing the protective sequences of ribosomes in different organelles does not always obtain high-quality data that meet the conditions. Therefore, the requirement for high-quality ribosome imprinting data greatly hinders the application of this technology in non-model species, and also limits its application range. The development of new methods and software that can be used for low-quality ribosome sequencing data analysis is of great significance for advancing the application of this technology and the research of small coding frames.
发明内容Summary of the invention
本发明的目的在于克服上述现有技术之不足而提供一种从低质量核糖体印迹数据预测基因编码框的方法,本发明引入密码子使用频率,结合核糖体印迹数据的3碱基周期性,科学度量核糖体印迹数据的质量并合理分配相应权重,计算每个密码子位于核糖体P位点的概率,提取序列特征,通过统计分析,综合评定编码框的预测概率值,进而预测新的编码框。这一方法的发明有助于降低核糖体数据分析对数据质量的要求,快速扩展其应用范围。The purpose of the present invention is to overcome the above-mentioned shortcomings of the prior art and provide a method for predicting gene coding frames from low-quality ribosomal imprinted data. The present invention introduces codon usage frequency, combined with the 3-base periodicity of ribosomal imprinted data, Scientifically measure the quality of ribosome imprinting data and reasonably assign corresponding weights, calculate the probability of each codon located at the ribosome P site, extract sequence features, and comprehensively evaluate the predicted probability value of the coding frame through statistical analysis, and then predict the new code frame. The invention of this method helps to reduce the data quality requirements of ribosome data analysis and rapidly expand its application range.
为实现上述目的,本发明提供一种从低质量核糖体印迹数据预测基因编码框的方法,包括如下步骤:In order to achieve the above objective, the present invention provides a method for predicting gene coding frame from low-quality ribosome imprinting data, which includes the following steps:
S1,将原始测序的核糖体印记数据去掉接头后与基因组参考序列进行比对;S1, the original sequenced ribosome imprint data is compared with the genome reference sequence after removing the linker;
S2,采用multitaper算法分析不同长度的核糖体印迹序列(RPF)的3碱基周期性,保留评估合格的RPF,用于后续分析;S2, using the multitaper algorithm to analyze the 3-base periodicity of ribosomal imprinted sequences (RPF) of different lengths, and retain the qualified RPF for subsequent analysis;
S3,通过基因组注释文件信息,提取转录本和已知编码框的序列和位置信息,同时获得全基因组所有转录本和已知编码框序列;S3, extract the sequence and position information of the transcript and the known coding frame through the genome annotation file information, and obtain all the transcripts and the known coding frame sequence of the whole genome at the same time;
S4,对步骤S2中保留的RPF进行特征训练,并依此进行权重分配;S4: Perform feature training on the RPF reserved in step S2, and perform weight distribution accordingly;
S5,计算各转录本上每一个碱基或每一个三碱基组合正好位于核糖体P位点(P-site)的概率;S5: Calculate the probability that each base or combination of three bases on each transcript is exactly at the P-site of the ribosome;
S6,根据已知的各编码框的序列信息以及步骤S5中计算得出的P-site概率,同时提取基因编码框特征;S6, according to the known sequence information of each coding frame and the P-site probability calculated in step S5, extract the features of the gene coding frame at the same time;
S7,根据S5中计算得出的每个碱基或三碱基组合正好位于核糖体P位点的概率,以及S6得到的基因编码框特征,预测出未知的基因编码框。S7: According to the probability that each base or three-base combination is exactly located at the P site of the ribosome calculated in S5, and the characteristics of the gene coding frame obtained by S6, an unknown gene coding frame is predicted.
需要指出的是,S6中的基因编码框特征是指已知编码框的密码子使用频率。It should be pointed out that the gene coding frame feature in S6 refers to the codon usage frequency of the known coding frame.
优选的,S2中,各长度RPF的3碱基周期性通过multitaper算法进行评估,频率显示为3.33Hz~0.34Hz,P值≤0.01的RPF得以保留,用于后续分析。Preferably, in S2, the 3-base periodicity of each length of RPF is evaluated by the multitaper algorithm, the frequency is displayed as 3.33 Hz ~ 0.34 Hz, and the RPF with P value ≤ 0.01 is retained for subsequent analysis.
更为优选的,S2中,各长度RPF的3碱基周期性通过multitaper算法进行评估,频率显示为3.33Hz或0.34Hz,P值≤0.01的RPF得以保留,用于后续分析。More preferably, in S2, the 3-base periodicity of each length RPF is evaluated by the multitaper algorithm, the frequency is displayed as 3.33 Hz or 0.34 Hz, and the RPF with a P value ≤ 0.01 is retained for subsequent analysis.
优选的,S4包括:Preferably, S4 includes:
S41,统计各个长度的RPF的5’端与P-site之间不同距离的出现频率;S41: Count the frequency of occurrence of different distances between the 5'end of the RPF and the P-site of each length;
S42,权重分配:根据S41中得到的各个RPF在相位0,1和2位置出现的频率,计算分布集中度。S42, weight distribution: calculate the distribution concentration according to the frequency of each RPF at the phase 0, 1 and 2 positions obtained in S41.
更为优选的,S41具体是:通过分析包含已知编码框启始密码子或终止密码子的RPF与对应启始或终止密码子的位置信息,计算每条RPF 5’端与核糖体P位点(P-site)和\或核糖体A位点(A-site)的距离,统计各个长度的RPF的5’端与P-site之间不同距离的出现频率。More preferably, S41 is specifically: by analyzing the position information of the RPF containing the start or stop codon of the known coding frame and the corresponding start or stop codon, calculate the 5'end of each RPF and the ribosomal P position The distance between the point (P-site) and/or the ribosome A site (A-site), and the frequency of occurrence of different distances between the 5'end of the RPF and the P-site of each length is counted.
优选的,S42具体是:根据S41中得到的各个RPF在相位0,1和2位置出现的频率,计算分布集中度;分布集中度由复杂度Entropy描述,公式(公式一)如下:
Figure PCTCN2019087412-appb-000001
Figure PCTCN2019087412-appb-000002
其中,i表示不同的相位,i的取值范围为0,1和2,Pi为各个RPF在i相位上分布的比例;根据公式一计算出复杂度Entropy的值,分配RPF的权重为1–Entropy,相应的,序列特征的权重分配为Entropy。
Preferably, S42 is specifically: calculating the distribution concentration degree according to the frequency of each RPF in the phase 0, 1, and 2 positions obtained in S41; the distribution concentration degree is described by the complexity Entropy, and the formula (formula 1) is as follows:
Figure PCTCN2019087412-appb-000001
Figure PCTCN2019087412-appb-000002
Among them, i represents different phases, the value range of i is 0, 1, and 2, and Pi is the proportion of each RPF distributed on phase i; according to formula 1, the value of complexity Entropy is calculated, and the weight of RPF is assigned to 1– Entropy, correspondingly, the weight of sequence features is assigned as Entropy.
优选的,S5具体是:根据Ribo-seq得到各RPF的位置信息以及各RPF的5’端与P-site之间的距离信息,计算各转录本上每一个碱基或者每一个三碱基组合正好位于P-site的概率。Preferably, S5 is specifically: according to Ribo-seq to obtain the position information of each RPF and the distance information between the 5'end of each RPF and the P-site, calculate each base or each three-base combination on each transcript The probability of being exactly at the P-site.
优选的,S6,根据各编码框的序列信息以及S5中计算得出的P-site概率,提取编码框特征,具体包括如下步骤:Preferably, S6, extracting coding frame features according to the sequence information of each coding frame and the P-site probability calculated in S5, specifically includes the following steps:
S61,Z-score:将S5计算得到的P-site的概率转化为Z-score;S61, Z-score: Convert the probability of P-site calculated by S5 into Z-score;
S62,密码子使用频率:根据基因组中所有编码框的密码子使用情况,计算每个密码子的出现频率,然后计算每个已知编码框中密码子出现频率的平均值。S62. Frequency of codon usage: Calculate the frequency of each codon according to the codon usage of all coding frames in the genome, and then calculate the average of the frequency of codons in each known coding frame.
优选的,S7具体包括:Preferably, S7 specifically includes:
S71,根据S3中所有转录本的序列信息,对基因编码框候选序列进行提取和搜索;S71, according to the sequence information of all the transcripts in S3, extract and search the candidate sequence of the gene coding frame;
S72,按照S6中的方法提取经S71得到的候选编码框的特征,进行多组统计检验,得到多个P值;S72, according to the method in S6, extract the features of the candidate encoding frame obtained in S71, and perform multiple sets of statistical tests to obtain multiple P values;
S73,P值合并:将S72中的多个P值经加权卡平方算法合并成最终P值;S73, P value merging: combining multiple P values in S72 into the final P value through the weighted card square algorithm;
S74,预测结果输出:控制S73中的P以及P编码框输出错误发现率FDR的值,将满足输出标准的候选编码框进行输出。S74: Output of the prediction result: control the P and P coding boxes in S73 to output the value of the false discovery rate FDR, and output candidate coding boxes that meet the output standard.
更为优选的,S7具体包括:More preferably, S7 specifically includes:
S71,依据S3中所有转录本的序列信息,提取所有候选编码框序列,依据标准为,拥有启始密码子(NUG)、终止密码子(UAG,UAA,UGA)并且其长度为3的整数倍数;优先搜索AUG起始的候选编码框,由长到短,逐一进行计算,AUG起始的候选编码框全部搜索完全且不满足输出条件后,再进行NUG编码框的搜索和计算;S71: Extract all candidate coding frame sequences based on the sequence information of all transcripts in S3. According to the standard, they have a start codon (NUG), a stop codon (UAG, UAA, UGA) and their length is an integer multiple of 3. ; First search for candidate encoding boxes starting with AUG, from long to short, calculating one by one. After all candidate encoding boxes starting with AUG are searched completely and do not meet the output conditions, then search and calculate NUG encoding boxes;
S72,按照S6中的方法提取这些候选编码框的特征,进行四组统计检验,分别是:单尾检验(a):位于相位0上的Z-score值极显著大于位于相位1上的Z-score;单尾检验(b):位于相位0上的Z-score值极显著大于位于相位2上的Z-score;单尾检验(c):位于相位0上的密码子的使用频率值极显著大于位于相位1上的密码子频率;单尾检验(d):位于相位0上的密码子的使用频率值极显著大于位于相位2上的密码子频率;S72, according to the method in S6, extract the features of these candidate encoding frames, and perform four sets of statistical tests, namely: one-tailed test (a): the Z-score value at phase 0 is extremely significantly greater than the Z-score at phase 1 score; one-tailed test (b): the Z-score value on phase 0 is extremely significantly greater than the Z-score on phase 2; one-tailed test (c): the frequency value of the codons on phase 0 is extremely significant Greater than the frequency of codons located on phase 1; one-tailed test (d): the frequency of use of codons located on phase 0 is extremely significantly greater than the frequency of codons located on phase 2;
S73,P值合并:将S72中的多个P值经加权卡平方算法合并成最终P值:S73, P value combination: combine multiple P values in S72 into the final P value through the weighted card square algorithm:
S74,将预测的基因编码框RPF结果输出:输出P值≤0.001的修选编码框并根据Benjamini和Hochberg法控制编码框输出错误发现率FDR≤0.0001,满足这一标准的候选编码框进行最后的结果输出。S74. Output the predicted RPF result of the gene encoding frame: output the modified encoding frame with P value ≤ 0.001 and control the output error discovery rate of the encoding frame according to the Benjamini and Hochberg method FDR ≤ 0.0001, and the candidate encoding frame that meets this standard is finalized The result is output.
优选的,S7中,预测出未知的基因编码框RPF包括小编码框和\或正常的基因编码框。Preferably, in S7, the predicted unknown gene coding frame RPF includes a small coding frame and/or a normal gene coding frame.
为实现本发明的另一目的,本发明还提供一种预测基因编码框的系统,包括计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有预测基因编码框的计算机程序,所述预测基因编码框的计算机程序被至少一个处理组件执行时,能够实现上述的从低质量核糖体印迹数据预测基因编码框的方法的步骤。In order to achieve another objective of the present invention, the present invention also provides a system for predicting a gene encoding frame, including a computer-readable storage medium, characterized in that the computer readable storage medium stores a computer program for predicting the gene encoding frame, When the computer program for predicting a gene encoding frame is executed by at least one processing component, the steps of the method for predicting a gene encoding frame from low-quality ribosomal imprint data can be realized.
本发明的有益效果是:The beneficial effects of the present invention are:
1.本发明引入密码子使用频率,结合核糖体印迹数据的3碱基周期性,科学度量核糖体印迹数据的质量并合理分配相应权重,计算每个密码子位于核糖体P位点的概率,提取序列特征,通过统计分析,综合评定编码框的预测概率值,进而预测新的编码框。这一方法的发明有助于降低核糖体数据分析对数据质量的要求,快速扩展其应用范围。提高对噪音数据的耐受程度,有效降低了对数据质量的要求。本发明的预测方法适用于:在模式生物中,某些细胞器难以获得高质量的核糖体印记数据,可以采用本发明的预测方法;在非模式生物中,如果较难以获得高质量的核糖体印记数据,可以采用本发明的预测方法预测基因编码框。本发明将预 测基因编码框的范围大大增加了,对于推进小编码框的研究具有重要的意义。1. The present invention introduces the frequency of codon usage, combined with the 3-base periodicity of ribosomal imprinting data, scientifically measures the quality of ribosomal imprinting data and reasonably assigns corresponding weights, and calculates the probability that each codon is located at the ribosomal P site, Extract sequence features, comprehensively evaluate the predicted probability value of the coding frame through statistical analysis, and then predict a new coding frame. The invention of this method helps to reduce the data quality requirements of ribosome data analysis and rapidly expand its application range. Improve the tolerance to noise data, effectively reducing the requirements for data quality. The prediction method of the present invention is suitable for: in model organisms, it is difficult to obtain high-quality ribosomal imprint data for certain organelles, and the prediction method of the present invention can be used; in non-model organisms, if it is difficult to obtain high-quality ribosome imprints Data, the prediction method of the present invention can be used to predict the gene coding frame. The present invention greatly increases the range of predicted gene coding frames, which is of great significance for advancing the research of small coding frames.
2.为了方便的应用本发明的预测方法,将本发明的方法步骤以计算机程序的形式呈现给用户。用户将核糖体印记数据等必要信息作为输入,计算机程序可以输出预测得到的基因编码框。有利于提升用户的处理效率,在将本发明的预测基因编码框的方法推广向各个物种时,采用计算机程序的实现方式有助于提升预测编码框的效率,使本发明的预测方法得以更快速的普及。2. In order to conveniently apply the prediction method of the present invention, the method steps of the present invention are presented to the user in the form of a computer program. The user takes the necessary information such as ribosome imprint data as input, and the computer program can output the predicted gene coding frame. It is beneficial to improve the processing efficiency of users. When the method of predicting gene coding frames of the present invention is extended to various species, the implementation of computer programs helps to improve the efficiency of predicting coding frames, so that the prediction method of the present invention can be faster Popularity.
附图说明Description of the drawings
图1是本发明的技术路线示意图,即本发明的工作流程图;Figure 1 is a schematic diagram of the technical route of the present invention, that is, the working flowchart of the present invention;
图2是本发明候选编码框搜索策略的示意图;Fig. 2 is a schematic diagram of a search strategy for candidate coding frames of the present invention;
图3是本发明应用实例,其中:图3(A)是实例数据RPF长度分布情况;图3(B)是三碱基周期性评估结果;图3(C)是RPF分布集中度计算及权重分配情况;图3(D)是预测效果评估结果;图3(E)是小编码框的预测结果;图3(F)是蛋白质质谱数据的支持证据;图3(G)是预测所得的ncsORF的进化分析,其中,图3G是一个热图,方块里的颜色深浅表示值的大小;Fig. 3 is an application example of the present invention, in which: Fig. 3(A) is the distribution of the RPF length of the example data; Fig. 3(B) is the three-base periodic evaluation result; Fig. 3(C) is the calculation and weighting of RPF distribution concentration Distribution; Figure 3(D) is the result of the prediction effect evaluation; Figure 3(E) is the prediction result of the small coding box; Figure 3(F) is the supporting evidence of the protein mass spectrum data; Figure 3(G) is the predicted ncsORF The evolution analysis of, among them, Figure 3G is a heat map, and the color depth in the square indicates the value of the value;
对图3进一步的放大形成如下的附图,用以更清晰的显示图3中各视图的细节:Further enlargement of Fig. 3 forms the following drawings to show the details of each view in Fig. 3 more clearly:
图4是图3中A视图的放大图;Figure 4 is an enlarged view of view A in Figure 3;
图5是图3中B视图的放大图;Figure 5 is an enlarged view of view B in Figure 3;
图6是图3中C视图的放大图;Figure 6 is an enlarged view of view C in Figure 3;
图7是图3中D视图的放大图;Figure 7 is an enlarged view of view D in Figure 3;
图8是图3中E视图的放大图;Figure 8 is an enlarged view of view E in Figure 3;
图9是图3中F视图的放大图;Figure 9 is an enlarged view of view F in Figure 3;
图10是图3中G视图的放大图;Figure 10 is an enlarged view of view G in Figure 3;
图11是本发明的从低质量核糖体印迹数据中预测基因编码框的方法的示意图。Figure 11 is a schematic diagram of the method for predicting gene coding frames from low-quality ribosomal imprinting data of the present invention.
本发明目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization of the objectives, functional characteristics and advantages of the present invention will be further described in conjunction with the embodiments and with reference to the accompanying drawings.
具体实施方式Detailed ways
实施例1Example 1
本发明公开了一种从低质量核糖体印迹数据中预测基因编码框的方法,该方法可以准确度量核糖体印迹的数据质量,并依此对数据进行初步过滤并合理分配相应权重,随后整合密码子使用频率辅助蛋白编码框的预测,该发明方法对核糖体印迹数据质量不敏感,具 有较强的容错性。不仅如此,该发明方法在高质量的核糖体印迹数据中也具有极好的表现,能够全面准确预测翻译的编码框。因此,该方法适用于所有核糖体印迹数据。本发明的要点如下:The present invention discloses a method for predicting a gene coding frame from low-quality ribosomal imprinting data. The method can accurately measure the quality of ribosomal imprinting data, and based on this, preliminary filtering of the data and reasonable distribution of corresponding weights are performed, and then the code is integrated The sub-use frequency assists the prediction of the protein coding frame. The method of the invention is insensitive to the quality of ribosome imprinting data and has strong fault tolerance. Not only that, the method of the invention also has excellent performance in high-quality ribosome imprinting data, and can comprehensively and accurately predict the coding frame of translation. Therefore, this method is applicable to all ribosome imprinting data. The main points of the present invention are as follows:
1.综合利用核糖体印迹和密码子使用频率进行蛋白编码框的预测。1. Comprehensively utilize ribosome imprinting and codon usage frequency to predict protein coding frame.
2.利用multitaper算法和复杂度(熵)对核糖体印迹数据质量进行描述。2. Use multitaper algorithm and complexity (entropy) to describe the quality of ribosome imprinting data.
3.根据核糖体印迹数据的复杂度(熵)自动分配相应的权重,从而平衡数据质量的影响。3. Automatically assign corresponding weights according to the complexity (entropy) of ribosome imprinting data, thereby balancing the influence of data quality.
如上所述,本发明主要针对现在核糖体印迹测序数据分析方法中对数量质量要求过高的问题,提出一种新的预测基因编码框的方法,提高对噪音数据的耐受程度,有效降低了对数据质量的要求。需要注意的是:本发明仅适用于有参考基因组序列和注释信息的物种。As mentioned above, the present invention mainly aims at the problem of excessively high quantitative quality requirements in the current ribosome imprinting sequencing data analysis method, and proposes a new method of predicting gene coding frame, which improves the tolerance to noise data and effectively reduces Requirements for data quality. It should be noted that the present invention is only applicable to species with reference genome sequence and annotation information.
请参阅图1和图4,本发明方法主要包括以下几个步骤:Please refer to Figure 1 and Figure 4, the method of the present invention mainly includes the following steps:
(1)基因组比对(1) Genome alignment
将原始的核糖体印记测序数据去掉接头后与基因组参考序列进行比对。基因组参考序列可以从公开的渠道获取。The original ribosome imprinting sequencing data is compared with the genome reference sequence after removing the linker. Genomic reference sequences can be obtained from public sources.
步骤(1)进行基因组比对的目的是:获取核糖体印迹序列在基因组上对应的位置信息。基因组参考序列就是已知的基因组序列,将核糖体印迹数据与之进行比对是为了获取他们在基因组上的位置信息。若比对结果不对,则后续所有预测都不对。这也是本发明预测方法的实现需要参考基因组序列的原因之一。Step (1) The purpose of genome comparison is to obtain the corresponding position information of the ribosome imprinted sequence on the genome. The genome reference sequence is the known genome sequence, and the ribosome imprinting data is compared with it to obtain their position information on the genome. If the comparison result is wrong, all subsequent predictions are wrong. This is also one of the reasons why the implementation of the prediction method of the present invention requires reference genome sequences.
(2)核糖体印迹数据质量评估(2) Quality assessment of ribosome imprinting data
通过分析核糖体印迹数据不同长度RPF的3碱基周期性,对完全不具有周期性的数据进行过滤。具体做法是:将各长度的3碱基周期性通过multitaper算法进行评估,频率显示为3.33Hz~0.34Hz,P值≤0.01的RPF得以保留,用于后续分析。By analyzing the 3-base periodicity of RPF of different lengths of ribosome imprinting data, the data that has no periodicity at all are filtered. The specific method is: the periodicity of 3 bases of each length is evaluated by the multitaper algorithm, the frequency is displayed as 3.33Hz~0.34Hz, and the RPF with P value ≤0.01 is retained for subsequent analysis.
上述的步骤(2),包含了数据过滤的操作,具体是:将完全不可用的数据过滤掉,保留评估合格的数据。采用multitaper算法进行数据质量评估,质量评估的目的是为数据过滤提供一个明确的过滤标准。The above step (2) includes the operation of data filtering, specifically: filtering out completely unusable data, and retaining the data that is qualified for evaluation. The multitaper algorithm is used for data quality evaluation. The purpose of quality evaluation is to provide a clear filtering standard for data filtering.
(3)转录本和已知编码框组装(3) Assembly of transcripts and known coding frames
通过基因组注释文件信息,提取转录本和已知编码框的序列和位置信息,获得全基因组所有转录本和已知编码框序列。Through the genome annotation file information, extract the sequence and position information of the transcript and the known coding frame, and obtain all the transcripts and the known coding frame sequence of the whole genome.
上述步骤(3)的目的或意义是:编码框是从转录本的序列基础上进行预测的。已知编码框的序列信息用于训练密码子使用频率,其位置信息用于训练RPF 5’端与对应P-site的 距离信息。The purpose or meaning of the above step (3) is that the coding frame is predicted based on the sequence of the transcript. The sequence information of the known coding frame is used to train the frequency of codon usage, and its position information is used to train the distance information between the 5'end of the RPF and the corresponding P-site.
(4)核糖体印迹数据(RPF)特征训练和权重分配(4) Ribosome imprinting data (RPF) feature training and weight distribution
①特征训练:通过提取比对到已知编码框启始或终止密码子的RPF比对信息,计算每条RPF 5’端与核糖体P位点(P-site)和\或核糖体A位点(A-site)的距离,统计各个长度的RPF的5’端与P-site之间不同距离的出现频率。① Feature training: Calculate the 5'end of each RPF and the ribosomal P site (P-site) and/or ribosome A site by extracting the RPF alignment information that is aligned to the start or stop codon of the known encoding frame Point (A-site) distance, count the frequency of occurrence of different distances between the 5'end of the RPF and the P-site of each length.
对上述步骤(4)①进行优化:选择计算每条RPF 5’端与核糖体P位点(P-site)的距离这一方案即可。这是因为:A和P相隔3个碱基,是确定的信息。Optimize the above step (4)①: select the solution to calculate the distance between the 5'end of each RPF and the P-site of the ribosome. This is because: A and P are separated by 3 bases, which is definite information.
步骤(4)中的特征训练的目的是:获得每一条RPF的5’端距离其所对应的P位点的距离信息。The purpose of feature training in step (4) is to obtain distance information from the 5'end of each RPF to its corresponding P site.
步骤(4)中的特征训练的意义或作用是:训练RPF 5’端与对应P-site的距离信息。该信息将用于确定每一条RPF所对应的P-site位置。需要注意:并不是每一条RPF都明确知道其对应的P-site,只有包含已知起始或终止密码子的RPF才可以获取这一信息;通过这一部分RPF训练获得该距离信息后再用于其他RPF。The significance or function of the feature training in step (4) is to train the distance information between the 5'end of the RPF and the corresponding P-site. This information will be used to determine the P-site location corresponding to each RPF. Note: Not every RPF knows its corresponding P-site. Only RPFs containing known start or stop codons can get this information; this part of RPF training can obtain this distance information before using it Other RPF.
②权重分配:根据各RPF在相位0,1和2位置出现的频率计算其分布集中度。此处的分布集中度是指相位分布的集中度。分布集中度由复杂度(熵)进行描述,公式如下:② Weight distribution: Calculate the distribution concentration of each RPF based on the frequency of each RPF appearing at the phase 0, 1, and 2. The distribution concentration here refers to the concentration of the phase distribution. The distribution concentration degree is described by the complexity (entropy), and the formula is as follows:
Figure PCTCN2019087412-appb-000003
Figure PCTCN2019087412-appb-000003
其中,i表示不同的相位(0,1和2),P i为RPF在相位i上分布的比例。根据计算所得复杂度Entropy的值对RPF分配相应的权重为(1–Entropy),相应的,序列特征的权重分配为Entropy。 Where, i denotes different phases (0, 1 and 2), P i is the ratio RPF distribution in phase i. According to the calculated complexity Entropy value, the corresponding weight is assigned to the RPF as (1-Entropy), and correspondingly, the weight of the sequence feature is assigned as Entropy.
步骤(4)中,“对RPF分配相应的权重”,该权重为一个系数,用于确定在后续预测过程中该证据的贡献度。具体而言:RPF质量越高,获得的权重越高,对后续预测的贡献越大;相反,RPF质量越低(噪音高),其对预测的贡献越小,这时预测结果更依赖于其他证据的支持,从而降低RPF噪音对预测结果的不利影响。“序列特征”,指序列自身的特征,相对于RPF而言,RPF是非序列特征。这里具体指密码子的使用频率。In step (4), “assign a corresponding weight to the RPF”, the weight is a coefficient used to determine the contribution of the evidence in the subsequent prediction process. Specifically: the higher the RPF quality, the higher the weight obtained, and the greater the contribution to subsequent predictions; on the contrary, the lower the RPF quality (higher noise), the smaller its contribution to the prediction, and the prediction results are more dependent on others. Supported by evidence, thereby reducing the adverse effects of RPF noise on the prediction results. "Sequence feature" refers to the feature of the sequence itself. RPF is a non-sequence feature relative to RPF. This specifically refers to the frequency of codon usage.
(5)计算P-site概率(5) Calculate P-site probability
根据核糖体印记测序(Ribo-seq)得到各RPF的位置信息以及其5’端与P-site之间的距离信息,还需指出:5’端与P-site之间的距离信息不是一个确定值,而是一系列值,我们这里采用3个值,每个值会对应一个概率。计算方法见第(4)步的特征训练部分:通过提取比对到已知编码框启始或终止密码子的RPF比对信息,计算每条RPF 5’端与核糖体P位点(P-site)或核糖体A位点(A-site)的距离;计算各转录本上每一个碱基或三碱基 组合正好位于P-site的概率,并将其转化为Z-score,即进行数据的标准化。若采用计算各转录本上每一个碱基正好位于P位点的概率的方案,则:每一个碱基都会获得一个概率值,其代表的是以这个碱基为起始的“三碱基组合”位于P位点的概率值。According to the ribosome imprint sequencing (Ribo-seq) to obtain the position information of each RPF and the distance information between its 5'end and the P-site, it is also necessary to point out: the distance information between the 5'end and the P-site is not a certainty Value, but a series of values, we use 3 values here, each value corresponds to a probability. The calculation method is shown in the feature training part of step (4): by extracting the RPF alignment information that is aligned to the start or stop codon of a known coding frame, calculate each RPF 5'end and the ribosomal P site (P- site) or the distance of the ribosome A site (A-site); calculate the probability that each base or three-base combination on each transcript is exactly located at the P-site, and convert it to Z-score, that is, perform the data Standardization. If a scheme is used to calculate the probability that each base on each transcript is located at the P site, then: each base will get a probability value, which represents a "three-base combination starting from this base" "The probability value of being located at the P site.
需要指出的是:What needs to be pointed out is:
a)步骤(5)中的位置信息是指RPF 5’端的位置,是通过与基因组比对所获得的。a) The position information in step (5) refers to the position of the 5'end of the RPF, which is obtained by comparison with the genome.
b)步骤(5)中的三碱基组合进一步的限定是:连续排布的三个碱基所构成的组合。b) The three-base combination in step (5) is further defined as: the combination of three consecutively arranged bases.
c)如果需要采用计算各转录本上每一个三碱基组合正好位于P位点的概率的方案,则应该将该方案应该理解为:若连续的三个碱基组合在当前所检测的物种所适用的遗传密码规则下对应某一密码子,则计算该密码子正好位于P位点的概率,按照上述方式,将当前转录本上所有可能的密码子组合都计算出P位点的概率,更进一步的,按照上述的方式,将各转录本都完成计算。c) If it is necessary to use a scheme to calculate the probability that each three-base combination on each transcript is located at the P site, the scheme should be understood as: if the three consecutive base combinations are in the current detected species Under the applicable genetic code rules, if a codon corresponds to a certain codon, calculate the probability that the codon is located at the P site. According to the above method, calculate the probability of the P site for all possible codon combinations in the current transcript. Further, according to the above method, all transcripts are calculated.
(6)已知编码框特征提取(6) Feature extraction of known coding frame
根据各编码框的序列信息以及上一步中计算得出的P-site概率,提取编码框特征,如下:According to the sequence information of each encoding frame and the P-site probability calculated in the previous step, the features of the encoding frame are extracted as follows:
①Z-score:计算各密码子正好位于P-site的概率,并转化为Z-score。①Z-score: Calculate the probability that each codon is exactly in the P-site, and convert it into Z-score.
②密码子使用频率:根据基因组中所有编码框的密码子使用情况,计算每个密码子的出现频率,然后计算每个已知编码框中密码子频率的平均值。②Codon usage frequency: According to the codon usage of all coding frames in the genome, calculate the frequency of each codon, and then calculate the average value of the codon frequency in each known coding frame.
需要指出的是:第(4)步训练的是RPF的特征,RPF里面包含有实际测量到的编码框信息。步骤(6)训练的是已知编码框的序列特征。步骤(4)的特征训练结果和步骤(6)的特征提取结果会共同用于未知编码框的预测。It should be pointed out that the training in step (4) is the characteristics of RPF, and the RPF contains the actually measured coding frame information. Step (6) trains the sequence characteristics of the known coding frame. The feature training result of step (4) and the feature extraction result of step (6) will be used together to predict the unknown coding frame.
(7)编码框的预测(7) Prediction of coding frame
①编码框候选序列提取和搜索(请参阅图2):依据(3)中所有转录本的序列信息,提取所有候选编码框序列,依据标准为,拥有启始密码子(NUG)、终止密码子(UAG,UAA,UGA)并且其长度为3的倍数。优先搜索AUG起始的候选编码框,由长到短,逐一进行计算,AUG起始的候选编码框全部搜索完全且不满足输出条件后,再进行NUG编码框的搜索和计算。① Extraction and search of candidate coding frame sequences (please refer to Figure 2): According to the sequence information of all transcripts in (3), extract all candidate coding frame sequences, according to the standard: having a start codon (NUG) and a stop codon (UAG, UAA, UGA) and its length is a multiple of 3. The candidate coding frame starting with AUG is searched first, from long to short, calculating one by one. After all the candidate coding frames starting with AUG are searched completely and the output conditions are not met, the search and calculation of the NUG coding frame are performed.
②统计检验:按照(6)中的方法提取这些候选编码框的特征,进行四组统计检验,分别是(a)位于相位0上的Z-score值极显著大于(单尾检验)位于相位1上的Z-score;(b) 位于相位0上的Z-score值极显著大于(单尾检验)位于相位2上的Z-score;(c)位于相位0上的密码子的使用频率值极显著大于(单尾检验)位于相位1上的密码子频率;(d)位于相位0上的密码子的使用频率值极显著大于(单尾检验)位于相位2上的密码子频率。②Statistical test: According to the method in (6), extract the features of these candidate encoding frames, and perform four sets of statistical tests. They are (a) the Z-score value at phase 0 is extremely significantly greater than (one-tailed test) at phase 1 The Z-score on phase 0; (b) The Z-score value on phase 0 is extremely significantly greater than (one-tailed test) the Z-score on phase 2; (c) the frequency value of the codons on phase 0 is extremely high Significantly greater than (one-tailed test) the frequency of codons located on phase 1; (d) the frequency of codons located on phase 0 is extremely significantly greater than (one-tailed test) the frequency of codons located on phase 2.
③P值合并:以上统计所得的4个P值(P value,P值是用来判定假设检验结果的一个参数)经加权卡平方算法(Weighted chi-square method)合并成最终P值,计算方法如下,③P value combination: the 4 P values (P value, which is a parameter used to determine the hypothesis test result) obtained from the above statistics, are combined into the final P value by the weighted chi-square method. The calculation method is as follows ,
首先按照步骤(4)中分配的权重,将P值转化为卡平方值,公式如下:First, according to the weight assigned in step (4), the P value is converted into the card square value, the formula is as follows:
Figure PCTCN2019087412-appb-000004
Figure PCTCN2019087412-appb-000004
其中M表示合并后的卡方值,i为第i个检验,Pi为第i个检验的P值,wi为第i个P值的权重,由于wi之和须为1,且RPF和密码子使用频率各进行了两次检验,因此,相对应P值的权重为上一步中计算所得RPF/密码频率权重的一半。Where M represents the combined chi-square value, i is the i-th test, Pi is the p-value of the i-th test, wi is the weight of the i-th P value, because the sum of wi must be 1, and RPF and codon The frequency of use has been checked twice, so the weight of the corresponding P value is half of the weight of the RPF/password frequency calculated in the previous step.
计算自由度(k)Calculation degrees of freedom (k)
k=2{E(M)} 2/var(M) k=2{E(M)} 2 /var(M)
其中,among them,
Figure PCTCN2019087412-appb-000005
Figure PCTCN2019087412-appb-000005
s i为P i单独转化后的卡方值,s i=-2×w i×ln(P i) s i is the chi-square value P i of a separately transformed, s i = -2 × w i × ln (P i)
Figure PCTCN2019087412-appb-000006
Figure PCTCN2019087412-appb-000006
其中,w i,w j为相的权重,与以上公式等价。ρ ij为第i个检验与第j个检验之间的相关性。ρ又可以从计算所得的P值间接估算得出。如下, Among them, w i and w j are the weights of the phases, which are equivalent to the above formula. ρ ij is the correlation between the i-th test and the j-th test. ρ can be estimated indirectly from the calculated P value. as follows,
Figure PCTCN2019087412-appb-000007
Figure PCTCN2019087412-appb-000007
其中,
Figure PCTCN2019087412-appb-000008
为s i的平均值,由于q t的期望值E(q t)=4–(0.75ρ 2+3.25ρ),所以计算可得
among them,
Figure PCTCN2019087412-appb-000008
Is the average value of si , since the expected value of q t E(q t ) = 4–(0.75ρ 2 +3.25ρ), the calculation can be obtained
0.75ρ 2+3.25ρ+E(q t)–4=0 0.75ρ 2 +3.25ρ+E(q t )–4=0
最后可以求解ρ的近似值为-2.167+(10.028-4q t/3) 0.5Finally, the approximate value of ρ can be solved -2.167+(10.028-4q t /3) 0.5 .
根据计算所得的自由度k和合并后的卡方值,根据卡方分布2χ 2 k/k获取对应的P值。 According to the calculated degrees of freedom k and the combined chi-square value, the corresponding P value is obtained according to the chi-square distribution 2χ 2 k /k.
④编码框输出错误发现率(FDR)控制④The output error detection rate (FDR) control of the encoding box
输出P值≤0.001的修选编码框并根据Benjamini和Hochberg法控制FDR≤0.0001,满 足这一标准的候选编码框进行最后的结果输出。Output the modified coding frame with P value ≤ 0.001 and control FDR ≤ 0.0001 according to the Benjamini and Hochberg method. The candidate coding frame that meets this standard is output for the final result.
实施例1主要涉及一种利用低质量的核糖体印迹数据进行预测蛋白编码框的方法。蛋白编码框(包括小编码框)的准确预测是所有基因相关研究和应用的基础。核糖体印迹测序技术的兴起使得能够更加准确的对蛋白编码框进行预测,特别是使小编码框的预测成为可能。虽然已有许多软件和流程可以用来从核糖体印迹数据中预测蛋白编码框,但是这些工具的使用都必须基于一个理想的条件,即核糖体印迹数据均具有较高的质量(完全呈3碱基的周期性分布)。这一条件的满足需要极高的实验技术和昂贵的试剂和设备,极大的制约了该技术的应用拓展。此外,高质量的核糖体印迹数据通常长度较短(28nt),在基因组上会有多个比对位点,会引入大量的错误,不利于后续研究的开展。总的来说,目前已有的流程和工具对低质量的核糖体印迹数据完全无能为力。为了解决低质量核糖体印迹数据无法使用,而高质量核糖体印迹数据又容易引入错误的问题,本发明提取密码子使用频率,结合核糖体印迹数据的3碱基周期性,科学度量核糖体印迹的数据质量并合理分配相应权重,计算每个密码子位于核糖体P位点的概率,提取序列特征,通过统计分析综合评定编码框的预测概率值,进而预测新的编码框。本发明将大幅降低相关工作对核糖体印迹数据质量的要求,将极大的促进核核体印迹技术应用的拓展,特别是在农作物研究中的应用。Example 1 mainly relates to a method for predicting the protein coding frame using low-quality ribosomal imprinting data. The accurate prediction of protein coding frames (including small coding frames) is the basis of all gene-related research and applications. The rise of ribosome imprinting sequencing technology makes it possible to predict protein coding frames more accurately, especially making it possible to predict small coding frames. Although there are many software and processes that can be used to predict protein coding frames from ribosomal imprinting data, the use of these tools must be based on an ideal condition, that is, ribosomal imprinting data are of high quality (completely 3 bases). The periodic distribution of basis). Satisfying this condition requires extremely high experimental technology and expensive reagents and equipment, which greatly restricts the application and expansion of this technology. In addition, high-quality ribosomal imprinting data is usually short (28nt) in length, and there will be multiple alignment sites on the genome, which will introduce a large number of errors and is not conducive to subsequent research. In general, the existing processes and tools are completely powerless for low-quality ribosome imprinting data. In order to solve the problem that low-quality ribosome imprinting data cannot be used, and high-quality ribosome imprinting data is easy to introduce errors, the present invention extracts the frequency of codon usage and combines the 3-base periodicity of ribosomal imprinting data to scientifically measure ribosomal imprinting. The corresponding weight is allocated reasonably, the probability of each codon located at the P site of the ribosome is calculated, the sequence features are extracted, and the predicted probability value of the coding frame is comprehensively evaluated through statistical analysis, and then the new coding frame is predicted. The present invention will greatly reduce the requirements of related work on the quality of ribosome imprinting data, and will greatly promote the expansion of the application of nucleosome imprinting technology, especially its application in crop research.
对于上一段的论述,还需要进一步的指出:For the discussion in the previous paragraph, it is necessary to further point out:
a)权重分配的多寡取决于数据质量,核糖体印迹数据质量越高,其分配的权重就越高。a) The amount of weight distribution depends on the quality of the data. The higher the quality of ribosomal imprinting data, the higher the weight assigned.
b)本发明的预测方法并不只局限于在“农作物研究中应用”,在动物、植物、微生物领域都可以使用本发明的预测方法,且都表现很好。相对而言,动物,微生物和人里面通常数据质量比较高,现有的方法可以较好处理。核糖体印记数据质量低的情况通常会在植物物种出现,特别是在非模式物种中遇到。也就是说,本发明的基因编码框预测方法还可以处理现有的预测方法不能处理的低质量核糖体印记数据。b) The prediction method of the present invention is not limited to "application in crop research". The prediction method of the present invention can be used in the fields of animals, plants, and microorganisms, and they all perform well. Relatively speaking, the quality of data in animals, microorganisms and humans is usually relatively high, and the existing methods can be better processed. The low quality of ribosome imprinting data usually occurs in plant species, especially in non-model species. In other words, the genetic coding frame prediction method of the present invention can also process low-quality ribosomal imprint data that cannot be processed by existing prediction methods.
实施例2:拟南芥膜结合核糖体数据的分析Example 2: Analysis of Arabidopsis membrane-bound ribosomal data
(1)实验数据从NCBI下载(GEO编号:GSE82041),该数据由LiShengben等于2016年发表于elife,文章名称为“Biogenesis of phased siRNA on membrane-bound polysomes in Arabidopsis”。实验中通过分离结合于膜上的核糖体,并对其保护的mRNA片段进行测 序,得到MBP(membrane-bound polysomes)Ribo-seq数据。在制备MBP保护的片段的过程中,由于裸露RNA的降解通常不够完全,导致核糖体印记数据(Ribo-seq)质量较低,不能呈现出很好的3碱基周期性。(1) The experimental data was downloaded from NCBI (GEO code: GSE82041). The data was published in elife in 2016 by LiShengben and the article title is "Biogenesis of phased siRNA on membrane-bound polysomes in Arabidopsis". In the experiment, we obtained MBP (membrane-bound polysomes) Ribo-seq data by separating the ribosomes bound to the membrane and sequencing the protected mRNA fragments. In the process of preparing MBP-protected fragments, the degradation of naked RNA is usually not complete, resulting in low quality of ribosomal imprinting data (Ribo-seq), which does not exhibit good 3-base periodicity.
(2)请参阅图3至图10,利用本发明方法,首先对该数据进行质量评估。结果显示该数据中RPF长度分布不集中(图3(A)和图4),理论上来说,真核生物中核糖体的印迹长度为28个核苷酸(nt,nucleotide),因此RPF长度理应集中出现在28nt。图3A显示该组数据中,RPF长度的分布范围从18nt到35nt不等,分布范围较广,虽然在32nt处出现一个峰值,但是总体占比并不高,仅10%左右,并且这一值也远远偏离了理论值(28nt),这表明该数据的产生过程中,祼露mRNA的降解并不完全,导致剩余的由核糖体保护的片段(RPF)长度不一,这将导致RPF的分辨率和精确度不足。这也表现在该数据的三碱基周期性也不强(图3(B)和图5),理论上讲,由于密码子长度为3个碱基,各个核糖体印迹之间的距离应该为3的倍数,最小距离为3个碱基,其在转录本序列上的分布呈现出3碱基的周期性,在multitaper的检验结果中,将表现为频率峰值为1/3,且P值极显著,周期性越好,其P值越小,通常理想情况下,-log10(P-value)>=10。图3B显示大部分RPF的频率峰值并不出现在1/3处,且P值较大,图中深色线表示长度为32nt(丰度最高)的RPF,数据显示其值约为3,刚刚通过multitaper检验(cutoff=2)。RPF对应P-site的分布集中度不强(图3(C)和图6),该图展示了长度为32nt的RPF P-site分布的集中度,计算所得其熵值为0.862。理想情况下,如果RPF仅对应唯一的P-site,熵值计算将为0,而如果RPF对应3个P-site,且分布平均,熵值计算将为1。图3C显示该组数据熵值为0.862,接近1而远离0,因此表示该组数据的分布不够集中。我们据此,为RPF和密码子使用频率分配相应的权重(RPF:0.138,密码子频率:0.862),由于该数据中RPF在0,1,2位上的分布集中度不够高,我们更多的使用密码子频率(权重为0.862)进行编码框的预测。利用本发明方法,有76%的已知编码框得以成功预测,并且准确率高达98%,综合评分也高达86%[综合评分=2×召回率×准确率/(召回率+准确率)](图3(D)和图7),并且成功预测了1471个小编码框,其中含有114个uORF,93个ouORF,245个dORF,232个odORF,653个teORF,121个pORF,13个ncsORF(图3(E)和图8)。已发表的蛋白质质谱数据分析显示这些预测得到编码框均受到了很好的支持(图3(F)和图9),图3(F)和图9中,横向虚线表示基因组中所有已知编码框受到蛋白质质谱数据的支持率,我们以此为参考进行比较,由此图可见,利用本方法从该数据中预测所得到的annotated ORF的质谱支持率明显高于整体水平(虚线表示),其他几类(uORF,ouORF, dORF,odORF,teORF,pORF和ncsORF)为小编码框,由于长度较短,能产生的肽段较少,因此不容易被检测到,所以支持率相对较低,特别是ncsORF,由于其数量少,因此在质谱数据中没有检出,这些都是正常现象。为了进一步验证ncsORF的准确性,我们对所预测得到的ncsORF序列进行进化分析,通过其序列保守性来证实该预测的准确性。图3(G)和图10显示,大部分预测的ncsORF都表现出了较强的保守性,具体来说,有5个ncsORF从苔藓中就开始出现,其序列在所有植物分支中都非常保守,另一部分(4个)ncsORF从十字花科植物开始出现,并且在这一分支中非常保守,据此我们可以推断这些ncsORF是具有重要生物学功能的,这些预测结果是正确的。(2) Please refer to Figures 3 to 10, using the method of the present invention, first perform quality evaluation on the data. The results show that the RPF length distribution in the data is not concentrated (Figure 3 (A) and Figure 4). In theory, the imprinting length of ribosomes in eukaryotes is 28 nucleotides (nt, nucleotide), so the RPF length should be Concentrated on 28nt. Figure 3A shows that in this set of data, the distribution range of RPF length ranges from 18nt to 35nt, and the distribution range is wide. Although there is a peak at 32nt, the overall proportion is not high, only about 10%, and this value It also deviates far from the theoretical value (28nt), which indicates that in the process of generating the data, the degradation of bare mRNA is not complete, resulting in the remaining ribosome-protected fragments (RPF) with different lengths, which will lead to RPF Insufficient resolution and accuracy. This is also reflected in the fact that the three-base periodicity of the data is not strong (Figure 3(B) and Figure 5). In theory, since the codon length is 3 bases, the distance between each ribosome imprint should be A multiple of 3, the minimum distance is 3 bases, and its distribution on the transcript sequence shows a periodicity of 3 bases. In the multitaper test result, the frequency peak is 1/3, and the P value is extreme Significantly, the better the periodicity, the smaller the P-value. Usually, -log10(P-value)>=10 under ideal conditions. Figure 3B shows that most of the RPF frequency peaks do not appear at 1/3, and the P value is large. The dark line in the figure represents the RPF with a length of 32 nt (the most abundant). The data shows that its value is about 3. Pass the multitaper test (cutoff=2). The distribution of RPF corresponding to the P-site concentration is not strong (Figure 3(C) and Figure 6). This figure shows the concentration of the RPF P-site distribution with a length of 32 nt, and the calculated entropy value is 0.862. Ideally, if the RPF only corresponds to a unique P-site, the entropy calculation will be 0, and if the RPF corresponds to 3 P-sites and the distribution is even, the entropy calculation will be 1. Figure 3C shows that the entropy value of this group of data is 0.862, which is close to 1 and far from 0, which means that the distribution of this group of data is not sufficiently concentrated. Based on this, we assign corresponding weights to RPF and codon usage frequency (RPF: 0.138, codon frequency: 0.862). Because the distribution of RPF at positions 0, 1, and 2 in this data is not high enough, we are more Use codon frequency (weight 0.862) to predict the coding frame. Using the method of the present invention, 76% of the known coding frames can be successfully predicted, and the accuracy rate is as high as 98%, and the comprehensive score is as high as 86% [comprehensive score=2×recall rate×accuracy rate/(recall rate+accuracy rate)] (Figure 3(D) and Figure 7), and successfully predicted 1471 small coding frames, which contain 114 uORF, 93 ouORF, 245 dORF, 232 odORF, 653 teORF, 121 pORF, and 13 ncsORF (Figure 3(E) and Figure 8). Analysis of published protein mass spectrometry data shows that these predicted coding frames are well supported (Figure 3(F) and Figure 9). In Figure 3(F) and Figure 9, the horizontal dashed line indicates all known codes in the genome The frame is supported by the protein mass spectrum data. We use this as a reference for comparison. From this figure, we can see that the mass spectrum support rate of annotated ORF predicted from the data using this method is significantly higher than the overall level (indicated by the dotted line). Several types (uORF, ouORF, dORF, odORF, teORF, pORF and ncsORF) are small coding frames. Due to their short length, fewer peptides can be generated, so they are not easy to be detected, so the support rate is relatively low, especially It is ncsORF. Because of its small number, it is not detected in the mass spectrum data. These are normal phenomena. In order to further verify the accuracy of ncsORF, we performed evolutionary analysis on the predicted ncsORF sequence, and confirmed the accuracy of the prediction through its sequence conservation. Figure 3(G) and Figure 10 show that most of the predicted ncsORFs show strong conservation. Specifically, there are 5 ncsORFs that began to appear in moss, and their sequences are very conservative in all plant branches The other part (4) of ncsORF began to appear from cruciferous plants and is very conserved in this branch. Based on this, we can infer that these ncsORFs have important biological functions, and these prediction results are correct.
实施例2是实施例1的一个具体实例。 Embodiment 2 is a specific example of Embodiment 1.
实施例3Example 3
本发明还公开一种预测基因编码框的系统,包括计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有预测基因编码框的计算机程序,所述预测基因编码框的计算机程序被至少一个处理组件执行时,能够实现上述的从低质量核糖体印迹数据预测基因编码框的方法的步骤。The present invention also discloses a system for predicting a gene encoding frame, including a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program for predicting a gene encoding frame, and the computer program for predicting a gene encoding frame When executed by at least one processing component, the steps of the method for predicting a gene coding frame from low-quality ribosomal imprint data can be realized.
实施例3主要是解决问题是:现有的预测基因编码框的系统只能处理高质量核糖体印记数据,对于低质量核糖体印记数据无能为力。Embodiment 3 mainly solves the problem that: the existing system for predicting gene coding frame can only process high-quality ribosome imprint data, and cannot do anything for low-quality ribosome imprint data.
所述存储介质存储器可以是ROM或可存储静态信息和指令的其他类型的静态存储设备,RAM或者可存储信息和指令的其他类型的动态存储设备,也可以是EEPROM、CD-ROM或其他光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质(包括机械硬盘、固态硬盘、混合硬盘等)或者其他磁存储设备(包括磁带)、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质(包括SD卡等),但不限于此。所述存储介质可以是存放于本地,也可以是设置在云端。The storage medium memory can be ROM or other types of static storage devices that can store static information and instructions, RAM or other types of dynamic storage devices that can store information and instructions, or EEPROM, CD-ROM or other optical disk storage , CD storage (including compressed CDs, laser disks, CDs, digital universal CDs, Blu-ray CDs, etc.), disk storage media (including mechanical hard drives, solid state drives, hybrid hard drives, etc.) or other magnetic storage devices (including tape), or can be used Any other medium (including SD card, etc.) that can carry or store desired program codes in the form of instructions or data structures and that can be accessed by a computer, but is not limited to this. The storage medium may be stored locally or set in the cloud.
所述处理组件是处理器,处理器可以是CPU,通用处理器,DSP,ASIC,FPGA或者其他可编程逻辑器件、晶体管逻辑器件、硬件部件或者其任意组合。处理器也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,DSP和微处理器的组合等。The processing component is a processor, and the processor may be a CPU, a general-purpose processor, DSP, ASIC, FPGA or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof. The processor may also be a combination of computing functions, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, and so on.
尽管上面已经示出和描述了本发明的实施例,可以理解的是,上述实施例是示例性的,不能理解为对本发明的限制,本领域的普通技术人员在不脱离本发明的原理和宗旨的情况下在本发明的范围内可以对上述实施例进行变化、修改、替换和变型。Although the embodiments of the present invention have been shown and described above, it can be understood that the above embodiments are exemplary and should not be construed as limiting the present invention. Those of ordinary skill in the art will not depart from the principle and purpose of the present invention. Under the circumstances, changes, modifications, substitutions and modifications can be made to the above-mentioned embodiments within the scope of the present invention.

Claims (10)

  1. 一种从低质量核糖体印迹数据预测基因编码框的方法,其特征在于,包括如下步骤:A method for predicting gene coding frame from low-quality ribosome imprinting data, which is characterized in that it comprises the following steps:
    S1,将原始测序的核糖体印记数据去掉接头后与基因组参考序列进行比对;S1, the original sequenced ribosome imprint data is compared with the genome reference sequence after removing the linker;
    S2,采用multitaper算法分析不同长度的核糖体印迹序列(RPF)的3碱基周期性,保留评估合格的RPF,用于后续分析;S2, use the multitaper algorithm to analyze the 3-base periodicity of ribosomal imprinted sequences (RPF) of different lengths, and retain the qualified RPF for subsequent analysis;
    S3,通过基因组注释文件信息,提取转录本和已知编码框的序列和位置信息,同时获得全基因组所有转录本和已知编码框序列;S3, extract the sequence and position information of the transcript and the known coding frame through the genome annotation file information, and obtain all the transcripts and the known coding frame sequence of the whole genome at the same time;
    S4,对步骤S2中保留的RPF进行特征训练,并依此进行权重分配;S4: Perform feature training on the RPF reserved in step S2, and perform weight distribution accordingly;
    S5,计算各转录本上每一个碱基或每一个三碱基组合正好位于核糖体P位点(P-site)的概率;S5: Calculate the probability that each base or combination of three bases on each transcript is exactly at the P-site of the ribosome;
    S6,根据已知的各编码框的序列信息以及步骤S5中计算得出的P-site概率,同时提取基因编码框特征;S6, according to the known sequence information of each coding frame and the P-site probability calculated in step S5, extract the features of the gene coding frame at the same time;
    S7,根据S5中计算得出的每个碱基或三碱基组合正好位于核糖体P位点的概率,以及S6得到的基因编码框特征,预测出未知的基因编码框。S7: According to the probability that each base or three-base combination is exactly located at the P site of the ribosome calculated in S5, and the characteristics of the gene coding frame obtained by S6, an unknown gene coding frame is predicted.
  2. 根据权利要求1所述的从低质量核糖体印迹数据预测基因编码框的方法,其特征在于,S2中,各长度RPF的3碱基周期性通过multitaper算法进行评估,频率显示为3.33Hz~0.34Hz,P值≤0.01的RPF得以保留,用于后续分析。The method for predicting a gene encoding frame from low-quality ribosomal imprinting data according to claim 1, wherein in S2, the 3-base periodicity of each length of RPF is evaluated by the multitaper algorithm, and the frequency is displayed as 3.33Hz~0.34 Hz, RPF with P value ≤ 0.01 is retained for subsequent analysis.
  3. 根据权利要求1所述的从低质量核糖体印迹数据预测基因编码框的方法,其特征在于,S4包括:The method for predicting a gene coding frame from low-quality ribosomal imprinting data according to claim 1, wherein S4 comprises:
    S41,统计各个长度的RPF的5’端与P-site之间不同距离的出现频率;S41: Count the frequency of occurrence of different distances between the 5'end of the RPF and the P-site of each length;
    S42,权重分配:根据S41中得到的各个RPF在相位0,1和2位置出现的频率,计算分布集中度。S42, weight distribution: calculate the distribution concentration according to the frequency of each RPF at the phase 0, 1 and 2 positions obtained in S41.
  4. 根据权利要求3所述的从低质量核糖体印迹数据预测基因编码框的方法,其特征在于,S41具体是:通过分析包含已知编码框启始密码子或终止密码子的RPF与对应启始或终止密码子的位置信息,计算每条RPF 5’端与核糖体P位点(P-site)和\或核糖体A位点(A-site)的距离,统计各个长度的RPF的5’端与P-site之间不同距离的出现频率。The method for predicting a gene coding frame from low-quality ribosomal imprinting data according to claim 3, wherein S41 is specifically: analyzing the RPF and the corresponding initiation codon containing the start codon or stop codon of the known coding frame. Or the position information of the stop codon, calculate the distance between the 5'end of each RPF and the ribosomal P site (P-site) and/or the ribosomal A site (A-site), and count the 5'of each length of RPF Frequency of occurrence of different distances between terminal and P-site.
  5. 根据权利要求3所述的从低质量核糖体印迹数据预测基因编码框的方法,其特征在于,S42具体是:根据S41中得到的各个RPF在相位0,1和2位置出现的 频率,计算分布集中度;分布集中度由复杂度Entropy描述,公式(公式一)如下:
    Figure PCTCN2019087412-appb-100001
    其中,i表示不同的相位,i的取值范围为0,1和2,Pi为各个RPF在i相位上分布的比例;根据公式一计算出复杂度Entropy的值,分配RPF的权重为1–Entropy,相应的,序列特征的权重分配为Entropy。
    The method for predicting a gene coding frame from low-quality ribosome imprinting data according to claim 3, wherein S42 is specifically: calculating the distribution according to the frequency of each RPF obtained in S41 at positions 0, 1, and 2 Concentration: The distribution concentration is described by the complexity Entropy, and the formula (formula 1) is as follows:
    Figure PCTCN2019087412-appb-100001
    Among them, i represents different phases, the value range of i is 0, 1, and 2, and Pi is the proportion of each RPF distributed on phase i; according to formula 1, the value of complexity Entropy is calculated, and the weight of RPF is assigned to 1– Entropy, correspondingly, the weight of sequence features is assigned as Entropy.
  6. 根据权利要求1所述的从低质量核糖体印迹数据预测基因编码框的方法,其特征在于,S5具体是:根据核糖体印记测序Ribo-seq得到各RPF的位置信息以及各RPF的5’端与P-site之间的距离信息,计算各转录本上每一个碱基或者每一个三碱基组合正好位于P-site的概率。The method for predicting a gene coding frame from low-quality ribosomal imprint data according to claim 1, wherein S5 is specifically: sequencing Ribo-seq according to ribosome imprinting to obtain the position information of each RPF and the 5'end of each RPF The distance information from the P-site is calculated, and the probability that each base or three-base combination on each transcript is exactly located at the P-site.
  7. 根据权利要求1所述的从低质量核糖体印迹数据预测基因编码框的方法,其特征在于,S6,根据各编码框的序列信息以及S5中计算得出的P-site概率,提取编码框特征,具体包括如下步骤:The method for predicting a gene encoding frame from low-quality ribosome imprinting data according to claim 1, wherein S6 extracts encoding frame features based on the sequence information of each encoding frame and the P-site probability calculated in S5 , Specifically including the following steps:
    S61,Z-score:将S5计算得到的P-site的概率转化为Z-score;S61, Z-score: Convert the probability of P-site calculated by S5 into Z-score;
    S62,密码子使用频率:根据基因组中所有编码框的密码子使用情况,计算每个密码子的出现频率,然后计算每个已知编码框中密码子出现频率的平均值。S62. Frequency of codon usage: Calculate the frequency of each codon according to the codon usage of all coding frames in the genome, and then calculate the average of the frequency of codons in each known coding frame.
  8. 根据权利要求1所述的从低质量核糖体印迹数据预测基因编码框的方法,其特征在于,S7具体包括:The method for predicting a gene coding frame from low-quality ribosomal imprinting data according to claim 1, wherein S7 specifically includes:
    S71,根据S3中所有转录本的序列信息,对基因编码框候选序列进行提取和搜索;S71, according to the sequence information of all the transcripts in S3, extract and search the candidate sequence of the gene coding frame;
    S72,按照S6中的方法提取经S71得到的候选编码框的特征,进行多组统计检验,得到多个P值;S72, according to the method in S6, extract the features of the candidate encoding frame obtained in S71, and perform multiple sets of statistical tests to obtain multiple P values;
    S73,P值合并:将S72中的多个P值经加权卡平方算法合并成最终P值;S73, P value merging: combining multiple P values in S72 into the final P value through the weighted card square algorithm;
    S74,预测结果输出:控制S73中的P以及P编码框输出错误发现率FDR的值,将满足输出标准的候选编码框进行输出。S74: Output of the prediction result: control the P and P coding boxes in S73 to output the value of the false discovery rate FDR, and output candidate coding boxes that meet the output standard.
  9. 根据权利要求8所述的从低质量核糖体印迹数据预测基因编码框的方法,其特征在于,S7具体包括如下步骤:The method for predicting a gene coding frame from low-quality ribosomal imprinting data according to claim 8, wherein S7 specifically includes the following steps:
    S71,依据S3中所有转录本的序列信息,提取所有候选编码框序列,依据标准为,拥有启始密码子(NUG)、终止密码子(UAG,UAA,UGA)并且其长度为3的整数倍数;优先搜索AUG起始的候选编码框,由长到短,逐一进行计算,AUG起始的候选编码框全部搜索完全且不满足输出条件后,再进行NUG编码 框的搜索和计算;S71: Extract all candidate coding frame sequences based on the sequence information of all transcripts in S3. According to the standard, they have a start codon (NUG), a stop codon (UAG, UAA, UGA) and their length is an integer multiple of 3. ; First search for candidate encoding boxes starting with AUG, from long to short, calculating one by one. After all candidate encoding boxes starting with AUG are searched completely and do not meet the output conditions, then search and calculate NUG encoding boxes;
    S72,按照S6中的方法提取这些候选编码框的特征,进行四组统计检验,分别是:S72: Extract the features of these candidate coding frames according to the method in S6, and perform four sets of statistical tests, which are:
    单尾检验(a):位于相位0上的Z-score值极显著大于位于相位1上的Z-score;One-tailed test (a): The Z-score value at phase 0 is extremely significantly greater than the Z-score at phase 1;
    单尾检验(b):位于相位0上的Z-score值极显著大于位于相位2上的Z-score;One-tailed test (b): The Z-score value on phase 0 is extremely significantly greater than the Z-score value on phase 2;
    单尾检验(c):位于相位0上的密码子的使用频率值极显著大于位于相位1上的密码子频率;One-tailed test (c): the frequency of use of codons on phase 0 is extremely significantly greater than that of codons on phase 1;
    单尾检验(d):位于相位0上的密码子的使用频率值极显著大于位于相位2上的密码子频率;One-tailed test (d): the frequency of use of codons on phase 0 is extremely significantly greater than that of codons on phase 2;
    S73,P值合并:将S72中的多个P值经加权卡平方算法合并成最终P值:S73, P value combination: combine multiple P values in S72 into the final P value through the weighted card square algorithm:
    S74,将预测的基因编码框RPF结果输出:输出P值≤0.001的修选编码框并根据Benjamini和Hochberg法控制编码框输出错误发现率FDR≤0.0001,满足这一标准的候选编码框进行最后的结果输出。S74. Output the predicted RPF result of the gene encoding frame: output the modified encoding frame with P value ≤ 0.001 and control the output error discovery rate of the encoding frame according to the Benjamini and Hochberg method FDR ≤ 0.0001, and the candidate encoding frame that meets this standard is finalized The result is output.
  10. 一种预测基因编码框的系统,包括计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有预测基因编码框的计算机程序,所述预测基因编码框的计算机程序被至少一个处理组件执行时,能够实现如权利要求1至9任一项所述的从低质量核糖体印迹数据预测基因编码框的方法的步骤。A system for predicting a gene coding frame, comprising a computer readable storage medium, wherein the computer readable storage medium stores a computer program for predicting a gene coding frame, and the computer program for predicting a gene coding frame is processed by at least one When the components are executed, the steps of the method for predicting a gene coding frame from low-quality ribosomal imprint data according to any one of claims 1 to 9 can be realized.
PCT/CN2019/087412 2019-05-15 2019-05-17 Method for predicting gene coding frame from low-quality ribosome imprint data and system WO2020228046A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910407961.7A CN110136776B (en) 2019-05-15 2019-05-15 Method and system for predicting gene coding frame from low-quality ribosome blotting data
CN201910407961.7 2019-05-15

Publications (1)

Publication Number Publication Date
WO2020228046A1 true WO2020228046A1 (en) 2020-11-19

Family

ID=67574536

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/087412 WO2020228046A1 (en) 2019-05-15 2019-05-17 Method for predicting gene coding frame from low-quality ribosome imprint data and system

Country Status (2)

Country Link
CN (1) CN110136776B (en)
WO (1) WO2020228046A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111243665A (en) * 2020-01-07 2020-06-05 广州基迪奥生物科技有限公司 Analysis method and system for ribosome imprinting sequencing data
CN111312331B (en) * 2020-03-27 2022-05-24 武汉古奥基因科技有限公司 Genome annotation method by using second-generation and third-generation transcriptome sequencing data
CN115713973B (en) * 2022-11-21 2023-08-08 深圳市儿童医院 Method for identifying gene coding frame formed by trans-cutting of SL sequence

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107506614A (en) * 2016-06-14 2017-12-22 武汉生命之美科技有限公司 A kind of bacterium ncRNA Forecasting Methodologies of transcript profile sequencing data and PeakCalling methods based on Illumina
CN108624651A (en) * 2018-05-14 2018-10-09 深圳承启生物科技有限公司 A method of structure Ribo-seq sequencing libraries

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102277431A (en) * 2011-08-04 2011-12-14 中南大学 Application method of human encephalic germ cell tumour marker gene HESRG (Human Embryonic Stem Cellrelated Gene)
EP3262190B1 (en) * 2015-02-24 2021-09-01 Ruprecht-Karls-Universität Heidelberg Biomarker panel for the detection of cancer
CN109652580B (en) * 2018-12-21 2021-08-24 华南农业大学 Ribosomal RNA sequence of pathogenic bacteria Septoria sp of early-baking Oak and application thereof

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107506614A (en) * 2016-06-14 2017-12-22 武汉生命之美科技有限公司 A kind of bacterium ncRNA Forecasting Methodologies of transcript profile sequencing data and PeakCalling methods based on Illumina
CN108624651A (en) * 2018-05-14 2018-10-09 深圳承启生物科技有限公司 A method of structure Ribo-seq sequencing libraries

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
DAMIEN PAULET, DAVID ALEXANDRE, RIVALS ERIC: "Ribo-seq enlightens codon usage bias", DNA RESEARCH, vol. 24, no. 3, 7 February 2017 (2017-02-07), pages 303 - 310, XP055753490, ISSN: 1340-2838, DOI: 10.1093/dnares/dsw062 *
LORENZO CALVIELLO, MUKHERJEE NEELANJAN, WYLER EMANUEL, ZAUBER HENRIK, HIRSEKORN ANTJE, SELBACH MATTHIAS, LANDTHALER MARKUS, OBERMA: "Detecting actively translated open reading frames in ribosome profiling data", NATURE METHODS, vol. 13, no. 2, 14 December 2015 (2015-12-14), pages 165 - 170, XP055753489, ISSN: 1548-7091, DOI: 10.1038/nmeth.3688 *
NICHOLAS T INGOLIA; GLORIA A BRAR; SILVIA ROUSKIN; ANNA M MCGEACHY; JONATHAN S WEISSMAN: "The ribosome profiling strategy for monitoring translation in vivo by deep sequencing of ribosome-protected mRNA fragments.", NATURE PROTOCOLS, vol. 7, no. 8, 26 July 2012 (2012-07-26), pages 1534 - 1550, XP055586719, ISSN: 1754-2189, DOI: 10.1038/nprot.2012.086 *
PENG ZHANG, DANDAN HE, YI XU, JIAKAI HOU, BIH-FANG PAN, YUNFEI WANG, TAO LIU, CHRISTEL M. DAVIS, ERIK A. EHLI, LIN TAN, FENG ZHOU,: "Genome-wide identification and differential analysis of translational initiation", NATURE COMMUNICATIONS, vol. 8, 1749, 23 November 2017 (2017-11-23), pages 731 - 745, XP055753495, ISSN: 2041-1723, DOI: 10.1038/s41467-017-01981-8 *

Also Published As

Publication number Publication date
CN110136776A (en) 2019-08-16
CN110136776B (en) 2021-04-20

Similar Documents

Publication Publication Date Title
Bickhart et al. Generating lineage-resolved, complete metagenome-assembled genomes from complex microbial communities
WO2020228046A1 (en) Method for predicting gene coding frame from low-quality ribosome imprint data and system
Pang et al. Utility of the trnH–psbA intergenic spacer region and its combinations as plant DNA barcodes: a meta-analysis
Tsankov et al. The role of nucleosome positioning in the evolution of gene regulation
Dao et al. AptaTRACE elucidates RNA sequence-structure motifs from selection trends in HT-SELEX experiments
Smith et al. Structural and functional annotation of long noncoding RNAs
Lee et al. Principles and methods of in-silico prioritization of non-coding regulatory variants
Febrer et al. Advances in bacterial transcriptome and transposon insertion-site profiling using second-generation sequencing
Bhattacharyya et al. MicroRNA transcription start site prediction with multi-objective feature selection
Cohen et al. A code for transcription elongation speed
WO2024066461A1 (en) Method for identifying microoganisms having oil reservoir flooding function based on metagenomics and metatranscriptomics
Han et al. Lncident: a tool for rapid identification of long noncoding RNAs utilizing sequence intrinsic composition and open reading frame information
Jiang et al. Three-nucleotide periodicity of nucleotide diversity in a population enables the identification of open reading frames
Yuan et al. RNA-CODE: a noncoding RNA classification tool for short reads in NGS data lacking reference genomes
Thompson et al. Genetic algorithm learning as a robust approach to RNA editing site prediction
Goswami et al. RNA-Seq for revealing the function of the transcriptome
Feng et al. LncCat: An ORF attention model to identify LncRNA based on ensemble learning strategy and fused sequence information
Wang et al. miProBERT: identification of microRNA promoters based on the pre-trained model BERT
Natalya et al. Mitochondrial genomes of Amoebozoa
Sheng et al. Motif identification method based on Gibbs sampling and genetic algorithm
WO2006109535A1 (en) Dna sequence analyzer and method and program for analyzing dna sequence
CN115240775B (en) Cas protein prediction method based on stacking integrated learning strategy
Yu et al. Prediction of protein-coding small ORFs in multi-species using integrated sequence-derived features and the random forest model
CN114639442A (en) Method and system for predicting open reading frame based on single nucleotide polymorphism
Kashyap et al. Pan-tissue transcriptome analysis of long noncoding RNAs in the American beaver Castor canadensis

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19929035

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19929035

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 15.06.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 19929035

Country of ref document: EP

Kind code of ref document: A1