WO2020228046A1

WO2020228046A1 - Method for predicting gene coding frame from low-quality ribosome imprint data and system

Info

Publication number: WO2020228046A1
Application number: PCT/CN2019/087412
Authority: WO
Inventors: 莫蓓莘; 宋波; 杨晓玉; 高雷; 陈雪梅
Original assignee: 深圳大学
Priority date: 2019-05-15
Filing date: 2019-05-17
Publication date: 2020-11-19
Also published as: CN110136776A; CN110136776B

Abstract

Provided is a method for predicting gene coding frame from low-quality ribosome imprint data, ribosome imprints and codon using frequency are comprehensively used for predicting a protein coding frame, a multitaper algorithm and complexity are used for describing quality of the ribosome imprint data, a corresponding weight is automatically distributed according to the complexity of the ribosome imprint data, thereby balancing the influence of the data quality. Specifically, the codon using frequency is extracted, combining with the 3-base periodicity of the ribosome imprint data, the quality of the ribosome imprint data is scientifically measured and a corresponding weight is reasonably distributed, the probability of each codon at the P-site of the ribosome is calculated, a sequence characteristic is extracted, the predicted probability value of the coding frame is comprehensively evaluated through statistical analysis, and further a new coding frame is predicted. The requirement for the quality of the ribosome imprint data is greatly reduced, and the extension of ribosome imprint technology application is greatly promoted, particularly application in crop researching.

Description

Method and system for predicting gene coding frame from low-quality ribosome imprinting data

Technical field

The invention belongs to the field of biotechnology, and specifically relates to a method for predicting a protein encoding frame using low-quality ribosomal imprinting data, that is, a method for predicting a gene encoding frame from low-quality ribosomal imprinting data, and also relates to a predicting gene encoding frame system.

Background technique

With the continuous development of second- and third-generation gene sequencing, genomic data has grown exponentially in recent years, which has greatly promoted the research and application of life sciences. Gene function is the basis of all life activities. The study of gene function can help us to improve our understanding of disease occurrence and the formation mechanism of crop traits, and then help people to prevent and treat diseases more effectively or improve crop traits. In many existing genomics and biological studies, people mainly focus on the larger coding genes (length>=300bp) in the genome, and directly ignore the small coding frames in the genome, thinking that their expression is low and their coding ability is weak. No or only very minor functions. As people continue to deepen their research and understanding of genomes, more and more evidences show that small coding frames in the genome play an important regulatory role in gene expression and translation, and play an important role in the formation of plant traits, yeast development and animal embryo development. Both have a very critical role. It can be seen that the research on small gene coding frames has very broad prospects in medical, industrial and agricultural applications. At the same time, the study of small gene coding frames is also essential for a comprehensive understanding of biological processes and mechanisms.

The accurate prediction of gene coding frame (Open reading frame, ORF) is the basic work of all genome research and related research and application. At present, the prediction of the gene coding frame is mainly based on the judgment of the characteristics of the DNA sequence, so as to determine the starting and ending positions of the protein coding gene, and then infer the protein sequence encoded by the base. Existing data shows that this traditional prediction method has high accuracy for the prediction of long coding frames, but it is almost powerless to predict small ORFs. The traditional method uses experimental methods to confirm and verify the small coding boxes one by one. This method is time-consuming and labor-intensive, and is not operable in most organisms. At present, only about 300 small coding frames have been experimentally verified in the yeast genome. In recent years, the rise of Ribo-seq (Ribo-seq) technology has made it possible to quickly and accurately predict small coding frames in the whole genome. The basic principle is that the translated RNA sequence will be protected by the ribosome. After these protected sequences are proposed and then sequenced, the translated sequence can be obtained to predict the position of the small coding frame. With the continuous expansion of the application range of ribosome sequencing technology, many methods and software for predicting small coding frames based on ribosome sequencing data have also been developed. However, since these main methods are currently developed in the study of model species, they are based on an ideal assumption that ribosomal sequencing data are of high quality (completely with a periodic distribution of 3 bases). ). This prerequisite is relatively easy to achieve in model species, but it is not always the case in other non-model species. Even in model species, sequencing the protective sequences of ribosomes in different organelles does not always obtain high-quality data that meet the conditions. Therefore, the requirement for high-quality ribosome imprinting data greatly hinders the application of this technology in non-model species, and also limits its application range. The development of new methods and software that can be used for low-quality ribosome sequencing data analysis is of great significance for advancing the application of this technology and the research of small coding frames.

Summary of the invention

The purpose of the present invention is to overcome the above-mentioned shortcomings of the prior art and provide a method for predicting gene coding frames from low-quality ribosomal imprinted data. The present invention introduces codon usage frequency, combined with the 3-base periodicity of ribosomal imprinted data, Scientifically measure the quality of ribosome imprinting data and reasonably assign corresponding weights, calculate the probability of each codon located at the ribosome P site, extract sequence features, and comprehensively evaluate the predicted probability value of the coding frame through statistical analysis, and then predict the new code frame. The invention of this method helps to reduce the data quality requirements of ribosome data analysis and rapidly expand its application range.

In order to achieve the above objective, the present invention provides a method for predicting gene coding frame from low-quality ribosome imprinting data, which includes the following steps:

S1, the original sequenced ribosome imprint data is compared with the genome reference sequence after removing the linker;

S2, using the multitaper algorithm to analyze the 3-base periodicity of ribosomal imprinted sequences (RPF) of different lengths, and retain the qualified RPF for subsequent analysis;

S3, extract the sequence and position information of the transcript and the known coding frame through the genome annotation file information, and obtain all the transcripts and the known coding frame sequence of the whole genome at the same time;

S4: Perform feature training on the RPF reserved in step S2, and perform weight distribution accordingly;

S5: Calculate the probability that each base or combination of three bases on each transcript is exactly at the P-site of the ribosome;

S6, according to the known sequence information of each coding frame and the P-site probability calculated in step S5, extract the features of the gene coding frame at the same time;

S7: According to the probability that each base or three-base combination is exactly located at the P site of the ribosome calculated in S5, and the characteristics of the gene coding frame obtained by S6, an unknown gene coding frame is predicted.

It should be pointed out that the gene coding frame feature in S6 refers to the codon usage frequency of the known coding frame.

Preferably, in S2, the 3-base periodicity of each length of RPF is evaluated by the multitaper algorithm, the frequency is displayed as 3.33 Hz ~ 0.34 Hz, and the RPF with P value ≤ 0.01 is retained for subsequent analysis.

More preferably, in S2, the 3-base periodicity of each length RPF is evaluated by the multitaper algorithm, the frequency is displayed as 3.33 Hz or 0.34 Hz, and the RPF with a P value ≤ 0.01 is retained for subsequent analysis.

Preferably, S4 includes:

S41: Count the frequency of occurrence of different distances between the 5'end of the RPF and the P-site of each length;

S42, weight distribution: calculate the distribution concentration according to the frequency of each RPF at the

phase

0, 1 and 2 positions obtained in S41.

More preferably, S41 is specifically: by analyzing the position information of the RPF containing the start or stop codon of the known coding frame and the corresponding start or stop codon, calculate the 5'end of each RPF and the ribosomal P position The distance between the point (P-site) and/or the ribosome A site (A-site), and the frequency of occurrence of different distances between the 5'end of the RPF and the P-site of each length is counted.

Preferably, S42 is specifically: calculating the distribution concentration degree according to the frequency of each RPF in the

phase

0, 1, and 2 positions obtained in S41; the distribution concentration degree is described by the complexity Entropy, and the formula (formula 1) is as follows:

Among them, i represents different phases, the value range of i is 0, 1, and 2, and Pi is the proportion of each RPF distributed on phase i; according to formula 1, the value of complexity Entropy is calculated, and the weight of RPF is assigned to 1– Entropy, correspondingly, the weight of sequence features is assigned as Entropy.

Preferably, S5 is specifically: according to Ribo-seq to obtain the position information of each RPF and the distance information between the 5'end of each RPF and the P-site, calculate each base or each three-base combination on each transcript The probability of being exactly at the P-site.

Preferably, S6, extracting coding frame features according to the sequence information of each coding frame and the P-site probability calculated in S5, specifically includes the following steps:

S61, Z-score: Convert the probability of P-site calculated by S5 into Z-score;

S62. Frequency of codon usage: Calculate the frequency of each codon according to the codon usage of all coding frames in the genome, and then calculate the average of the frequency of codons in each known coding frame.

Preferably, S7 specifically includes:

S71, according to the sequence information of all the transcripts in S3, extract and search the candidate sequence of the gene coding frame;

S72, according to the method in S6, extract the features of the candidate encoding frame obtained in S71, and perform multiple sets of statistical tests to obtain multiple P values;

S73, P value merging: combining multiple P values in S72 into the final P value through the weighted card square algorithm;

S74: Output of the prediction result: control the P and P coding boxes in S73 to output the value of the false discovery rate FDR, and output candidate coding boxes that meet the output standard.

More preferably, S7 specifically includes:

S71: Extract all candidate coding frame sequences based on the sequence information of all transcripts in S3. According to the standard, they have a start codon (NUG), a stop codon (UAG, UAA, UGA) and their length is an integer multiple of 3. ; First search for candidate encoding boxes starting with AUG, from long to short, calculating one by one. After all candidate encoding boxes starting with AUG are searched completely and do not meet the output conditions, then search and calculate NUG encoding boxes;

S72, according to the method in S6, extract the features of these candidate encoding frames, and perform four sets of statistical tests, namely: one-tailed test (a): the Z-score value at phase 0 is extremely significantly greater than the Z-score at phase 1 score; one-tailed test (b): the Z-score value on phase 0 is extremely significantly greater than the Z-score on phase 2; one-tailed test (c): the frequency value of the codons on phase 0 is extremely significant Greater than the frequency of codons located on phase 1; one-tailed test (d): the frequency of use of codons located on phase 0 is extremely significantly greater than the frequency of codons located on phase 2;

S73, P value combination: combine multiple P values in S72 into the final P value through the weighted card square algorithm:

S74. Output the predicted RPF result of the gene encoding frame: output the modified encoding frame with P value ≤ 0.001 and control the output error discovery rate of the encoding frame according to the Benjamini and Hochberg method FDR ≤ 0.0001, and the candidate encoding frame that meets this standard is finalized The result is output.

Preferably, in S7, the predicted unknown gene coding frame RPF includes a small coding frame and/or a normal gene coding frame.

In order to achieve another objective of the present invention, the present invention also provides a system for predicting a gene encoding frame, including a computer-readable storage medium, characterized in that the computer readable storage medium stores a computer program for predicting the gene encoding frame, When the computer program for predicting a gene encoding frame is executed by at least one processing component, the steps of the method for predicting a gene encoding frame from low-quality ribosomal imprint data can be realized.

The beneficial effects of the present invention are:

1. The present invention introduces the frequency of codon usage, combined with the 3-base periodicity of ribosomal imprinting data, scientifically measures the quality of ribosomal imprinting data and reasonably assigns corresponding weights, and calculates the probability that each codon is located at the ribosomal P site, Extract sequence features, comprehensively evaluate the predicted probability value of the coding frame through statistical analysis, and then predict a new coding frame. The invention of this method helps to reduce the data quality requirements of ribosome data analysis and rapidly expand its application range. Improve the tolerance to noise data, effectively reducing the requirements for data quality. The prediction method of the present invention is suitable for: in model organisms, it is difficult to obtain high-quality ribosomal imprint data for certain organelles, and the prediction method of the present invention can be used; in non-model organisms, if it is difficult to obtain high-quality ribosome imprints Data, the prediction method of the present invention can be used to predict the gene coding frame. The present invention greatly increases the range of predicted gene coding frames, which is of great significance for advancing the research of small coding frames.

2. In order to conveniently apply the prediction method of the present invention, the method steps of the present invention are presented to the user in the form of a computer program. The user takes the necessary information such as ribosome imprint data as input, and the computer program can output the predicted gene coding frame. It is beneficial to improve the processing efficiency of users. When the method of predicting gene coding frames of the present invention is extended to various species, the implementation of computer programs helps to improve the efficiency of predicting coding frames, so that the prediction method of the present invention can be faster Popularity.

Description of the drawings

Figure 1 is a schematic diagram of the technical route of the present invention, that is, the working flowchart of the present invention;

Fig. 2 is a schematic diagram of a search strategy for candidate coding frames of the present invention;

Fig. 3 is an application example of the present invention, in which: Fig. 3(A) is the distribution of the RPF length of the example data; Fig. 3(B) is the three-base periodic evaluation result; Fig. 3(C) is the calculation and weighting of RPF distribution concentration Distribution; Figure 3(D) is the result of the prediction effect evaluation; Figure 3(E) is the prediction result of the small coding box; Figure 3(F) is the supporting evidence of the protein mass spectrum data; Figure 3(G) is the predicted ncsORF The evolution analysis of, among them, Figure 3G is a heat map, and the color depth in the square indicates the value of the value;

Further enlargement of Fig. 3 forms the following drawings to show the details of each view in Fig. 3 more clearly:

Figure 4 is an enlarged view of view A in Figure 3;

Figure 5 is an enlarged view of view B in Figure 3;

Figure 6 is an enlarged view of view C in Figure 3;

Figure 7 is an enlarged view of view D in Figure 3;

Figure 8 is an enlarged view of view E in Figure 3;

Figure 9 is an enlarged view of view F in Figure 3;

Figure 10 is an enlarged view of view G in Figure 3;

Figure 11 is a schematic diagram of the method for predicting gene coding frames from low-quality ribosomal imprinting data of the present invention.

The realization of the objectives, functional characteristics and advantages of the present invention will be further described in conjunction with the embodiments and with reference to the accompanying drawings.

Detailed ways

Example 1

The present invention discloses a method for predicting a gene coding frame from low-quality ribosomal imprinting data. The method can accurately measure the quality of ribosomal imprinting data, and based on this, preliminary filtering of the data and reasonable distribution of corresponding weights are performed, and then the code is integrated The sub-use frequency assists the prediction of the protein coding frame. The method of the invention is insensitive to the quality of ribosome imprinting data and has strong fault tolerance. Not only that, the method of the invention also has excellent performance in high-quality ribosome imprinting data, and can comprehensively and accurately predict the coding frame of translation. Therefore, this method is applicable to all ribosome imprinting data. The main points of the present invention are as follows:

1. Comprehensively utilize ribosome imprinting and codon usage frequency to predict protein coding frame.

2. Use multitaper algorithm and complexity (entropy) to describe the quality of ribosome imprinting data.

3. Automatically assign corresponding weights according to the complexity (entropy) of ribosome imprinting data, thereby balancing the influence of data quality.

As mentioned above, the present invention mainly aims at the problem of excessively high quantitative quality requirements in the current ribosome imprinting sequencing data analysis method, and proposes a new method of predicting gene coding frame, which improves the tolerance to noise data and effectively reduces Requirements for data quality. It should be noted that the present invention is only applicable to species with reference genome sequence and annotation information.

Please refer to Figure 1 and Figure 4, the method of the present invention mainly includes the following steps:

(1) Genome alignment

The original ribosome imprinting sequencing data is compared with the genome reference sequence after removing the linker. Genomic reference sequences can be obtained from public sources.

Step (1) The purpose of genome comparison is to obtain the corresponding position information of the ribosome imprinted sequence on the genome. The genome reference sequence is the known genome sequence, and the ribosome imprinting data is compared with it to obtain their position information on the genome. If the comparison result is wrong, all subsequent predictions are wrong. This is also one of the reasons why the implementation of the prediction method of the present invention requires reference genome sequences.

(2) Quality assessment of ribosome imprinting data

By analyzing the 3-base periodicity of RPF of different lengths of ribosome imprinting data, the data that has no periodicity at all are filtered. The specific method is: the periodicity of 3 bases of each length is evaluated by the multitaper algorithm, the frequency is displayed as 3.33Hz～0.34Hz, and the RPF with P value ≤0.01 is retained for subsequent analysis.

The above step (2) includes the operation of data filtering, specifically: filtering out completely unusable data, and retaining the data that is qualified for evaluation. The multitaper algorithm is used for data quality evaluation. The purpose of quality evaluation is to provide a clear filtering standard for data filtering.

(3) Assembly of transcripts and known coding frames

Through the genome annotation file information, extract the sequence and position information of the transcript and the known coding frame, and obtain all the transcripts and the known coding frame sequence of the whole genome.

The purpose or meaning of the above step (3) is that the coding frame is predicted based on the sequence of the transcript. The sequence information of the known coding frame is used to train the frequency of codon usage, and its position information is used to train the distance information between the 5'end of the RPF and the corresponding P-site.

(4) Ribosome imprinting data (RPF) feature training and weight distribution

① Feature training: Calculate the 5'end of each RPF and the ribosomal P site (P-site) and/or ribosome A site by extracting the RPF alignment information that is aligned to the start or stop codon of the known encoding frame Point (A-site) distance, count the frequency of occurrence of different distances between the 5'end of the RPF and the P-site of each length.

Optimize the above step (4)①: select the solution to calculate the distance between the 5'end of each RPF and the P-site of the ribosome. This is because: A and P are separated by 3 bases, which is definite information.

The purpose of feature training in step (4) is to obtain distance information from the 5'end of each RPF to its corresponding P site.

The significance or function of the feature training in step (4) is to train the distance information between the 5'end of the RPF and the corresponding P-site. This information will be used to determine the P-site location corresponding to each RPF. Note: Not every RPF knows its corresponding P-site. Only RPFs containing known start or stop codons can get this information; this part of RPF training can obtain this distance information before using it Other RPF.

② Weight distribution: Calculate the distribution concentration of each RPF based on the frequency of each RPF appearing at the phase 0, 1, and 2. The distribution concentration here refers to the concentration of the phase distribution. The distribution concentration degree is described by the complexity (entropy), and the formula is as follows:

Where, i denotes different phases (0, 1 and 2), P _i is the ratio RPF distribution in phase i. According to the calculated complexity Entropy value, the corresponding weight is assigned to the RPF as (1-Entropy), and correspondingly, the weight of the sequence feature is assigned as Entropy.

In step (4), “assign a corresponding weight to the RPF”, the weight is a coefficient used to determine the contribution of the evidence in the subsequent prediction process. Specifically: the higher the RPF quality, the higher the weight obtained, and the greater the contribution to subsequent predictions; on the contrary, the lower the RPF quality (higher noise), the smaller its contribution to the prediction, and the prediction results are more dependent on others. Supported by evidence, thereby reducing the adverse effects of RPF noise on the prediction results. "Sequence feature" refers to the feature of the sequence itself. RPF is a non-sequence feature relative to RPF. This specifically refers to the frequency of codon usage.

(5) Calculate P-site probability

According to the ribosome imprint sequencing (Ribo-seq) to obtain the position information of each RPF and the distance information between its 5'end and the P-site, it is also necessary to point out: the distance information between the 5'end and the P-site is not a certainty Value, but a series of values, we use 3 values here, each value corresponds to a probability. The calculation method is shown in the feature training part of step (4): by extracting the RPF alignment information that is aligned to the start or stop codon of a known coding frame, calculate each RPF 5'end and the ribosomal P site (P- site) or the distance of the ribosome A site (A-site); calculate the probability that each base or three-base combination on each transcript is exactly located at the P-site, and convert it to Z-score, that is, perform the data Standardization. If a scheme is used to calculate the probability that each base on each transcript is located at the P site, then: each base will get a probability value, which represents a "three-base combination starting from this base" "The probability value of being located at the P site.

What needs to be pointed out is:

a) The position information in step (5) refers to the position of the 5'end of the RPF, which is obtained by comparison with the genome.

b) The three-base combination in step (5) is further defined as: the combination of three consecutively arranged bases.

c) If it is necessary to use a scheme to calculate the probability that each three-base combination on each transcript is located at the P site, the scheme should be understood as: if the three consecutive base combinations are in the current detected species Under the applicable genetic code rules, if a codon corresponds to a certain codon, calculate the probability that the codon is located at the P site. According to the above method, calculate the probability of the P site for all possible codon combinations in the current transcript. Further, according to the above method, all transcripts are calculated.

(6) Feature extraction of known coding frame

According to the sequence information of each encoding frame and the P-site probability calculated in the previous step, the features of the encoding frame are extracted as follows:

①Z-score: Calculate the probability that each codon is exactly in the P-site, and convert it into Z-score.

②Codon usage frequency: According to the codon usage of all coding frames in the genome, calculate the frequency of each codon, and then calculate the average value of the codon frequency in each known coding frame.

It should be pointed out that the training in step (4) is the characteristics of RPF, and the RPF contains the actually measured coding frame information. Step (6) trains the sequence characteristics of the known coding frame. The feature training result of step (4) and the feature extraction result of step (6) will be used together to predict the unknown coding frame.

(7) Prediction of coding frame

① Extraction and search of candidate coding frame sequences (please refer to Figure 2): According to the sequence information of all transcripts in (3), extract all candidate coding frame sequences, according to the standard: having a start codon (NUG) and a stop codon (UAG, UAA, UGA) and its length is a multiple of 3. The candidate coding frame starting with AUG is searched first, from long to short, calculating one by one. After all the candidate coding frames starting with AUG are searched completely and the output conditions are not met, the search and calculation of the NUG coding frame are performed.

②Statistical test: According to the method in (6), extract the features of these candidate encoding frames, and perform four sets of statistical tests. They are (a) the Z-score value at phase 0 is extremely significantly greater than (one-tailed test) at phase 1 The Z-score on phase 0; (b) The Z-score value on phase 0 is extremely significantly greater than (one-tailed test) the Z-score on phase 2; (c) the frequency value of the codons on phase 0 is extremely high Significantly greater than (one-tailed test) the frequency of codons located on phase 1; (d) the frequency of codons located on phase 0 is extremely significantly greater than (one-tailed test) the frequency of codons located on phase 2.

③P value combination: the 4 P values (P value, which is a parameter used to determine the hypothesis test result) obtained from the above statistics, are combined into the final P value by the weighted chi-square method. The calculation method is as follows ,

First, according to the weight assigned in step (4), the P value is converted into the card square value, the formula is as follows:

Where M represents the combined chi-square value, i is the i-th test, Pi is the p-value of the i-th test, wi is the weight of the i-th P value, because the sum of wi must be 1, and RPF and codon The frequency of use has been checked twice, so the weight of the corresponding P value is half of the weight of the RPF/password frequency calculated in the previous step.

Calculation degrees of freedom (k)

k=2{E(M)} ² /var(M)

among them,

s _i is the chi-square value P _i of a separately _{_{transformed, s i = -2 × w i}} × ln (P i)

Among them, w _i and w _j are the weights of the phases, which are equivalent to the above formula. ρ _ij is the correlation between the i-th test and the j-th test. ρ can be estimated indirectly from the calculated P value. as follows,

among them,

Is the average value of _si , since the expected value of q _t E(q _t ) = 4–(0.75ρ ² +3.25ρ), the calculation can be obtained

0.75ρ ² +3.25ρ+E(q _t )–4=0

Finally, the approximate value of ρ can be solved -2.167+(10.028-4q _t /3) ^0.5 .

According to the calculated degrees of freedom k and the combined chi-square value, the corresponding P value is obtained according to the chi-square distribution 2χ ² _k /k.

④The output error detection rate (FDR) control of the encoding box

Output the modified coding frame with P value ≤ 0.001 and control FDR ≤ 0.0001 according to the Benjamini and Hochberg method. The candidate coding frame that meets this standard is output for the final result.

Example 1 mainly relates to a method for predicting the protein coding frame using low-quality ribosomal imprinting data. The accurate prediction of protein coding frames (including small coding frames) is the basis of all gene-related research and applications. The rise of ribosome imprinting sequencing technology makes it possible to predict protein coding frames more accurately, especially making it possible to predict small coding frames. Although there are many software and processes that can be used to predict protein coding frames from ribosomal imprinting data, the use of these tools must be based on an ideal condition, that is, ribosomal imprinting data are of high quality (completely 3 bases). The periodic distribution of basis). Satisfying this condition requires extremely high experimental technology and expensive reagents and equipment, which greatly restricts the application and expansion of this technology. In addition, high-quality ribosomal imprinting data is usually short (28nt) in length, and there will be multiple alignment sites on the genome, which will introduce a large number of errors and is not conducive to subsequent research. In general, the existing processes and tools are completely powerless for low-quality ribosome imprinting data. In order to solve the problem that low-quality ribosome imprinting data cannot be used, and high-quality ribosome imprinting data is easy to introduce errors, the present invention extracts the frequency of codon usage and combines the 3-base periodicity of ribosomal imprinting data to scientifically measure ribosomal imprinting. The corresponding weight is allocated reasonably, the probability of each codon located at the P site of the ribosome is calculated, the sequence features are extracted, and the predicted probability value of the coding frame is comprehensively evaluated through statistical analysis, and then the new coding frame is predicted. The present invention will greatly reduce the requirements of related work on the quality of ribosome imprinting data, and will greatly promote the expansion of the application of nucleosome imprinting technology, especially its application in crop research.

For the discussion in the previous paragraph, it is necessary to further point out:

a) The amount of weight distribution depends on the quality of the data. The higher the quality of ribosomal imprinting data, the higher the weight assigned.

b) The prediction method of the present invention is not limited to "application in crop research". The prediction method of the present invention can be used in the fields of animals, plants, and microorganisms, and they all perform well. Relatively speaking, the quality of data in animals, microorganisms and humans is usually relatively high, and the existing methods can be better processed. The low quality of ribosome imprinting data usually occurs in plant species, especially in non-model species. In other words, the genetic coding frame prediction method of the present invention can also process low-quality ribosomal imprint data that cannot be processed by existing prediction methods.

Example 2: Analysis of Arabidopsis membrane-bound ribosomal data

(1) The experimental data was downloaded from NCBI (GEO code: GSE82041). The data was published in elife in 2016 by LiShengben and the article title is "Biogenesis of phased siRNA on membrane-bound polysomes in Arabidopsis". In the experiment, we obtained MBP (membrane-bound polysomes) Ribo-seq data by separating the ribosomes bound to the membrane and sequencing the protected mRNA fragments. In the process of preparing MBP-protected fragments, the degradation of naked RNA is usually not complete, resulting in low quality of ribosomal imprinting data (Ribo-seq), which does not exhibit good 3-base periodicity.

(2) Please refer to Figures 3 to 10, using the method of the present invention, first perform quality evaluation on the data. The results show that the RPF length distribution in the data is not concentrated (Figure 3 (A) and Figure 4). In theory, the imprinting length of ribosomes in eukaryotes is 28 nucleotides (nt, nucleotide), so the RPF length should be Concentrated on 28nt. Figure 3A shows that in this set of data, the distribution range of RPF length ranges from 18nt to 35nt, and the distribution range is wide. Although there is a peak at 32nt, the overall proportion is not high, only about 10%, and this value It also deviates far from the theoretical value (28nt), which indicates that in the process of generating the data, the degradation of bare mRNA is not complete, resulting in the remaining ribosome-protected fragments (RPF) with different lengths, which will lead to RPF Insufficient resolution and accuracy. This is also reflected in the fact that the three-base periodicity of the data is not strong (Figure 3(B) and Figure 5). In theory, since the codon length is 3 bases, the distance between each ribosome imprint should be A multiple of 3, the minimum distance is 3 bases, and its distribution on the transcript sequence shows a periodicity of 3 bases. In the multitaper test result, the frequency peak is 1/3, and the P value is extreme Significantly, the better the periodicity, the smaller the P-value. Usually, -log10(P-value)>=10 under ideal conditions. Figure 3B shows that most of the RPF frequency peaks do not appear at 1/3, and the P value is large. The dark line in the figure represents the RPF with a length of 32 nt (the most abundant). The data shows that its value is about 3. Pass the multitaper test (cutoff=2). The distribution of RPF corresponding to the P-site concentration is not strong (Figure 3(C) and Figure 6). This figure shows the concentration of the RPF P-site distribution with a length of 32 nt, and the calculated entropy value is 0.862. Ideally, if the RPF only corresponds to a unique P-site, the entropy calculation will be 0, and if the RPF corresponds to 3 P-sites and the distribution is even, the entropy calculation will be 1. Figure 3C shows that the entropy value of this group of data is 0.862, which is close to 1 and far from 0, which means that the distribution of this group of data is not sufficiently concentrated. Based on this, we assign corresponding weights to RPF and codon usage frequency (RPF: 0.138, codon frequency: 0.862). Because the distribution of RPF at

positions

0, 1, and 2 in this data is not high enough, we are more Use codon frequency (weight 0.862) to predict the coding frame. Using the method of the present invention, 76% of the known coding frames can be successfully predicted, and the accuracy rate is as high as 98%, and the comprehensive score is as high as 86% [comprehensive score=2×recall rate×accuracy rate/(recall rate+accuracy rate)] (Figure 3(D) and Figure 7), and successfully predicted 1471 small coding frames, which contain 114 uORF, 93 ouORF, 245 dORF, 232 odORF, 653 teORF, 121 pORF, and 13 ncsORF (Figure 3(E) and Figure 8). Analysis of published protein mass spectrometry data shows that these predicted coding frames are well supported (Figure 3(F) and Figure 9). In Figure 3(F) and Figure 9, the horizontal dashed line indicates all known codes in the genome The frame is supported by the protein mass spectrum data. We use this as a reference for comparison. From this figure, we can see that the mass spectrum support rate of annotated ORF predicted from the data using this method is significantly higher than the overall level (indicated by the dotted line). Several types (uORF, ouORF, dORF, odORF, teORF, pORF and ncsORF) are small coding frames. Due to their short length, fewer peptides can be generated, so they are not easy to be detected, so the support rate is relatively low, especially It is ncsORF. Because of its small number, it is not detected in the mass spectrum data. These are normal phenomena. In order to further verify the accuracy of ncsORF, we performed evolutionary analysis on the predicted ncsORF sequence, and confirmed the accuracy of the prediction through its sequence conservation. Figure 3(G) and Figure 10 show that most of the predicted ncsORFs show strong conservation. Specifically, there are 5 ncsORFs that began to appear in moss, and their sequences are very conservative in all plant branches The other part (4) of ncsORF began to appear from cruciferous plants and is very conserved in this branch. Based on this, we can infer that these ncsORFs have important biological functions, and these prediction results are correct.

Embodiment 2 is a specific example of Embodiment 1.

Example 3

The present invention also discloses a system for predicting a gene encoding frame, including a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program for predicting a gene encoding frame, and the computer program for predicting a gene encoding frame When executed by at least one processing component, the steps of the method for predicting a gene coding frame from low-quality ribosomal imprint data can be realized.

Embodiment 3 mainly solves the problem that: the existing system for predicting gene coding frame can only process high-quality ribosome imprint data, and cannot do anything for low-quality ribosome imprint data.

The storage medium memory can be ROM or other types of static storage devices that can store static information and instructions, RAM or other types of dynamic storage devices that can store information and instructions, or EEPROM, CD-ROM or other optical disk storage , CD storage (including compressed CDs, laser disks, CDs, digital universal CDs, Blu-ray CDs, etc.), disk storage media (including mechanical hard drives, solid state drives, hybrid hard drives, etc.) or other magnetic storage devices (including tape), or can be used Any other medium (including SD card, etc.) that can carry or store desired program codes in the form of instructions or data structures and that can be accessed by a computer, but is not limited to this. The storage medium may be stored locally or set in the cloud.

The processing component is a processor, and the processor may be a CPU, a general-purpose processor, DSP, ASIC, FPGA or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof. The processor may also be a combination of computing functions, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, and so on.

Although the embodiments of the present invention have been shown and described above, it can be understood that the above embodiments are exemplary and should not be construed as limiting the present invention. Those of ordinary skill in the art will not depart from the principle and purpose of the present invention. Under the circumstances, changes, modifications, substitutions and modifications can be made to the above-mentioned embodiments within the scope of the present invention.

Claims

A method for predicting gene coding frame from low-quality ribosome imprinting data, which is characterized in that it comprises the following steps:

S1, the original sequenced ribosome imprint data is compared with the genome reference sequence after removing the linker;

S2, use the multitaper algorithm to analyze the 3-base periodicity of ribosomal imprinted sequences (RPF) of different lengths, and retain the qualified RPF for subsequent analysis;

S3, extract the sequence and position information of the transcript and the known coding frame through the genome annotation file information, and obtain all the transcripts and the known coding frame sequence of the whole genome at the same time;

S4: Perform feature training on the RPF reserved in step S2, and perform weight distribution accordingly;

S5: Calculate the probability that each base or combination of three bases on each transcript is exactly at the P-site of the ribosome;

S6, according to the known sequence information of each coding frame and the P-site probability calculated in step S5, extract the features of the gene coding frame at the same time;

S7: According to the probability that each base or three-base combination is exactly located at the P site of the ribosome calculated in S5, and the characteristics of the gene coding frame obtained by S6, an unknown gene coding frame is predicted.
The method for predicting a gene encoding frame from low-quality ribosomal imprinting data according to claim 1, wherein in S2, the 3-base periodicity of each length of RPF is evaluated by the multitaper algorithm, and the frequency is displayed as 3.33Hz～0.34 Hz, RPF with P value ≤ 0.01 is retained for subsequent analysis.
The method for predicting a gene coding frame from low-quality ribosomal imprinting data according to claim 1, wherein S4 comprises:

S41: Count the frequency of occurrence of different distances between the 5'end of the RPF and the P-site of each length;

S42, weight distribution: calculate the distribution concentration according to the frequency of each RPF at the phase 0, 1 and 2 positions obtained in S41.
The method for predicting a gene coding frame from low-quality ribosomal imprinting data according to claim 3, wherein S41 is specifically: analyzing the RPF and the corresponding initiation codon containing the start codon or stop codon of the known coding frame. Or the position information of the stop codon, calculate the distance between the 5'end of each RPF and the ribosomal P site (P-site) and/or the ribosomal A site (A-site), and count the 5'of each length of RPF Frequency of occurrence of different distances between terminal and P-site.
The method for predicting a gene coding frame from low-quality ribosome imprinting data according to claim 3, wherein S42 is specifically: calculating the distribution according to the frequency of each RPF obtained in S41 at positions 0, 1, and 2 Concentration: The distribution concentration is described by the complexity Entropy, and the formula (formula 1) is as follows:
Among them, i represents different phases, the value range of i is 0, 1, and 2, and Pi is the proportion of each RPF distributed on phase i; according to formula 1, the value of complexity Entropy is calculated, and the weight of RPF is assigned to 1– Entropy, correspondingly, the weight of sequence features is assigned as Entropy.
The method for predicting a gene coding frame from low-quality ribosomal imprint data according to claim 1, wherein S5 is specifically: sequencing Ribo-seq according to ribosome imprinting to obtain the position information of each RPF and the 5'end of each RPF The distance information from the P-site is calculated, and the probability that each base or three-base combination on each transcript is exactly located at the P-site.
The method for predicting a gene encoding frame from low-quality ribosome imprinting data according to claim 1, wherein S6 extracts encoding frame features based on the sequence information of each encoding frame and the P-site probability calculated in S5 , Specifically including the following steps:

S61, Z-score: Convert the probability of P-site calculated by S5 into Z-score;

S62. Frequency of codon usage: Calculate the frequency of each codon according to the codon usage of all coding frames in the genome, and then calculate the average of the frequency of codons in each known coding frame.
The method for predicting a gene coding frame from low-quality ribosomal imprinting data according to claim 1, wherein S7 specifically includes:

S71, according to the sequence information of all the transcripts in S3, extract and search the candidate sequence of the gene coding frame;

S72, according to the method in S6, extract the features of the candidate encoding frame obtained in S71, and perform multiple sets of statistical tests to obtain multiple P values;

S73, P value merging: combining multiple P values in S72 into the final P value through the weighted card square algorithm;

S74: Output of the prediction result: control the P and P coding boxes in S73 to output the value of the false discovery rate FDR, and output candidate coding boxes that meet the output standard.
The method for predicting a gene coding frame from low-quality ribosomal imprinting data according to claim 8, wherein S7 specifically includes the following steps:

S71: Extract all candidate coding frame sequences based on the sequence information of all transcripts in S3. According to the standard, they have a start codon (NUG), a stop codon (UAG, UAA, UGA) and their length is an integer multiple of 3. ; First search for candidate encoding boxes starting with AUG, from long to short, calculating one by one. After all candidate encoding boxes starting with AUG are searched completely and do not meet the output conditions, then search and calculate NUG encoding boxes;

S72: Extract the features of these candidate coding frames according to the method in S6, and perform four sets of statistical tests, which are:

One-tailed test (a): The Z-score value at phase 0 is extremely significantly greater than the Z-score at phase 1;

One-tailed test (b): The Z-score value on phase 0 is extremely significantly greater than the Z-score value on phase 2;

One-tailed test (c): the frequency of use of codons on phase 0 is extremely significantly greater than that of codons on phase 1;

One-tailed test (d): the frequency of use of codons on phase 0 is extremely significantly greater than that of codons on phase 2;

S73, P value combination: combine multiple P values in S72 into the final P value through the weighted card square algorithm:

S74. Output the predicted RPF result of the gene encoding frame: output the modified encoding frame with P value ≤ 0.001 and control the output error discovery rate of the encoding frame according to the Benjamini and Hochberg method FDR ≤ 0.0001, and the candidate encoding frame that meets this standard is finalized The result is output.
A system for predicting a gene coding frame, comprising a computer readable storage medium, wherein the computer readable storage medium stores a computer program for predicting a gene coding frame, and the computer program for predicting a gene coding frame is processed by at least one When the components are executed, the steps of the method for predicting a gene coding frame from low-quality ribosomal imprint data according to any one of claims 1 to 9 can be realized.