CN114639442B

CN114639442B - Method and system for predicting open reading frame based on single nucleotide polymorphism

Info

Publication number: CN114639442B
Application number: CN202210325529.5A
Authority: CN
Inventors: 宋波; 姜梦云; 宁卫东; 程时锋
Original assignee: Agricultural Genomics Institute at Shenzhen of CAAS
Current assignee: Agricultural Genomics Institute at Shenzhen of CAAS
Priority date: 2022-03-30
Filing date: 2022-03-30
Publication date: 2024-01-30
Anticipated expiration: 2042-03-30
Also published as: CN114639442A

Abstract

The invention discloses a method for predicting an open reading frame based on single nucleotide polymorphism and a system for predicting the open reading frame. The invention utilizes the 3 base periodicity of nucleotide polymorphism in coding sequence in group genome variation data to test and screen open reading frames in the gene sequence to be tested, counts the using frequency of codons in the open reading frames, combines the 3 base periodicity of nucleotide polymorphism and the statistical result of codon using frequency, and comprehensively evaluates the prediction probability of small open reading frames by statistical analysis, thereby realizing accurate prediction of the small open reading frames in genome.

Description

Method and system for predicting open reading frame based on single nucleotide polymorphism

Technical Field

The invention belongs to the technical field of biology, and particularly relates to a method for predicting an open reading frame based on single nucleotide polymorphism and a system for predicting the open reading frame.

Background

Open reading frames (Open reading frame, ORFs) are sequences in DNA sequences that have the potential to encode proteins, and their annotation in the genome is one of the most important processes required for downstream analysis and use of a reference genome. Various algorithms are currently developed to predict ORFs in the genome, but these sequence-based approaches fail to predict small open reading frames (sorfs). Recent studies have shown that polypeptides encoded by sORFs that are shorter than 100 amino acids play an important role in plant responses to abiotic and biotic stresses, human carcinogenesis, and some biological processes associated with cancer therapy. For a long time, prediction has been problematic due to the short length of the sORF and the use of non-standard initiation codons (CUG, GUG, UUG).

In the prior art, ribosome blot sequencing (Ribo-seq) techniques can analyze ribosome protected mRNA imprinting (RPFs) and can be used to accurately predict the translated orf in a number of species including yeast, human, animal and plant. However, most of these species are simple model organisms, usually diploid homozygous genomes, and the application of the Ribo-seq technology in complex genomes is rarely reported. A typical eukaryotic ribosome has a footprint of 28 bases and is too short for accurate sequence targeting, which is a problem that is more pronounced in polyploid complex genomes. Many plant genomes are highly repetitive and highly heterozygous polyploid complex genomes, which greatly limits the application of the Ribo-seq technology in these plants. Since many important crops, such as wheat (hexaploid) and cotton (tetraploid), are polyploid, it is necessary to develop new methods and tools for solving the identification of small coding boxes in polyploid complex genomes.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a method and a system for predicting an open reading frame based on single nucleotide polymorphism. The invention utilizes the 3-base periodicity of nucleotide polymorphism in the coding sequence in the group genome variation data, introduces the use frequency of codons, comprehensively evaluates the prediction probability value of a coding frame through statistical analysis, and further predicts the coding frame in a complex genome.

The aim of the invention is realized by the following technical scheme: a method for predicting open reading frames based on single nucleotide polymorphisms, comprising the steps of:

s1, acquiring transcript information to be predicted, and extracting candidate long open reading frames;

s2, evaluating the change rule of single nucleotide polymorphism in the candidate long open reading frame to be predicted, and screening the true long open reading frame according to a preset first screening condition;

s3, counting the use frequency of each codon in the real long open reading frame;

s4, extracting candidate open reading frames from the transcript information, evaluating the change rule and the codon use frequency of single nucleotide polymorphism in the candidate open reading frames to be predicted, and taking the candidate open reading frames meeting the preset second screening conditions as prediction results.

Further, the basis for extracting the candidate long open reading frames and the candidate open reading frames is as follows: beginning with the start codon AUG and ending with the stop codon UAG, UAA or UGA, and the sequence lengths of the candidate long open reading frame and the candidate open reading frame are integer multiples of 3.

Further, the length of the candidate long open reading frame is greater than 900bp, and the length of the candidate open reading frame is greater than 100bp.

Further, evaluating the change rule of the single nucleotide polymorphism in the candidate long open reading frame to be predicted includes:

s21, acquiring group variation data of a sample to be predicted, and calculating nucleotide diversity values of all sites in a candidate long open reading frame to be predicted;

s22, respectively checking whether the nucleotide diversity value of the 3 n-th base in the candidate long open reading frame is larger than the nucleotide diversity values of the 3n-2 nd base and the 3n-1 st base, wherein n is more than or equal to 1 and less than or equal to L/3, L is the length of the candidate long open reading frame, and obtaining P ₁ And P ₂ The combined P-value is calculated.

Further, the first screening condition is that the P value is less than 0.0001.

Further, assessing the rules of variation and the frequency of codon usage of single nucleotide polymorphisms in the candidate open reading frames to be predicted includes:

s41, acquiring group variation data of a sample to be predicted, and calculating nucleotide diversity values of all sites in a candidate open reading frame to be predicted;

s42, respectively checking whether the nucleotide diversity value of the 3 n-th base in the candidate open reading frame is larger than the nucleotide diversity values of the 3n-2 nd base and the 3n-1 st base, wherein n is more than or equal to 1 and less than or equal to L '/3, L' is the length of the candidate open reading frame, so as to obtain P ₁ ' and P ₂ ' s,; respectively checking whether the frequency of use of the triplet starting from the 3n-2 th base in the candidate open reading frame as a codon is higher than the frequency of use of the triplet starting from the 3n-1 st base and the 3 n-th base to obtain P ₃ ' and P ₄ ' calculate P ₁ '、P ₂ '、P ₃ And P ₄ The P' value after the four values are combined.

Further, the second screening condition is to control the false discovery rate FDR of the P' value meeting the preset requirement, and the FDR is controlled to be less than or equal to 0.0001.

Further, the preset requirement is that the P' value is less than 0.05.

It is another object of the present invention to provide a system for predicting open reading frames based on single nucleotide polymorphisms, comprising a processor and a storage medium storing machine-readable instructions executable by the processor, which when executed, perform the method of predicting open reading frames described above.

The beneficial effects of the invention are as follows:

1) The invention utilizes the 3 base periodicity of the nucleotide polymorphism in the coding sequence in the genome variation data of the population, and based on the fact that the third base of the codon in the coding sequence of the gene is usually a degenerate base, mutation is easier to occur and natural selection is not carried out, so that the third base of the codon shows higher polymorphism in natural population. The sequence segments with obvious 3 base periodicity are analyzed and found in genome variation polymorphism data of the population to determine the translation phase of the open reading frame, so as to judge the start and stop sites and complete the prediction of the open reading frame. By introducing the using frequency of the codons, the predicted probability value of the open reading frame is comprehensively assessed through statistical analysis, so that the accurate prediction of the open reading frame in the genome is realized. The method is also suitable for the prediction and identification of small open reading frames in polyploid complex genome, and is beneficial to the promotion of research and development of the polyploid complex genome.

2) The invention also provides a system for predicting open reading frames by applying the method, the processing process of the method steps is applied to a computer in the form of a computer program, and after necessary information such as group variation data, transcripts and the like of a sample to be predicted is input, the computer program outputs a prediction result, so that the method is beneficial to improving the use efficiency of the method and promoting the application of the method in polyploid complex genome research.

Drawings

Fig. 1 is a schematic diagram of the technical route of the present invention.

Fig. 2 is a flow chart of the method of the present invention.

FIG. 3 is an example of two open reading frames predicted in the first embodiment of the present invention.

FIG. 4 is a graph showing the results of evaluation of the predicted effect of the first embodiment of the present invention, showing the open reading frames identified from SNPs of cotton by the method of the present invention.

FIG. 5 shows the predicted results of a small open reading frame according to the first embodiment of the present invention.

FIG. 6 is supporting evidence of protein mass spectrometry data according to example one of the present invention.

FIG. 7 shows two predicted open reading frames according to the second embodiment of the present invention.

FIG. 8 is a result of evaluation of the predictive effect of the second embodiment of the present invention, showing the open reading frames identified from SNPs of wheat by the method of the present invention.

FIG. 9 shows the predicted results of a small open reading frame according to the second embodiment of the present invention.

FIG. 10 is supporting evidence of protein mass spectrometry data for example two of the present invention.

Detailed Description

The technical solutions of the present invention will be clearly and completely described below with reference to the embodiments, and it is apparent that the described embodiments are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by a person skilled in the art without any inventive effort, are intended to be within the scope of the present invention, based on the embodiments of the present invention.

As shown in fig. 1 and 2, the present invention provides a method for predicting an open reading frame based on a single nucleotide polymorphism, comprising the steps of:

s1, transcript information to be predicted is obtained, and candidate long open reading frames are extracted.

By obtaining transcript sequence information from the genomic sequence to be predicted and extracting therefrom candidate long open reading frames greater than 900bp in length. The basis for extracting candidate long open reading frames is as follows: beginning with the initiation codon AUG and ending with the termination codon UAG, UAA or UGA, and the length of the candidate long open reading frame sequence is an integer multiple of 3, greater than 900bp.

S2, evaluating the change rule of single nucleotide polymorphism in the candidate long open reading frame to be predicted, and screening the true long open reading frame according to a preset first screening condition.

Wherein, the evaluation of the variation law of single nucleotide polymorphism in the candidate long open reading frame to be predicted mainly comprises the following steps:

s21, obtaining group variation data of a genome to be predicted, calculating nucleotide diversity values of all sites in the candidate long open reading frame to be predicted, and taking the nucleotide diversity values of all sites in the candidate long open reading frame to be predicted as a basis of screening conditions.

S22, respectively checking whether the nucleotide diversity value of the 3n base in the candidate long open reading frame is larger than the nucleotide diversity values of the 3n-2 base and the 3n-1 base, namely checking whether the nucleotide diversity value of the third nucleotide of each codon in each candidate long open reading frame is larger than the nucleotide diversity values of the first nucleotide and the second nucleotide of each codon. Wherein n is more than or equal to 1 and less than or equal to L/3, L is the length of the candidate long open reading frame, and a test result P is obtained ₁ And P ₂ P1 and P2 are combined by using the "combine_pvalues" function in the python language "scipy. Stats" module, and the combination is calculatedThe specific calculation method of the P value is as follows:

P＝scipy.stats.combine_pvalues([P1,P2])。

the first screening condition is that the P value is less than 0.0001, and the candidate long open reading frame is evaluated as a true long open reading frame when the P value satisfies the first screening condition.

S3, counting the use frequency of each codon in the true long open reading frame, counting the occurrence times of each codon in the true long open reading frame, and calculating the proportion of each codon to the occurrence times of all codons, namely the use frequency of each codon, wherein the use frequency is used for representing the use frequency of each codon in the whole gene to be predicted.

S4, extracting candidate open reading frames from transcript information, wherein the basis for extracting the candidate open reading frames is as follows: beginning with the start codon AUG and ending with the stop codon UAG, UAA or UGA, and the length of the candidate open reading frame sequence is an integer multiple of 3, greater than 100bp. And checking and screening to verify whether each extracted candidate open reading frame accords with the characteristics of the open reading frame, so as to obtain a prediction result.

The detection and screening process is mainly used for evaluating the change rule and the codon usage frequency of single nucleotide polymorphism in a candidate open reading frame to be predicted, and mainly comprises the following steps of:

s41, obtaining group variation data of a genome to be predicted, calculating nucleotide diversity values of all sites of the candidate open reading frames to be predicted, and taking the nucleotide diversity values of all the sites in the candidate open reading frames to be predicted as the basis of screening conditions.

S42, respectively checking whether the nucleotide diversity value of the 3n base in the candidate open reading frames is greater than the nucleotide diversity values of the 3n-2 base and the 3n-1 base, namely checking whether the nucleotide diversity value of the third nucleotide of each codon in each candidate open reading frame is greater than the nucleotide diversity values of the first nucleotide and the second nucleotide of each codon, wherein n is greater than or equal to 1 and less than or equal to L '/3, L' is the length of the candidate open reading frames, and obtaining a checking result P ₁ ' and P ₂ '。

Separately checking candidate open reading frames for the firstWhether or not the frequency of use of the triplet starting from 3n-2 bases as a codon is higher than the frequency of use of the triplet starting from 3n-1 base and 3n base as a codon, giving a test result P ₃ ' and P ₄ ' by P ₃ ' and P ₄ The agreement of the triplet in the candidate open reading frame as a codon usage frequency with the statistical result in S3 can be reflected, reflecting the reliability of the candidate open reading frame as an open reading frame. P is determined by using the "combine_pvalues" function in the python language "scipy. Stats" module ₁ '、P ₂ '、P ₃ ' and P ₄ ' merging, calculating to obtain P ₁ '、P ₂ '、P ₃ ' and P ₄ 'P' values after four-value combining. The specific calculation method comprises the following steps:

P＝scipy.stats.combine_pvalues([P ₁ ',P ₂ ',P ₃ ',P ₄ '])。

the second screening condition is to control the error discovery rate FDR of the P 'value meeting the preset requirement, wherein the preset requirement is that the P' value is smaller than 0.05, the FDR is controlled to be smaller than or equal to 0.0001, and the predicted result is the result of the second screening condition.

It should be noted that the method of predicting open reading frames based on single nucleotide polymorphisms of the present invention is not applicable to population genomic data with too low polymorphisms, e.g., the method of the present invention is not adapted to open reading frame prediction when the number of populations is less than 400.

The invention also provides a system for predicting open reading frames based on single nucleotide polymorphisms, which comprises a processor and a storage medium, wherein the storage medium can be in the form of a magnetic disk, a ROM or a RAM, and machine-readable instructions executable by the processor are stored on the storage medium, and the machine-readable instructions are mainly embodied as a computer program executable on a computer processor, and the method for predicting open reading frames is executed by the program so as to realize the prediction of open reading frames.

Embodiment one: analysis of cotton population data

The experimental data of this example was downloaded from figshare, published by Li JiangYeing equal to 2021 in Genome Biology, wen Zhangming, called "Cotton pan-Genome retrieves the lost sequences and genes during domestication and selection", and yielded 1961 samples of whole Genome re-sequencing data.

S1, extracting a candidate long open reading frame from transcript information, wherein the candidate long open reading frame starts with a start codon AUG and ends with a stop codon UAG, UAA or UGA, and the length of the sequence of the candidate long open reading frame is an integer multiple of 3 and is more than 900bp.

S2, checking according to the single nucleotide diversity value of each site in the candidate long open reading frames to be predicted, and screening according to the first screening condition to obtain 4065 real long open reading frames.

S3, counting the use frequency of each codon in the real long open reading frame obtained in S2.

S4, extracting all candidate open reading frames from the transcript sequence, and carrying out inspection screening according to a second screening condition, wherein a total of 86889 candidate open reading frames are predicted to be real open reading frames, the recall rate is 76% (the proportion of known open reading frames in a genome, namely, the proportion of the number of true positives and the total number of annotated ORFs is multiplied by 100%), the accuracy is as high as 94% (the proportion of the predicted open reading frames which are consistent with the known reading frames, namely, the proportion of the number of true positives and the total number of predicted ORFs is multiplied by 100%), and the comprehensive score is 84% [ comprehensive score=2×recall rate×accuracy/(recall rate+accuracy) ].

As shown in FIG. 5, 4704 small open reading frames are also included, containing 1182 uORFs, 316 ouORFs, 2110 dORFs, 557 odORFs, 477 internal ORFs, 62 truncated ORFs. As shown in FIG. 6, the dashed lines indicate the degree of support of known ORFs in the genome by the protein mass spectrometry data, and analysis of published protein mass spectrometry data shows that these predicted small open reading frames are well supported.

Embodiment two: analysis of wheat population data

The experimental data of this example were downloaded from NCBI (accession number PRJNA 476679) and CNCB (accession number GVM 000082), the first set of data was published in Genome Biology by Cheng Hong equal to 2019, wen Zhangming being called "frequency intra-and inter-species introgression shape the landscape of genetic variation in bread wheat", and 93 wheat were subjected to whole Genome re-sequencing. The second set of data was published by Zhou Yao in 2020 under Nature Genetics, wen Zhangming, designated "Triticum population sequencing prov-ides insights into wheat adaptation", and total genome re-sequencing was performed on a total of 414 wheat varieties. This example was used for small open reading frame prediction after combining the two sets of data.

S1, extracting a candidate long open reading frame from transcript information, starting with a start codon AUG, ending with a stop codon UAG, UAA or UGA, wherein the length of the candidate long open reading frame sequence is an integer multiple of 3, and the length is more than 900bp.

S2, checking according to nucleotide diversity values of all sites in the candidate long open reading frames to be predicted, and screening according to the first screening condition to obtain 13683 real long open reading frames.

S4, extracting all candidate open reading frames from the transcript sequences, and carrying out inspection and screening according to a second screening condition, wherein as shown in fig. 7 and 8, a total of 117140 candidate open reading frames are predicted to be real open reading frames, the accuracy rate is as high as 95% and the comprehensive score is 91%.

As shown in FIG. 9, 5025 small open reading frames were predicted successfully by the test screen, containing 232 uORFs, 21 ouORFs, 234 dORFs, 129 odORFs, 3532 internal ORFs, 675 extended ORFs, and 202 truncated ORFs. As shown in FIG. 10, the dashed lines indicate the degree of support of known ORFs in the genome by the protein mass spectrometry data, and analysis of published protein mass spectrometry data shows that these predicted small open reading frames are well supported.

The foregoing is merely a preferred embodiment of the invention, and it is to be understood that the invention is not limited to the form disclosed herein but is not to be construed as excluding other embodiments, but is capable of numerous other combinations, modifications and environments and is capable of modifications within the scope of the inventive concept, either as taught or as a matter of routine skill or knowledge in the relevant art. And that modifications and variations which do not depart from the spirit and scope of the invention are intended to be within the scope of the appended claims.

Claims

1. A method for predicting open reading frames based on single nucleotide polymorphisms, characterized by: the method comprises the following steps:

s4, extracting candidate open reading frames from the transcript information, evaluating the change rule and the codon use frequency of single nucleotide polymorphism in the candidate open reading frames to be predicted, and taking the candidate open reading frames meeting the preset second screening conditions as prediction results;

the evaluation of the change rule of the single nucleotide polymorphism in the candidate long open reading frame to be predicted comprises:

s22, respectively checking whether the nucleotide diversity value of the 3 n-th base in the candidate long open reading frame is larger than the nucleotide diversity values of the 3n-2 nd base and the 3n-1 st base, wherein n is more than or equal to 1 and less than or equal to L/3, L is the length of the candidate long open reading frame, P1 and P2 are obtained, and calculating the combined P value;

the evaluating of the change rule and codon usage frequency of single nucleotide polymorphisms in the candidate open reading frames to be predicted comprises:

s42, respectively checking whether the nucleotide diversity value of the 3 n-th base in the candidate open reading frame is larger than the nucleotide diversity values of the 3n-2 nd base and the 3n-1 st base, wherein n is more than or equal to 1 and less than or equal to L '/3, L' is the length of the candidate open reading frame, and P1 'and P2' are obtained; and respectively checking whether the use frequency of the triplet taking the 3n-2 th base as the starting point in the candidate open reading frame is higher than that of the triplet taking the 3n-1 st base and the 3 n-th base as the starting point in the candidate open reading frame, obtaining P3' and P4', and calculating the P ' value after combining the four values of P1', P2', P3 and P4.

2. The method for predicting open reading frames based on single nucleotide polymorphisms according to claim 1, wherein: the basis for extracting the candidate long open reading frames and the candidate open reading frames is as follows: beginning with the start codon AUG and ending with the stop codon UAG, UAA or UGA, and the sequence lengths of the candidate long open reading frame and the candidate open reading frame are integer multiples of 3.

3. The method for predicting open reading frames based on single nucleotide polymorphisms according to claim 1, wherein: the length of the candidate long open reading frame is greater than 900bp, and the length of the candidate open reading frame is greater than 100bp.

4. The method for predicting open reading frames based on single nucleotide polymorphisms according to claim 1, wherein: the first screening condition is that the P value is less than 0.0001.

5. The method for predicting open reading frames based on single nucleotide polymorphisms according to claim 1, wherein: the second screening condition is to control the error discovery rate FDR of the P' value meeting the preset requirement, and the FDR is controlled to be less than or equal to 0.0001.

6. The method for predicting open reading frames based on single nucleotide polymorphisms as recited in claim 5 wherein: the preset requirement is that the P' value is less than 0.05.

7. A system for predicting open reading frames based on single nucleotide polymorphisms, characterized by: comprising a processor and a storage medium storing machine-readable instructions executable by the processor, which when executed perform the method of predicting an open reading frame of any one of claims 1-6.