CN114639442B - Method and system for predicting open reading frame based on single nucleotide polymorphism - Google Patents

Method and system for predicting open reading frame based on single nucleotide polymorphism Download PDF

Info

Publication number
CN114639442B
CN114639442B CN202210325529.5A CN202210325529A CN114639442B CN 114639442 B CN114639442 B CN 114639442B CN 202210325529 A CN202210325529 A CN 202210325529A CN 114639442 B CN114639442 B CN 114639442B
Authority
CN
China
Prior art keywords
open reading
candidate
reading frame
reading frames
predicted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210325529.5A
Other languages
Chinese (zh)
Other versions
CN114639442A (en
Inventor
宋波
姜梦云
宁卫东
程时锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Agricultural Genomics Institute at Shenzhen of CAAS
Original Assignee
Agricultural Genomics Institute at Shenzhen of CAAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Agricultural Genomics Institute at Shenzhen of CAAS filed Critical Agricultural Genomics Institute at Shenzhen of CAAS
Priority to CN202210325529.5A priority Critical patent/CN114639442B/en
Publication of CN114639442A publication Critical patent/CN114639442A/en
Application granted granted Critical
Publication of CN114639442B publication Critical patent/CN114639442B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioethics (AREA)
  • Genetics & Genomics (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a method for predicting an open reading frame based on single nucleotide polymorphism and a system for predicting the open reading frame. The invention utilizes the 3 base periodicity of nucleotide polymorphism in coding sequence in group genome variation data to test and screen open reading frames in the gene sequence to be tested, counts the using frequency of codons in the open reading frames, combines the 3 base periodicity of nucleotide polymorphism and the statistical result of codon using frequency, and comprehensively evaluates the prediction probability of small open reading frames by statistical analysis, thereby realizing accurate prediction of the small open reading frames in genome.

Description

Method and system for predicting open reading frame based on single nucleotide polymorphism
Technical Field
The invention belongs to the technical field of biology, and particularly relates to a method for predicting an open reading frame based on single nucleotide polymorphism and a system for predicting the open reading frame.
Background
Open reading frames (Open reading frame, ORFs) are sequences in DNA sequences that have the potential to encode proteins, and their annotation in the genome is one of the most important processes required for downstream analysis and use of a reference genome. Various algorithms are currently developed to predict ORFs in the genome, but these sequence-based approaches fail to predict small open reading frames (sorfs). Recent studies have shown that polypeptides encoded by sORFs that are shorter than 100 amino acids play an important role in plant responses to abiotic and biotic stresses, human carcinogenesis, and some biological processes associated with cancer therapy. For a long time, prediction has been problematic due to the short length of the sORF and the use of non-standard initiation codons (CUG, GUG, UUG).
In the prior art, ribosome blot sequencing (Ribo-seq) techniques can analyze ribosome protected mRNA imprinting (RPFs) and can be used to accurately predict the translated orf in a number of species including yeast, human, animal and plant. However, most of these species are simple model organisms, usually diploid homozygous genomes, and the application of the Ribo-seq technology in complex genomes is rarely reported. A typical eukaryotic ribosome has a footprint of 28 bases and is too short for accurate sequence targeting, which is a problem that is more pronounced in polyploid complex genomes. Many plant genomes are highly repetitive and highly heterozygous polyploid complex genomes, which greatly limits the application of the Ribo-seq technology in these plants. Since many important crops, such as wheat (hexaploid) and cotton (tetraploid), are polyploid, it is necessary to develop new methods and tools for solving the identification of small coding boxes in polyploid complex genomes.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a method and a system for predicting an open reading frame based on single nucleotide polymorphism. The invention utilizes the 3-base periodicity of nucleotide polymorphism in the coding sequence in the group genome variation data, introduces the use frequency of codons, comprehensively evaluates the prediction probability value of a coding frame through statistical analysis, and further predicts the coding frame in a complex genome.
The aim of the invention is realized by the following technical scheme: a method for predicting open reading frames based on single nucleotide polymorphisms, comprising the steps of:
s1, acquiring transcript information to be predicted, and extracting candidate long open reading frames;
s2, evaluating the change rule of single nucleotide polymorphism in the candidate long open reading frame to be predicted, and screening the true long open reading frame according to a preset first screening condition;
s3, counting the use frequency of each codon in the real long open reading frame;
s4, extracting candidate open reading frames from the transcript information, evaluating the change rule and the codon use frequency of single nucleotide polymorphism in the candidate open reading frames to be predicted, and taking the candidate open reading frames meeting the preset second screening conditions as prediction results.
Further, the basis for extracting the candidate long open reading frames and the candidate open reading frames is as follows: beginning with the start codon AUG and ending with the stop codon UAG, UAA or UGA, and the sequence lengths of the candidate long open reading frame and the candidate open reading frame are integer multiples of 3.
Further, the length of the candidate long open reading frame is greater than 900bp, and the length of the candidate open reading frame is greater than 100bp.
Further, evaluating the change rule of the single nucleotide polymorphism in the candidate long open reading frame to be predicted includes:
s21, acquiring group variation data of a sample to be predicted, and calculating nucleotide diversity values of all sites in a candidate long open reading frame to be predicted;
s22, respectively checking whether the nucleotide diversity value of the 3 n-th base in the candidate long open reading frame is larger than the nucleotide diversity values of the 3n-2 nd base and the 3n-1 st base, wherein n is more than or equal to 1 and less than or equal to L/3, L is the length of the candidate long open reading frame, and obtaining P 1 And P 2 The combined P-value is calculated.
Further, the first screening condition is that the P value is less than 0.0001.
Further, assessing the rules of variation and the frequency of codon usage of single nucleotide polymorphisms in the candidate open reading frames to be predicted includes:
s41, acquiring group variation data of a sample to be predicted, and calculating nucleotide diversity values of all sites in a candidate open reading frame to be predicted;
s42, respectively checking whether the nucleotide diversity value of the 3 n-th base in the candidate open reading frame is larger than the nucleotide diversity values of the 3n-2 nd base and the 3n-1 st base, wherein n is more than or equal to 1 and less than or equal to L '/3, L' is the length of the candidate open reading frame, so as to obtain P 1 ' and P 2 ' s,; respectively checking whether the frequency of use of the triplet starting from the 3n-2 th base in the candidate open reading frame as a codon is higher than the frequency of use of the triplet starting from the 3n-1 st base and the 3 n-th base to obtain P 3 ' and P 4 ' calculate P 1 '、P 2 '、P 3 And P 4 The P' value after the four values are combined.
Further, the second screening condition is to control the false discovery rate FDR of the P' value meeting the preset requirement, and the FDR is controlled to be less than or equal to 0.0001.
Further, the preset requirement is that the P' value is less than 0.05.
It is another object of the present invention to provide a system for predicting open reading frames based on single nucleotide polymorphisms, comprising a processor and a storage medium storing machine-readable instructions executable by the processor, which when executed, perform the method of predicting open reading frames described above.
The beneficial effects of the invention are as follows:
1) The invention utilizes the 3 base periodicity of the nucleotide polymorphism in the coding sequence in the genome variation data of the population, and based on the fact that the third base of the codon in the coding sequence of the gene is usually a degenerate base, mutation is easier to occur and natural selection is not carried out, so that the third base of the codon shows higher polymorphism in natural population. The sequence segments with obvious 3 base periodicity are analyzed and found in genome variation polymorphism data of the population to determine the translation phase of the open reading frame, so as to judge the start and stop sites and complete the prediction of the open reading frame. By introducing the using frequency of the codons, the predicted probability value of the open reading frame is comprehensively assessed through statistical analysis, so that the accurate prediction of the open reading frame in the genome is realized. The method is also suitable for the prediction and identification of small open reading frames in polyploid complex genome, and is beneficial to the promotion of research and development of the polyploid complex genome.
2) The invention also provides a system for predicting open reading frames by applying the method, the processing process of the method steps is applied to a computer in the form of a computer program, and after necessary information such as group variation data, transcripts and the like of a sample to be predicted is input, the computer program outputs a prediction result, so that the method is beneficial to improving the use efficiency of the method and promoting the application of the method in polyploid complex genome research.
Drawings
Fig. 1 is a schematic diagram of the technical route of the present invention.
Fig. 2 is a flow chart of the method of the present invention.
FIG. 3 is an example of two open reading frames predicted in the first embodiment of the present invention.
FIG. 4 is a graph showing the results of evaluation of the predicted effect of the first embodiment of the present invention, showing the open reading frames identified from SNPs of cotton by the method of the present invention.
FIG. 5 shows the predicted results of a small open reading frame according to the first embodiment of the present invention.
FIG. 6 is supporting evidence of protein mass spectrometry data according to example one of the present invention.
FIG. 7 shows two predicted open reading frames according to the second embodiment of the present invention.
FIG. 8 is a result of evaluation of the predictive effect of the second embodiment of the present invention, showing the open reading frames identified from SNPs of wheat by the method of the present invention.
FIG. 9 shows the predicted results of a small open reading frame according to the second embodiment of the present invention.
FIG. 10 is supporting evidence of protein mass spectrometry data for example two of the present invention.
Detailed Description
The technical solutions of the present invention will be clearly and completely described below with reference to the embodiments, and it is apparent that the described embodiments are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by a person skilled in the art without any inventive effort, are intended to be within the scope of the present invention, based on the embodiments of the present invention.
As shown in fig. 1 and 2, the present invention provides a method for predicting an open reading frame based on a single nucleotide polymorphism, comprising the steps of:
s1, transcript information to be predicted is obtained, and candidate long open reading frames are extracted.
By obtaining transcript sequence information from the genomic sequence to be predicted and extracting therefrom candidate long open reading frames greater than 900bp in length. The basis for extracting candidate long open reading frames is as follows: beginning with the initiation codon AUG and ending with the termination codon UAG, UAA or UGA, and the length of the candidate long open reading frame sequence is an integer multiple of 3, greater than 900bp.
S2, evaluating the change rule of single nucleotide polymorphism in the candidate long open reading frame to be predicted, and screening the true long open reading frame according to a preset first screening condition.
Wherein, the evaluation of the variation law of single nucleotide polymorphism in the candidate long open reading frame to be predicted mainly comprises the following steps:
s21, obtaining group variation data of a genome to be predicted, calculating nucleotide diversity values of all sites in the candidate long open reading frame to be predicted, and taking the nucleotide diversity values of all sites in the candidate long open reading frame to be predicted as a basis of screening conditions.
S22, respectively checking whether the nucleotide diversity value of the 3n base in the candidate long open reading frame is larger than the nucleotide diversity values of the 3n-2 base and the 3n-1 base, namely checking whether the nucleotide diversity value of the third nucleotide of each codon in each candidate long open reading frame is larger than the nucleotide diversity values of the first nucleotide and the second nucleotide of each codon. Wherein n is more than or equal to 1 and less than or equal to L/3, L is the length of the candidate long open reading frame, and a test result P is obtained 1 And P 2 P1 and P2 are combined by using the "combine_pvalues" function in the python language "scipy. Stats" module, and the combination is calculatedThe specific calculation method of the P value is as follows:
P=scipy.stats.combine_pvalues([P1,P2])。
the first screening condition is that the P value is less than 0.0001, and the candidate long open reading frame is evaluated as a true long open reading frame when the P value satisfies the first screening condition.
S3, counting the use frequency of each codon in the true long open reading frame, counting the occurrence times of each codon in the true long open reading frame, and calculating the proportion of each codon to the occurrence times of all codons, namely the use frequency of each codon, wherein the use frequency is used for representing the use frequency of each codon in the whole gene to be predicted.
S4, extracting candidate open reading frames from transcript information, wherein the basis for extracting the candidate open reading frames is as follows: beginning with the start codon AUG and ending with the stop codon UAG, UAA or UGA, and the length of the candidate open reading frame sequence is an integer multiple of 3, greater than 100bp. And checking and screening to verify whether each extracted candidate open reading frame accords with the characteristics of the open reading frame, so as to obtain a prediction result.
The detection and screening process is mainly used for evaluating the change rule and the codon usage frequency of single nucleotide polymorphism in a candidate open reading frame to be predicted, and mainly comprises the following steps of:
s41, obtaining group variation data of a genome to be predicted, calculating nucleotide diversity values of all sites of the candidate open reading frames to be predicted, and taking the nucleotide diversity values of all the sites in the candidate open reading frames to be predicted as the basis of screening conditions.
S42, respectively checking whether the nucleotide diversity value of the 3n base in the candidate open reading frames is greater than the nucleotide diversity values of the 3n-2 base and the 3n-1 base, namely checking whether the nucleotide diversity value of the third nucleotide of each codon in each candidate open reading frame is greater than the nucleotide diversity values of the first nucleotide and the second nucleotide of each codon, wherein n is greater than or equal to 1 and less than or equal to L '/3, L' is the length of the candidate open reading frames, and obtaining a checking result P 1 ' and P 2 '。
Separately checking candidate open reading frames for the firstWhether or not the frequency of use of the triplet starting from 3n-2 bases as a codon is higher than the frequency of use of the triplet starting from 3n-1 base and 3n base as a codon, giving a test result P 3 ' and P 4 ' by P 3 ' and P 4 The agreement of the triplet in the candidate open reading frame as a codon usage frequency with the statistical result in S3 can be reflected, reflecting the reliability of the candidate open reading frame as an open reading frame. P is determined by using the "combine_pvalues" function in the python language "scipy. Stats" module 1 '、P 2 '、P 3 ' and P 4 ' merging, calculating to obtain P 1 '、P 2 '、P 3 ' and P 4 'P' values after four-value combining. The specific calculation method comprises the following steps:
P=scipy.stats.combine_pvalues([P 1 ',P 2 ',P 3 ',P 4 '])。
the second screening condition is to control the error discovery rate FDR of the P 'value meeting the preset requirement, wherein the preset requirement is that the P' value is smaller than 0.05, the FDR is controlled to be smaller than or equal to 0.0001, and the predicted result is the result of the second screening condition.
It should be noted that the method of predicting open reading frames based on single nucleotide polymorphisms of the present invention is not applicable to population genomic data with too low polymorphisms, e.g., the method of the present invention is not adapted to open reading frame prediction when the number of populations is less than 400.
The invention also provides a system for predicting open reading frames based on single nucleotide polymorphisms, which comprises a processor and a storage medium, wherein the storage medium can be in the form of a magnetic disk, a ROM or a RAM, and machine-readable instructions executable by the processor are stored on the storage medium, and the machine-readable instructions are mainly embodied as a computer program executable on a computer processor, and the method for predicting open reading frames is executed by the program so as to realize the prediction of open reading frames.
Embodiment one: analysis of cotton population data
The experimental data of this example was downloaded from figshare, published by Li JiangYeing equal to 2021 in Genome Biology, wen Zhangming, called "Cotton pan-Genome retrieves the lost sequences and genes during domestication and selection", and yielded 1961 samples of whole Genome re-sequencing data.
S1, extracting a candidate long open reading frame from transcript information, wherein the candidate long open reading frame starts with a start codon AUG and ends with a stop codon UAG, UAA or UGA, and the length of the sequence of the candidate long open reading frame is an integer multiple of 3 and is more than 900bp.
S2, checking according to the single nucleotide diversity value of each site in the candidate long open reading frames to be predicted, and screening according to the first screening condition to obtain 4065 real long open reading frames.
S3, counting the use frequency of each codon in the real long open reading frame obtained in S2.
S4, extracting all candidate open reading frames from the transcript sequence, and carrying out inspection screening according to a second screening condition, wherein a total of 86889 candidate open reading frames are predicted to be real open reading frames, the recall rate is 76% (the proportion of known open reading frames in a genome, namely, the proportion of the number of true positives and the total number of annotated ORFs is multiplied by 100%), the accuracy is as high as 94% (the proportion of the predicted open reading frames which are consistent with the known reading frames, namely, the proportion of the number of true positives and the total number of predicted ORFs is multiplied by 100%), and the comprehensive score is 84% [ comprehensive score=2×recall rate×accuracy/(recall rate+accuracy) ].
As shown in FIG. 5, 4704 small open reading frames are also included, containing 1182 uORFs, 316 ouORFs, 2110 dORFs, 557 odORFs, 477 internal ORFs, 62 truncated ORFs. As shown in FIG. 6, the dashed lines indicate the degree of support of known ORFs in the genome by the protein mass spectrometry data, and analysis of published protein mass spectrometry data shows that these predicted small open reading frames are well supported.
Embodiment two: analysis of wheat population data
The experimental data of this example were downloaded from NCBI (accession number PRJNA 476679) and CNCB (accession number GVM 000082), the first set of data was published in Genome Biology by Cheng Hong equal to 2019, wen Zhangming being called "frequency intra-and inter-species introgression shape the landscape of genetic variation in bread wheat", and 93 wheat were subjected to whole Genome re-sequencing. The second set of data was published by Zhou Yao in 2020 under Nature Genetics, wen Zhangming, designated "Triticum population sequencing prov-ides insights into wheat adaptation", and total genome re-sequencing was performed on a total of 414 wheat varieties. This example was used for small open reading frame prediction after combining the two sets of data.
S1, extracting a candidate long open reading frame from transcript information, starting with a start codon AUG, ending with a stop codon UAG, UAA or UGA, wherein the length of the candidate long open reading frame sequence is an integer multiple of 3, and the length is more than 900bp.
S2, checking according to nucleotide diversity values of all sites in the candidate long open reading frames to be predicted, and screening according to the first screening condition to obtain 13683 real long open reading frames.
S3, counting the use frequency of each codon in the real long open reading frame obtained in S2.
S4, extracting all candidate open reading frames from the transcript sequences, and carrying out inspection and screening according to a second screening condition, wherein as shown in fig. 7 and 8, a total of 117140 candidate open reading frames are predicted to be real open reading frames, the accuracy rate is as high as 95% and the comprehensive score is 91%.
As shown in FIG. 9, 5025 small open reading frames were predicted successfully by the test screen, containing 232 uORFs, 21 ouORFs, 234 dORFs, 129 odORFs, 3532 internal ORFs, 675 extended ORFs, and 202 truncated ORFs. As shown in FIG. 10, the dashed lines indicate the degree of support of known ORFs in the genome by the protein mass spectrometry data, and analysis of published protein mass spectrometry data shows that these predicted small open reading frames are well supported.
The foregoing is merely a preferred embodiment of the invention, and it is to be understood that the invention is not limited to the form disclosed herein but is not to be construed as excluding other embodiments, but is capable of numerous other combinations, modifications and environments and is capable of modifications within the scope of the inventive concept, either as taught or as a matter of routine skill or knowledge in the relevant art. And that modifications and variations which do not depart from the spirit and scope of the invention are intended to be within the scope of the appended claims.

Claims (7)

1. A method for predicting open reading frames based on single nucleotide polymorphisms, characterized by: the method comprises the following steps:
s1, acquiring transcript information to be predicted, and extracting candidate long open reading frames;
s2, evaluating the change rule of single nucleotide polymorphism in the candidate long open reading frame to be predicted, and screening the true long open reading frame according to a preset first screening condition;
s3, counting the use frequency of each codon in the real long open reading frame;
s4, extracting candidate open reading frames from the transcript information, evaluating the change rule and the codon use frequency of single nucleotide polymorphism in the candidate open reading frames to be predicted, and taking the candidate open reading frames meeting the preset second screening conditions as prediction results;
the evaluation of the change rule of the single nucleotide polymorphism in the candidate long open reading frame to be predicted comprises:
s21, acquiring group variation data of a sample to be predicted, and calculating nucleotide diversity values of all sites in a candidate long open reading frame to be predicted;
s22, respectively checking whether the nucleotide diversity value of the 3 n-th base in the candidate long open reading frame is larger than the nucleotide diversity values of the 3n-2 nd base and the 3n-1 st base, wherein n is more than or equal to 1 and less than or equal to L/3, L is the length of the candidate long open reading frame, P1 and P2 are obtained, and calculating the combined P value;
the evaluating of the change rule and codon usage frequency of single nucleotide polymorphisms in the candidate open reading frames to be predicted comprises:
s41, acquiring group variation data of a sample to be predicted, and calculating nucleotide diversity values of all sites in a candidate open reading frame to be predicted;
s42, respectively checking whether the nucleotide diversity value of the 3 n-th base in the candidate open reading frame is larger than the nucleotide diversity values of the 3n-2 nd base and the 3n-1 st base, wherein n is more than or equal to 1 and less than or equal to L '/3, L' is the length of the candidate open reading frame, and P1 'and P2' are obtained; and respectively checking whether the use frequency of the triplet taking the 3n-2 th base as the starting point in the candidate open reading frame is higher than that of the triplet taking the 3n-1 st base and the 3 n-th base as the starting point in the candidate open reading frame, obtaining P3' and P4', and calculating the P ' value after combining the four values of P1', P2', P3 and P4.
2. The method for predicting open reading frames based on single nucleotide polymorphisms according to claim 1, wherein: the basis for extracting the candidate long open reading frames and the candidate open reading frames is as follows: beginning with the start codon AUG and ending with the stop codon UAG, UAA or UGA, and the sequence lengths of the candidate long open reading frame and the candidate open reading frame are integer multiples of 3.
3. The method for predicting open reading frames based on single nucleotide polymorphisms according to claim 1, wherein: the length of the candidate long open reading frame is greater than 900bp, and the length of the candidate open reading frame is greater than 100bp.
4. The method for predicting open reading frames based on single nucleotide polymorphisms according to claim 1, wherein: the first screening condition is that the P value is less than 0.0001.
5. The method for predicting open reading frames based on single nucleotide polymorphisms according to claim 1, wherein: the second screening condition is to control the error discovery rate FDR of the P' value meeting the preset requirement, and the FDR is controlled to be less than or equal to 0.0001.
6. The method for predicting open reading frames based on single nucleotide polymorphisms as recited in claim 5 wherein: the preset requirement is that the P' value is less than 0.05.
7. A system for predicting open reading frames based on single nucleotide polymorphisms, characterized by: comprising a processor and a storage medium storing machine-readable instructions executable by the processor, which when executed perform the method of predicting an open reading frame of any one of claims 1-6.
CN202210325529.5A 2022-03-30 2022-03-30 Method and system for predicting open reading frame based on single nucleotide polymorphism Active CN114639442B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210325529.5A CN114639442B (en) 2022-03-30 2022-03-30 Method and system for predicting open reading frame based on single nucleotide polymorphism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210325529.5A CN114639442B (en) 2022-03-30 2022-03-30 Method and system for predicting open reading frame based on single nucleotide polymorphism

Publications (2)

Publication Number Publication Date
CN114639442A CN114639442A (en) 2022-06-17
CN114639442B true CN114639442B (en) 2024-01-30

Family

ID=81951506

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210325529.5A Active CN114639442B (en) 2022-03-30 2022-03-30 Method and system for predicting open reading frame based on single nucleotide polymorphism

Country Status (1)

Country Link
CN (1) CN114639442B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106029891A (en) * 2013-12-20 2016-10-12 牛津生物医学(英国)有限公司 Viral vector production system
CN107027313A (en) * 2014-10-17 2017-08-08 宾州研究基金会 For the polynary RNA genome editors guided and the method and composition of other RNA technologies
CN108884473A (en) * 2016-03-21 2018-11-23 生物技术Rna制药有限公司 RNA replicon for the expression of multi-functional and efficient gene
CN110114461A (en) * 2016-08-17 2019-08-09 博德研究所 Novel C RISPR enzyme and system
CN110556163A (en) * 2019-09-04 2019-12-10 广州基迪奥生物科技有限公司 Analysis method of long-chain non-coding RNA translation small peptide based on translation group
CN111527203A (en) * 2018-01-18 2020-08-11 弗门尼舍有限公司 Cytochrome P450 monooxygenase catalyzed oxidation of sesquiterpenes
CN113005139A (en) * 2021-03-19 2021-06-22 中国林业科学研究院林业研究所 Application of transcription factor PsMYB1 in regulation and control of synthesis of peony petal anthocyanin
CN113425857A (en) * 2013-06-17 2021-09-24 布罗德研究所有限公司 Delivery and use of CRISPR-CAS systems, vectors and compositions for liver targeting and therapy

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003091388A2 (en) * 2002-04-23 2003-11-06 Yeda Research And Development Co. Ltd. Polymorphic olfactory receptor genes and arrays, kits and methods utilizing them

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113425857A (en) * 2013-06-17 2021-09-24 布罗德研究所有限公司 Delivery and use of CRISPR-CAS systems, vectors and compositions for liver targeting and therapy
CN106029891A (en) * 2013-12-20 2016-10-12 牛津生物医学(英国)有限公司 Viral vector production system
CN107027313A (en) * 2014-10-17 2017-08-08 宾州研究基金会 For the polynary RNA genome editors guided and the method and composition of other RNA technologies
CN108884473A (en) * 2016-03-21 2018-11-23 生物技术Rna制药有限公司 RNA replicon for the expression of multi-functional and efficient gene
CN110114461A (en) * 2016-08-17 2019-08-09 博德研究所 Novel C RISPR enzyme and system
CN111527203A (en) * 2018-01-18 2020-08-11 弗门尼舍有限公司 Cytochrome P450 monooxygenase catalyzed oxidation of sesquiterpenes
CN110556163A (en) * 2019-09-04 2019-12-10 广州基迪奥生物科技有限公司 Analysis method of long-chain non-coding RNA translation small peptide based on translation group
CN113005139A (en) * 2021-03-19 2021-06-22 中国林业科学研究院林业研究所 Application of transcription factor PsMYB1 in regulation and control of synthesis of peony petal anthocyanin

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Performance comparison of optical modulation formats for 40 Gbit/s systems from the viewpoint of frequency utilization efficiency and tolerance for fiber nonlinearities;Norimatsu S等;《Electronics & Communications in Japan》;第89卷(第8期);50-64 *
光皮桦BlSPL1转录因子基因的克隆、表达及单核苷酸多态性分析;李玉岭等;《林业科学》;第49卷(第9期);52-61 *
高通量测序在病原微生物耐药方面的应用进展;余甜等;《中国微生态学杂志》(第6期);125-130 *
鸡生长发育和肌纤维生长的影响因素与相关基因表达研究;李娟;《中国博士学位论文全文数据库 (农业科技辑)》(第3期);D050-23 *

Also Published As

Publication number Publication date
CN114639442A (en) 2022-06-17

Similar Documents

Publication Publication Date Title
Spealman et al. Conserved non-AUG uORFs revealed by a novel regression analysis of ribosome profiling data
Vincent et al. Next-generation sequencing (NGS) in the microbiological world: How to make the most of your money
Rastogi et al. Integrative analysis of large scale transcriptome data draws a comprehensive landscape of Phaeodactylum tricornutum genome and evolutionary origin of diatoms
Blevins et al. Uncovering de novo gene birth in yeast using deep transcriptomics
Nie et al. Correlation of mRNA expression and protein abundance affected by multiple sequence features related to translational efficiency in Desulfovibrio vulgaris: a quantitative analysis
Bentolila et al. Comprehensive high-resolution analysis of the role of an Arabidopsis gene family in RNA editing
Man et al. Differential translation efficiency of orthologous genes is involved in phenotypic divergence of yeast species
CN107103205A (en) A kind of bioinformatics method based on proteomic image data notes eukaryotic gene group
Roth et al. Measuring codon usage bias
Parts et al. Heritability and genetic basis of protein level variation in an outbred population
CN112397149A (en) Transcriptome analysis method and system without reference genome sequence
Liberman et al. Integrative systems biology: an attempt to describe a simple weed
Cusack et al. Predictive models of genetic redundancy in Arabidopsis thaliana
Wang et al. Recent advances in ribosome profiling for deciphering translational regulation
Li et al. Foster thy young: enhanced prediction of orphan genes in assembled genomes
CN110136776B (en) Method and system for predicting gene coding frame from low-quality ribosome blotting data
Du et al. Prediction of C-to-U RNA editing sites in plant mitochondria using both biochemical and evolutionary information
Ahsan et al. Identification of epistasis loci underlying rice flowering time by controlling population stratification and polygenic effect
CN114639442B (en) Method and system for predicting open reading frame based on single nucleotide polymorphism
Mendoza-Revilla et al. A foundational large language model for edible plant genomes
Glick et al. The effect of methodological considerations on the construction of gene-based plant pan-genomes
Souilmi et al. Ancient human genomes reveal a hidden history of strong selection in Eurasia
Hu et al. Riboexp: an interpretable reinforcement learning framework for ribosome density modeling
Battlay et al. Large haploblocks underlie rapid adaptation in an invasive weed
CN111028885B (en) Method and device for detecting yak RNA editing site

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant