CN116665777A - Primer design method, system and storage medium based on primer template binding capacity - Google Patents

Primer design method, system and storage medium based on primer template binding capacity Download PDF

Info

Publication number
CN116665777A
CN116665777A CN202310544947.8A CN202310544947A CN116665777A CN 116665777 A CN116665777 A CN 116665777A CN 202310544947 A CN202310544947 A CN 202310544947A CN 116665777 A CN116665777 A CN 116665777A
Authority
CN
China
Prior art keywords
primer
candidate
target
sequence
screening
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310544947.8A
Other languages
Chinese (zh)
Inventor
夏涵
杨军波
官远林
魏康飞
段美林
骆晨
胡龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yuguo Microcode Biotechnology Co ltd Of Xixian New Area
Yuguo Zhizao Technology Beijing Co ltd
Yuguo Biotechnology Beijing Co ltd
Original Assignee
Yuguo Microcode Biotechnology Co ltd Of Xixian New Area
Yuguo Zhizao Technology Beijing Co ltd
Yuguo Biotechnology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yuguo Microcode Biotechnology Co ltd Of Xixian New Area, Yuguo Zhizao Technology Beijing Co ltd, Yuguo Biotechnology Beijing Co ltd filed Critical Yuguo Microcode Biotechnology Co ltd Of Xixian New Area
Priority to CN202310544947.8A priority Critical patent/CN116665777A/en
Publication of CN116665777A publication Critical patent/CN116665777A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/20Polymerase chain reaction [PCR]; Primer or probe design; Probe optimisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Chemical & Material Sciences (AREA)
  • Biotechnology (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Document Processing Apparatus (AREA)
  • General Factory Administration (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application discloses a primer design method, a system and a storage medium based on primer template binding capacity, wherein the method comprises the following steps: obtaining a target sequence; dividing a target sequence into a plurality of target units, and carrying out similarity analysis on sequences in each target unit to obtain corresponding multi-sequence comparison results; screening candidate primers in each target unit by adopting a hidden Markov model, a primer template binding capacity matrix and a mismatch information matrix according to the multi-sequence comparison result; screening the candidate primers to obtain candidate primer pairs; and establishing a primer pool by using the candidate primer pair. By summarizing the key position, quantity information and mismatch type, and digitizing this information to aid in primer design, the present application achieves a better primer design at the expense of little coverage of the targeting sequence compared to existing primer design methods.

Description

Primer design method, system and storage medium based on primer template binding capacity
Technical Field
The application relates to the technical field of biological genes, in particular to a primer design method, a primer design system and a storage medium based on primer template binding capacity.
Background
Diseases caused by infection with pathogenic bacteria (e.g., covd-19) constitute a fatal risk to human health, and timely and accurate detection of pathogenic bacteria is critical for effective treatment and prevention of antibiotic abuse. However, similar clinical symptoms exist for different pathogen infections, and thus identifying a particular pathogen can be challenging. Traditional detection techniques, such as microscopy or biochemical testing, have limited ability to detect a wide range of pathogens. Moreover, these techniques may require culture or specific conditions for detection, which may be time consuming and cumbersome to operate. The advent of large-scale parallel sequencing of genomes has enabled rapid and easy exploration of the complexity of the genetic makeup of clinical samples. Metagenomic sequencing (mNGS), next generation sequencing of the metatranscriptome (mtNGS), and targeted sequencing (tNGS) have significantly improved the efficiency of pathogen identification and have become increasingly popular in recent years. While mNGS and mtNGS can provide comprehensive pathogen detection, billions of sequencing fragments (reads) are required to obtain positive pathogen reads due to overwhelming effects of human genomic contamination and environmental microbial contamination. Although some techniques are available to prevent contamination of the human genome by host DNA reduction during DNA extraction, these techniques are complex, expensive and time consuming and difficult to use extensively in clinic for a short period of time. In contrast, tNGS combines multiple PCR amplification and high throughput sequencing techniques to simultaneously amplify multiple targets in a single reaction. It can rapidly and economically detect hundreds of known pathogenic microorganisms and virulence or drug resistance genes of pathogens.
In the prior art, CN116030882a provides a primer design method based on minimum degeneracy, which is a fault-tolerant primer design method based on viterbi algorithm and proximity model, and the core is not to find perfect matching primers, but to make the designed primers as similar as possible to the target sequence within limited mismatch, and to realize higher coverage of the target sequence by tolerating mismatch, which can usually reach quite high coverage (> 96%). However, in the implementation of this method, the binding capacity of the primer to the template is affected by the mismatch, and in some non-critical positions, such as the middle position, the efficiency of the mismatched primer is usually 70-99% of that of the perfectly matched primer, but in some critical positions, such as the 3' -end, the efficiency of the mismatched primer is 10-60%, and the more obvious phenomenon is that the primer efficiency decreases linearly with the increase in the number of mismatches.
Disclosure of Invention
The embodiment of the application provides a primer design method, a primer design system and a primer design storage medium based on primer template binding capacity, which are used for solving the problem that the primer efficiency is reduced linearly along with the increase of mismatch quantity at a key position in the prior art.
In one aspect, embodiments of the present application provide a primer design method based on primer template binding capacity, comprising:
obtaining a target sequence;
dividing a target sequence into a plurality of target units, and carrying out similarity analysis on sequences in each target unit to obtain corresponding multi-sequence comparison results;
screening candidate primers in each target unit by adopting a hidden Markov model, a primer template binding capacity matrix and a mismatch information matrix according to the multi-sequence comparison result;
screening the candidate primers to obtain candidate primer pairs;
and establishing a primer pool by using the candidate primer pair.
In another aspect, the present application provides a primer design system based on primer template binding ability, including:
a sequence acquisition module for acquiring a target sequence;
the sequence analysis module is used for dividing a target sequence into a plurality of target units, and carrying out similarity analysis on sequences in each target unit to obtain corresponding multi-sequence comparison results;
the first primer screening module is used for screening candidate primers in each target unit by adopting a hidden Markov model, a primer template binding capacity matrix and a mismatch information matrix according to the multi-sequence comparison result;
the second primer screening module is used for screening the candidate primers to obtain candidate primer pairs;
and the primer pool establishment module is used for establishing a primer pool by utilizing the candidate primer pairs.
In another aspect, an embodiment of the present application further provides a computer storage medium, where a plurality of computer instructions are stored, where the plurality of computer instructions are configured to cause a computer to perform the method described above.
The primer design method, the system and the storage medium based on the primer template binding capacity have the following advantages:
the design of the primers was aided by summarizing the key position, quantitative information, and mismatch type, and digitizing this information. Compared with the existing primer design method, the method provided by the application has the advantage that better primer design is obtained under the premise of sacrificing little coverage of the target sequence. Experiments show that a pair of new primers is added into a target primer pool, 14.96 primers are required in the existing design method on average, and only 5-10 primers are required in the application, so that the research and development cost is greatly reduced, and the method has a promotion effect on the research of a large clinical gene package (panel).
Drawings
In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a primer design method based on primer template binding capacity provided by an embodiment of the application;
FIG. 2 is a schematic diagram of a mismatch information matrix according to an embodiment of the present application;
FIG. 3 is a schematic diagram showing the influence of information on primer efficiency by mismatch position and number provided in the examples of the present application;
FIG. 4 is a schematic diagram showing the effect of mismatches on pathogen detection efficiency provided by an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
FIG. 1 is a flow chart of a primer design method based on the binding capacity of a primer template according to an embodiment of the present application. The embodiment of the application provides a primer design method based on primer template binding capacity, which comprises the following steps:
s100, acquiring a target sequence.
Illustratively, the obtained target Sequence may be stored in a FASTA format Sequence file, which may be a CDS (Coding Sequence), gene, genome or other type of Sequence.
S110, dividing the target sequence into a plurality of target units, and carrying out similarity analysis on the sequences in each target unit to obtain corresponding multi-sequence comparison results.
For example, the number of target sequences is large, so that the target sequences can be subjected to cluster analysis by using classification software CD-hit, specifically, the similarity between the target sequences can be calculated, clustering is performed according to the similarity, and a plurality of target units representing clustering results can be obtained through cluster analysis, wherein the target sequences in each target unit have higher similarity.
After the target units are obtained, a number, e.g., 200-1000, of target sequences are extracted from each target unit and subjected to similarity analysis. In the embodiment of the application, sequence comparison software MUSCLE and MAFFT can be adopted for analysis, and in the analysis process, if the number of target sequences to be analyzed is lower than 50, MUSCLE is adopted, and MAFFT is adopted for other numbers of target sequences.
S120, screening candidate primers in each target unit by adopting a hidden Markov model, a primer template binding capacity matrix and a mismatch information matrix according to the multi-sequence comparison result.
Illustratively, the method of the present application is an improvement over the prior art, the viterbi algorithm in the prior art being modified to a hidden markov model, which is made in that the problem of candidate primer selection can be summarized in that the current position of each primer is determined only by its previous moment, and the primer in combination with the template is determined by the gibbsen free energy, which can be calculated from the nearest base model, and after the free energy is determined by calculation, the implicit parameters in the hidden markov model, i.e. the best candidate primer, can be determined from the free energy. After the hidden Markov model is established, the problem of screening candidate primers is simplified to solve hidden parameters of the hidden Markov model.
In the embodiment of the application, the primer template binding energy moment array is shown in table 1, and the mismatch information matrix comprises mismatch positions, mismatch quantity and mismatch types, as shown in fig. 2. The existing mismatch information matrix, i.e. Y-distance in FIG. 2, only records the number of mismatches and the positions of the mismatches, the present application modifies this representation, and the new Y-distance records the types of mismatches at the same time, e.g. 17:0.9 in FIG. 2, which is a typical hash-type recording method, 17 refers to the fact that the 17 th position of the primer is a mismatch, and the colon is a correspondence, and 0.9 represents that the position is base R (A or G) and base C. The length of the entire list is 1, representing only one mismatch.
Table 1 primer template binding energy moment array
Further, after screening for candidate primers, the candidate primers are also filtered according to GC content, hairpin, melting temperature, GC clamp, dimer detection, error coverage.
S130, screening the candidate primers to obtain candidate primer pairs.
Illustratively, the primer obtained in S120 is not a primer pair, but a single primer, and in order to obtain a primer pair, the present application also selects an appropriate primer pair based on information such as PCR product length, primer pair dissolution temperature difference, and primer pair overall coverage.
S140, establishing a primer pool by using the candidate primer pairs.
Illustratively, when the primer pool is established, a loss function may be used to determine compatibility between any two candidate primer pairs, a greedy algorithm is further used to screen candidate primer pairs judged by compatibility, and the screened candidate primer pairs form the primer pool.
In one possible embodiment, after S140, it may further include: and (3) performing specificity detection on the primers in the primer pool. Specifically, in the specificity test process, if any two candidate primer pairs are matched to the host at the same time, the candidate primer pairs are excluded from the primer pool when the number of mismatches in the matching process does not exceed the mismatch threshold and the product length is within the length threshold. The mismatch threshold in the present application is 3 and the length threshold is 2000.
The key to primer and template annealing is specific Watson-Crick hybridization between complementary bases, however the favorable thermodynamic properties of precisely paired bases may exceed those caused by several mismatches, which may lead to primer-template mismatched annealing. Annealing of primer and template mismatches can affect the efficiency of the PCR system, and severe mismatches can lead to non-specific amplification. The prior art has exploited the property that mismatches can also extend amplification, as shown in FIG. 4. In the application, the number of mismatches is simply allowed to be controlled within two and at least 4bp away from the 3' end of the primer, but the most effective thermodynamic combination of the primer and the template is found to be the 2,3,4 and 3' last positions of the 5' end according to the data result of the prior art, as shown in FIG. 3, and the primer template combination capability is converted into a digital matrix, so that the primer is combined into a primer combination process, and the high-efficiency high-coverage primer design is realized.
The embodiment of the application also provides a primer design system based on the primer template binding capacity, which comprises:
a sequence acquisition module for acquiring a target sequence;
the sequence analysis module is used for dividing a target sequence into a plurality of target units, and carrying out similarity analysis on sequences in each target unit to obtain corresponding multi-sequence comparison results;
the first primer screening module is used for screening candidate primers in each target unit by adopting a hidden Markov model, a primer template binding capacity matrix and a mismatch information matrix according to the multi-sequence comparison result;
the second primer screening module is used for screening the candidate primers to obtain candidate primer pairs;
and the primer pool establishment module is used for establishing a primer pool by utilizing the candidate primer pairs.
The embodiment of the application also provides a computer storage medium, wherein a plurality of computer instructions are stored in the computer storage medium, and the computer instructions are used for making a computer execute the method.
While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (8)

1. The primer design method based on the primer template binding capacity is characterized by comprising the following steps:
obtaining a target sequence;
dividing the target sequence into a plurality of target units, and carrying out similarity analysis on sequences in each target unit to obtain corresponding multi-sequence comparison results;
screening candidate primers in each target unit by adopting a hidden Markov model, a primer template binding capacity matrix and a mismatch information matrix according to the multi-sequence comparison result;
screening the candidate primers to obtain candidate primer pairs;
and establishing a primer pool by using the candidate primer pair.
2. The method for designing primers based on binding ability of primer template according to claim 1, wherein the primers in the primer pool are further subjected to a specificity test.
3. The method according to claim 2, wherein, in the specific test, if any two candidate primer pairs are simultaneously matched to the host, the candidate primer pairs are excluded from the primer pool when the number of mismatches in the matching process does not exceed the mismatch threshold and the product length is within the length threshold.
4. The method according to claim 1, wherein the candidate primer is filtered based on GC content, hairpin, melting temperature, GC clamp, dimer detection, and error coverage after screening.
5. The method for designing a primer based on the binding capacity of a primer template according to claim 1, wherein the candidate primer is selected based on the length of the PCR product, the difference in the dissolution temperature of the primer pair, and the total coverage of the primer.
6. The method according to claim 1, wherein a loss function is used to determine compatibility between any two of the candidate primer pairs when the primer pool is established, a greedy algorithm is further used to screen candidate primer pairs judged by compatibility, and the screened candidate primer pairs form the primer pool.
7. A primer design system based on primer template binding capacity, comprising:
a sequence acquisition module for acquiring a target sequence;
the sequence analysis module is used for dividing the target sequence into a plurality of target units, and carrying out similarity analysis on the sequences in each target unit to obtain corresponding multi-sequence comparison results;
the first primer screening module is used for screening candidate primers in each target unit by adopting a hidden Markov model, a primer template binding capacity matrix and a mismatch information matrix according to the multi-sequence comparison result;
the second primer screening module is used for screening the candidate primers to obtain candidate primer pairs;
and the primer pool establishment module is used for establishing a primer pool by utilizing the candidate primer pair.
8. A computer storage medium having stored therein a plurality of computer instructions for causing a computer to perform the method of any of claims 1-6.
CN202310544947.8A 2023-05-15 2023-05-15 Primer design method, system and storage medium based on primer template binding capacity Pending CN116665777A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310544947.8A CN116665777A (en) 2023-05-15 2023-05-15 Primer design method, system and storage medium based on primer template binding capacity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310544947.8A CN116665777A (en) 2023-05-15 2023-05-15 Primer design method, system and storage medium based on primer template binding capacity

Publications (1)

Publication Number Publication Date
CN116665777A true CN116665777A (en) 2023-08-29

Family

ID=87719816

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310544947.8A Pending CN116665777A (en) 2023-05-15 2023-05-15 Primer design method, system and storage medium based on primer template binding capacity

Country Status (1)

Country Link
CN (1) CN116665777A (en)

Similar Documents

Publication Publication Date Title
US10127351B2 (en) Accurate and fast mapping of reads to genome
US9177099B2 (en) Method and systems for processing polymeric sequence data and related information
JP2022088566A (en) Method and system for generation and error-correction of unique molecular index sets with heterogeneous molecular lengths
JP2020513856A (en) Leveraging Sequence-Based Fecal Microbial Survey Data to Identify Multiple Biomarkers for Colorectal Cancer
JP2013531983A (en) Nucleic acids for multiplex biological detection and methods of use and production thereof
US20140288844A1 (en) Characterization of biological material in a sample or isolate using unassembled sequence information, probabilistic methods and trait-specific database catalogs
JP2023501538A (en) Identification of host RNA biomarkers of infection
CN115976235A (en) Identification method of Lactobacillus delbrueckii CICC6047 strain, primer, kit and application thereof
Ohta et al. Using nanopore sequencing to identify fungi from clinical samples with high phylogenetic resolution
TWI582631B (en) Dna sequence analyzing system for analyzing bacterial species and method thereof
CN116665777A (en) Primer design method, system and storage medium based on primer template binding capacity
US20200135300A1 (en) Applying low coverage whole genome sequencing for intelligent genomic routing
Sung Bioinformatics applications in genomics
CN112634983B (en) Pathogen species specific PCR primer optimization design method
Wan et al. Validation of mixed-genome microarrays as a method for genetic discrimination
Mahmod Novel methods to study intestinal microbiota
CN117737272A (en) Screening method for target microorganism markers and application of screening method
WO2024142013A1 (en) Method for error correction in nucleic acid sequencing
Davenport Short papers on current state of sequencing, metagenomics, and RNAseq for diagnostics
CN116030882A (en) Primer design method and system based on minimum degeneracy and computer storage medium
豊間根耕地 Studies on identification and evaluation of CRISPR diversity on human skin microbiome for development of a new personal identification method
Ogundolie et al. Microbiome characterization and identification: key emphasis on molecular approaches
Williams Application of Exact Alignments with an In-memory Core Gene Database for an Improved Metagenomic Taxonomic Classification
WO2023031485A1 (en) Method for the diagnosis and/or classification of a disease in a subject
WO2024118105A1 (en) Methods and compositions for mitigating index hopping in dna sequencing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination