CN116665777A

CN116665777A - Primer design method, system and storage medium based on primer template binding capacity

Info

Publication number: CN116665777A
Application number: CN202310544947.8A
Authority: CN
Inventors: 夏涵; 杨军波; 官远林; 魏康飞; 段美林; 骆晨; 胡龙
Original assignee: Yuguo Microcode Biotechnology Co ltd Of Xixian New Area; Yuguo Zhizao Technology Beijing Co ltd; Yuguo Biotechnology Beijing Co ltd
Current assignee: Yuguo Microcode Biotechnology Co ltd Of Xixian New Area; Yuguo Zhizao Technology Beijing Co ltd; Yuguo Biotechnology Beijing Co ltd
Priority date: 2023-05-15
Filing date: 2023-05-15
Publication date: 2023-08-29

Abstract

The application discloses a primer design method, a system and a storage medium based on primer template binding capacity, wherein the method comprises the following steps: obtaining a target sequence; dividing a target sequence into a plurality of target units, and carrying out similarity analysis on sequences in each target unit to obtain corresponding multi-sequence comparison results; screening candidate primers in each target unit by adopting a hidden Markov model, a primer template binding capacity matrix and a mismatch information matrix according to the multi-sequence comparison result; screening the candidate primers to obtain candidate primer pairs; and establishing a primer pool by using the candidate primer pair. By summarizing the key position, quantity information and mismatch type, and digitizing this information to aid in primer design, the present application achieves a better primer design at the expense of little coverage of the targeting sequence compared to existing primer design methods.

Description

Primer design method, system and storage medium based on primer template binding capacity

Technical Field

The application relates to the technical field of biological genes, in particular to a primer design method, a primer design system and a storage medium based on primer template binding capacity.

Background

Diseases caused by infection with pathogenic bacteria (e.g., covd-19) constitute a fatal risk to human health, and timely and accurate detection of pathogenic bacteria is critical for effective treatment and prevention of antibiotic abuse. However, similar clinical symptoms exist for different pathogen infections, and thus identifying a particular pathogen can be challenging. Traditional detection techniques, such as microscopy or biochemical testing, have limited ability to detect a wide range of pathogens. Moreover, these techniques may require culture or specific conditions for detection, which may be time consuming and cumbersome to operate. The advent of large-scale parallel sequencing of genomes has enabled rapid and easy exploration of the complexity of the genetic makeup of clinical samples. Metagenomic sequencing (mNGS), next generation sequencing of the metatranscriptome (mtNGS), and targeted sequencing (tNGS) have significantly improved the efficiency of pathogen identification and have become increasingly popular in recent years. While mNGS and mtNGS can provide comprehensive pathogen detection, billions of sequencing fragments (reads) are required to obtain positive pathogen reads due to overwhelming effects of human genomic contamination and environmental microbial contamination. Although some techniques are available to prevent contamination of the human genome by host DNA reduction during DNA extraction, these techniques are complex, expensive and time consuming and difficult to use extensively in clinic for a short period of time. In contrast, tNGS combines multiple PCR amplification and high throughput sequencing techniques to simultaneously amplify multiple targets in a single reaction. It can rapidly and economically detect hundreds of known pathogenic microorganisms and virulence or drug resistance genes of pathogens.

In the prior art, CN116030882a provides a primer design method based on minimum degeneracy, which is a fault-tolerant primer design method based on viterbi algorithm and proximity model, and the core is not to find perfect matching primers, but to make the designed primers as similar as possible to the target sequence within limited mismatch, and to realize higher coverage of the target sequence by tolerating mismatch, which can usually reach quite high coverage (> 96%). However, in the implementation of this method, the binding capacity of the primer to the template is affected by the mismatch, and in some non-critical positions, such as the middle position, the efficiency of the mismatched primer is usually 70-99% of that of the perfectly matched primer, but in some critical positions, such as the 3' -end, the efficiency of the mismatched primer is 10-60%, and the more obvious phenomenon is that the primer efficiency decreases linearly with the increase in the number of mismatches.

Disclosure of Invention

The embodiment of the application provides a primer design method, a primer design system and a primer design storage medium based on primer template binding capacity, which are used for solving the problem that the primer efficiency is reduced linearly along with the increase of mismatch quantity at a key position in the prior art.

In one aspect, embodiments of the present application provide a primer design method based on primer template binding capacity, comprising:

obtaining a target sequence;

dividing a target sequence into a plurality of target units, and carrying out similarity analysis on sequences in each target unit to obtain corresponding multi-sequence comparison results;

screening candidate primers in each target unit by adopting a hidden Markov model, a primer template binding capacity matrix and a mismatch information matrix according to the multi-sequence comparison result;

screening the candidate primers to obtain candidate primer pairs;

and establishing a primer pool by using the candidate primer pair.

In another aspect, the present application provides a primer design system based on primer template binding ability, including:

a sequence acquisition module for acquiring a target sequence;

the sequence analysis module is used for dividing a target sequence into a plurality of target units, and carrying out similarity analysis on sequences in each target unit to obtain corresponding multi-sequence comparison results;

the first primer screening module is used for screening candidate primers in each target unit by adopting a hidden Markov model, a primer template binding capacity matrix and a mismatch information matrix according to the multi-sequence comparison result;

the second primer screening module is used for screening the candidate primers to obtain candidate primer pairs;

and the primer pool establishment module is used for establishing a primer pool by utilizing the candidate primer pairs.

In another aspect, an embodiment of the present application further provides a computer storage medium, where a plurality of computer instructions are stored, where the plurality of computer instructions are configured to cause a computer to perform the method described above.

The primer design method, the system and the storage medium based on the primer template binding capacity have the following advantages:

the design of the primers was aided by summarizing the key position, quantitative information, and mismatch type, and digitizing this information. Compared with the existing primer design method, the method provided by the application has the advantage that better primer design is obtained under the premise of sacrificing little coverage of the target sequence. Experiments show that a pair of new primers is added into a target primer pool, 14.96 primers are required in the existing design method on average, and only 5-10 primers are required in the application, so that the research and development cost is greatly reduced, and the method has a promotion effect on the research of a large clinical gene package (panel).

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a primer design method based on primer template binding capacity provided by an embodiment of the application;

FIG. 2 is a schematic diagram of a mismatch information matrix according to an embodiment of the present application;

FIG. 3 is a schematic diagram showing the influence of information on primer efficiency by mismatch position and number provided in the examples of the present application;

FIG. 4 is a schematic diagram showing the effect of mismatches on pathogen detection efficiency provided by an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

FIG. 1 is a flow chart of a primer design method based on the binding capacity of a primer template according to an embodiment of the present application. The embodiment of the application provides a primer design method based on primer template binding capacity, which comprises the following steps:

s100, acquiring a target sequence.

Illustratively, the obtained target Sequence may be stored in a FASTA format Sequence file, which may be a CDS (Coding Sequence), gene, genome or other type of Sequence.

S110, dividing the target sequence into a plurality of target units, and carrying out similarity analysis on the sequences in each target unit to obtain corresponding multi-sequence comparison results.

For example, the number of target sequences is large, so that the target sequences can be subjected to cluster analysis by using classification software CD-hit, specifically, the similarity between the target sequences can be calculated, clustering is performed according to the similarity, and a plurality of target units representing clustering results can be obtained through cluster analysis, wherein the target sequences in each target unit have higher similarity.

After the target units are obtained, a number, e.g., 200-1000, of target sequences are extracted from each target unit and subjected to similarity analysis. In the embodiment of the application, sequence comparison software MUSCLE and MAFFT can be adopted for analysis, and in the analysis process, if the number of target sequences to be analyzed is lower than 50, MUSCLE is adopted, and MAFFT is adopted for other numbers of target sequences.

S120, screening candidate primers in each target unit by adopting a hidden Markov model, a primer template binding capacity matrix and a mismatch information matrix according to the multi-sequence comparison result.

Illustratively, the method of the present application is an improvement over the prior art, the viterbi algorithm in the prior art being modified to a hidden markov model, which is made in that the problem of candidate primer selection can be summarized in that the current position of each primer is determined only by its previous moment, and the primer in combination with the template is determined by the gibbsen free energy, which can be calculated from the nearest base model, and after the free energy is determined by calculation, the implicit parameters in the hidden markov model, i.e. the best candidate primer, can be determined from the free energy. After the hidden Markov model is established, the problem of screening candidate primers is simplified to solve hidden parameters of the hidden Markov model.

In the embodiment of the application, the primer template binding energy moment array is shown in table 1, and the mismatch information matrix comprises mismatch positions, mismatch quantity and mismatch types, as shown in fig. 2. The existing mismatch information matrix, i.e. Y-distance in FIG. 2, only records the number of mismatches and the positions of the mismatches, the present application modifies this representation, and the new Y-distance records the types of mismatches at the same time, e.g. 17:0.9 in FIG. 2, which is a typical hash-type recording method, 17 refers to the fact that the 17 th position of the primer is a mismatch, and the colon is a correspondence, and 0.9 represents that the position is base R (A or G) and base C. The length of the entire list is 1, representing only one mismatch.

Table 1 primer template binding energy moment array

Further, after screening for candidate primers, the candidate primers are also filtered according to GC content, hairpin, melting temperature, GC clamp, dimer detection, error coverage.

S130, screening the candidate primers to obtain candidate primer pairs.

Illustratively, the primer obtained in S120 is not a primer pair, but a single primer, and in order to obtain a primer pair, the present application also selects an appropriate primer pair based on information such as PCR product length, primer pair dissolution temperature difference, and primer pair overall coverage.

S140, establishing a primer pool by using the candidate primer pairs.

Illustratively, when the primer pool is established, a loss function may be used to determine compatibility between any two candidate primer pairs, a greedy algorithm is further used to screen candidate primer pairs judged by compatibility, and the screened candidate primer pairs form the primer pool.

In one possible embodiment, after S140, it may further include: and (3) performing specificity detection on the primers in the primer pool. Specifically, in the specificity test process, if any two candidate primer pairs are matched to the host at the same time, the candidate primer pairs are excluded from the primer pool when the number of mismatches in the matching process does not exceed the mismatch threshold and the product length is within the length threshold. The mismatch threshold in the present application is 3 and the length threshold is 2000.

The key to primer and template annealing is specific Watson-Crick hybridization between complementary bases, however the favorable thermodynamic properties of precisely paired bases may exceed those caused by several mismatches, which may lead to primer-template mismatched annealing. Annealing of primer and template mismatches can affect the efficiency of the PCR system, and severe mismatches can lead to non-specific amplification. The prior art has exploited the property that mismatches can also extend amplification, as shown in FIG. 4. In the application, the number of mismatches is simply allowed to be controlled within two and at least 4bp away from the 3' end of the primer, but the most effective thermodynamic combination of the primer and the template is found to be the 2,3,4 and 3' last positions of the 5' end according to the data result of the prior art, as shown in FIG. 3, and the primer template combination capability is converted into a digital matrix, so that the primer is combined into a primer combination process, and the high-efficiency high-coverage primer design is realized.

The embodiment of the application also provides a primer design system based on the primer template binding capacity, which comprises:

a sequence acquisition module for acquiring a target sequence;

The embodiment of the application also provides a computer storage medium, wherein a plurality of computer instructions are stored in the computer storage medium, and the computer instructions are used for making a computer execute the method.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. The primer design method based on the primer template binding capacity is characterized by comprising the following steps:

obtaining a target sequence;

dividing the target sequence into a plurality of target units, and carrying out similarity analysis on sequences in each target unit to obtain corresponding multi-sequence comparison results;

screening the candidate primers to obtain candidate primer pairs;

and establishing a primer pool by using the candidate primer pair.

2. The method for designing primers based on binding ability of primer template according to claim 1, wherein the primers in the primer pool are further subjected to a specificity test.

3. The method according to claim 2, wherein, in the specific test, if any two candidate primer pairs are simultaneously matched to the host, the candidate primer pairs are excluded from the primer pool when the number of mismatches in the matching process does not exceed the mismatch threshold and the product length is within the length threshold.

4. The method according to claim 1, wherein the candidate primer is filtered based on GC content, hairpin, melting temperature, GC clamp, dimer detection, and error coverage after screening.

5. The method for designing a primer based on the binding capacity of a primer template according to claim 1, wherein the candidate primer is selected based on the length of the PCR product, the difference in the dissolution temperature of the primer pair, and the total coverage of the primer.

6. The method according to claim 1, wherein a loss function is used to determine compatibility between any two of the candidate primer pairs when the primer pool is established, a greedy algorithm is further used to screen candidate primer pairs judged by compatibility, and the screened candidate primer pairs form the primer pool.

7. A primer design system based on primer template binding capacity, comprising:

a sequence acquisition module for acquiring a target sequence;

the sequence analysis module is used for dividing the target sequence into a plurality of target units, and carrying out similarity analysis on the sequences in each target unit to obtain corresponding multi-sequence comparison results;

and the primer pool establishment module is used for establishing a primer pool by utilizing the candidate primer pair.

8. A computer storage medium having stored therein a plurality of computer instructions for causing a computer to perform the method of any of claims 1-6.