WO2019242186A1 - Method, apparatus, computer device and storage medium for determining target to be detected - Google Patents

Method, apparatus, computer device and storage medium for determining target to be detected Download PDF

Info

Publication number
WO2019242186A1
WO2019242186A1 PCT/CN2018/111924 CN2018111924W WO2019242186A1 WO 2019242186 A1 WO2019242186 A1 WO 2019242186A1 CN 2018111924 W CN2018111924 W CN 2018111924W WO 2019242186 A1 WO2019242186 A1 WO 2019242186A1
Authority
WO
WIPO (PCT)
Prior art keywords
specific
genome
overlapping
mer
preset
Prior art date
Application number
PCT/CN2018/111924
Other languages
French (fr)
Chinese (zh)
Inventor
孙亚洲
杜刘稳
陈斌
牛团结
肖贡
郭婷
曾柳眉
陈杰
Original Assignee
深圳市达仁基因科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市达仁基因科技有限公司 filed Critical 深圳市达仁基因科技有限公司
Publication of WO2019242186A1 publication Critical patent/WO2019242186A1/en

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Definitions

  • the present application relates to a method, an apparatus, a computer device, and a storage medium for determining a detection target.
  • a target is a special nucleic acid fragment that can be used in PCR reactions (Polymerase Chain Reaction), antibody-antigen reactions, and hybridization probe reactions.
  • a method, an apparatus, a computer device, and a storage medium for determining a detection target are provided.
  • a method for determining a detection target includes:
  • the specific k-mer is a k-mer that satisfies preset specificity conditions, and k-mer refers to a genome of length k sequence;
  • a device for determining a detection target includes:
  • a specific k-mer acquisition module configured to acquire a specific k-mer included in the target pathogen operation group from a target database, where the specific k-mer is a k-mer that satisfies a preset specific condition, k-mer refers to a genomic sequence of length k; determining a specific k-mer included in each genome included in the target pathogen operating group;
  • a non-overlapping specific region acquisition module is configured to process specific k-mers contained in each genome to obtain a set of non-overlapping specific regions corresponding to each genome, where the non-overlapping specific region set includes non-overlapping specific regions Specific regions;
  • the detection target selection module is used to obtain the number of occurrences of each non-overlapping specific region contained in the set of non-overlapping specific regions corresponding to each genome; A non-overlapping specific region of the number of times threshold is set as a detection target of the target pathogen operation group.
  • a computer device includes a memory and one or more processors.
  • the memory stores computer-readable instructions.
  • the one or more processors are executed. The following steps: computer readable instructions computer readable instructions to determine a target pathogen operating group to be detected;
  • the specific k-mer is a k-mer that satisfies preset specificity conditions, and k-mer refers to a genome of length k sequence;
  • One or more non-volatile computer-readable storage media storing computer-readable instructions.
  • the one or more processors execute the following steps: computer-readable instructions Computer readable instructions to determine the target pathogen operating group to be detected;
  • the specific k-mer is a k-mer that satisfies preset specificity conditions, and k-mer refers to a genome of length k sequence;
  • FIG. 1 is a schematic flowchart of a method for determining a detection point according to one or more embodiments.
  • FIG. 2 is a schematic flow chart before step 102 according to one or more embodiments.
  • FIG. 3 is a schematic flowchart of step 108 according to one or more embodiments.
  • FIG. 4 is a schematic flowchart of step 306 according to one or more embodiments.
  • FIG. 5 is a schematic flow chart after step 306 according to one or more embodiments.
  • FIG. 6 is a schematic flowchart of step 110 according to one or more embodiments.
  • FIG. 7 is a schematic flow chart after step 112 according to one or more embodiments.
  • FIG. 8 is a schematic flowchart of step 704 according to one or more embodiments.
  • FIG. 9 is a schematic flow chart after step 112 according to another embodiment.
  • FIG. 10 is a schematic flowchart of step 906 according to one or more embodiments.
  • FIG. 11 is a schematic flowchart of step 912 according to one or more embodiments.
  • FIG. 12 is a schematic flowchart of step 1104 according to one or more embodiments.
  • FIG. 13 is a schematic flowchart of a method for determining a detection point according to another or more embodiments.
  • FIG. 14 is a schematic flowchart of step 1302 according to one or more embodiments.
  • FIG. 15 is a schematic flowchart of step 1304 according to one or more embodiments.
  • FIG. 16 is a structural block diagram of an apparatus for determining a detection point according to one or more embodiments.
  • FIG. 17 is an internal structural diagram of a computer device according to one or more embodiments.
  • a method for determining a detection target including the following steps:
  • Step 102 Determine a target pathogen operation group to be detected.
  • a pathogen operating group can represent a genetic unit or a taxonomic unit of different classification levels such as a species, a subspecies, a subtype, a strain or virus strain, or a genus.
  • a pathogen operating group may include one or more Related genomes.
  • the target pathogen operating group refers to a pathogen operating group to be detected.
  • the pathogen operating group to be detected is Staphylococcus aureus
  • the target pathogen operating group in step 102 refers to Staphylococcus aureus.
  • Step 104 Obtain the specific k-mer included in the target pathogen operation group from the target database.
  • the specific k-mer is a k-mer that meets the preset specific conditions.
  • the k-mer refers to a genomic sequence of length k. .
  • the target database stores a feature target sequence set previously established for each pathogen operating group, and the characteristic target sequence set corresponding to each pathogen operating group includes a specific k-mer corresponding to each pathogen operating group . Therefore, the specific k-mer included in the target pathogen manipulation group can be obtained from the target database.
  • the specific k-mer refers to a k-mer selected from the k-mers included in the target pathogen operation group and meeting a preset specificity condition, that is, a specific k-mer corresponding to the target pathogen operation group.
  • the preset specific condition is a condition set by a technician in advance for selecting a matching k-mer. The preset specific condition may be determined according to a technician's consideration or an actual project requirement.
  • k-mer refers to a genomic sequence of length k, where k is a natural number. If there are a different deterministic characters in a genomic data, then for a particular k, there may be a total of k-mers with a power of a that are different.
  • deterministic characters refer to the five bases A (adenine), T (thymine), C (cytosine), G (guanine), and U (uracil); In the case of protein sequences, deterministic characters are defined amino acid characters.
  • Step 106 Determine a specific k-mer included in each genome included in the target pathogen operating group.
  • a pathogen operating group can include one or more related genomes. Therefore, the target pathogen operating group includes one or more related genomes, and each genome includes one or more k-mers. Specific k-mers are k-mers that satisfy preset specificity conditions, so each genome contains one or more specific k-mers. Therefore, the specific k-mer contained in each genome contained in the target pathogen operating group to be detected can be determined.
  • Step 108 The specific k-mer included in each genome is processed to obtain a set of non-overlapping specific regions corresponding to each genome, and the set of non-overlapping specific regions includes non-overlapping specific regions.
  • the specific k-mer contained in each genome can be processed. After processing the specific k-mer contained in each genome, a set of non-overlapping specific regions corresponding to each genome can be obtained.
  • the set of non-overlapping specific regions corresponding to each genome includes one or more non-overlapping specific regions corresponding to each genome.
  • the difference between a non-overlapping specific region and a specific k-mer is that the specific k-mer has a length limitation and is a sequence of a specific length k, while the non-overlapping specific region does not have any limitation on its length.
  • Step 110 Obtain the number of occurrences of each non-overlapping specific region included in the non-overlapping specific region set corresponding to each genome in all the non-overlapping specific region sets.
  • the number of occurrences of each non-overlapping specific region in each non-overlapping specific region set in each non-overlapping specific region set can be obtained.
  • the set of non-overlapping specific regions corresponding to each genome can be regarded as a small set, and the entire set of non-overlapping specific regions of the entire genome is composed of a complete set of non-overlapping specific regions, that is, the complete set is composed of multiple small sets. .
  • the complete set of non-overlapping specific regions contains all the non-overlap-specific regions in the entire genome included in the operation group of the pathogen. Therefore, each non-overlap-specific region in the set of non-overlap-specific regions of each genome can be obtained. Sexual region, the number of occurrences in the full set of coincident specific regions.
  • step 112 a non-overlapping specific region with a number of occurrences exceeding a preset number of times is selected as a detection target of the target pathogen operation group.
  • a non-overlapping specific region with an appearance frequency exceeding a preset number of thresholds can be selected as a detection target of the target pathogen operation group .
  • the preset number of times threshold can be set by technicians according to actual project requirements.
  • the selected non-overlapping specific regions may be multiple, or may correspond to multiple genomes.
  • the specific k-mer of the target pathogen operating group is obtained, and then a non-overlap that meets the preset number of thresholds is selected from the non-overlap specific regions obtained according to the specific k-mer.
  • Specific region as the detection target of the target pathogen operating group. Because the specific k-mer is determined by probabilistic preset specific conditions, and a set of non-overlapping specific regions is obtained based on the specific k-mer, when the non-overlapping specific regions that meet the conditions are finally selected as the detection target Therefore, this technical solution greatly expands the search range of potential detection targets, increases the flexibility of limiting the search range of detection targets, and improves the efficiency of determining detection targets.
  • the preset number of times threshold (1-Y) * N
  • Y is a preset first condition threshold
  • N is the number of non-overlapping specific region sets.
  • a non-overlapping specific region is selected to be a non-overlapping specific region set. Overlap specific regions.
  • the preset number of times threshold is equal to (1-Y) * N.
  • Y is a preset first condition threshold, and the preset first condition threshold Y can be set to less than 5%, which can be specifically set by a technician according to the actual situation.
  • N is the number of non-overlapping specific region sets, and each genome corresponds to a non-overlapping specific region set, so N is actually the number of genomes contained in the target pathogen operating group.
  • a non-overlapping specific region does not appear in every genome, so any non-overlapping specific region appears in all non-overlapping-specific region sets less than or equal to N.
  • the preset first condition threshold is less than 5%.
  • the specific k-mer refers to the k-mer in the pathogen operation group in which the number of occurrences in the genome occurrence number index table of the target pathogen operation group meets a preset error condition.
  • the characteristic target sequence set corresponding to each pathogen operation group includes a specific k-mer that satisfies a preset specific condition in each pathogen operation group.
  • the preset specific condition refers to a k-mer included in a pathogen operation group in which the number of occurrences in the genome occurrence number index table of each target pathogen operation group satisfies a preset error condition.
  • the preset error condition refers to the error condition preset by the technician according to the actual project requirements.
  • the error condition can be a range of regions, that is, the k-mer selected as a specific can be allowed to have a certain error, instead of being completely satisfied. Some strict objective condition.
  • a target pathogen operating group has an index table of the number of occurrences of the corresponding genome of the pathogen operating group.
  • the number of genomes contained in the target pathogen operation group of the k-mer included in the target pathogen operation group can be known according to the number of genome occurrence index table corresponding to the target pathogen operation group, that is, the target pathogen operation group can be selected.
  • the specific sequence representing the target pathogen operation group can be found with a high probability within a certain error range. Because the specificity is determined by the probability preset specificity condition When the non-overlapping specific region set is obtained based on the specific k-mer, and the non-overlapping specific region that meets the conditions is finally selected as the detection target, this technical solution greatly expands the potential detection target.
  • the search range increases the flexibility of limiting the search range of detection targets, and improves the efficiency of determining detection targets.
  • the method before the above step 102, the method further includes the following steps: generating an index table of the number of occurrences of the genome corresponding to the target pathogen operation group, and the genome number index table records that the genome included in the target pathogen operation group contains each The number of k-mer genomes; the genome occurrence index table is stored in the feature target sequence set corresponding to the target pathogen operating group.
  • the genome is all the genetic information in an organism. This genetic information is stored in the form of a nucleotide sequence. The sum of the genetic material in a complete monomer of an organism (such as an animal or plant individual, or animal or plant cell, or bacterial individual) is the genome.
  • each pathogen operating group multiple genomes can be included, and in each genome, multiple k-mers can be included.
  • the index table of the number of genome occurrences corresponding to each pathogen operation group the number of k-mers contained in each pathogen operation group has appeared in the genome of the pathogen operation group, that is, the genome number index table records each k- The number of genomes of the k-mer is included in the genome included in the pathogen operating group corresponding to mer.
  • the table of the number of genomes corresponding to the target pathogen operation group actually records how many genomes included in the target pathogen operation group for each k-mer included in the target pathogen operation. If a k-mer occurs more than once in the same genome, it will still only be counted once in the index table of the number of genome occurrences corresponding to the target pathogen operating group. After obtaining data on how many genomes each k-mer included in the target pathogen operation group has contained in the target pathogen operation group, an index table of the number of occurrences of the genome corresponding to the target pathogen operation group can be established.
  • the genomic appearance frequency index table corresponding to the target pathogen operating group can be stored into the characteristic target sequence set corresponding to the target pathogen operating group, that is, stored in the target database. After storage, if needed, Using the index table of the number of occurrences of the genome can retrieve data from the database, thereby improving the detection efficiency.
  • the method further includes the following steps:
  • Step 100 Select a k-mer that satisfies a preset specific condition from a k-mer corresponding to the target pathogen operation group.
  • Step 101 Store a k-mer that satisfies a preset specific condition into a feature target sequence set corresponding to a target pathogen operation group.
  • a characteristic target sequence set corresponding to the target pathogen operating group is stored, and the target characteristic target sequence set includes a specific k-mer corresponding to the target pathogen operating group.
  • Specific k-mer refers to the selection of k-mers that meet the preset specific conditions from the k-mers included in the target pathogen operation group.
  • the k-mer in the specific k-mer satisfies the following two conditions: the number of occurrences in the genome occurrence index table corresponding to the target pathogen operation group meets a first preset error condition; and in the target pathogen The number of occurrences in the genome occurrence number index table corresponding to the operation group and the number of occurrences in the genome occurrence number index table of the complete set meet the second preset error condition.
  • the genome number index table corresponding to the target pathogen operation group records the number of genomes of each k-mer in the genome included in the target pathogen operation group; the genome occurrence number index table of the complete set contains the genomes included in the complete set. The number of genomes of each k-mer.
  • the target pathogen operating group has a corresponding characteristic target sequence set
  • the specific k-mer included in the characteristic target sequence set refers to a k-mer that satisfies a preset specific condition.
  • the preset specific condition includes a first preset error condition and a second preset error condition.
  • the complete set refers to the collection of all the high-confidence genomes collected.
  • the high-confidence genome contains both the pathogen's genome and the non-pathogen's genome. For example, high-confidence genomes of symbiotic bacteria, probiotics, humans, animals, and plants.
  • a high-confidence genome refers to a selected genome that meets a preset reliability condition.
  • the count corresponding to each k-mer recorded in the genome occurrence index table of the complete set represents how many genomes of the k-mer have appeared in the total set. If the k-mer appears multiple times in the same genome, it will only be counted once.
  • the genome number index table of the target pathogen operating group records the number of genomes of each k-mer in the genome contained in the target pathogen operation group, and the genome occurrence index index of the complete episode records the genome included in the complete episode. It contains the number of k-mer genomes.
  • the selection of the specific k-mer includes two parameters, a preset error condition and a second preset error condition, and thus allows the non-specificity of the specific k-mer within a certain range. Without these two parameters, non-specificity in a certain range cannot be allowed, and it is often difficult to find a specific k-mer for a pathogen operating group. Therefore, by selecting a specific k-mer that allows a certain amount of error, and thereby establishing a set of characteristic target sequences, a specific target that can represent the pathogen's operating group can be found with high probability.
  • the first preset error condition is: the ratio of the number of occurrences in the genome occurrence index table of the target pathogen operation group to the number of genomes included in the target pathogen operation group is greater than or equal to the first threshold 1.
  • the first preset error condition refers to that the sum of the ratio of the number of occurrences recorded in the genome occurrence number index table corresponding to the target pathogen operation group to the number of genomes contained in the target pathogen operation group and the first threshold is greater than or equal to 1.
  • the target pathogen operating group contains N genomes, and the number of occurrences of a k-mer in the genome occurrence index table of the target pathogen operating group is C1, and the first threshold is P1, then the first preset error condition refers to, C1 / N + P1 ⁇ 1.
  • the first threshold value P1 represents an acceptable error probability, and can be any value between 0 and 1.
  • the first threshold value can be set by a technician according to the actual project.
  • the first threshold is less than 5%.
  • the first threshold is an acceptable error probability.
  • the first threshold may be any value between 0 and 1.
  • the first threshold may be set to a value less than 5%.
  • the second preset error condition is: the ratio of the number of occurrences in the genome appearance number index table of the target pathogen operation group to the number of appearances in the genome occurrence number index table of the complete set and the second threshold value And is greater than or equal to 1.
  • the second preset error condition refers to that the sum of the number of occurrences recorded in the genome appearance number index table corresponding to the target pathogen operating group and the appearance number in the genome episode number index table of the complete set and the sum of the second threshold value is greater than or equal to 1 .
  • the number of occurrences of a k-mer in the genome occurrence index table of the target pathogen operation group is C1
  • the number of occurrences of the k-mer in the genome occurrence index table of the complete set is C2
  • the second threshold value is P2
  • the second threshold value is the same as the above-mentioned first threshold value, which represents an acceptable error probability, and can be any value between 0 and 1.
  • the second threshold value P2 can also be set by a technician according to the actual project.
  • the second threshold is less than 5%.
  • the second threshold value is the same as the first threshold value, which means an acceptable error probability.
  • the second threshold value can also be any value between 0 and 1, and the second threshold value can be set to a value less than 5%.
  • the first threshold and the second threshold may be equal or different.
  • the method before obtaining the sequencing data of the sample, the method further includes: generating a genome occurrence index table of the complete set, and the genome occurrence number index table of the complete set records records of each k-mer included in the genome included in the complete set.
  • the number of genomes; the genome occurrence index table of the complete set is stored in the target database.
  • a characteristic target sequence set corresponding to each pathogen operating group is stored.
  • the full set contains all the high-confidence genomes collected, that is, the full set contains both the high-reliability genomes of multiple pathogen operating groups and the high-reliability genomes of multiple non-pathogenic operating groups.
  • an index table of the number of occurrences of the genome of the complete set can be generated.
  • the genome occurrences index table of the complete set records how many genomes of the k-mer contained in each pathogen operation group have appeared in the complete set, that is, the genome count index table of the complete set records the genomes that each k-mer contains in the complete set. It contains the number of k-mer genomes.
  • the genome number table of the complete set actually how many genomes each k-mer contains in the complete set is recorded, that is, how many genomes each k-mer appears in the entire genome is recorded.
  • the number of measurements is the number of genomes, not the number of k-mer occurrences. If a k-mer occurs more than once in the same genome, it will still be counted only once in the genome occurrence index table of the complete set.
  • an index table of the number of occurrences of the genome for the complete set can be established.
  • the genome occurrence index table of the complete set is different from the genome occurrence index table corresponding to each pathogen operation group.
  • the genome occurrence number index table of the pathogen operation group corresponds to the pathogen operation group, and each pathogen operation group has its corresponding Index table of the number of occurrences of the genome, but only one set of the index of the number of occurrences of the genome in the ensemble is generated, which is for all data. After storing the generated genomic appearance frequency index table of the complete set, if it is needed in the process of detecting the sequencing data, the data can be retrieved from the database, thereby improving the detection efficiency.
  • the above step 106 includes: sequentially using each genome included in the target pathogen operating group as a reference genome; locating each specific k-mer included in the target pathogen operating group to the reference genome; and The specific k-mer mapped to the reference genome is the specific k-mer contained in the reference genome.
  • Each genome can be used as a reference genome in turn, and the specific k-mer included in the target pathogen operating group can be mapped to the reference genome. Because the specific k-mer is a k-mer that is pre-selected to meet the preset specificity conditions, there may be cases where some specific k-mers cannot be mapped to a certain genome.
  • the specific k-mer successfully mapped to the reference genome can be used as the specific k-mer included in the reference genome.
  • Some specific k-mers cannot be mapped to a certain genome, then it can be considered that the specific k-mers are not included in the genome. Therefore, localization can also be considered as reconfirming the specific k-mer included in each genome. Therefore, through this localization operation, the specific confirmation of the specific k-mer contained in each genome is double-checked to increase the probability of fault tolerance.
  • the specific k-mer mapped to the reference genome is used as the specific k-mer included in the reference genome, which includes: sequentially selecting a region from the reference genome for comparison with the specific k-mer; when detecting When the similarity between the selected region and the specific k-mer reaches a preset similarity threshold, the specific k-mer is used as the specific k-mer included in the reference genome.
  • a region can be selected from the reference genome in turn and compared with the specific k-mer.
  • the selected region is a gene sequence. Comparing the selected gene sequence with a specific k-mer, the similarity of the two sequences can be detected. When it is detected that the similarity between the selected region and the specific k-mer reaches a preset similarity threshold, it can be considered that the specific k-mer is included in the reference genome, that is, the specific k-mer can be used as a reference
  • the genome contains a specific k-mer.
  • the reference genome can be regarded as a string of many bases.
  • a sequence of length k can be sequentially compared with the specific k-mer from the string of the reference genome. . If the selected sequence and the specific k-mer have similar strings reaching a preset similarity threshold, the specific k-mer can be considered as a specific k-mer in the reference genome.
  • the preset similarity threshold can be customized by a technician. For example, when the preset similarity threshold is set to 99%, if the specificity of a specific k-mer and a region on the reference genome reaches or exceeds 99%, the specific k-mer is considered to belong to the reference genome.
  • the specific k-mer mapped to the reference genome is used as the specific k-mer included in the reference genome, including: sequentially selecting from the reference genome a sequence having the same length as the specific k-mer, and selecting The sequence of the same length as the specific k-mer is compared with the specific k-mer; when it is detected that the selected sequence is the same as the specific k-mer, the specific k-mer is used as the specific k included in the reference genome. -mer.
  • the length of the specific k-mer and the specific k-mer can be selected from the reference genome in this order Identical sequences were compared with specific k-mers. If it is detected that the selected sequence is the same as the specific k-mer, it is considered that the specific k-mer belongs to the reference genome and is the specific k-mer included in the reference genome. If it is detected that the selected sequence is different from the specific k-mer, the specific k-mer is not considered to belong to the reference genome. That is, there is no case that the selected sequence is similar to the specific k-mer, but only belongs to or does not belong. This positioning method is faster because there is no judgment of similarity errors.
  • the method before determining the target pathogen operation group to be detected, the method further includes: obtaining a pre-selected genome that meets a preset confidence condition as a high-confidence genome; and determining a high degree of confidence included in each pathogen operation group. Confidence genome, as the genome corresponding to each pathogen operating group.
  • a high-confidence genome refers to a selected genome that meets a preset reliability condition.
  • the preset confidence condition refers to a condition for selecting a genome set by a technician. After obtaining the high-confidence genome, the high-confidence genome contained in each pathogen operating group can be determined, and the corresponding contained genome in each pathogen operating group can be determined.
  • the high-confidence genome can include both the pathogen genome and the non-pathogen genome, such as high-confidence genomes of symbiotic, probiotic, human, animal, and plant.
  • the highly reliable genome can be derived from the RefSeq dataset (RefSeq reference sequence database of the National Center for Biotechnology Information) of the NCBI (RefSeq reference sequence database, non-redundant in a biological sense provided by the National Center for Bioinformatics). Other genes and protein sequences) or other public or private high-confidence genomes.
  • satisfying the preset credibility condition includes any of the following: when the proportion of non-deterministic characters contained in the genome sequence is lower than a preset proportion threshold; the sequences belonging to the same chromosome included in the genome sequence When the fragment is below the preset fragment threshold; sequence comparison of a genomic sequence with all other genomic sequences whose genetic relationship meets the preset genetic distance threshold range to determine the full sequence average of the genomic sequence in its similar genomic sequence Coverage percentage, when the average coverage percentage is higher than a preset percentage value.
  • non-deterministic characters For example, for the DNA genome, the proportion of non-deterministic characters refers to the proportion of non-ACGT characters contained in it. If the proportion of non-ACGT characters in a piece of DNA genome data is too high, then the piece of data is suspected of low confidence. Genome. For DNA or RNA sequences, non-deterministic characters refer to characters other than ACGTU. For protein sequences, non-deterministic characters refer to characters other than certain amino acid characters.
  • Genomes with a low average percentage of coverage are those that are suspected of having low completion, ie, low confidence.
  • Genetic distance refers to an index that measures the size of the overall genetic difference between species (or individuals).
  • step 108 includes:
  • step 302 a specific k-mer included in each genome is mapped to the genome.
  • Step 304 Select specific k-mer and / or non-overlapping specific regions contained in each genome in turn for detection.
  • step 306 when it is detected that the distance between the selected specific k-mer and / or non-overlapping specific regions on the genome is less than a preset distance threshold, the selected specific k-mer and / or non-overlapping specific regions are detected.
  • the replacement is performed to obtain a non-overlapping specific region after replacement.
  • Step 308 Obtain a set of non-overlapping specific regions corresponding to each genome according to the finally retained specific k-mer and the replaced non-overlapping specific regions.
  • the specific k-mer contained in each genome can be mapped to the genome. Then, two specific k-mers included in the genome can be selected in order to detect whether the distance between the selected two specific k-mers on the genome is less than a preset distance threshold. If so, the two specific k-mers selected will be selected. k-mer for replacement.
  • the replacement method may be to replace the selected two specific k-mers with the smallest region that can cover the selected two specific k-mers to obtain the corresponding non-overlapping specific regions. It may also be a sequence in which the selected two specific k-mers are located on the genome as corresponding non-overlapping specific regions.
  • the preset distance threshold can be a negative number or 0, and is generally set to an integer less than 5.
  • the specific k-mer in each genome can also be compared with the non-overlapping specific regions in the genome, or the two non-overlapping specific regions in the genome can be compared.
  • specific k-mer and / or non-overlapping specific regions contained in each genome can be selected for detection.
  • the alignment method is the same as that between two specific k-mers. Check whether the selected two non-overlapping specific regions or the selected specific k-mer and non-overlapping specific regions have a distance on the genome smaller than a preset distance threshold. If so, the selected specific k-mer and non-overlapping specific regions or two selected non-overlapping specific regions are replaced to obtain corresponding non-overlapping specific regions.
  • a set of non-overlapping specific regions corresponding to each genome can be obtained according to the specific k-mer finally retained in each genome and the non-overlapping specific regions after replacement, and a set of non-overlapping specific regions corresponding to each genome Contained are non-overlapping specific regions in the genome.
  • a and B are the two specific k-mers selected, A is ACGGTCATC, and B is TCATCCGA.
  • ACGGTCATC the sequence between A and B is CCC
  • B the way to replace A and B can be A + CCC + B, that is, the cost of replacing A and B is coincident.
  • the specific region is ACGGTCATCCCTCATCCGA. If there is no sequence between A and B, A and B can be directly spliced, that is, the sequence composed of A + B is the overlapped specific region obtained by replacing A and B. In this example, there is a case where the ends overlap between A and B, that is, the ends of A and the head of B have multiple overlapping characters.
  • the replacement method for A and B is to replace the selected two specific k-mers with the smallest area that can cover the selected two specific k-mers, that is, ACGGTCATCCGA.
  • the specific replacement method can be customized by a technician, or selected based on the distance between A and B or the number of overlapping characters.
  • the length of the obtained non-overlapping specific region is not limited. If the distance between the selected two specific k-mers on the genome is not less than a preset distance threshold, no processing is required. After doing this for each genome, a set of non-overlapping specific regions contained in each genome can be obtained. The set of non-overlapping specific regions corresponding to each genome contains the non-overlapping specific regions in the genome.
  • the preset distance threshold is less than 5.
  • the preset distance threshold can be an integer less than 5.
  • the above step 306 includes:
  • step 402 it is detected whether the distance between the selected specific k-mer and / or non-overlapping specific region on the genome is less than or equal to zero, and if yes, step 404 is performed; if not, step 406 is performed.
  • Step 404 Replace the selected two specific k-mers with the smallest region that can cover the selected two specific k-mers to obtain non-overlapping specific regions.
  • two specific k-mer and / or non-overlapping specific regions are selected from the genome in order to locate on the genome, and these two specific k-mer and / or non-overlapping specificities can be obtained.
  • the distance between regions which is the number of characters in the distance.
  • the selected two specific k-mer and / or non-overlapping specific regions may be specifically sexual regions are replaced to obtain corresponding non-overlapping specific regions.
  • the distance between the two specific k-mer and / or non-overlapping specific regions on the genome is 0, it means that the two specific k-mer and / or non-overlapping specific regions are directly related to each other.
  • the The replacement method may be to replace the selected two specific k-mer and / or non-overlapping specific regions with the smallest region covering the selected two specific k-mer and / or non-overlapping specific regions. That is, a region replaces the two specific k-mer and / or non-overlapping specific regions, and this region is a non-overlapping specific region obtained according to the two specific k-mer and / or non-overlapping specific regions.
  • Step 406 Obtain a sequence spaced between the selected two specific k-mer and / or non-overlapping specific regions on the mapped genome.
  • Step 408 splicing the selected two specific k-mers and the intermediate interval in sequence to obtain a spliced sequence.
  • Step 410 Replace the selected two specific k-mers with a splicing sequence to obtain a non-overlapping specific region.
  • the two sequences selected may be two specific k-mers, or one specific k-mer and one non-overlapping specific region, or two non-overlapping specific regions.
  • the selected two sequences and the intermediate interval sequence can be spliced in sequence to obtain a spliced sequence. After replacing the selected two sequences with the spliced sequence, a non-overlapping specific region can be obtained. And so on, until there is no specific k-mer or specific region whose distance is less than a preset distance threshold.
  • the method further includes:
  • Step 502 Select specific k-mer and / or non-overlapping specific regions contained in each genome for detection.
  • step 504 when it is detected that the distance between the selected specific k-mer and / or non-overlapping specific region on the genome is less than a preset distance threshold, the selected specific k-mer and / or non-overlapping specific region is detected.
  • the replacement is performed to obtain a non-overlapping specific region after replacement.
  • Step 506 Obtain a set of non-overlapping specific regions corresponding to each genome according to the finally retained specific k-mer and the replaced non-overlapping specific regions.
  • the corresponding non-overlapping specific region can be obtained.
  • the specific k-mer in each genome can be compared with the non-overlapping specific regions in the genome, or the two non-overlapping specific regions in the genome can be compared.
  • the alignment method is the same as that between two specific k-mers. Detect whether the selected two non-overlapping specific regions or the selected specific k-mer and non-overlapping specific intervals are smaller than the preset distance threshold on the genome. If so, the selected specific k-mer and non-overlapping specific regions or two selected non-overlapping specific regions are replaced to obtain corresponding non-overlapping specific regions.
  • a set of non-overlapping specific regions corresponding to each genome can be obtained. In the set of non-overlapping specific regions corresponding to each genome, Contained are non-overlapping specific regions in the genome.
  • the above step 110 includes:
  • step 602 a set of non-overlapping specific regions corresponding to each genome included in the target pathogen operation group is summarized to obtain a non-overlapping specific region union.
  • Step 604 Obtain the number of occurrences of each non-overlapping specific region included in the non-overlapping specific region set in the set of non-overlapping specific regions corresponding to each genome.
  • N sets of non-overlapping specific regions will be obtained.
  • a non-overlapping specific region union set can be obtained.
  • it is actually calculating the number of occurrences of each non-overlapping specific region in the non-overlapping specific regions. If a certain non-overlapping specific region has appeared in M genomes, then the number of occurrences of the non-overlapping specific region in the non-overlapping specific region and the concentration will be M.
  • the method further includes:
  • step 702 the selected non-overlapping specific region whose number of occurrences exceeds a preset number of times threshold is used as a representative specific region.
  • step 704 a total set of representative specific regions is formed according to the obtained representative specific regions.
  • Step 706 Remove the representative specific region that does not have a biological function from the total set of representative specific regions to obtain a total set of representative specific regions that have biological functions.
  • Step 708 Use the non-overlapping specific region in the total set of representative specific regions with biological functions as the detection target of the target pathogen operating group.
  • the biological functions of genes include the storage of genetic information, the replication of genetic information, and the expression of genetic information. After selecting non-overlapping specific regions whose partial occurrences exceed a preset number of thresholds, these selected non-overlapping specific regions can be used as representative specific regions, and the representative specific regions can be composed according to the selected representative specific regions Total collection. The non-overlapping specific regions in the total set of specific regions are screened. The screening method is to remove the representative specific regions that do not have biological functions, and then obtain the total set of representative specific regions that have biological functions.
  • these biologically-representative specific-specific regions in the total set of biologically-representative specific-regions can be used as detection targets, that is, non-overlapping specific regions in the total set of biologically-representative-specific-regions as the target pathogen Detection target of the operation group.
  • step 704 includes:
  • Step 802 Obtain the gene annotation information of each genome included in the target pathogen operation group from the target database.
  • the gene annotation information includes the position of each known functional region on each genome and corresponding function information.
  • the gene annotation information is information indicating the position and function of each gene in a genome. Therefore, the gene annotation information of each genome includes the position and corresponding function information of each known functional region on each genome. The position of the region includes the starting and ending positions, plus and minus strands, and sequences.
  • the corresponding functional information includes genes encoding proteins, encoding microRNAs (a type of non-coding single-stranded RNA with a length of about 22 nucleotides encoded by endogenous genes). Molecule), a region encoding a promoter (promoter), a region encoding a regulatory protein that recognizes binding, a replication initiation region, and the like.
  • the gene annotation information of each genome stored in the target database can be obtained from the GenBank gene annotation information of the corresponding genome through the NCBI's GenBank database (NCBI's open source annotated information accounting sequence database), or through the Ensembl database (an A database of genome sequences and annotation information maintained by organizations such as the European Bioinformatics Institute) to obtain gene annotation information of corresponding genomes.
  • NCBI's GenBank database NCBI's open source annotated information accounting sequence database
  • Ensembl database an A database of genome sequences and annotation information maintained by organizations such as the European Bioinformatics Institute
  • the target database stores the gene annotation information of each genome contained in each pathogen operating group, so the gene annotation information of each genome contained in the target pathogen operating group can be obtained from the target database, and can be obtained Go to the biologically functional area of each genome contained in the target pathogen manipulation group.
  • one representative specific region is selected from the total set of representative specific regions and compared with known functional regions in the entire genome.
  • Step 806 Remove the representative specific region whose overlapped region length with the known functional region is lower than a preset coincidence threshold to obtain a total set of representative specific regions with biological functions.
  • a representative specific region and a known functional region included in the entire genome of the target pathogen operating group can be selected in order. Align and determine whether the overlapped regions of the selected representative specific region and the known functional region have significant overlap, that is, determine whether the degree of overlap of the two sequences reaches a preset overlap threshold.
  • the representative specific region When the degree of coincidence between the selected representative specific region and the known functional region is lower than the preset coincidence threshold, the representative specific region is considered to have no biological function, and these and the known functional region can be removed.
  • the preset coincidence threshold for significant coincidence may be: the length of the coincident region exceeds a certain threshold T1, for example, 12bp, or the length of the coincident region as a percentage of the length of the specific region exceeds a certain threshold T2, for example, 30%, or the coincident region
  • the percentage of the length of the relevant functional region exceeds a certain threshold T3, such as 30%, or the percentage of the total length of all functional regions contained in the specific region to the length of the specific region exceeds a certain threshold T4 , For example 30%.
  • Step 806 is an optional step, that is, step 806 may not be performed, but is generally recommended to be performed. Because the sequences with biological functions generally do not mutate during the evolution process under the selection pressure, the biologically selected sequences are finally selected as diagnostic targets, which can effectively avoid the emergence of pathogens during the evolution and reproduction of pathogens. Mutations in selected specific regions, ie, changes in sequence. Therefore, the long-term validity and accuracy of the selected detection target can be guaranteed to a certain extent.
  • the method further includes:
  • step 902 the genome containing the largest number of non-overlapping specific regions with a number of occurrences exceeding a preset number of times is used as the representative genome.
  • Step 904 Use the representative specific region set corresponding to the representative genome as the PCR specific region set.
  • the selected non-overlapping specific regions can be selected based on the number of occurrences in the genome of the selected non-overlapping specific regions. Representing the genome. Can count the number of non-overlapping specific regions in each genome that exceed the preset number of thresholds, and select the genome that contains the largest number of non-overlapping specific regions in which these occurrences exceed the preset number of thresholds as representatives Genome.
  • the representative specific area is a non-overlapping specific area selected in all non-overlapping specific area sets that exceeds a preset number of times, that is, there are multiple representative specific areas.
  • the representative genome is actually a genome selected from multiple genomes contained in the target pathogen operating group, then the representative genome also has its corresponding set of representative specific regions. Therefore, after the representative genome is selected, the representative specific region set corresponding to the representative genome can be used as a PCR (polymerase chain reaction) representative specific region set.
  • Step 906 Select non-overlapping specific regions from the set of PCR-specific regions to generate a set of PCR-specific region pairs that meets the conditions.
  • Step 908 Select two non-overlapping specific regions in one PCR-specific region pair from the set of eligible PCR-specific region pairs to locate on the representative genome.
  • the preset distance range D can be (MD-SD, MD + SD). MD can be set to about 1000bp, SD can be set to about 500bp.
  • Step 910 Use the sequence corresponding to the interval formed by the two non-overlapping specific regions on the representative genome as the interval to be detected.
  • step 912 a set of intervals to be detected is formed according to the obtained intervals to be detected.
  • each to-be-detected interval in the set of to-be-detected intervals is screened to obtain a final set of detection primer pairs.
  • a and B are two non-overlapping specific regions whose distances on the genome match the preset distance range
  • a and B are eligible PCR-specific region pairs
  • a and B are one PCR-specific region.
  • the sequence corresponding to the interval formed by these two PCR-specific regions can be obtained, and this sequence is taken as the interval to be detected representing the genome.
  • the two PCR-specific regions in each pair of PCR-specific region pairs in the set of eligible PCR-specific region pairs are selected and positioned on the genome, and then each corresponding interval to be detected can be obtained.
  • each of the to-be-detected intervals in the set of to-be-detected intervals is screened, and the final set of detection primer pairs can be obtained after the screening.
  • the final detection primer pair set obtained according to the interval to be detected may include multiple PCR primer pairs.
  • step 902 includes: obtaining a total set of representative specific regions according to the selected non-overlapping specific regions whose occurrences exceed a preset number of times; selecting a non-overlapping specificity in the total set including the representative specific regions; The genome with the largest number of regions is the representative genome.
  • a non-overlapping specific region with an occurrence exceeding a preset number of times can be selected.
  • the selected non-overlapping specific regions are composed to represent the total set of specific regions.
  • the total set of representative specific regions includes one or more non-overlap specific regions whose occurrences in all non-overlap specific region sets exceed a preset number of times. From the total set of representative specific regions, a genome containing the largest number of non-overlapping specific regions (ie, representative specific regions) that occur more than a preset number of times can be selected as the representative genome. That is, the genome containing the largest number of non-overlapping specific regions in the total set of representative specific regions is selected as the representative genome.
  • step 906 includes:
  • Step 1002 Obtain the position of each specific region of the PCR in the set of representative regions of the PCR.
  • step 1004 two non-overlapping specific regions with a distance in a preset distance range are used as a pair of PCR-specific regions that meet the conditions.
  • Step 1006 Generate a set of eligible PCR-specific region pairs according to the eligible PCR-specific region pairs.
  • a non-overlapping specific region that meets a preset distance range can be selected from the PCR representative specific region sets to generate a set of PCR specific region pairs that meets the conditions .
  • the position of each PCR representative specific region included in the PCR representative specific region set can be obtained, that is, each non-overlapping specific region included in the PCR representative specific region set is located on the representative genome.
  • the location of each non-overlapping specific region on the representative genome can be determined. Therefore, the distance between each two non-overlapping specific regions can be obtained, that is, the number of characters separated between each two non-overlapping specific regions.
  • Optional take out two non-overlapping specific regions with a distance that matches the preset distance range.
  • the preset separation distance range is set to a range of 700 to 1300
  • two non-overlapping specific regions with a separation distance in the range of 700 to 1300 can be selected, that is, two non-overlapping characters with a space of 700 to 1300 characters.
  • Specific region The two non-overlapping specific regions with a selected distance in the preset distance range are used as a pair of PCR-specific regions that meet the conditions.
  • a set of PCR-specific region pairs meeting the conditions may be generated according to the selected eligible PCR-specific region pairs. That is, one or more eligible PCR-specific region pairs are included in the set of eligible PCR-specific region pairs.
  • the preset distance range is greater than 500 bp and less than 1500 bp.
  • step 910 the positions of the two non-overlapping specific regions selected as the two furthest segments on the representative genome are used as the boundary of the interval to be detected; the boundary of the interval to be detected is formed on the representative genome. The sequence corresponding to the interval is used as the interval to be detected.
  • the characters "A” and “G” can be used as the boundaries of the interval to be detected, and then the interval to be detected is actually "ACGGTCATC” + "AAAATTTTT" + “TCATCCGAG”. That is, the finally obtained detection interval is "ACGGTCATCAAAATTTTTTCATCCGAG".
  • This processing is performed on two non-overlapping specific regions selected randomly from the set of PCR-specific regions that meet the conditions, and multiple corresponding intervals to be detected can be obtained.
  • the positions of the two non-overlapping specific regions selected and the positions of the non-specific regions between the two selected non-overlapping specific regions are marked on the representative genome.
  • the positions of the two non-overlapping specific regions can also be marked on the representative genome, and the two non-overlapping specific regions can also be recorded.
  • Location between non-specific regions refers to a specific k-mer that does not belong to the selection, nor does it belong to a non-overlapping specific region formed by the specific k-mer.
  • the positions of non-overlapping specific regions can be used to identify PCR primer pairs. PCR primers need to use non-overlapping specific regions, and the specificity of PCR depends on the primers, so you can mark which belong to non-overlapping specific regions and which belong to non-specific regions.
  • step 912 includes:
  • a PCR primer tool is used to screen each to-be-detected interval in the set of to-be-detected intervals to obtain a set of candidate PCR primer pairs.
  • test intervals After selecting the sequences corresponding to the two non-overlapping specific regions in the region representing the genome as the test intervals, in this way, multiple test intervals are obtained, and then each test interval is screened. For example, PCR can be applied.
  • the primer tool screens each to-be-detected interval in the set of to-be-detected intervals.
  • the PCR primer tool may be Primer3. Thereby, a part of the interval to be detected can be selected, and one or more candidate PCR primer pair sets can be obtained.
  • Step 1104 Select a specific primer pair for the target pathogen operation group from the candidate PCR primer pair set, and generate a specific primer pair set corresponding to the target pathogen operation group.
  • a pair of PCR primers automatically generated by most existing automatic PCR primer generation tools can only satisfy the specificity of the primers in the region to be detected, and cannot guarantee the specificity in other regions.
  • some automatic PCR primer generation tools cannot refer to specific regions within the interval to be detected to mark messages. Therefore, it is necessary to further determine the specificity of the obtained candidate PCR primers to the primers in the collection. Therefore, a specific primer pair for the target pathogen operating group can be selected from the obtained candidate PCR primer pair set, and a specific primer pair set corresponding to the target pathogen operating group can be generated.
  • the genomes in the complete set that are not in the target pathogen operating group can be used as alignment reference genomes, and the two primers in an alternative PCR primer pair can be aligned with the sequences in the alignment reference genome and mapped to the alignment. On the reference genome.
  • the primer pair of the aligned reference genome can be compared with the sequence corresponding to the position of the aligned reference genome.
  • the sequence similarity reaches a preset similarity threshold, it is determined that the primer pair is successfully located.
  • the candidate PCR primer pairs that have been successfully mapped are determined to be not specific primer pairs corresponding to the target pathogen operating group, and the pair of primers are removed from the set of candidate PCR primer pairs.
  • a set of specific primer pairs corresponding to the target pathogen operating group can be generated based on the selected specific primer pairs.
  • Step 1106 Select a primer pair in the specific primer pair set that meets the preset primer conditions as the final detection primer pair.
  • Step 1108 Generate a set of final detection primer pairs based on the final detection primer pairs.
  • a primer pair is selected from the specific primer pair set that meets the preset primer conditions, and these selected primer pairs are used as the final detection primer pair, thereby generating a corresponding final detection primer pair set.
  • step 1104 includes:
  • step 1202 a full set is obtained from the target database, and the full set contains a plurality of collected high-confidence genomes.
  • step 1204 a genome that is not included in the target pathogen operating group is obtained through the complete set, and is used as a comparison reference genome.
  • a full set is stored in the target database, and the full set contains multiple collected high-confidence genomes.
  • a high-confidence genome refers to a selected genome that meets a preset reliability condition. Therefore, the genomes included in the target pathogen operating group can be known, so that the genomes included in the target pathogen operating group can be removed from the full set, and other genomes that are not in the target pathogen operating group can be used as reference reference genomes.
  • the alignment reference genome is not included in the target pathogen manipulation group.
  • primer pairs are selected from the set of candidate PCR primer pairs in order to map the reference genome.
  • step 1208 the selected primer pair is compared with the sequence corresponding to the location of the aligned reference genome. When the sequence similarity reaches a preset similarity threshold, it is determined that the primer pair is successfully located.
  • a primer pair selected to be aligned to a reference gene may be compared with a sequence corresponding to a position where the primer pair is located to be aligned to the reference genome.
  • a preset similarity threshold can be set by a technician, for example, the preset similarity can be set to 95%, 99%. Then when it is detected that the similarity between the primer pair and the sequence corresponding to the position reaches 95% or 99%, it is determined that the primer pair is successfully mapped to the reference genome.
  • step 1210 a primer pair that satisfies a preset alignment condition is removed from the primer pair determined to be successfully located to obtain a specific primer pair corresponding to the target pathogen operation group.
  • Step 1212 Generate a set of specific primer pairs corresponding to the target pathogen operating group according to the specific primer pairs.
  • a set of specific primer pairs corresponding to the target pathogen operating group can be generated from the specific primer pairs. That is, the set of specific primer pairs corresponding to the target pathogen operating group includes one or more specific primer pairs corresponding to the target pathogen operating group.
  • the preset comparison conditions include at least one of the following: two primers of the selected primer pair are located on the same chromosome of the same genome at the same time; the distance between the two primers of the selected primer pair is The distance range is set; a preset number of base sequences exists at the 3 ′ end of any primer of the selected primer pair, and the base sequences of the primer pair located at the position of the aligned reference genome are the same.
  • the preset distance range is a range interval D, D can float within the range (MD-SD, MD + SD), where MD is generally 1000bp and SD is generally 500dp. k is generally greater than 0.5.
  • meeting the preset primer conditions includes at least one of the following: primer length is between 17 and 28 bp; primer annealing temperature is between 52 and 58 degrees Celsius; GC percentage is between 40% and 60%; primer The 3 'end of the primer is C, G, CG, or GC; the G / C of the last 5 bases of the 3' end of the primer does not exceed 3, and the last 5 bases of the 3 'end of the primer does not contain more than 2 consecutive C or G; does not contain repeats or single nucleic acid repeats; there is no 3 'end complementary between the two primers, or self-complementation of a single primer.
  • a method for determining a detection target including the following steps:
  • step 1302 a characteristic target sequence set of the target pathogen operation group is established.
  • step 1302 includes:
  • Step 1302A Collection and sorting of high-confidence genomes.
  • the target database a feature target sequence set corresponding to each pathogen operation group can be stored.
  • the high-confidence genome can include both the pathogen genome and the non-pathogen genome, such as high-confidence genomes of symbiotic, probiotic, human, animal, and plant.
  • the highly reliable genome can be derived from the RefSeq dataset (RefSeq reference sequence database of the National Center for Biotechnology Information) of the NCBI (RefSeq reference sequence database, non-redundant in a biological sense provided by the National Center for Bioinformatics). Other genes and protein sequences) or other public or private high-confidence genomes.
  • non-deterministic characters For example, for the DNA genome, the proportion of non-deterministic characters refers to the proportion of non-ACGT characters contained in it. If the proportion of non-ACGT characters in a piece of DNA genome data is too high, then the piece of data is suspected of low confidence. Genome. For DNA or RNA sequences, non-deterministic characters refer to characters other than ACGTU. For protein sequences, non-deterministic characters refer to characters other than certain amino acid characters.
  • Genomes with a low average percentage of coverage are those that are suspected of having low completion, ie, low confidence.
  • Genetic distance refers to an index that measures the size of the overall genetic difference between species (or individuals). All high-confidence genomes collected can be collectively referred to as the complete set.
  • step 1302B the genome included in the target pathogen operating group is determined.
  • a high-confidence genome refers to a selected genome that meets a preset reliability condition. After obtaining the high-confidence genome, the high-confidence genome contained in a pathogen operating group can be determined, and the corresponding contained genome in a pathogen operating group can be determined. The genome contained in the target pathogen operating group can thus be determined.
  • the target pathogen operating group refers to a pathogen operating group to be detected.
  • the pathogen operating group to be detected is Staphylococcus aureus
  • the target pathogen operating group in step 102 refers to Staphylococcus aureus.
  • step 1302C an index table of the number of occurrences of the genome of the complete set is generated.
  • a genomic appearance index table of the corpus can be generated.
  • k-mer refers to a genomic sequence of length k.
  • k can be defined by itself, and the range can generally be set between 11 and 32. If there are a different deterministic characters in a genomic data, then for a specific k, there may be a total of k different powers of k.
  • DNA has a total of four different deterministic characters of ACGT, then for a particular k, there are 4 possible k-th different k-mers.
  • For a genome of length n there may be at most n-k + 1 different k-mers. But because a genome contains repeating regions, in general, a k-mer with a length of n characters will be much smaller than n-k + 1. Therefore, if the ordinary k-mer counting method is used, a given k-mer may appear multiple times and may be counted multiple times in a given genome.
  • the genome occurrence index table of the complete set which is different from the previous method, if a k-mer occurs more than once in a genome, the genome occurrence index table of the complete set still counts only once. Therefore, the count corresponding to a k-mer in the resulting k-mer genome occurrence number index table represents how many genomes the k-mer has appeared in the total set.
  • step 1302D an index table of the number of occurrences of the genome corresponding to the target pathogen operation group is generated.
  • the index table of the number of occurrences of the genome of the target pathogen operating group is different from the index table of the number of occurrences of the genome of the complete set in step 1302C.
  • the complete set of genome occurrence index table records the complete set, that is, how many genomes a k-mer has appeared in the entire pathogen operating group, that is, how many genomes a k-mer has appeared in the complete set.
  • the index table of the number of occurrences of the genome corresponding to the target pathogen operation group is corresponding to the target pathogen operation group. It records the k-mers contained in the target pathogen operation group and how many genomes have appeared in the target pathogen operation group.
  • step 1302E a specific k-mer table corresponding to the target pathogen operation group is generated.
  • the specific k-mer table corresponding to the target pathogen operation group records the k-mers in the target pathogen operation group that satisfy the preset specific conditions, that is, the specific k-mer.
  • the specific k-mer is a k-mer selected from the k-mers that meets the preset specificity conditions. The selection of a specific k-mer must meet the following two conditions:
  • the target pathogen operation group contains N genomes, and the number of occurrences of a k-mer in the genome occurrence index table corresponding to the target pathogen operation group is C 1 , then the condition must be met: C 1 / N + P 1 ⁇ 1. That is, the ratio of the number of occurrences in the genome appearance index table of the target pathogen operation group to the number of genomes included in the target pathogen operation group and the first threshold is greater than or equal to 1, where the first threshold P 1 is usually less than 5%. .
  • C 1 / C 2 + P 2 ⁇ 1 that is, the ratio of the number of occurrences in the genome occurrence index table of the target pathogen operating group to the occurrence number in the genome occurrence index table of the complete set and the second threshold Is greater than or equal to 1.
  • the second threshold value P 2 is usually less than 5%.
  • the first threshold value P 1 and the second threshold value P 2 may be equal to or different from each other.
  • the two parameters of the first threshold P 1 and the second threshold P 2 are added, allowing an error rate within a certain range, that is, allowing the non-specificity of the specific k-mer within a certain range. . Without these two parameters, non-specificity in a certain range cannot be allowed, and it is often difficult to find a specific k-mer for a certain pathogen operating group.
  • the probability of false positives for the pathogen operation group is actually less than or equal to P 1 n ' (that is, the power of n' to P 2 ).
  • P 1 n ' that is, the power of n' to P 2
  • the false negative rate refers to the proportion of positives that produce a negative test result in the test, that is, the conditional probability that a negative test result exists considering the condition being searched for.
  • Step 1304 Determine a detection target of the target pathogen operation group.
  • step 1304 includes:
  • each genome included in the target pathogen operating group is used as a reference genome in turn, and each specific k-mer included in the target pathogen operating group is mapped to the reference genome.
  • each genome included in the target pathogen operating group is used as a reference genome in turn, and each specific k-mer in the specific k-mer table corresponding to the target pathogen operating group is mapped to the reference genome. Since the specific k-mer is a k-mer selected in advance to meet the preset specificity conditions, there may be cases where some specific k-mers cannot be mapped to the genome.
  • the specific k-mer successfully mapped to the reference genome can be used as the specific k-mer included in the reference genome.
  • Step 1304B sequentially select a region from the reference genome and compare it with the specific k-mer.
  • the specific k-mer is taken as The reference genome contains a specific k-mer.
  • step 1304C two specific k-mers included in the reference genome are sequentially selected for detection.
  • step 1304D when it is detected that the distance between the selected two specific k-mers on the reference genome is less than a preset distance threshold, the selected two specific k-mers are replaced to obtain a non-overlapping specific region.
  • Step 1304E Generate a set of non-overlapping specific regions corresponding to each genome from each of the non-overlapping-specific regions in each genome obtained.
  • the two specific k-mers included in the genome can be selected in order to detect whether the distance between the selected two specific k-mers on the genome is less than a preset distance threshold, and if so, the selected two specific k-mers will be selected. -mer for replacement.
  • the replacement of the two specific k-mers may be to cover the selected two specific k-mers.
  • the smallest region of mer replaces the two specific k-mers selected. That is, a region replaces the two specific k-mers, and this region is a non-overlapping specific region obtained based on the two specific k-mers.
  • the preset distance threshold can be a negative number or 0, and is generally set to an integer less than 5.
  • the distance between the two specific k-mers selected on the genome is 0, it means that the two specific k-mers selected are directly adjacent and connected to each other.
  • the distance is negative, it means that the two specific k-mers selected have a certain number of base pairs.
  • the length of the obtained non-overlapping specific region is not limited. If the distance between the selected two specific k-mers on the genome is not less than a preset distance threshold, no processing is required. After doing this for each genome, a set of non-overlapping specific regions contained in each genome can be obtained. The set of non-overlapping specific regions corresponding to each genome contains the non-overlapping specific regions in the genome.
  • Step 1304F Obtain the number of occurrences of each non-overlapping specific region included in the non-overlapping specific region set corresponding to each genome in the entire non-overlapping specific region set.
  • the number of occurrences of each non-overlapping specific region in each non-overlapping specific region set in each non-overlapping specific region set can be obtained.
  • the set of non-overlapping specific regions corresponding to each genome can be regarded as a small set, and the entire set of non-overlapping specific regions of the entire genome is composed of a complete set of non-overlapping specific regions, that is, the complete set is composed of multiple small sets. .
  • the full set of non-overlapping specific regions contains all non-overlapping specific regions in the genome, so each non-overlapping specific region in the set of non-overlapping specific regions of each genome can be obtained. The number of occurrences in the regional ensemble.
  • N non-overlapping specific regions corresponding to N genomes are obtained.
  • a non-overlapping specific region does not appear in every genome, so each non-overlapping specific region is in all non-overlapping specific regions.
  • the number of occurrences in the set of coincident specific regions is generally less than or equal to N.
  • step 1304G a non-overlapping specific region with a number of occurrences exceeding a preset number of times is selected as a detection target of the target pathogen operation group.
  • a non-overlapping specific region with an appearance frequency exceeding a preset number of thresholds can be selected as a detection target of the target pathogen operation group .
  • the preset number of times threshold can be set by technicians according to actual project requirements.
  • the selected non-overlapping specific regions may be multiple, or may correspond to multiple genomes.
  • the selected non-overlapping specific regions have the following characteristics: (1) the length is generally much longer than the k-mer obtained from the characteristic target sequence manipulated by the target pathogen; (2) basically all are included in the target pathogen manipulation group It has appeared in all genomes; (3) It has basically not appeared in genomes that are not the target pathogen operating group. These characteristics can meet the technical needs of the detection target used in most diagnostic detection technologies. Therefore, the selected non-overlapping specific regions are simply screened according to the technical requirements of the detection target of a specific diagnostic detection technology (such as meeting the conditions of length, CG content percentage, annealing temperature, etc.), and finally can be obtained.
  • the non-overlapping specific regions that meet the needs of the detection target technology constitute the final set of detection targets that can be used to detect the target pathogen operating group. Based on the sequences in the set of detection targets, users can synthesize and manufacture molecular probes suitable for this particular diagnostic detection technology.
  • step 1304H the selected non-overlapping specific region with a number of occurrences exceeding a preset number threshold is taken as the representative specific region, and the genome containing the most non-overlapping specific region with the number of occurrences exceeding a preset number threshold is taken as a representative genome.
  • Step 1304I The representative specific region set corresponding to the representative genome is used as the PCR specific region set.
  • the selected non-overlapping specific regions can be selected based on the number of occurrences in the genome of the selected non-overlapping specific regions. Representing the genome. Can count the number of non-overlapping specific regions in each genome that exceed the preset number of thresholds, and select the genome that contains the largest number of non-overlapping specific regions in which these occurrences exceed the preset number of thresholds as representatives Genome. In fact, the non-overlapping specific region contained in the representative genome is the detection target of the identified target pathogen operating group.
  • the representative specific area is a non-overlapping specific area selected in all non-overlapping specific area sets that exceeds a preset number of times, that is, there are multiple representative specific areas.
  • the representative genome is actually a genome selected from multiple genomes contained in the target pathogen operating group, then the representative genome also has its corresponding set of representative specific regions. Therefore, after the representative genome is selected, the representative specific region set corresponding to the representative genome can be used as a PCR (polymerase chain reaction) representative specific region set.
  • the gene annotation information of each genome can be obtained from the target database.
  • the gene annotation information is information indicating the position and function of each gene in a genome. Therefore, the gene annotation information of each genome includes the position and corresponding function information of each known functional region on each genome.
  • the GenBank gene annotation information of the genome is obtained through the NCBI's GenBank database, or the genome annotation information of the genome is obtained through the Ensembl database.
  • the type of gene annotation information includes the position of any known functional region on the genome and the functional information corresponding to the region.
  • Positions include start and stop positions, plus and minus strands, sequences, etc.
  • Functional information refers to, for example, genes encoding proteins, genes encoding microRNA, regions encoding promoters, regions encoding regulatory proteins for recognition and binding, and replication initiation regions.
  • each non-overlapping specific region included in each representative genome can be screened.
  • the representative specific region is a non-overlapping specific region whose appearance frequency exceeds a preset number of times.
  • a representative specific region may be sequentially selected for comparison with a known functional region included in the entire genome of the target pathogen operating group, and the selected Whether the overlapping region of the specific region and the known functional region is significantly overlapped, that is, it is judged whether the overlap degree of the two sequences reaches a preset overlap threshold.
  • the representative specific region When the degree of coincidence between the selected representative specific region and the known functional region is lower than the preset coincidence threshold, the representative specific region is considered to have no biological function, and these and the known functional region can be removed.
  • the preset coincidence threshold for significant coincidence may be: the length of the coincident region exceeds a certain threshold T1, for example, 12bp, or the length of the coincident region as a percentage of the length of the specific region exceeds a certain threshold T2, for example, 30%, or the coincident region
  • the percentage of the length of the relevant functional region exceeds a certain threshold T3, such as 30%, or the percentage of the total length of all functional regions contained in the specific region to the length of the specific region exceeds a certain threshold T4 , For example 30%.
  • Step 1304J Select non-overlapping specific regions from the set of PCR specific regions that meet the preset distance distance range to generate a set of PCR-specific region pairs that meet the conditions.
  • the preset distance range D can be (MD-SD, MD + SD). MD can be set to about 1000bp, SD can be set to about 500bp.
  • Step 1304K Select two non-overlapping specific regions in a PCR-specific region pair from the set of eligible PCR-specific region pairs to locate on the representative genome, and place the selected two non-overlapping specific regions on the representative genome. The sequence corresponding to the interval is used as the interval to be detected.
  • step 1304L screening is performed for each interval to be detected to obtain a final detection primer pair set.
  • the two selected PCR-specific regions in a pair of PCR-specific region pairs from a set of eligible PCR-specific region pairs and locate them on the representative genome. Then, the two selected PCR-specific regions will be on the representative genome. By forming an interval, the sequence corresponding to the interval formed by these two PCR-specific regions can be obtained, and this sequence is taken as the interval to be detected representing the genome.
  • the two PCR-specific regions in each pair of PCR-specific region pairs in the set of eligible PCR-specific region pairs are selected and positioned on the genome, and then each corresponding interval to be detected can be obtained.
  • Set to the interval to be detected That is, the set of intervals to be detected includes one or more intervals to be detected.
  • PCR primer tools can be used to screen each detection interval in the set of detection intervals.
  • the PCR primer tool may be Primer3.
  • a part of the interval to be detected can be selected, and one or more candidate PCR primer pair sets can be obtained.
  • a pair of PCR primers automatically generated by most existing automatic PCR primer generation tools can only satisfy the specificity of the primers in the region to be detected, and cannot guarantee the specificity in other regions.
  • some automatic PCR primer generation tools cannot refer to specific regions within the interval to be detected to mark messages. Therefore, it is necessary to further determine the specificity of the obtained candidate PCR primers to the primers in the collection. Therefore, a specific primer pair for the target pathogen operating group can be selected from the obtained candidate PCR primer pair set, and a specific primer pair set corresponding to the target pathogen operating group can be generated.
  • the genomes in the complete set that do not belong to the target pathogen operating group can be used as alignment reference genomes respectively, and the two primers in one candidate PCR primer pair can be compared with the sequences in the alignment reference genome and combined.
  • Map to the aligned reference genome When judging whether the mapping is successful, the primer pair of the aligned reference genome can be compared with the sequence corresponding to the position of the aligned reference genome. When the sequence similarity reaches a preset similarity threshold, the primer pair is determined to be successful.
  • the candidate PCR primer pairs that have been successfully located are determined to be not specific primer pairs corresponding to the target pathogen operating group, and the team primers are removed from the set of candidate PCR primer pairs.
  • a set of specific primer pairs corresponding to the target pathogen operating group can be generated based on the selected specific primer pairs. After selecting specific primer pairs for the target pathogen operating group, further screening can be performed. A primer pair that meets the preset primer conditions is selected from the specific primer pair set, and these selected primer pairs are used as the final detection primer pair to generate a corresponding final detection primer pair set.
  • the length of the primer is between 17 and 28bp; the annealing temperature of the primer is between 52 and 58 degrees Celsius; the GC percentage is between 40% and 60%; the 3 'end of the primer is C, G, CG, or GC; no more than 3 G / C in the last 5 bases of the 3 'end of the primer, no more than 2 consecutive C or G in the last 5 bases of the 3' end of the primer; no Repeats or single nucleic acid repeats; there is no 3 'end complementarity between two primers, or self-complementation of a single primer.
  • step 1302 For a selected target pathogen operation group, the process in step 1302 needs to be run before the process in step 1304 is run. If the genomic data of the pathogen, or background genome data is updated, then steps 1302 and 1304 need to be rerun.
  • steps in the flowcharts of FIG. 1 to FIG. 15 are sequentially displayed in accordance with the instructions of the arrows, these steps are not necessarily performed in the order indicated by the arrows. Unless explicitly stated in this document, the execution of these steps is not strictly limited, and these steps can be performed in other orders. Moreover, at least a part of the steps in the figure may include multiple sub-steps or stages. These sub-steps or stages are not necessarily executed at the same time, but may be executed at different times. The execution order of these sub-steps or stages It is not necessarily performed sequentially, but may be performed in turn or alternately with at least a part of other steps or sub-steps or stages of other steps.
  • a device for determining a detection target including:
  • a determining module 1602 configured to determine a target pathogen operation group to be detected
  • the specific k-mer acquisition module 1604 is used to acquire the specific k-mer included in the target pathogen operation group from the target database.
  • the specific k-mer is a k-mer that satisfies preset specific conditions. Refers to a genomic sequence of length k; determines the specific k-mer contained in each genome contained in the target pathogen operating group;
  • a non-overlapping specific region acquisition module 1606 is configured to process specific k-mers contained in each genome to obtain a set of non-overlapping specific regions corresponding to each genome.
  • the non-overlapping specific region set includes non-overlapping specific regions. Sexual area;
  • Detection target selection module 1608 used to obtain the number of occurrences of each non-overlapping specific region contained in the set of non-overlapping specific regions corresponding to each genome in the entire set of non-overlapping specific regions; taking the number of occurrences exceeding a preset The non-overlap specific region of the number of times threshold is used as the detection target of the target pathogen operation group.
  • the specific k-mer acquisition module 1604 is further configured to sequentially use each genome included in the target pathogen operation group as a reference genome; and use each specific k-mer included in the target pathogen operation group as a reference genome. Mapping to a reference genome; and a specific k-mer included in the reference genome as a specific k-mer included in the reference genome.
  • the specific k-mer acquisition module 1604 is further configured to sequentially select a region from the reference genome for comparison with the specific k-mer; and when the selected region is detected with the specific k-mer When the similarity reaches a preset similarity threshold, the specific k-mer is used as the specific k-mer included in the reference genome.
  • the specific k-mer acquisition module 1604 is further configured to sequentially select a sequence with the same length as the specific k-mer from the reference genome, and select the selected sequence with the same length as the specific k-mer from The specific k-mer is compared; and when the selected sequence is detected to be the same as the specific k-mer, the specific k-mer is used as the specific k-mer included in the reference genome.
  • the above-mentioned device further includes a data creation module (not shown in the figure) for obtaining a pre-selected genome that meets a preset confidence condition as a high-confidence genome; and determining each pathogen operation
  • the group includes a high-confidence genome as the genome corresponding to each pathogen operating group.
  • satisfying the preset credibility condition includes any of the following: when the proportion of non-deterministic characters contained in the genome sequence is lower than a preset proportion threshold; the sequences belonging to the same chromosome included in the genome sequence When a fragment is below a preset fragment threshold; and performing a sequence comparison between a genomic sequence and all other genomic sequences whose genetic relationship meets a preset genetic distance threshold range to determine the full sequence of the genomic sequence in a similar genomic sequence Average coverage percentage, when the average coverage percentage is higher than a preset percentage value.
  • the above-mentioned non-overlapping specific region acquisition module 1606 is further configured to locate a specific k-mer included in each genome onto the genome; and sequentially select the specific k-mer included in each genome. And / or the non-overlapping specific region is detected; when it is detected that the selected specific k-mer and / or non-overlapping specific region has a distance on the genome that is less than a preset distance threshold, the selected specific Replacement of specific k-mer and / or non-overlapping specific regions to obtain the non-overlapping specific regions after replacement; and corresponding to each genome according to the specific k-mer and the non-overlapping specific regions after replacement. Of non-coincidence specific regions.
  • the above-mentioned non-overlapping specific region acquisition module 1606 is further configured to select a specific k-mer and / or non-overlapping specific region contained in each genome for detection; when the selected specific k- When the distance between the mer and / or non-overlapping specific region on the genome is less than a preset distance threshold, the selected specific k-mer and / or non-overlapping specific region are replaced to obtain the replaced non-overlapping specific region. ; And obtaining a set of non-overlapping specific regions corresponding to each genome according to the finally retained specific k-mer and the replaced non-overlap-specific regions and the replaced non-overlap-specific regions.
  • the above-mentioned non-overlapping specific region acquisition module 1606 is further configured to enable energy detection when the detected specific k-mer and / or non-overlapping specific region distance on the genome is less than or equal to zero. Covering the smallest region of the selected two specific k-mers and replacing the selected two specific k-mers to obtain non-overlapping specific regions; when the selected specific k-mer and / or non-overlapping specific regions are detected When the distance on the genome is greater than zero, the sequence of the two specific k-mers and / or non-overlapping specific regions spaced on the located genome is obtained; the two specific k-mers and the intermediate space are selected. The sequences are spliced in order to obtain the spliced sequence; and the two specific k-mers selected are replaced with the spliced sequence to obtain non-overlapping specific regions.
  • the preset distance threshold is less than 5.
  • the detection target selection module 1608 is further configured to summarize a set of non-overlapping specific regions corresponding to each genome included in the target pathogen operation group, to obtain a union of non-overlapping specific regions; And obtaining the number of times that each non-overlap specific region included in the set of non-overlap specific regions corresponding to each genome appears in the non-overlap specific regions.
  • the preset number of times threshold (1-Y) * N
  • Y is a preset first condition threshold
  • N is the number of non-overlapping specific region sets.
  • the preset first condition threshold is less than 5%.
  • the above-mentioned device further includes a primer screening module (not shown in the figure), configured to use the non-overlapping specific region selected by the occurrence number exceeding a preset number threshold as the representative specific region; according to each obtained
  • the composition of representative specific regions represents the total set of representative specific regions; the representative specific regions that do not have biological functions are removed from the total set of representative specific regions to obtain the total set of representative specific regions that have biological functions; and Non-overlapping specific regions in the total set of sexual regions are used as detection targets for the target pathogen manipulation group.
  • the primer screening module is further configured to obtain the gene annotation information of each genome included in the target pathogen operation group from the target database, and the gene annotation information includes each known Location of functional regions and corresponding functional information; sequentially selecting a representative specific region from the total set of representative specific regions for comparison with known functional regions in the entire genome; and removing known functional regions
  • the overlapped region has a representative specific region whose length is less than a preset coincidence threshold, and a total set of representative specific regions with biological functions is obtained.
  • the primer screening module is further configured to use the genome containing the largest number of non-overlapping specific regions that exceed the preset number of times as the representative genome; and use the set of representative specific regions corresponding to the representative genome as PCR.
  • the detection intervals constitute a set of intervals to be detected; and each of the intervals to be detected in the set of intervals to be detected is screened to obtain a final set of detection primer pairs.
  • the primer screening module is further configured to obtain a total set of representative specific regions according to the selected non-overlapping specific regions whose occurrences exceed a preset number of times; and select a non-overlapping region that includes the total set of representative specific regions.
  • the genome with the most specific regions is the representative genome.
  • the above primer screening module is further used to obtain the position of each PCR representative specific region in the representative genome in the set of PCR representative specific regions; the two PCR representatives whose distances match a preset distance range The specific region is used as a qualified PCR specific region pair; and a set of qualified PCR specific region pairs is generated based on the qualified PCR specific region pair.
  • the preset distance range is greater than 500 bp and less than 1500 bp.
  • the primer screening module is further configured to use the position of two non-overlapping specific regions in a selected pair of PCR-specific regions on the representative genome as the two segments that are furthest apart from each other as the boundary of the interval to be detected. ; And a sequence corresponding to the interval formed by the boundary of the interval to be detected representing the genome is taken as the interval to be detected.
  • the above primer screening module is further configured to mark the positions of the selected two non-overlapping specific regions on the representative genome and the positions of the non-specific regions between the two selected non-overlapping specific regions. .
  • the above primer screening module is further configured to use the PCR primer tool to screen each to-be-detected interval in the set of intervals to be detected to obtain a set of candidate PCR primer pairs; and select from the set of candidate PCR primer pairs.
  • the specific primer pair of the target pathogen operating group For the specific primer pair of the target pathogen operating group, generate a specific primer pair set corresponding to the target pathogen operating group; and select the primer pair in the specific primer pair set that meets the preset primer conditions as the final detection primer pair; according to the final Detection primer pairs generate a final set of detection primer pairs.
  • the above primer screening module is further configured to obtain a complete set from a target database, and the complete set contains a plurality of collected high-confidence genomes; the complete set is obtained through the complete set and is not included in the target pathogen operating group
  • the genome is used as an alignment reference genome; a primer pair is selected from the set of alternative PCR primer pairs in order to locate the alignment reference genome; the selected primer pair is compared with a sequence corresponding to the position of the alignment reference genome, and when the sequences are similar
  • the primer pair is determined to be successfully positioned; the primer pair that satisfies the preset alignment condition is removed from the primer pair determined to be successfully positioned to obtain a specific primer pair corresponding to the target pathogen operation group; and A set of specific primer pairs corresponding to the target pathogen operating group is generated based on the specific primer pairs.
  • the primer screening module is further configured to compare the selected primer pair with a sequence corresponding to the position of the aligned reference genome, and determine the primer pair when the sequence similarity reaches a preset similarity threshold. Positioning succeeded.
  • the preset comparison conditions include at least one of the following: two primers of the selected primer pair are located on the same chromosome of the same genome at the same time; the distance between the two primers of the selected primer pair is The distance range is set; a preset number of base sequences exists at the 3 ′ end of any primer of the selected primer pair, and the base sequences of the primer pair located at the position of the aligned reference genome are the same.
  • meeting the preset primer conditions includes at least one of the following: primer length is between 17 and 28 bp; primer annealing temperature is between 52 and 58 degrees Celsius; GC percentage is between 40% and 60%; primer The 3 'end of the primer is C, G, CG, or GC; the G / C of the last 5 bases of the 3' end of the primer does not exceed 3, and the last 5 bases of the 3 'end of the primer does not contain more than 2 consecutive C or G; does not contain repeats or single nucleic acid repeats; there is no 3 'end complementary between the two primers, or self-complementation of a single primer.
  • the k-mer in the specific k-mer satisfies the following two conditions: the number of occurrences in the genome occurrence index table corresponding to the target pathogen operation group meets a first preset error condition; and in the target pathogen The number of occurrences in the genome occurrence index table corresponding to the operation group and the number of occurrences in the genome occurrence index table of the complete set meet the second preset error condition; the genome number index table corresponding to the target pathogen operation group records the target pathogen operation
  • the genome contained in the group contains the number of genomes of each k-mer; the genome occurrence index table of the complete set records the number of genomes of each k-mer contained in the genome of the complete set.
  • the first preset error condition is: the ratio of the number of occurrences in the genome occurrence index table of the target pathogen operation group to the number of genomes included in the target pathogen operation group is greater than or equal to the first threshold 1.
  • the first threshold is less than 5%.
  • the second preset error condition is: the ratio of the number of occurrences in the genome appearance number index table of the target pathogen operation group to the number of appearances in the genome occurrence number index table of the complete set and the second threshold value And is greater than or equal to 1.
  • the second threshold is less than 5%.
  • Each module in the above device for determining a detection target can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the above-mentioned modules may be embedded in the hardware in or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
  • a computer device is provided.
  • the computer device may be a server, and its internal structure diagram may be as shown in FIG. 17.
  • the computer device includes a processor, a memory, a network interface, and a database connected through a system bus.
  • the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, computer-readable instructions, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer-readable instructions in a non-volatile storage medium.
  • the database of the computer equipment is used to store data for determining the detection target.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer-readable instructions are executed by a processor to implement a method for determining a detection target.
  • FIG. 17 is only a block diagram of a part of the structure related to the scheme of the present application, and does not constitute a limitation on the computer equipment to which the scheme of the present application is applied.
  • the specific computer equipment may be Include more or fewer parts than shown in the figure, or combine certain parts, or have a different arrangement of parts.
  • a computer device includes a memory and one or more processors.
  • the memory stores computer-readable instructions.
  • the method for determining a detection target provided in any embodiment of the present application is implemented. step.
  • Computer-readable instructions computer-readable instructions computer-readable instructions computer-readable instructions computer-readable instructions computer-readable instructions
  • Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Chemical & Material Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Analytical Chemistry (AREA)
  • Organic Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Molecular Biology (AREA)
  • Wood Science & Technology (AREA)
  • Zoology (AREA)
  • Immunology (AREA)
  • Microbiology (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

Disclosed is a method for determining a target to be detected, comprising: determining a target pathogen operation group to be detected (102); obtaining specific k-mers contained in the target pathogen operation group from the target database (104); determining the specific k-mers contained in each genome contained in the target pathogen operation group (106); processing the specific k-mers contained in each genome to obtain sets of non-coincident specific regions corresponding to each genome (108); obtaining the number of occurrences of each non-coincident specific region contained in the sets of non-coincident specific regions corresponding to each genome in all the sets of non-coincident specific regions (110); selecting a non-coincident specific region that occurs more than a preset number threshold as the target to be detected of the target pathogen operation group (112).

Description

确定检测靶点的方法、装置、计算机设备和存储介质Method, device, computer equipment and storage medium for determining detection target
本申请要求于2018年06月22日提交中国专利局,申请号为2018106516939,申请名称为“确定检测靶点的方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed on June 22, 2018, with the application number 2018106516939, and the application name is "Method, Device, Computer Equipment and Storage Medium for Detecting Targets", its entire content Incorporated by reference in this application.
技术领域Technical field
本申请涉及一种确定检测靶点的方法、装置、计算机设备和存储介质。The present application relates to a method, an apparatus, a computer device, and a storage medium for determining a detection target.
背景技术Background technique
靶点是指特殊的核酸片段,此核酸片段可以用于PCR反应(Polymerase Chain Reaction,聚合酶链式反应)、抗体抗原反应以及杂交探针反应等。A target is a special nucleic acid fragment that can be used in PCR reactions (Polymerase Chain Reaction), antibody-antigen reactions, and hybridization probe reactions.
在传统技术中,如果需要确定针对某一病原体或病原体操作组的特异性检测靶点(即特征靶点),则需要对该病原体或病原体操作组的生理、代谢、遗传等多种指标进行细致和详细的长期研究,因此耗时长、效率低。目前已有的一些新的技术方案已经可以通过对基因组进行数据分析的方法,从而寻找到针对某一病原体或病原体操作组的潜在检测靶点。但目前的这些技术方案往往要求苛刻,例如要求特征靶点仅在一个物种操作组中存在,并且在该物种操作组中的每一个个体的基因组中都存在。这样没有任何弹性的苛刻要求使得在一个病原体或病原体操作组中找到特征靶点非常困难。In the traditional technology, if specific detection targets (i.e., characteristic targets) for a certain pathogen or pathogen operating group need to be determined, the physiological, metabolic, genetic and other indicators of the pathogen or pathogen operating group need to be detailed. And detailed long-term research, so it is time-consuming and inefficient. Some existing new technical solutions can already be used to analyze the genome data to find potential detection targets for a certain pathogen or pathogen operating group. However, these current technical solutions are often demanding, such as requiring characteristic targets to exist only in one species operation group, and to exist in the genome of each individual in the species operation group. This lack of any stringent requirements makes it very difficult to find characteristic targets in a pathogen or a group of pathogens.
发明内容Summary of the Invention
根据本申请公开的各种实施例,提供一种确定检测靶点的方法、装置、计算机设备和存储介质。According to various embodiments disclosed in the present application, a method, an apparatus, a computer device, and a storage medium for determining a detection target are provided.
一种确定检测靶点的方法,包括:A method for determining a detection target includes:
确定待检测的目标病原体操作组;Determine the target pathogen operating group to be detected;
从靶点数据库中获取所述目标病原体操作组中包含的特异性k-mer,所述特异性k-mer为满足预设特异性条件的k-mer,k-mer是指长度为k的基因组序列;Obtain a specific k-mer included in the target pathogen operating group from the target database, where the specific k-mer is a k-mer that satisfies preset specificity conditions, and k-mer refers to a genome of length k sequence;
确定所述目标病原体操作组中包含的每个基因组中包含的特异性k-mer;Determining a specific k-mer included in each genome included in the target pathogen operating group;
对每个基因组包含的特异性k-mer进行处理,得到每个基因组对应的非重合特异性区域集合,所述非重合特异性区域集合中包含有非重合特异性区域;Processing the specific k-mer included in each genome to obtain a set of non-overlapping specific regions corresponding to each genome, wherein the set of non-overlapping specific regions includes non-overlapping specific regions;
获取每个基因组对应的非重合特异性区域集合中包含的每个非重合特异性区域在全部非重合特异性区域集合中的出现次数;及Obtaining the number of occurrences of each non-overlapping specific region contained in the non-overlapping specific region set corresponding to each genome in the entire non-overlapping specific region set; and
选取所述出现次数超过预设次数阈值的非重合特异性区域作为所述目标病原体操作组的检测靶点。Selecting a non-overlapping specific region where the number of occurrences exceeds a preset number of times as a detection target of the target pathogen operation group.
一种确定检测靶点的装置,包括:A device for determining a detection target includes:
确定模块,用于确定待检测的目标病原体操作组;A determination module for determining a target pathogen operation group to be detected;
特异性k-mer获取模块,用于从靶点数据库中获取所述目标病原体操作组中包含的特异性k-mer,所述特异性k-mer为满足预设特异性条件的k-mer,k-mer是指长度为k的基因组序列;确定所述目标病原体操作组中包含的每个基因组中包含的特异性k-mer;A specific k-mer acquisition module, configured to acquire a specific k-mer included in the target pathogen operation group from a target database, where the specific k-mer is a k-mer that satisfies a preset specific condition, k-mer refers to a genomic sequence of length k; determining a specific k-mer included in each genome included in the target pathogen operating group;
非重合特异性区域获取模块,用于对每个基因组包含的特异性k-mer进行处理,得到每个基因组对应的非重合特异性区域集合,所述非重合特异性区域集合中包含有非重合特异性区域;及A non-overlapping specific region acquisition module is configured to process specific k-mers contained in each genome to obtain a set of non-overlapping specific regions corresponding to each genome, where the non-overlapping specific region set includes non-overlapping specific regions Specific regions; and
检测靶点选取模块,用于获取每个基因组对应的非重合特异性区域集合中包含的每个非重合特异性区域在全部非重合特异性区域集合中的出现次数;选取所述出现次数超过预设次数阈值的非重合特异性区域作为所述目标病原体操作组的检测靶点。The detection target selection module is used to obtain the number of occurrences of each non-overlapping specific region contained in the set of non-overlapping specific regions corresponding to each genome; A non-overlapping specific region of the number of times threshold is set as a detection target of the target pathogen operation group.
一种计算机设备,包括存储器和一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述一个或多个处理器执行以下步骤:计算机可读指令计算机可读指令确定待检测的目标病原体操作组;A computer device includes a memory and one or more processors. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the processor, the one or more processors are executed. The following steps: computer readable instructions computer readable instructions to determine a target pathogen operating group to be detected;
从靶点数据库中获取所述目标病原体操作组中包含的特异性k-mer,所述特异性k-mer为满足预设特异性条件的k-mer,k-mer是指长度为k的基因组序列;Obtain a specific k-mer included in the target pathogen operating group from the target database, where the specific k-mer is a k-mer that satisfies preset specificity conditions, and k-mer refers to a genome of length k sequence;
确定所述目标病原体操作组中包含的每个基因组中包含的特异性k-mer;Determining a specific k-mer included in each genome included in the target pathogen operating group;
对每个基因组包含的特异性k-mer进行处理,得到每个基因组对应的非重合特异性区域集合,所述非重合特异性区域集合中包含有非重合特异性区域;Processing the specific k-mer included in each genome to obtain a set of non-overlapping specific regions corresponding to each genome, wherein the set of non-overlapping specific regions includes non-overlapping specific regions;
获取每个基因组对应的非重合特异性区域集合中包含的每个非重合特异性区域在全部非重合特异性区域集合中的出现次数;及Obtaining the number of occurrences of each non-overlapping specific region contained in the non-overlapping specific region set corresponding to each genome in the entire non-overlapping specific region set; and
选取所述出现次数超过预设次数阈值的非重合特异性区域作为所述目标病原体操作组的检测靶点。Selecting a non-overlapping specific region where the number of occurrences exceeds a preset number of times as a detection target of the target pathogen operation group.
一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行以下步骤:计算机可读指令计算机可读指令确定待检测的目标病原体操作组;One or more non-volatile computer-readable storage media storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors execute the following steps: computer-readable instructions Computer readable instructions to determine the target pathogen operating group to be detected;
从靶点数据库中获取所述目标病原体操作组中包含的特异性k-mer,所述特异性k-mer为满足预设特异性条件的k-mer,k-mer是指长度为k的基因组序列;Obtain a specific k-mer included in the target pathogen operating group from the target database, where the specific k-mer is a k-mer that satisfies preset specificity conditions, and k-mer refers to a genome of length k sequence;
确定所述目标病原体操作组中包含的每个基因组中包含的特异性k-mer;Determining a specific k-mer included in each genome included in the target pathogen operating group;
对每个基因组包含的特异性k-mer进行处理,得到每个基因组对应的非重合特异性区域集合,所述非重合特异性区域集合中包含有非重合特异性区域;Processing the specific k-mer included in each genome to obtain a set of non-overlapping specific regions corresponding to each genome, wherein the set of non-overlapping specific regions includes non-overlapping specific regions;
获取每个基因组对应的非重合特异性区域集合中包含的每个非重合特异性区域在全部非重合特异性区域集合中的出现次数;及Obtaining the number of occurrences of each non-overlapping specific region contained in the non-overlapping specific region set corresponding to each genome in the entire non-overlapping specific region set; and
选取所述出现次数超过预设次数阈值的非重合特异性区域作为所述目标病原体操作组的检测靶点。Selecting a non-overlapping specific region where the number of occurrences exceeds a preset number of times as a detection target of the target pathogen operation group.
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征和优点将从说明书、附图以及权利要求书变得明显。Details of one or more embodiments of the present application are set forth in the accompanying drawings and description below. Other features and advantages of the application will become apparent from the description, the drawings, and the claims.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions in the embodiments of the present application or the prior art more clearly, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are merely These are some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without paying creative work.
图1为根据一个或多个实施例中确定检测靶点的方法的流程示意图。FIG. 1 is a schematic flowchart of a method for determining a detection point according to one or more embodiments.
图2为根据一个或多个实施例中在步骤102之前的流程示意图。FIG. 2 is a schematic flow chart before step 102 according to one or more embodiments.
图3为根据一个或多个实施例中步骤108的流程示意图。FIG. 3 is a schematic flowchart of step 108 according to one or more embodiments.
图4为根据一个或多个实施例中步骤306的流程示意图。FIG. 4 is a schematic flowchart of step 306 according to one or more embodiments.
图5为根据一个或多个实施例中在步骤306之后的流程示意图。FIG. 5 is a schematic flow chart after step 306 according to one or more embodiments.
图6为根据一个或多个实施例中步骤110的流程示意图。FIG. 6 is a schematic flowchart of step 110 according to one or more embodiments.
图7为根据一个或多个实施例中在步骤112之后的流程示意图。FIG. 7 is a schematic flow chart after step 112 according to one or more embodiments.
图8为根据一个或多个实施例中步骤704的流程示意图。FIG. 8 is a schematic flowchart of step 704 according to one or more embodiments.
图9为根据另一个或多个实施例中在步骤112之后的流程示意图。FIG. 9 is a schematic flow chart after step 112 according to another embodiment.
图10为根据一个或多个实施例中步骤906的流程示意图。FIG. 10 is a schematic flowchart of step 906 according to one or more embodiments.
图11为根据一个或多个实施例中步骤912的流程示意图。FIG. 11 is a schematic flowchart of step 912 according to one or more embodiments.
图12为根据一个或多个实施例中步骤1104的流程示意图。FIG. 12 is a schematic flowchart of step 1104 according to one or more embodiments.
图13为根据另一个或多个实施例中确定检测靶点的方法的流程示意图。FIG. 13 is a schematic flowchart of a method for determining a detection point according to another or more embodiments.
图14为根据一个或多个实施例中步骤1302的流程示意图。FIG. 14 is a schematic flowchart of step 1302 according to one or more embodiments.
图15为根据一个或多个实施例中步骤1304的流程示意图。FIG. 15 is a schematic flowchart of step 1304 according to one or more embodiments.
图16为根据一个或多个实施例中确定检测靶点的装置的结构框图。FIG. 16 is a structural block diagram of an apparatus for determining a detection point according to one or more embodiments.
图17为根据一个或多个实施例中计算机设备的内部结构图。FIG. 17 is an internal structural diagram of a computer device according to one or more embodiments.
具体实施方式detailed description
为了使本申请的技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。In order to make the technical solution and advantages of the present application more clear and clear, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the application, and are not used to limit the application.
在其中一个实施例中,如图1所示,提供了一种确定检测靶点的方法,包括以下步骤:In one embodiment, as shown in FIG. 1, a method for determining a detection target is provided, including the following steps:
步骤102,确定待检测的目标病原体操作组。Step 102: Determine a target pathogen operation group to be detected.
一个病原体操作组,可以代表一个物种、一个亚种、一个亚型、一个菌株或病毒株、或一个属等不同分类层级的遗传单位或物种分类学单位,一个病原体操作组可以包括一个或多个相关的基因组。目标病原体操作组是指待检测的一种病原体操作组。比如待检测的病原体操作组为金黄色葡萄球菌,那么步骤102中的目标病原体操作组则指的是金黄色葡萄球菌。A pathogen operating group can represent a genetic unit or a taxonomic unit of different classification levels such as a species, a subspecies, a subtype, a strain or virus strain, or a genus. A pathogen operating group may include one or more Related genomes. The target pathogen operating group refers to a pathogen operating group to be detected. For example, the pathogen operating group to be detected is Staphylococcus aureus, then the target pathogen operating group in step 102 refers to Staphylococcus aureus.
步骤104,从靶点数据库中获取目标病原体操作组中包含的特异性k-mer,特异性k-mer为满足预设特异性条件的k-mer,k-mer是指长度为k的基因组序列。Step 104: Obtain the specific k-mer included in the target pathogen operation group from the target database. The specific k-mer is a k-mer that meets the preset specific conditions. The k-mer refers to a genomic sequence of length k. .
在靶点数据库中存储有预先为每个病原体操作组建立的特征靶点序列集合,在每个病原 体操作组对应的特征靶点序列集合中包含有每个病原体操作组对应的特异性k-mer。因此,可从靶点数据库中获取到目标病原体操作组中包含的特异性k-mer。特异性k-mer是指从目标病原体操作组包含的k-mer中选取的满足预设特异性条件的k-mer,即作为目标病原体操作组对应的特异性k-mer。预设特异性条件是技术人员预先设定的条件,用于选取符合的k-mer,预设特异性条件可根据技术人员的考虑或实际项目需求而定。The target database stores a feature target sequence set previously established for each pathogen operating group, and the characteristic target sequence set corresponding to each pathogen operating group includes a specific k-mer corresponding to each pathogen operating group . Therefore, the specific k-mer included in the target pathogen manipulation group can be obtained from the target database. The specific k-mer refers to a k-mer selected from the k-mers included in the target pathogen operation group and meeting a preset specificity condition, that is, a specific k-mer corresponding to the target pathogen operation group. The preset specific condition is a condition set by a technician in advance for selecting a matching k-mer. The preset specific condition may be determined according to a technician's consideration or an actual project requirement.
k-mer是指长度为k的基因组序列,k为自然数。如果一种基因组数据中一共有a个不同的确定性字符,那么对于一个特定的k,则一共有数量为a的k次方个可能不相同的k-mer。对于DNA或RNA(核糖核酸)序列,确定性字符是指A(腺嘌呤)、T(胸腺嘧啶)、C(胞嘧啶)、G(鸟嘌呤)、U(尿嘧啶)这五种碱基;如果是蛋白序列,确定性字符则是指确定的氨基酸字符。k-mer refers to a genomic sequence of length k, where k is a natural number. If there are a different deterministic characters in a genomic data, then for a particular k, there may be a total of k-mers with a power of a that are different. For DNA or RNA (ribonucleic acid) sequences, deterministic characters refer to the five bases A (adenine), T (thymine), C (cytosine), G (guanine), and U (uracil); In the case of protein sequences, deterministic characters are defined amino acid characters.
步骤106,确定目标病原体操作组中包含的每个基因组中包含的特异性k-mer。Step 106: Determine a specific k-mer included in each genome included in the target pathogen operating group.
一个病原体操作组可以包括一个或多个相关的基因组。因此在目标病原体操作组中包含有一个或多个相关的基因组,而在每个基因组中,包含有一个或多个k-mer。特异性k-mer是满足预设特异性条件的k-mer,因此在每个基因组中包含有一个或多个特异性k-mer。因此可确定待检测的目标病原体操作组中包含的每个基因组中包含的特异性k-mer。A pathogen operating group can include one or more related genomes. Therefore, the target pathogen operating group includes one or more related genomes, and each genome includes one or more k-mers. Specific k-mers are k-mers that satisfy preset specificity conditions, so each genome contains one or more specific k-mers. Therefore, the specific k-mer contained in each genome contained in the target pathogen operating group to be detected can be determined.
步骤108,对每个基因组包含的特异性k-mer进行处理,得到每个基因组对应的非重合特异性区域集合,非重合特异性区域集合中包含有非重合特异性区域。Step 108: The specific k-mer included in each genome is processed to obtain a set of non-overlapping specific regions corresponding to each genome, and the set of non-overlapping specific regions includes non-overlapping specific regions.
在确定目标病原体操作组中包含的每个基因组中包含的特异性k-mer后,可以对每个基因组包含的特异性k-mer进行处理。在对每个基因组包含的特异性k-mer进行处理后,则可以得到每个基因组对应的非重合特异性区域集合。在每个基因组对应的非重合特异性区域集合中,包含有每个基因组对应的一个或多个非重合特异性区域。非重合特异性区域与特异性k-mer不同的是,特异性k-mer是有长度限制的,是特定长度为k的一段序列,而非重合特异性区域对于其本身的长度没有任何限制。After determining the specific k-mer contained in each genome included in the target pathogen manipulation group, the specific k-mer contained in each genome can be processed. After processing the specific k-mer contained in each genome, a set of non-overlapping specific regions corresponding to each genome can be obtained. The set of non-overlapping specific regions corresponding to each genome includes one or more non-overlapping specific regions corresponding to each genome. The difference between a non-overlapping specific region and a specific k-mer is that the specific k-mer has a length limitation and is a sequence of a specific length k, while the non-overlapping specific region does not have any limitation on its length.
步骤110,获取每个基因组对应的非重合特异性区域集合中包含的每个非重合特异性区域在全部非重合特异性区域集合中的出现次数。Step 110: Obtain the number of occurrences of each non-overlapping specific region included in the non-overlapping specific region set corresponding to each genome in all the non-overlapping specific region sets.
在得到每个基因组对应的非重合特异性区域集合后,可获取每个非重合特异性区域集合中每个非重合特异性区域在全部的非重合特异性区域集合中的出现次数。可以将每个基因组对应的非重合特异性区域集合看做是一个小的集合,将全部的基因组的非重合特异性区域集合组成非重合特异性区域全集,即由多个小的集合组成了全集。在非重合特异性区域全集中包含有该病原体操作组中包括的全部的基因组中的非重合特异性区域,因此可获取到每个基因组各自的非重合特异性区域集合中的每个非重合特异性区域,在重合特异性区域全集中的出现次数。After obtaining the non-overlapping specific region set corresponding to each genome, the number of occurrences of each non-overlapping specific region in each non-overlapping specific region set in each non-overlapping specific region set can be obtained. The set of non-overlapping specific regions corresponding to each genome can be regarded as a small set, and the entire set of non-overlapping specific regions of the entire genome is composed of a complete set of non-overlapping specific regions, that is, the complete set is composed of multiple small sets. . The complete set of non-overlapping specific regions contains all the non-overlap-specific regions in the entire genome included in the operation group of the pathogen. Therefore, each non-overlap-specific region in the set of non-overlap-specific regions of each genome can be obtained. Sexual region, the number of occurrences in the full set of coincident specific regions.
步骤112,选取出现次数超过预设次数阈值的非重合特异性区域作为目标病原体操作组的检测靶点。In step 112, a non-overlapping specific region with a number of occurrences exceeding a preset number of times is selected as a detection target of the target pathogen operation group.
在获取到每个非重合特异性区域在全部的非重合特异性区域集合中的出现次数后,可从中选取出现次数超过预设次数阈值的非重合特异性区域作为目标病原体操作组的检测靶点。 预设次数阈值可由技术人员根据实际项目需求设定。选取出的非重合特异性区域可以是多个,也可以对应多个基因组。After obtaining the number of occurrences of each non-overlapping specific region in the entire set of non-overlapping specific regions, a non-overlapping specific region with an appearance frequency exceeding a preset number of thresholds can be selected as a detection target of the target pathogen operation group . The preset number of times threshold can be set by technicians according to actual project requirements. The selected non-overlapping specific regions may be multiple, or may correspond to multiple genomes.
上述确定检测靶点的方法中,通过获取到目标病原体操作组的特异性k-mer,再从根据特异性k-mer得到的非重合特异性区域中选取出符合预设次数阈值要求的非重合特异性区域,作为目标病原体操作组的检测靶点。由于通过概率性的预设特异性条件来确定特异性k-mer,并根据特异性k-mer得到非重合特异性区域集合,最终再选取出符合条件的非重合特异性区域作为检测靶点时,因此这种技术方案大大扩大了潜在检测靶点的搜寻范围,增加了对检测靶点搜索范围限定的灵活性,提高了确定检测靶点的效率。In the above method for determining the detection target, the specific k-mer of the target pathogen operating group is obtained, and then a non-overlap that meets the preset number of thresholds is selected from the non-overlap specific regions obtained according to the specific k-mer. Specific region, as the detection target of the target pathogen operating group. Because the specific k-mer is determined by probabilistic preset specific conditions, and a set of non-overlapping specific regions is obtained based on the specific k-mer, when the non-overlapping specific regions that meet the conditions are finally selected as the detection target Therefore, this technical solution greatly expands the search range of potential detection targets, increases the flexibility of limiting the search range of detection targets, and improves the efficiency of determining detection targets.
在其中一个实施例中,预设次数阈值=(1-Y)*N,Y为预设第一条件阈值,N为非重合特异性区域集合数量。In one embodiment, the preset number of times threshold = (1-Y) * N, Y is a preset first condition threshold, and N is the number of non-overlapping specific region sets.
在从多个非重合特异性区域中选取部分非重合特异性区域作为目标病原体操作组的检测靶点时,选取的是在全部非重合特异性区域集合中的出现次数达到预设次数阈值的非重合特异性区域。预设次数阈值等于(1-Y)*N。Y为预设第一条件阈值,可将预设第一条件阈值Y设置为小于5%,具体可由技术人员根据实际情况进行设定。N为非重合特异性区域集合的数量,而每个基因组对应一个非重合特异性区域集合,因此N实际上就是目标病原体操作组中包含的基因组的数量。一般情况下,一个非重合特异性区域并不会在每个基因组中都出现,因此任意一个非重合特异性区域在全部非重合特异性区域集合中的出现次数一般会小于或等于N。When selecting a part of non-overlapping specific regions from multiple non-overlapping specific regions as the detection target of the target pathogen operation group, a non-overlapping specific region is selected to be a non-overlapping specific region set. Overlap specific regions. The preset number of times threshold is equal to (1-Y) * N. Y is a preset first condition threshold, and the preset first condition threshold Y can be set to less than 5%, which can be specifically set by a technician according to the actual situation. N is the number of non-overlapping specific region sets, and each genome corresponds to a non-overlapping specific region set, so N is actually the number of genomes contained in the target pathogen operating group. Generally, a non-overlapping specific region does not appear in every genome, so any non-overlapping specific region appears in all non-overlapping-specific region sets less than or equal to N.
在其中一个实施例中,预设第一条件阈值小于5%。In one embodiment, the preset first condition threshold is less than 5%.
在其中一个实施例中,特异性k-mer是指在目标病原体操作组的基因组出现次数索引表中的出现次数满足预设误差条件的病原体操作组中的k-mer。In one of the embodiments, the specific k-mer refers to the k-mer in the pathogen operation group in which the number of occurrences in the genome occurrence number index table of the target pathogen operation group meets a preset error condition.
在每个病原体操作组对应的特征靶点序列集合中,都包含有每个病原体操作组中满足预设特异性条件的特异性k-mer。进一步地,预设特异性条件是指,在每个目标病原体操作组的基因组出现次数索引表中出现次数满足预设误差条件的病原体操作组中包含的k-mer。预设误差条件是指技术人员根据实际项目需求预先设定的误差条件,误差条件可以是一个区域范围,即允许了选取作为特异性的k-mer能够存在一定的误差,而不是完全一定要满足某个严格的客观条件。The characteristic target sequence set corresponding to each pathogen operation group includes a specific k-mer that satisfies a preset specific condition in each pathogen operation group. Further, the preset specific condition refers to a k-mer included in a pathogen operation group in which the number of occurrences in the genome occurrence number index table of each target pathogen operation group satisfies a preset error condition. The preset error condition refers to the error condition preset by the technician according to the actual project requirements. The error condition can be a range of regions, that is, the k-mer selected as a specific can be allowed to have a certain error, instead of being completely satisfied. Some strict objective condition.
一个目标病原体操作组有其对应的病原体操作组基因组出现次数索引表。可根据目标病原体操作组对应的基因组出现次数索引表获知目标病原体操作组中包含的k-mer的在目标病原体操作组中所包含的多少个基因组里出现过,即可选出在目标病原体操作组的基因组出现次数索引表中的出现次数满足预设误差条件的病原体操作组中的k-mer,将选出的k-mer作为目标病原体操作组中包含的特异性k-mer。A target pathogen operating group has an index table of the number of occurrences of the corresponding genome of the pathogen operating group. The number of genomes contained in the target pathogen operation group of the k-mer included in the target pathogen operation group can be known according to the number of genome occurrence index table corresponding to the target pathogen operation group, that is, the target pathogen operation group can be selected. The k-mer in the pathogen operating group whose occurrences in the genome occurrence number index table satisfy the preset error condition, and the selected k-mer is used as the specific k-mer included in the target pathogen operating group.
在选取特异性k-mer时允许了一定的误差性,因此能够在一定误差范围内较高概率地找到代表目标病原体操作组的特异性序列,由于通过概率性的预设特异性条件来确定特异性k-mer,并根据特异性k-mer得到非重合特异性区域集合,最终再选取出符合条件的非重合特异性区域作为检测靶点时,因此这种技术方案大大扩大了潜在检测靶点的搜寻范围,增加了 对检测靶点搜索范围限定的灵活性,提高了确定检测靶点的效率。When selecting the specific k-mer, a certain degree of error is allowed, so the specific sequence representing the target pathogen operation group can be found with a high probability within a certain error range. Because the specificity is determined by the probability preset specificity condition When the non-overlapping specific region set is obtained based on the specific k-mer, and the non-overlapping specific region that meets the conditions is finally selected as the detection target, this technical solution greatly expands the potential detection target. The search range increases the flexibility of limiting the search range of detection targets, and improves the efficiency of determining detection targets.
在其中一个实施例中,在上述步骤102之前,还包括以下步骤:生成与目标病原体操作组对应的基因组出现次数索引表,基因组次数索引表记录了目标病原体操作组包含的基因组中包含有每个k-mer的基因组的个数;将基因组出现次数索引表存储至与目标病原体操作组对应的特征靶点序列集合。In one embodiment, before the above step 102, the method further includes the following steps: generating an index table of the number of occurrences of the genome corresponding to the target pathogen operation group, and the genome number index table records that the genome included in the target pathogen operation group contains each The number of k-mer genomes; the genome occurrence index table is stored in the feature target sequence set corresponding to the target pathogen operating group.
基因组是指一个生物体内所有遗传信息,这种遗传信息以核苷酸序列形式存储。一个生物体(例如一个动植物个体、或动植物细胞、或细菌个体)的一个完整单体内的遗传物质的总和即为基因组。在每个病原体操作组中,可以包含有多个基因组,而在每个基因组中,则可以包含有多个k-mer。在每个病原体操作组对应的基因组出现次数索引表中记录了每个病原体操作组包含的k-mer在该病原体操作组的多少个基因组中出现过,即基因组次数索引表记录了每个k-mer对应的病原体操作组包含的基因组中包含有该k-mer的基因组的个数。The genome is all the genetic information in an organism. This genetic information is stored in the form of a nucleotide sequence. The sum of the genetic material in a complete monomer of an organism (such as an animal or plant individual, or animal or plant cell, or bacterial individual) is the genome. In each pathogen operating group, multiple genomes can be included, and in each genome, multiple k-mers can be included. In the index table of the number of genome occurrences corresponding to each pathogen operation group, the number of k-mers contained in each pathogen operation group has appeared in the genome of the pathogen operation group, that is, the genome number index table records each k- The number of genomes of the k-mer is included in the genome included in the pathogen operating group corresponding to mer.
目标病原体操作组对应的基因组次数表中实际上记录的是,目标病原体操作中包含的每个k-mer在目标病原体操作组中包含的多少个基因组中出现过。如果在同一个基因组中一个k-mer出现超过一次,那么在目标病原体操作组对应的基因组出现次数索引表中仍然只会计数一次。在获取到目标病原体操作中包含的每个k-mer在目标病原体操作组中包含的多少个基因组中出现过的数据后,即可建立针对目标病原体操作组对应的基因组出现次数索引表。当目标病原体操作组对应的基因组出现次数索引表建立后,可将基因组出现次数索引表存储至与目标病原体操作组对应的特征靶点序列集合,即存储至靶点数据库中,存储后,若是需要用到基因组出现次数索引表即可从数据库进行数据调取,进而提高了检测的效率。The table of the number of genomes corresponding to the target pathogen operation group actually records how many genomes included in the target pathogen operation group for each k-mer included in the target pathogen operation. If a k-mer occurs more than once in the same genome, it will still only be counted once in the index table of the number of genome occurrences corresponding to the target pathogen operating group. After obtaining data on how many genomes each k-mer included in the target pathogen operation group has contained in the target pathogen operation group, an index table of the number of occurrences of the genome corresponding to the target pathogen operation group can be established. After the genomic appearance frequency index table corresponding to the target pathogen operating group is established, the genomic appearance frequency index table can be stored into the characteristic target sequence set corresponding to the target pathogen operating group, that is, stored in the target database. After storage, if needed, Using the index table of the number of occurrences of the genome can retrieve data from the database, thereby improving the detection efficiency.
在其中一个实施例中,如图2所示,在上述步骤102之前,还包括以下步骤:In one embodiment, as shown in FIG. 2, before step 102, the method further includes the following steps:
步骤100,从目标病原体操作组对应的k-mer中选取满足预设特异性条件的k-mer。Step 100: Select a k-mer that satisfies a preset specific condition from a k-mer corresponding to the target pathogen operation group.
步骤101,将满足预设特异性条件的k-mer存储至目标病原体操作组对应的特征靶点序列集合中。Step 101: Store a k-mer that satisfies a preset specific condition into a feature target sequence set corresponding to a target pathogen operation group.
在靶点数据库中,存储有目标病原体操作组对应的特征靶点序列集合,在目标特征靶点序列集合中包含有目标病原体操作组对应的特异性k-mer。特异性k-mer是指从目标病原体操作组包含的k-mer中选取满足预设特异性条件的k-mer,当选取出满足预设特异性条件的k-mer,即特异性k-mer,存储至目标病原体操作组对应的特征靶点序列集合中。In the target database, a characteristic target sequence set corresponding to the target pathogen operating group is stored, and the target characteristic target sequence set includes a specific k-mer corresponding to the target pathogen operating group. Specific k-mer refers to the selection of k-mers that meet the preset specific conditions from the k-mers included in the target pathogen operation group. When a k-mer that meets the preset specific conditions is selected, that is, the specific k-mer, It is stored in the feature target sequence set corresponding to the target pathogen operation group.
在其中一个实施例中,特异性k-mer中的k-mer满足以下两个条件:在目标病原体操作组对应的基因组出现次数索引表中的出现次数满足第一预设误差条件;在目标病原体操作组对应的基因组出现次数索引表中的出现次数,以及在全集的基因组出现次数索引表中的出现次数满足第二预设误差条件。目标病原体操作组对应的基因组次数索引表记录了目标病原体操作组包含的基因组中包含有每个k-mer的基因组的个数;全集的基因组出现次数索引表记录了在全集包含的基因组中包含有每个k-mer的基因组的个数。In one embodiment, the k-mer in the specific k-mer satisfies the following two conditions: the number of occurrences in the genome occurrence index table corresponding to the target pathogen operation group meets a first preset error condition; and in the target pathogen The number of occurrences in the genome occurrence number index table corresponding to the operation group and the number of occurrences in the genome occurrence number index table of the complete set meet the second preset error condition. The genome number index table corresponding to the target pathogen operation group records the number of genomes of each k-mer in the genome included in the target pathogen operation group; the genome occurrence number index table of the complete set contains the genomes included in the complete set. The number of genomes of each k-mer.
在靶点数据库中,目标病原体操作组有对应的特征靶点序列集合,在特征靶点序列集合中包含的特异性k-mer是指满足预设特异性条件的k-mer。预设特异性条件包括有第一预设误差条件及第二预设误差条件,当k-mer同时满足这两个条件时,即认为该k-mer满足预设特 异性条件,可将该k-mer作为特异性k-mer。进一步地,k-mer在目标病原体操作组的基因组出现次数索引表中的出现次数需要满足第一预设误差条件,且该k-mer在目标病原体操作组的基因组出现次数索引表中的出现次数,以及在全集的基因组出现次数索引表中的出现次数满足第二预设误差条件。全集是指收集到的所有高可信度基因组组成的集合,高可信度基因组中既包含有各个病原体基因组,也包含有非病原体基因组。比如共生菌、益生菌、人类、动物、植物等的高可信度基因组。高可信度基因组是指选取出的满足预设可信度条件的基因组。In the target database, the target pathogen operating group has a corresponding characteristic target sequence set, and the specific k-mer included in the characteristic target sequence set refers to a k-mer that satisfies a preset specific condition. The preset specific condition includes a first preset error condition and a second preset error condition. When the k-mer satisfies these two conditions at the same time, it is considered that the k-mer meets the preset specific condition and the k -mer as a specific k-mer. Further, the number of occurrences of the k-mer in the genome appearance index table of the target pathogen operation group needs to satisfy the first preset error condition, and the number of occurrences of the k-mer in the genome appearance index table of the target pathogen operation group. And the number of occurrences in the genome occurrence number index table of the complete set satisfies the second preset error condition. The complete set refers to the collection of all the high-confidence genomes collected. The high-confidence genome contains both the pathogen's genome and the non-pathogen's genome. For example, high-confidence genomes of symbiotic bacteria, probiotics, humans, animals, and plants. A high-confidence genome refers to a selected genome that meets a preset reliability condition.
全集的基因组出现次数索引表中记录的每一个k-mer所对应的计数代表的是该k-mer一共在全集中多少个基因组中出现过。如果该k-mer在同一个基因组中出现过多次,也只会计数一次。在目标病原体操作组的基因组次数索引表中,记录了目标病原体操作组包含的基因组中包含有每个k-mer的基因组的个数,而全集的基因组出现次数索引表记录了在全集包含的基因组中包含有该k-mer的基因组的个数。The count corresponding to each k-mer recorded in the genome occurrence index table of the complete set represents how many genomes of the k-mer have appeared in the total set. If the k-mer appears multiple times in the same genome, it will only be counted once. The genome number index table of the target pathogen operating group records the number of genomes of each k-mer in the genome contained in the target pathogen operation group, and the genome occurrence index index of the complete episode records the genome included in the complete episode. It contains the number of k-mer genomes.
特异性k-mer的选取加入了一预设误差条件及第二预设误差条件这两个参数,因此允许了一定范围内的特异性k-mer的非特异性。如果没有这两个参数,就不能允许一定范围内的非特异性,那么针对一个病原体操作组,往往很难找到特异性k-mer。所以通过允许一定误差的方式选取的特异性k-mer,从而建立的特点靶点序列集合,能够高概率地找到能够代表该病原体操作组的特异性靶点。The selection of the specific k-mer includes two parameters, a preset error condition and a second preset error condition, and thus allows the non-specificity of the specific k-mer within a certain range. Without these two parameters, non-specificity in a certain range cannot be allowed, and it is often difficult to find a specific k-mer for a pathogen operating group. Therefore, by selecting a specific k-mer that allows a certain amount of error, and thereby establishing a set of characteristic target sequences, a specific target that can represent the pathogen's operating group can be found with high probability.
在其中一个实施例中,第一预设误差条件为:在目标病原体操作组的基因组出现次数索引表中的出现次数与目标病原体操作组中包含的基因组数量的比值与第一阈值的和大于等于1。In one embodiment, the first preset error condition is: the ratio of the number of occurrences in the genome occurrence index table of the target pathogen operation group to the number of genomes included in the target pathogen operation group is greater than or equal to the first threshold 1.
第一预设误差条件是指,在目标病原体操作组对应的基因组出现次数索引表中记录的出现次数与目标病原体操作组中包含的基因组数量的比值与第一阈值的和大于等于1。假设目标病原体操作组包含有N个基因组,某一k-mer在目标病原体操作组的基因组出现次数索引表中的出现次数为C1,第一阈值为P1,那么第一预设误差条件是指,C1/N+P1≥1。第一阈值P1代表的是可接受的误差概率,可以是一个0到1之间的任意值,第一阈值可由技术人员根据实际项目进行设定。The first preset error condition refers to that the sum of the ratio of the number of occurrences recorded in the genome occurrence number index table corresponding to the target pathogen operation group to the number of genomes contained in the target pathogen operation group and the first threshold is greater than or equal to 1. Suppose the target pathogen operating group contains N genomes, and the number of occurrences of a k-mer in the genome occurrence index table of the target pathogen operating group is C1, and the first threshold is P1, then the first preset error condition refers to, C1 / N + P1≥1. The first threshold value P1 represents an acceptable error probability, and can be any value between 0 and 1. The first threshold value can be set by a technician according to the actual project.
在其中一个实施例中,第一阈值小于5%。In one of these embodiments, the first threshold is less than 5%.
第一阈值是指可接受的误差概率,第一阈值可以是一个0到1之间的任意值,可将第一阈值设为小于5%的值。The first threshold is an acceptable error probability. The first threshold may be any value between 0 and 1. The first threshold may be set to a value less than 5%.
在其中一个实施例中,第二预设误差条件为:在目标病原体操作组的基因组出现次数索引表中的出现次数与在全集的基因组出现次数索引表中的出现次数的比值与第二阈值的和大于等于1。In one embodiment, the second preset error condition is: the ratio of the number of occurrences in the genome appearance number index table of the target pathogen operation group to the number of appearances in the genome occurrence number index table of the complete set and the second threshold value And is greater than or equal to 1.
第二预设误差条件是指,在目标病原体操作组对应的基因组出现次数索引表中记录的出现次数与在全集的基因组出现次数索引表中的出现次数的比值与第二阈值的和大于等于1。假设某一k-mer在目标病原体操作组的基因组出现次数索引表中的出现次数为C1,该k-mer在全集的基因组出现次数索引表中的出现次数为C2,第二阈值为P2,那么第二预设误差条 件是指,C1/C2+P2≥1。第二阈值与上述的第一阈值一样,代表的是可接受的误差概率,可以是一个0到1之间的任意值,第二阈值P2同样可由技术人员根据实际项目进行设定。The second preset error condition refers to that the sum of the number of occurrences recorded in the genome appearance number index table corresponding to the target pathogen operating group and the appearance number in the genome episode number index table of the complete set and the sum of the second threshold value is greater than or equal to 1 . Assume that the number of occurrences of a k-mer in the genome occurrence index table of the target pathogen operation group is C1, and the number of occurrences of the k-mer in the genome occurrence index table of the complete set is C2, and the second threshold value is P2, then The second preset error condition refers to C1 / C2 + P2≥1. The second threshold value is the same as the above-mentioned first threshold value, which represents an acceptable error probability, and can be any value between 0 and 1. The second threshold value P2 can also be set by a technician according to the actual project.
在其中一个实施例中,第二阈值小于5%。In one of these embodiments, the second threshold is less than 5%.
第二阈值与第一阈值一样,均是指可接受的误差概率,第二阈值也可以是一个0到1之间的任意值,可将第二阈值设为小于5%的值。第一阈值与第二阈值可以是相等的,也可以是不等的。The second threshold value is the same as the first threshold value, which means an acceptable error probability. The second threshold value can also be any value between 0 and 1, and the second threshold value can be set to a value less than 5%. The first threshold and the second threshold may be equal or different.
在其中一个实施例中,在获取样本的测序数据之前,还包括:生成全集的基因组出现次数索引表,全集的基因组出现次数索引表记录了在全集包含的基因组中包含有每个k-mer的基因组的个数;将全集的基因组出现次数索引表存储至靶点数据库。In one embodiment, before obtaining the sequencing data of the sample, the method further includes: generating a genome occurrence index table of the complete set, and the genome occurrence number index table of the complete set records records of each k-mer included in the genome included in the complete set. The number of genomes; the genome occurrence index table of the complete set is stored in the target database.
在靶点数据库中,存储有每个病原体操作组对应的特征靶点序列集合。在全集中包含有收集到的所有高可信度基因组,即在全集中既包含有多个病原体操作组的高可信度基因组,也包含有多个非病原体操作组的高可信度基因组。获取到每个病原体操作组中包含的每个k-mer在全集包含的多少个基因组中出现过的数据后,即可生成全集的基因组出现次数索引表。在全集的基因组出现次数索引表中记录了每个病原体操作组包含的k-mer在全集的多少个基因组中出现过,即全集的基因组次数索引表记录了每个k-mer在全集包含的基因组中包含有该k-mer的基因组的个数。In the target database, a characteristic target sequence set corresponding to each pathogen operating group is stored. The full set contains all the high-confidence genomes collected, that is, the full set contains both the high-reliability genomes of multiple pathogen operating groups and the high-reliability genomes of multiple non-pathogenic operating groups. After obtaining the data of how many genomes each k-mer included in each pathogen operation group has contained in the complete set, an index table of the number of occurrences of the genome of the complete set can be generated. The genome occurrences index table of the complete set records how many genomes of the k-mer contained in each pathogen operation group have appeared in the complete set, that is, the genome count index table of the complete set records the genomes that each k-mer contains in the complete set. It contains the number of k-mer genomes.
因此在全集的基因组次数表中实际上记录的是每个k-mer在全集包含的多少个基因组中出现过,即记录的是在全部的基因组中,每个k-mer在多少个基因组中出现过,也就是计量数为基因组的数量,而不是k-mer的出现次数。如果在同一个基因组中一个k-mer出现超过一次,那么在该全集的基因组出现次数索引表中仍然只会计数一次。在获取到每个k-mer在全集的多少个基因组中出现过的数据后,即可建立针对全集的基因组出现次数索引表。全集的基因组出现次数索引表与各个病原体操作组所对应的基因组出现次数索引表有所不同,病原体操作组的基因组出现次数索引表是与病原体操作组对应的,每一个病原体操作组均有其对应的基因组出现次数索引表,但全集的基因组出现次数索引表则只会生成一个,针对的是全部的数据。将生成的全集的基因组出现次数索引表进行存储后,若是在对测序数据进行检测的过程中需要用到,即可从数据库进行数据调取,进而提高了检测的效率。Therefore, in the genome number table of the complete set, actually how many genomes each k-mer contains in the complete set is recorded, that is, how many genomes each k-mer appears in the entire genome is recorded. However, the number of measurements is the number of genomes, not the number of k-mer occurrences. If a k-mer occurs more than once in the same genome, it will still be counted only once in the genome occurrence index table of the complete set. After obtaining the data of how many genomes each k-mer has appeared in the complete set, an index table of the number of occurrences of the genome for the complete set can be established. The genome occurrence index table of the complete set is different from the genome occurrence index table corresponding to each pathogen operation group. The genome occurrence number index table of the pathogen operation group corresponds to the pathogen operation group, and each pathogen operation group has its corresponding Index table of the number of occurrences of the genome, but only one set of the index of the number of occurrences of the genome in the ensemble is generated, which is for all data. After storing the generated genomic appearance frequency index table of the complete set, if it is needed in the process of detecting the sequencing data, the data can be retrieved from the database, thereby improving the detection efficiency.
在其中一个实施例中,上述步骤106,包括:依次将目标病原体操作组中包含的每个基因组作为参考基因组;将目标病原体操作组中包含的每个特异性k-mer定位至参考基因组;将定位至参考基因组的特异性k-mer作为参考基因组包含的特异性k-mer。In one embodiment, the above step 106 includes: sequentially using each genome included in the target pathogen operating group as a reference genome; locating each specific k-mer included in the target pathogen operating group to the reference genome; and The specific k-mer mapped to the reference genome is the specific k-mer contained in the reference genome.
在目标病原体操作组中包含有多个基因组,可以依次将每个基因组作为参考基因组,并将目标病原体操作组中包含的特异性k-mer定位至参考基因组上。由于特异性k-mer是预先挑选出来符合预设特异性条件的k-mer,因此会存在有部分特异性k-mer无法定位到某一个基因组上的情况。可以将成功定位至参考基因组的特异性k-mer作为该参考基因组中包含的特异性k-mer。有的特异性k-mer是无法定位至某个基因组上的,则可以认为该特异性k-mer并不是该基因组中包含的。因此定位也可以认为是再次确认各个基因组中包含的特异性k-mer。因此通过这种定位操作,对每个基因组中包含的特异性k-mer进行二次确认,增加容错几率。Multiple target genomes are included in the target pathogen operating group. Each genome can be used as a reference genome in turn, and the specific k-mer included in the target pathogen operating group can be mapped to the reference genome. Because the specific k-mer is a k-mer that is pre-selected to meet the preset specificity conditions, there may be cases where some specific k-mers cannot be mapped to a certain genome. The specific k-mer successfully mapped to the reference genome can be used as the specific k-mer included in the reference genome. Some specific k-mers cannot be mapped to a certain genome, then it can be considered that the specific k-mers are not included in the genome. Therefore, localization can also be considered as reconfirming the specific k-mer included in each genome. Therefore, through this localization operation, the specific confirmation of the specific k-mer contained in each genome is double-checked to increase the probability of fault tolerance.
在其中一个实施例中,将定位至参考基因组的特异性k-mer作为参考基因组包含的特异性k-mer,包括:依次从参考基因组中选取一个区域与特异性k-mer进行比较;当检测到选取的区域与特异性k-mer的相似度达到预设相似阈值时,则将特异性k-mer作为参考基因组包含的特异性k-mer。In one embodiment, the specific k-mer mapped to the reference genome is used as the specific k-mer included in the reference genome, which includes: sequentially selecting a region from the reference genome for comparison with the specific k-mer; when detecting When the similarity between the selected region and the specific k-mer reaches a preset similarity threshold, the specific k-mer is used as the specific k-mer included in the reference genome.
在将目标病原体操作组中包含的特异性k-mer定位至参考基因组以确定该特异性k-mer是否属于参考基因组时,可以依次从参考基因组中选取一个区域与特异性k-mer进行比较。选取的区域是一个基因序列,将此选取的基因序列与特异性k-mer进行比较,可以对这两个序列的相似度进行检测。当检测到选取的区域与特异性k-mer的相似度达到预设相似阈值时,则可以认为该特异性k-mer是包含在参考基因组中的,即可以将该特异性k-mer作为参考基因组包含的特异性k-mer。可以将参考基因组当做一个由很多碱基组成的字符串,在与特异性k-mer进行比对时,可以从参考基因组这个字符串中依次取长度为k的序列与特异性k-mer进行比较。如果选取的序列与特异性k-mer中,相似的字符串达到预设相似阈值时,则可以认为该特异性k-mer是参考基因组中的特异性k-mer。预设相似阈值可以由技术人员进行自定义设置。例如将预设相似阈值设置为99%时,则如果有一个特异性k-mer与参考基因组上的一个区域的相似度达到或超过99%,则认为该特异性k-mer属于参考基因组。When the specific k-mer included in the target pathogen operating group is mapped to the reference genome to determine whether the specific k-mer belongs to the reference genome, a region can be selected from the reference genome in turn and compared with the specific k-mer. The selected region is a gene sequence. Comparing the selected gene sequence with a specific k-mer, the similarity of the two sequences can be detected. When it is detected that the similarity between the selected region and the specific k-mer reaches a preset similarity threshold, it can be considered that the specific k-mer is included in the reference genome, that is, the specific k-mer can be used as a reference The genome contains a specific k-mer. The reference genome can be regarded as a string of many bases. When compared with the specific k-mer, a sequence of length k can be sequentially compared with the specific k-mer from the string of the reference genome. . If the selected sequence and the specific k-mer have similar strings reaching a preset similarity threshold, the specific k-mer can be considered as a specific k-mer in the reference genome. The preset similarity threshold can be customized by a technician. For example, when the preset similarity threshold is set to 99%, if the specificity of a specific k-mer and a region on the reference genome reaches or exceeds 99%, the specific k-mer is considered to belong to the reference genome.
在其中一个实施例中,将定位至参考基因组的特异性k-mer作为参考基因组包含的特异性k-mer,包括:依次从参考基因组中选取与特异性k-mer长度相同的序列,将选取的与特异性k-mer长度相同的序列与特异性k-mer进行比较;当检测到选取的序列与特异性k-mer相同时,则将特异性k-mer作为参考基因组包含的特异性k-mer。In one of the embodiments, the specific k-mer mapped to the reference genome is used as the specific k-mer included in the reference genome, including: sequentially selecting from the reference genome a sequence having the same length as the specific k-mer, and selecting The sequence of the same length as the specific k-mer is compared with the specific k-mer; when it is detected that the selected sequence is the same as the specific k-mer, the specific k-mer is used as the specific k included in the reference genome. -mer.
与上一个实施例中判断选取的区域与特异性k-mer的相似度是否达到预设相似阈值的定位方式有所不同的是,此处可以依次从参考基因组中选取与特异性k-mer长度相同的序列与特异性k-mer进行比较。若是检测到选取的序列与特异性k-mer相同,则认为该特异性k-mer属于参考基因组,是参考基因组中包含的特异性k-mer。若是检测到选取的序列与特异性k-mer不相同,则认为该特异性k-mer不属于参考基因组。即不存在判断选取的序列与特异性k-mer相似的情况,只有属于或不属于。这种定位方式由于没有相似度差错的判断,速度会更快。Different from the positioning method used in the previous embodiment to determine whether the similarity between the selected region and the specific k-mer reaches a preset similarity threshold, the length of the specific k-mer and the specific k-mer can be selected from the reference genome in this order Identical sequences were compared with specific k-mers. If it is detected that the selected sequence is the same as the specific k-mer, it is considered that the specific k-mer belongs to the reference genome and is the specific k-mer included in the reference genome. If it is detected that the selected sequence is different from the specific k-mer, the specific k-mer is not considered to belong to the reference genome. That is, there is no case that the selected sequence is similar to the specific k-mer, but only belongs to or does not belong. This positioning method is faster because there is no judgment of similarity errors.
在其中一个实施例中,在确定待检测的目标病原体操作组之前,还包括:获取预先选取的满足预设可信度条件的基因组作为高可信度基因组;确定每个病原体操作组包括的高可信度基因组,作为每个病原体操作组对应的基因组。In one embodiment, before determining the target pathogen operation group to be detected, the method further includes: obtaining a pre-selected genome that meets a preset confidence condition as a high-confidence genome; and determining a high degree of confidence included in each pathogen operation group. Confidence genome, as the genome corresponding to each pathogen operating group.
高可信度基因组是指选取出的满足预设可信度条件的基因组。预设可信度条件是指由技术人员设定的挑选基因组的条件。在获取到高可信度基因组后,可以确定每个病原体操作组中包含的高可信度基因组,即可确定每个病原体操作组中对应包含的基因组。高可信度基因组既可以包括病原体基因组,也包括非病原体基因组,例如共生菌、益生菌、人类、动物、植物等的高可信度基因组。高可信度的基因组可以来源于NCBI((National Center for Biotechnology Information,美国国立生物技术信息中心)的RefSeq数据集(RefSeq参考序列数据库,美国国家生物信息技术中心提供的具有生物意义上的非冗余的基因和蛋白质序列)或其他公共或私有的高可信度基因组。A high-confidence genome refers to a selected genome that meets a preset reliability condition. The preset confidence condition refers to a condition for selecting a genome set by a technician. After obtaining the high-confidence genome, the high-confidence genome contained in each pathogen operating group can be determined, and the corresponding contained genome in each pathogen operating group can be determined. The high-confidence genome can include both the pathogen genome and the non-pathogen genome, such as high-confidence genomes of symbiotic, probiotic, human, animal, and plant. The highly reliable genome can be derived from the RefSeq dataset (RefSeq reference sequence database of the National Center for Biotechnology Information) of the NCBI (RefSeq reference sequence database, non-redundant in a biological sense provided by the National Center for Bioinformatics). Other genes and protein sequences) or other public or private high-confidence genomes.
在其中一个实施例中,满足预设可信度条件包括以下任意一种:基因组序列中包含的非确定性字符的比例低于预设比例阈值时;基因组序列中包含的属于同一条染色体的序列片段低于预设片段阈值时;将某一基因组序列与其他所有遗传关系符合预设遗传距离阈值范围的基因组序列进行序列比对,以确定该基因组序列在其相近的基因组序列中的全序列平均覆盖百分比,当该平均覆盖百分比高于预设百分比值时。In one embodiment, satisfying the preset credibility condition includes any of the following: when the proportion of non-deterministic characters contained in the genome sequence is lower than a preset proportion threshold; the sequences belonging to the same chromosome included in the genome sequence When the fragment is below the preset fragment threshold; sequence comparison of a genomic sequence with all other genomic sequences whose genetic relationship meets the preset genetic distance threshold range to determine the full sequence average of the genomic sequence in its similar genomic sequence Coverage percentage, when the average coverage percentage is higher than a preset percentage value.
高可信度的基因组的确认和筛选方法可以通过以下这三种方式:There are three ways to identify and screen high-confidence genomes:
1、根据一条基因组数据中所含非确定性字符的比例进行筛选。例如对于DNA基因组来说,非确定性字符的比例是指其中含有的非ACGT字符的比例,一条DNA基因组数据如果其非ACGT字符的比例过高,那么该条数据即为疑似低可信度的基因组。对于DNA或RNA序列,非确定性字符是指除去ACGTU这几个确定性字符以外的字符;对于蛋白质序列,非确定性字符则是指除了确定的氨基酸字符以外的字符。1. Screen based on the proportion of non-deterministic characters contained in a genomic data. For example, for the DNA genome, the proportion of non-deterministic characters refers to the proportion of non-ACGT characters contained in it. If the proportion of non-ACGT characters in a piece of DNA genome data is too high, then the piece of data is suspected of low confidence. Genome. For DNA or RNA sequences, non-deterministic characters refer to characters other than ACGTU. For protein sequences, non-deterministic characters refer to characters other than certain amino acid characters.
2、根据一条完整的染色体所包括的基因组数据片段的数目进行筛选,如果有过多的片段同属于一条染色体,那么该基因组即为疑似低可信度的基因组。2. Screen based on the number of genomic data fragments included in a complete chromosome. If there are too many fragments that belong to the same chromosome, then the genome is a suspected low-confidence genome.
3、通过与该基因组遗传关系相近的(例如遗传距离小于某一阈值)多个基因组进行全基因组序列比对,确定该基因组在其相近基因组中的全基因组平均覆盖百分比,然后根据这个全基因组平均覆盖百分比进行筛选:平均覆盖百分比过低的基因组即为疑似低完成度、即低可信度的基因组。遗传距离是指衡量物种间(或个体间)综合遗传差异大小的指标。3. Perform genome-wide sequence alignment of multiple genomes with similar genetic relationships (eg, genetic distance is less than a certain threshold) to determine the average genome-wide coverage percentage of the genome in its similar genomes, and then average based on this whole genome Screening by percentage of coverage: Genomes with a low average percentage of coverage are those that are suspected of having low completion, ie, low confidence. Genetic distance refers to an index that measures the size of the overall genetic difference between species (or individuals).
在其中一个实施例中,如图3所示,步骤108,包括:In one embodiment, as shown in FIG. 3, step 108 includes:
步骤302,将每个基因组包含的特异性k-mer定位至该基因组上。In step 302, a specific k-mer included in each genome is mapped to the genome.
步骤304,依次选取每个基因组包含的特异性k-mer和/或非重合特异性区域进行检测。Step 304: Select specific k-mer and / or non-overlapping specific regions contained in each genome in turn for detection.
步骤306,当检测到选取的特异性k-mer和/或非重合特异性区域在基因组上的距离小于预设距离阈值时,则将选取的特异性k-mer和/或非重合特异性区域进行替换,得到替换后的非重合特异性区域。In step 306, when it is detected that the distance between the selected specific k-mer and / or non-overlapping specific regions on the genome is less than a preset distance threshold, the selected specific k-mer and / or non-overlapping specific regions are detected. The replacement is performed to obtain a non-overlapping specific region after replacement.
步骤308,根据最终所保留的特异性k-mer和替换后的非重合特异性区域得到每个基因组对应的非重合特异性区域集合。Step 308: Obtain a set of non-overlapping specific regions corresponding to each genome according to the finally retained specific k-mer and the replaced non-overlapping specific regions.
在确定了目标病原体操作组中包含的每个基因组中包含的特异性k-mer后,可以将每个基因组包含的特异性k-mer定位至该基因组上。然后可以依次选取基因组中包含的两个特异性k-mer进行检测,检测选取的两个特异性k-mer在基因组上的距离是否小于预设距离阈值,若是,则将选取的两个特异性k-mer进行替换。替换方式可以是取能覆盖选取的两个特异性k-mer的最小区域替换选取的两个特异性k-mer,即可得到对应的非重合特异性区域。也可以是截取选取的这两个特异性k-mer在基因组上定位的那一段序列作为对应的非重合特异性区域。预设距离阈值可以是一个负数,也可以为0,一般设置为小于5的整数。After determining the specific k-mer contained in each genome included in the target pathogen operating group, the specific k-mer contained in each genome can be mapped to the genome. Then, two specific k-mers included in the genome can be selected in order to detect whether the distance between the selected two specific k-mers on the genome is less than a preset distance threshold. If so, the two specific k-mers selected will be selected. k-mer for replacement. The replacement method may be to replace the selected two specific k-mers with the smallest region that can cover the selected two specific k-mers to obtain the corresponding non-overlapping specific regions. It may also be a sequence in which the selected two specific k-mers are located on the genome as corresponding non-overlapping specific regions. The preset distance threshold can be a negative number or 0, and is generally set to an integer less than 5.
还可以将每个基因组中的特异性k-mer与该基因组中得到的非重合特异性区域进行比对,或者对该基因组中的两个非重合特异性区域之间进行比对。在进行定位比对的时候,可以选取每个基因组包含的特异性k-mer和/或非重合特异性区域进行检测。比对方式与两个特异性k-mer之间进行比对的方式是一致的。检测选取的两个非重合特异性区域或选取的特异性 k-mer与非重合特异性区域,在基因组上的距离是否小于预设距离阈值。若是,则将选取的特异性k-mer和非重合特异性区域或将选取的两个非重合特异性区域进行替换,即可得到对应的非重合特异性区域。根据每个基因组中的最终所保留的特异性k-mer和替换后的非重合特异性区域可得到每个基因组对应的非重合特异性区域集合,在每个基因组对应的非重合特异性区域集合中包含的是该基因组中的非重合特异性区域。The specific k-mer in each genome can also be compared with the non-overlapping specific regions in the genome, or the two non-overlapping specific regions in the genome can be compared. When performing alignment comparison, specific k-mer and / or non-overlapping specific regions contained in each genome can be selected for detection. The alignment method is the same as that between two specific k-mers. Check whether the selected two non-overlapping specific regions or the selected specific k-mer and non-overlapping specific regions have a distance on the genome smaller than a preset distance threshold. If so, the selected specific k-mer and non-overlapping specific regions or two selected non-overlapping specific regions are replaced to obtain corresponding non-overlapping specific regions. A set of non-overlapping specific regions corresponding to each genome can be obtained according to the specific k-mer finally retained in each genome and the non-overlapping specific regions after replacement, and a set of non-overlapping specific regions corresponding to each genome Contained are non-overlapping specific regions in the genome.
假设A和B为选取的两个特异性k-mer,A为ACGGTCATC,B为TCATCCGA。将A和B定位至基因组上后,若是在A和B之间的序列为CCC,那么将A和B进行替换的方式可以是A+CCC+B,也就是将A和B替换得到的费重合特异性区域为ACGGTCATCCCTCATCCGA。若是在A和B之间没有序列,则可以直接将A和B进行拼接,也就是A+B组成的序列就是将A和B替换得到的费重合特异性区域。而本例子中,A和B之间存在有末端重合的情况,也就是A的末端和B的首端存在有多个重合的字符。那么对A和B的替换方式则是取能覆盖选取的两个特异性k-mer的最小区域替换选取的两个特异性k-mer,即ACGGTCATCCGA。具体的替换方式可由技术人员进行自定义,或者根据A和B之间相距的距离或重合的字符数进行选择。Assume that A and B are the two specific k-mers selected, A is ACGGTCATC, and B is TCATCCGA. After locating A and B on the genome, if the sequence between A and B is CCC, then the way to replace A and B can be A + CCC + B, that is, the cost of replacing A and B is coincident. The specific region is ACGGTCATCCCTCATCCGA. If there is no sequence between A and B, A and B can be directly spliced, that is, the sequence composed of A + B is the overlapped specific region obtained by replacing A and B. In this example, there is a case where the ends overlap between A and B, that is, the ends of A and the head of B have multiple overlapping characters. Then the replacement method for A and B is to replace the selected two specific k-mers with the smallest area that can cover the selected two specific k-mers, that is, ACGGTCATCCGA. The specific replacement method can be customized by a technician, or selected based on the distance between A and B or the number of overlapping characters.
在对特异性k-mer进行处理得到非重合特异性区域时,可能存在的情况是,两个特异性k-mer进行替换能得到一个非重合特异性区域,也可能是三个特异性k-mer进行替换能得到一个非重合特异性区域,或者也可以是多个特异性k-mer进行替换能得到的非重合特异性区域。因此得到的非重合特异性区域的长度并无限制。若选取的两个特异性k-mer在基因组上的距离并未小于预设距离阈值,则无需处理。针对每个基因组均做此处理后,即可得到每个基因组中包含的非重合特异性区域集合。在每个基因组对应的非重合特异性区域集合中包含的是该基因组中的非重合特异性区域。When processing specific k-mers to obtain non-overlapping specific regions, there may be situations in which two specific k-mers can be replaced to obtain one non-overlapping specific region or three specific k-mers. The replacement of mer can obtain a non-overlapping specific region, or it can also be a non-overlapping specific region obtained by replacing multiple specific k-mers. Therefore, the length of the obtained non-overlapping specific region is not limited. If the distance between the selected two specific k-mers on the genome is not less than a preset distance threshold, no processing is required. After doing this for each genome, a set of non-overlapping specific regions contained in each genome can be obtained. The set of non-overlapping specific regions corresponding to each genome contains the non-overlapping specific regions in the genome.
在其中一个实施例中,预设距离阈值小于5。In one embodiment, the preset distance threshold is less than 5.
可以将预设距离阈值小于5的整数。The preset distance threshold can be an integer less than 5.
在其中一个实施例中,如图4所示,上述步骤306,包括:In one embodiment, as shown in FIG. 4, the above step 306 includes:
步骤402,检测选取的特异性k-mer和/或非重合特异性区域在基因组上的距离是否小于或等于零,若是,则执行步骤404,;若否,则执行步骤406。In step 402, it is detected whether the distance between the selected specific k-mer and / or non-overlapping specific region on the genome is less than or equal to zero, and if yes, step 404 is performed; if not, step 406 is performed.
步骤404,取能覆盖选取的两个特异性k-mer的最小区域替换选取的两个特异性k-mer,得到非重合特异性区域。Step 404: Replace the selected two specific k-mers with the smallest region that can cover the selected two specific k-mers to obtain non-overlapping specific regions.
针对每个基因组,依次从基因组中选取出两个特异性k-mer和/或非重合特异性区域定位至基因组上,则可获取到这两个特异性k-mer和/或非重合特异性区域相隔的距离,也就是距离的字符数量。当检测到选取的两个特异性k-mer和/或非重合特异性区域在基因组上的距离小于预设距离阈值时,则可以将选取的两个特异性k-mer和/或非重合特异性区域进行替换,得到对应的非重合特异性区域。当选取的这两个特异性k-mer和/或非重合特异性区域在基因组上的距离为0时,意味着选取的这两个特异性k-mer和/或非重合特异性区域直接相邻并相接。距离为负数时,意味着选取的这两个特异性k-mer和/或非重合特异性区域有一定数量的碱基对的重合。当检测到选取的两个特异性k-mer和/或非重合特异性区域在基因组上的距离 小于或等于0时,则对这两个特异性k-mer和/或非重合特异性区域的替换方式可以是,取能覆盖选取的两个特异性k-mer和/或非重合特异性区域的最小区域替换选取的两个特异性k-mer和/或非重合特异性区域。也就是一个区域替换这两个特异性k-mer和/或非重合特异性区域,此区域就是根据这两个特异性k-mer和/或非重合特异性区域得到的非重合特异性区域。For each genome, two specific k-mer and / or non-overlapping specific regions are selected from the genome in order to locate on the genome, and these two specific k-mer and / or non-overlapping specificities can be obtained. The distance between regions, which is the number of characters in the distance. When it is detected that the distance between two selected specific k-mer and / or non-overlapping specific regions on the genome is less than a preset distance threshold, the selected two specific k-mer and / or non-overlapping specific regions may be specifically Sexual regions are replaced to obtain corresponding non-overlapping specific regions. When the distance between the two specific k-mer and / or non-overlapping specific regions on the genome is 0, it means that the two specific k-mer and / or non-overlapping specific regions are directly related to each other. Adjacent and connected. When the distance is negative, it means that the selected two specific k-mer and / or non-overlapping specific regions have a certain number of base pairs. When it is detected that the distance between two selected specific k-mer and / or non-overlapping specific regions on the genome is less than or equal to 0, the The replacement method may be to replace the selected two specific k-mer and / or non-overlapping specific regions with the smallest region covering the selected two specific k-mer and / or non-overlapping specific regions. That is, a region replaces the two specific k-mer and / or non-overlapping specific regions, and this region is a non-overlapping specific region obtained according to the two specific k-mer and / or non-overlapping specific regions.
步骤406,获取选取的两个特异性k-mer和/或非重合特异性区域在定位的基因组上中间间隔的序列。Step 406: Obtain a sequence spaced between the selected two specific k-mer and / or non-overlapping specific regions on the mapped genome.
步骤408,将选取的两个特异性k-mer和中间间隔的序列依次进行拼接,得到拼接序列。Step 408: splicing the selected two specific k-mers and the intermediate interval in sequence to obtain a spliced sequence.
步骤410,将选取的两个特异性k-mer替换成拼接序列,得到非重合特异性区域。Step 410: Replace the selected two specific k-mers with a splicing sequence to obtain a non-overlapping specific region.
当检测到选取的特异性k-mer和/或非重合特异性区域在基因组上的距离大于零时,意味着这两个序列之间存在着有其他序列。则可以获取到选取这两个序列在基因组上中间间隔的序列。选取的这两个序列可以是两个特异性k-mer,也可以是一个特异性k-mer与一个非重合特异性区域,还可以是两个非重合特异性区域。从而可以将选取的两个序列和中间间隔的序列依次进行拼接,得到拼接序列,将选取的两个序列替换成拼接序列后,即可得到非重合特异性区域。以此类推,直到没有任何距离小于预设距离阈值的特异性k-mer或特异性区域。When the distance between the selected specific k-mer and / or non-overlapping specific regions on the genome is greater than zero, it means that there are other sequences between the two sequences. Then you can get the sequence of these two sequences in the middle of the genome. The two sequences selected may be two specific k-mers, or one specific k-mer and one non-overlapping specific region, or two non-overlapping specific regions. Thus, the selected two sequences and the intermediate interval sequence can be spliced in sequence to obtain a spliced sequence. After replacing the selected two sequences with the spliced sequence, a non-overlapping specific region can be obtained. And so on, until there is no specific k-mer or specific region whose distance is less than a preset distance threshold.
在其中一个实施例中,如图5所示,在步骤306之后,还包括:In one embodiment, as shown in FIG. 5, after step 306, the method further includes:
步骤502,选取每个基因组包含的特异性k-mer和/或非重合特异性区域进行检测。Step 502: Select specific k-mer and / or non-overlapping specific regions contained in each genome for detection.
步骤504,当检测到选取的特异性k-mer和/或非重合特异性区域在基因组上的距离小于预设距离阈值时,则将选取的特异性k-mer和/或非重合特异性区域进行替换,得到替换后的非重合特异性区域。In step 504, when it is detected that the distance between the selected specific k-mer and / or non-overlapping specific region on the genome is less than a preset distance threshold, the selected specific k-mer and / or non-overlapping specific region is detected. The replacement is performed to obtain a non-overlapping specific region after replacement.
步骤506,根据最终所保留的特异性k-mer和替换后的非重合特异性区域得到每个基因组对应的非重合特异性区域集合。Step 506: Obtain a set of non-overlapping specific regions corresponding to each genome according to the finally retained specific k-mer and the replaced non-overlapping specific regions.
在对特异性k-mer进行相应的处理后,可得到对应的非重合特异性区域。可以将每个基因组中的特异性k-mer与该基因组中得到的非重合特异性区域进行比对,或者对该基因组中的两个非重合特异性区域之间进行比对。比对方式与两个特异性k-mer之间进行比对的方式是一致的。检测选取的两个非重合特异性区域或选取的特异性k-mer与非重合特异性区间,在基因组上的距离是否小于预设距离阈值。若是,则将选取的特异性k-mer和非重合特异性区域或将选取的两个非重合特异性区域进行替换,即可得到对应的非重合特异性区域。根据每个基因组中最终所保留的特异性k-mer和替换后的非重合特异性区域可得到每个基因组对应的非重合特异性区域集合,在每个基因组对应的非重合特异性区域集合中包含的是该基因组中的非重合特异性区域。After processing the specific k-mer, the corresponding non-overlapping specific region can be obtained. The specific k-mer in each genome can be compared with the non-overlapping specific regions in the genome, or the two non-overlapping specific regions in the genome can be compared. The alignment method is the same as that between two specific k-mers. Detect whether the selected two non-overlapping specific regions or the selected specific k-mer and non-overlapping specific intervals are smaller than the preset distance threshold on the genome. If so, the selected specific k-mer and non-overlapping specific regions or two selected non-overlapping specific regions are replaced to obtain corresponding non-overlapping specific regions. According to the specific k-mer finally retained in each genome and the non-overlapping specific regions after replacement, a set of non-overlapping specific regions corresponding to each genome can be obtained. In the set of non-overlapping specific regions corresponding to each genome, Contained are non-overlapping specific regions in the genome.
在其中一个实施例中,如图6所示,上述步骤110,包括:In one embodiment, as shown in FIG. 6, the above step 110 includes:
步骤602,将目标病原体操作组中包含的每个基因组对应的非重合特异性区域集合进行汇总,得到非重合特异性区域并集。In step 602, a set of non-overlapping specific regions corresponding to each genome included in the target pathogen operation group is summarized to obtain a non-overlapping specific region union.
步骤604,获取每个基因组对应的非重合特异性区域集合中包含的每个非重合特异性区域在非重合特异性区域并集中的出现次数。Step 604: Obtain the number of occurrences of each non-overlapping specific region included in the non-overlapping specific region set in the set of non-overlapping specific regions corresponding to each genome.
在得到每个基因组对应的非重合特异性区域集合后,如果目标病原体操作组中包含有N 个基因组,那么将可以得到N个非重合特异性区域集合。将这N个非重合特异性区域集合进行汇总,则可得到一个非重合特异性区域并集。在计算每个非重合特异性区域在全部非重合特异性区域集合中的出现次数时,实际上就是计算每个非重合特异性区域在非重合特异性区域并集中的出现次数。如果某一个非重合特异性区域在M个基因组中出现过,那么该非重合特异性区域在非重合特异性区域并集中的出现次数则会为M。After obtaining the set of non-overlapping specific regions corresponding to each genome, if the target pathogen operation group contains N genomes, then N sets of non-overlapping specific regions will be obtained. By summing the N non-overlapping specific region sets, a non-overlapping specific region union set can be obtained. When calculating the number of occurrences of each non-overlapping specific region in the entire set of non-overlapping specific regions, it is actually calculating the number of occurrences of each non-overlapping specific region in the non-overlapping specific regions. If a certain non-overlapping specific region has appeared in M genomes, then the number of occurrences of the non-overlapping specific region in the non-overlapping specific region and the concentration will be M.
在其中一个实施例中,如图7所示,在步骤112之后,还包括:In one embodiment, as shown in FIG. 7, after step 112, the method further includes:
步骤702,将选取的出现次数超过预设次数阈值的非重合特异性区域作为代表特异性区域。In step 702, the selected non-overlapping specific region whose number of occurrences exceeds a preset number of times threshold is used as a representative specific region.
步骤704,根据获得的各个代表特异性区域组成代表特异性区域总集合。In step 704, a total set of representative specific regions is formed according to the obtained representative specific regions.
步骤706,去除代表特异性区域总集合中不具备生物功能的代表特异性区域,得到具备生物功能的代表特异性区域总集合。Step 706: Remove the representative specific region that does not have a biological function from the total set of representative specific regions to obtain a total set of representative specific regions that have biological functions.
步骤708,将具备生物功能的代表特异性区域总集合中的非重合特异性区域作为目标病原体操作组的检测靶点。Step 708: Use the non-overlapping specific region in the total set of representative specific regions with biological functions as the detection target of the target pathogen operating group.
基因的生物功能有,遗传信息的储存、遗传信息的复制以及遗传信息的表达等。在选取部分出现次数超过预设次数阈值的非重合特异性区域后,可将这些选取出的非重合特异性区域作为代表特异性区域,根据选取出的各个代表特异性区域可以组成代表特异性区域总集合。并对代表特异性区域总集合中的非重合特异性区域进行筛选。筛选的方式为,去除不具备生物功能的代表特异性区域,则可以得到具备生物功能的代表特异性区域总集合。从而可以将具备生物功能的代表特异性区域总集合中的这些具备生物功能的代表特异性区域作为检测靶点,即将具备生物功能的代表特异性区域总集合中的非重合特异性区域作为目标病原体操作组的检测靶点。The biological functions of genes include the storage of genetic information, the replication of genetic information, and the expression of genetic information. After selecting non-overlapping specific regions whose partial occurrences exceed a preset number of thresholds, these selected non-overlapping specific regions can be used as representative specific regions, and the representative specific regions can be composed according to the selected representative specific regions Total collection. The non-overlapping specific regions in the total set of specific regions are screened. The screening method is to remove the representative specific regions that do not have biological functions, and then obtain the total set of representative specific regions that have biological functions. Therefore, these biologically-representative specific-specific regions in the total set of biologically-representative specific-regions can be used as detection targets, that is, non-overlapping specific regions in the total set of biologically-representative-specific-regions as the target pathogen Detection target of the operation group.
在其中一个实施例中,如图8所示,步骤704,包括:In one embodiment, as shown in FIG. 8, step 704 includes:
步骤802,从靶点数据库中获取目标病原体操作组中包含的每个基因组的基因注释信息,基因注释信息包含有每个基因组上每个已知的有功能的区域的位置及对应的功能信息。Step 802: Obtain the gene annotation information of each genome included in the target pathogen operation group from the target database. The gene annotation information includes the position of each known functional region on each genome and corresponding function information.
基因注释信息是指标注一个基因组中各个基因位置及功能的信息,因此每个基因组的基因注释信息中包含有每个基因组上每个已知的有功能的区域的位置及对应的功能信息。区域的位置包括有起止位置、正负链以及序列等,对应的功能信息包括编码蛋白的基因、编码microRNA(一类由内源基因编码的长度约为22个核苷酸的非编码单链RNA分子)的基因,编码promoter(启动子)的区域,编码调控蛋白识别结合的区域,复制起始区等。靶点数据库中存储的各个基因组的基因注释信息,可以通过NCBI的GenBank数据库(NCBI的一个开源的有注释信息的核算序列数据库)获取到对应基因组的GenBank基因注释信息,或通过Ensembl数据库(一个由欧洲生物信息研究所等组织维护的基因组序列及注释信息的数据库)获得对应基因组的基因注释信息。The gene annotation information is information indicating the position and function of each gene in a genome. Therefore, the gene annotation information of each genome includes the position and corresponding function information of each known functional region on each genome. The position of the region includes the starting and ending positions, plus and minus strands, and sequences. The corresponding functional information includes genes encoding proteins, encoding microRNAs (a type of non-coding single-stranded RNA with a length of about 22 nucleotides encoded by endogenous genes). Molecule), a region encoding a promoter (promoter), a region encoding a regulatory protein that recognizes binding, a replication initiation region, and the like. The gene annotation information of each genome stored in the target database can be obtained from the GenBank gene annotation information of the corresponding genome through the NCBI's GenBank database (NCBI's open source annotated information accounting sequence database), or through the Ensembl database (an A database of genome sequences and annotation information maintained by organizations such as the European Bioinformatics Institute) to obtain gene annotation information of corresponding genomes.
在靶点数据库中存储有每个病原体操作组中包含的每个基因组的基因注释信息,因此可从靶点数据库中获取到目标病原体操作组中包含的每个基因组的基因注释信息,即可获取到目标病原体操作组中包含的每个基因组中具备生物功能的区域。The target database stores the gene annotation information of each genome contained in each pathogen operating group, so the gene annotation information of each genome contained in the target pathogen operating group can be obtained from the target database, and can be obtained Go to the biologically functional area of each genome contained in the target pathogen manipulation group.
步骤804,依次从代表特异性区域总集合中选取一个代表特异性区域与全部基因组中已知的有功能的区域进行比对。In step 804, one representative specific region is selected from the total set of representative specific regions and compared with known functional regions in the entire genome.
步骤806,去除与已知的有功能的区域的重合区域长度低于预设重合阈值的代表特异性区域,得到具备生物功能的代表特异性区域总集合。Step 806: Remove the representative specific region whose overlapped region length with the known functional region is lower than a preset coincidence threshold to obtain a total set of representative specific regions with biological functions.
代表特异性区域总集合,是根据各个基因组选取出的出现次数超过预设次数阈值的非重合特异性区域对应的非重合特异性区域集合,组成的并集。在确定代表特异性区域总集合中的每个代表特异性区域是否具备生物功能时,可依次选取出一个代表特异性区域与目标病原体操作组中的全部基因组包含的已知的有功能的区域进行比对,并判断选取出的代表特异性区域与已知的有功能的区域的重合区域是否有显著重合,即判断两个序列重合程度是否达到预先设定的重合阈值。当选取出的代表特异性区域与已知的有功能的区域的重合程度低于预设重合阈值时,则认为该代表特异性区域是不具备生物功能的,可去除这些与已知的有功能的区域的重合程度低于预设重合阈值的代表特异性区域。从而其他剩余的与已知的有功能的区域的重合区域程度高于预设重合阈值的代表特异性区域,则就是具备生物功能的代表特异性区域。Represents the total set of specific regions, which is a union set of non-coincidence specific regions corresponding to non-coincidence specific regions selected according to each genome that exceeds a preset number of times. When determining whether each representative specific region in the total set of representative specific regions has biological function, a representative specific region and a known functional region included in the entire genome of the target pathogen operating group can be selected in order. Align and determine whether the overlapped regions of the selected representative specific region and the known functional region have significant overlap, that is, determine whether the degree of overlap of the two sequences reaches a preset overlap threshold. When the degree of coincidence between the selected representative specific region and the known functional region is lower than the preset coincidence threshold, the representative specific region is considered to have no biological function, and these and the known functional region can be removed. The specific overlap of the regions whose degree of coincidence is lower than a preset coincidence threshold. Therefore, the other representative specific regions whose degree of overlap with the known functional region is higher than a preset coincidence threshold are representative specific regions with biological functions.
在比对判断选取出的代表特异性区域与已知的有功能的区域的重合区域是否有显著重合时,即判断两个序列重合程度是否达到预先设定的重合阈值。此处显著重合的预先设定的重合阈值可以是:重合的区域长度超过一定阈值T1,例如12bp,或重合区域的长度占特异性区域长度的百分比超过一定阈值T2,例如30%,或重合区域的长度占相关的有功能的区域的长度的百分比超过一定阈值T3,例如30%,或该特异性区域所包含的所有有功能区域的总长度占该特异性区域的长度的百分比超过一定阈值T4,例如30%。When comparing and judging whether the overlapped region of the selected representative specific region and the known functional region has significant overlap, it is judged whether the overlap degree of the two sequences reaches a preset overlap threshold. The preset coincidence threshold for significant coincidence here may be: the length of the coincident region exceeds a certain threshold T1, for example, 12bp, or the length of the coincident region as a percentage of the length of the specific region exceeds a certain threshold T2, for example, 30%, or the coincident region The percentage of the length of the relevant functional region exceeds a certain threshold T3, such as 30%, or the percentage of the total length of all functional regions contained in the specific region to the length of the specific region exceeds a certain threshold T4 , For example 30%.
步骤806是可选步骤,即步骤806可以不执行,但一般建议执行。因为一般具备生物功能的序列才会在选择压力的筛选下在进化过程中不发生变异,因此最后挑选出有生物功能的序列作为诊断靶点,可以有效避免在病原体进化及繁殖过程中出现在所选出的特异性区域里的发生突变,即改变序列的情况。因此能在一定程度上保证最终所选择出的检测靶点的长时间的有效性和准确性。Step 806 is an optional step, that is, step 806 may not be performed, but is generally recommended to be performed. Because the sequences with biological functions generally do not mutate during the evolution process under the selection pressure, the biologically selected sequences are finally selected as diagnostic targets, which can effectively avoid the emergence of pathogens during the evolution and reproduction of pathogens. Mutations in selected specific regions, ie, changes in sequence. Therefore, the long-term validity and accuracy of the selected detection target can be guaranteed to a certain extent.
在其中一个实施例中,如图9所示,在步骤112之后,还包括:In one embodiment, as shown in FIG. 9, after step 112, the method further includes:
步骤902,将包含有出现次数超过预设次数阈值的非重合特异性区域数量最多的基因组作为代表基因组。In step 902, the genome containing the largest number of non-overlapping specific regions with a number of occurrences exceeding a preset number of times is used as the representative genome.
步骤904,将代表基因组对应的代表特异性区域集合作为PCR代表特异性区域集合。Step 904: Use the representative specific region set corresponding to the representative genome as the PCR specific region set.
在将选取出的出现次数超过预设次数阈值的非重合特异性区域,作为目标病原体操作组的检测靶点之后,可以根据这些选取出的非重合特异性区域在基因组中的出现次数,选出代表基因组。可统计每个基因组中包含有出现次数超过预设次数阈值的非重合特异性区域的数量,将包含有这些出现次数超过预设次数阈值的非重合特异性区域数量最多的基因组选出,作为代表基因组。After selecting the non-overlapping specific regions whose occurrences exceed a preset number of times as the detection target of the target pathogen operation group, the selected non-overlapping specific regions can be selected based on the number of occurrences in the genome of the selected non-overlapping specific regions. Representing the genome. Can count the number of non-overlapping specific regions in each genome that exceed the preset number of thresholds, and select the genome that contains the largest number of non-overlapping specific regions in which these occurrences exceed the preset number of thresholds as representatives Genome.
代表特异性区域是选取的在全部的非重合特异性区域集合中出现次数超过预设次数阈值的非重合特异性区域,即代表特异性区域有多个。每个基因组中选取出的代表特异性区域可 以是多个,每个基因组对应有一个代表特异性区域集合,代表特异性区域集合中包含的就是代表特异性区域。代表基因组实际上就是从目标病原体操作组包含的多个基因组中选出的一个基因组,那么代表基因组也有其对应的代表特异性区域集合。因此,可以在选出代表基因组后,将代表基因组对应的代表特异性区域集合作为PCR(聚合酶链式反应)代表特异性区域集合。The representative specific area is a non-overlapping specific area selected in all non-overlapping specific area sets that exceeds a preset number of times, that is, there are multiple representative specific areas. There can be multiple representative specific regions selected in each genome, and each genome corresponds to a set of representative specific regions, and the set of representative specific regions contains representative specific regions. The representative genome is actually a genome selected from multiple genomes contained in the target pathogen operating group, then the representative genome also has its corresponding set of representative specific regions. Therefore, after the representative genome is selected, the representative specific region set corresponding to the representative genome can be used as a PCR (polymerase chain reaction) representative specific region set.
步骤906,从PCR代表特异性区域集合中选取符合预设相距距离范围的非重合特异性区域,生成符合条件的PCR特异性区域对集合。Step 906: Select non-overlapping specific regions from the set of PCR-specific regions to generate a set of PCR-specific region pairs that meets the conditions.
步骤908,从符合条件的PCR特异性区域对集合中选取一个PCR特异性区域对中的两个非重合特异性区域定位至代表基因组上。Step 908: Select two non-overlapping specific regions in one PCR-specific region pair from the set of eligible PCR-specific region pairs to locate on the representative genome.
在PCR代表特异性区域集合中包含有一个或多个非重合特异性区域,可以获取到PCR代表特异性区域集合中包含的每个非重合特异性区域在代表基因组上的位置。从而可以找到在基因组上的距离符合预设相距距离范围的两个非重合特异性区域,生成符合条件的PCR特异性区域对集合。预设相距距离范围D可以是(MD-SD,MD+SD)。MD可以设置为1000bp左右,SD可以设置为500bp左右。One or more non-overlapping specific regions are included in the PCR representative specific region set, and the position of each non-overlapping specific region included in the PCR representative specific region set on the representative genome can be obtained. Thereby, two non-overlapping specific regions whose distances on the genome match a preset distance range can be found, and a set of PCR-specific region pairs meeting the conditions can be generated. The preset distance range D can be (MD-SD, MD + SD). MD can be set to about 1000bp, SD can be set to about 500bp.
步骤910,将选取的两个非重合特异性区域在代表基因组上行成的区间对应的序列作为待检测区间。Step 910: Use the sequence corresponding to the interval formed by the two non-overlapping specific regions on the representative genome as the interval to be detected.
步骤912,根据获得的各个待检测区间组成待检测区间集合。In step 912, a set of intervals to be detected is formed according to the obtained intervals to be detected.
步骤914,对待检测区间集合中的每个待检测区间进行筛选,得到最终检测引物对集合。In step 914, each to-be-detected interval in the set of to-be-detected intervals is screened to obtain a final set of detection primer pairs.
假设A和B是在基因组上的距离符合预设相距距离范围的两个非重合特异性区域,那么A和B则就是符合条件的PCR特异性区域对,而A和B分别是一个PCR特异性区域。从符合条件的PCR特异性区域对集合中选取一对PCR特异性区域对中的两个PCR特异性区域,并定位到代表基因组上,则这选取的两个PCR特异性区域会在代表基因组上形成一个区间,可获取到这两个PCR特异性区域形成的区间所对应的序列,将这序列作为代表基因组的待检测区间。将符合条件的PCR特异性区域对集合中的每一对PCR特异性区域对中的两个PCR特异性区域均选取出来定位至基因组上,则能获取到各个相应的待检测区间,就能够获取到待检测区间集合。即在待检测区间集合中包含有一个或多个待检测区间。Assuming that A and B are two non-overlapping specific regions whose distances on the genome match the preset distance range, then A and B are eligible PCR-specific region pairs, and A and B are one PCR-specific region. Select two PCR-specific regions in a pair of PCR-specific region pairs from a set of eligible PCR-specific region pairs and locate them on the representative genome. Then, the two selected PCR-specific regions will be on the representative genome. By forming an interval, the sequence corresponding to the interval formed by these two PCR-specific regions can be obtained, and this sequence is taken as the interval to be detected representing the genome. The two PCR-specific regions in each pair of PCR-specific region pairs in the set of eligible PCR-specific region pairs are selected and positioned on the genome, and then each corresponding interval to be detected can be obtained. Set to the interval to be detected. That is, the set of intervals to be detected includes one or more intervals to be detected.
再对待检测区间集合中的每个待检测区间进行筛选,筛选后即可得到最终检测引物对集合。筛选时可以使用自动PCR引物生成工具挑选出合适的备选的PCR引物对,比如Primer3(一个设计引物工具)。在此处,根据待检测区间得到的最终检测引物对集合中,可以包含有多个PCR引物对。Then, each of the to-be-detected intervals in the set of to-be-detected intervals is screened, and the final set of detection primer pairs can be obtained after the screening. You can use the automatic PCR primer generation tool to select suitable PCR primer pairs during screening, such as Primer3 (a design primer tool). Here, the final detection primer pair set obtained according to the interval to be detected may include multiple PCR primer pairs.
在其中一个实施例中,步骤902,包括:根据选取的出现次数超过预设次数阈值的非重合特异性区域得到代表特异性区域总集合;选取包含有代表特异性区域总集合中非重合特异性区域数量最多的基因组为代表基因组。In one embodiment, step 902 includes: obtaining a total set of representative specific regions according to the selected non-overlapping specific regions whose occurrences exceed a preset number of times; selecting a non-overlapping specificity in the total set including the representative specific regions; The genome with the largest number of regions is the representative genome.
获取到每个非重合特异性区域在全部非重合特异性区域集合中的出现次数后,可以选取出出现次数超过预设次数阈值的非重合特异性区域。将选取出的这些非重合特异性区域组成代表特异性区域总集合。在代表特异性区域总集合中包含有一个或多个在全部非重合特异性 区域集合中出现次数超过预设次数阈值的非重合特异性区域。可以从代表特异性区域总集合中,选取出包含有出现次数超过预设次数阈值的非重合特异性区域(即代表特异性区域)数量最多的基因组作为代表基因组。即,选取包含有代表特异性区域总集合中非重合特异性区域数量最多的基因组为代表基因组。After obtaining the number of occurrences of each non-overlapping specific region in the entire set of non-overlapping specific regions, a non-overlapping specific region with an occurrence exceeding a preset number of times can be selected. The selected non-overlapping specific regions are composed to represent the total set of specific regions. The total set of representative specific regions includes one or more non-overlap specific regions whose occurrences in all non-overlap specific region sets exceed a preset number of times. From the total set of representative specific regions, a genome containing the largest number of non-overlapping specific regions (ie, representative specific regions) that occur more than a preset number of times can be selected as the representative genome. That is, the genome containing the largest number of non-overlapping specific regions in the total set of representative specific regions is selected as the representative genome.
在其中一个实施例中,如图10所示,步骤906,包括:In one embodiment, as shown in FIG. 10, step 906 includes:
步骤1002,获取PCR代表特异性区域集合中每个PCR代表特异性区域在代表基因组中的位置。Step 1002: Obtain the position of each specific region of the PCR in the set of representative regions of the PCR.
步骤1004,将相距距离符合预设相距距离范围的两个非重合特异性区域作为符合条件的PCR特异性区域对。In step 1004, two non-overlapping specific regions with a distance in a preset distance range are used as a pair of PCR-specific regions that meet the conditions.
步骤1006,根据符合条件的PCR特异性区域对生成符合条件的PCR特异性区域对集合。Step 1006: Generate a set of eligible PCR-specific region pairs according to the eligible PCR-specific region pairs.
在得到PCR代表特异性区域组成的PCR代表特异性区域集合后,可从PCR代表特异性区域集合中选取符合预设相距距离范围的非重合特异性区域,生成符合条件的PCR特异性区域对集合。可获取到PCR代表特异性区域集合中包含的每个PCR代表特异性区域在代表基因组中的位置,即将PCR代表特异性区域集合中包含的每个非重合特异性区域均定位至代表基因组上,可确定每个非重合特异性区域在代表基因组上的位置。从而可以获取到每两个非重合特异性区域之间的相距距离,即获取到每两个非重合特异性区域之间相隔的字符数。可选取出相距距离符合预设相距距离范围的两个非重合特异性区域。After obtaining the PCR representative specific region set consisting of the PCR representative specific regions, a non-overlapping specific region that meets a preset distance range can be selected from the PCR representative specific region sets to generate a set of PCR specific region pairs that meets the conditions . The position of each PCR representative specific region included in the PCR representative specific region set can be obtained, that is, each non-overlapping specific region included in the PCR representative specific region set is located on the representative genome. The location of each non-overlapping specific region on the representative genome can be determined. Therefore, the distance between each two non-overlapping specific regions can be obtained, that is, the number of characters separated between each two non-overlapping specific regions. Optional take out two non-overlapping specific regions with a distance that matches the preset distance range.
比如,将预设相距距离范围设置为一个范围700到1300,那么可选取出相距距离在范围700到1300的两个非重合特异性区域,即相隔字符数为700到1300个的两个非重合特异性区域。将选取出的相距距离符合预设相距距离范围的两个非重合特异性区域,作为符合条件的PCR特异性区域对。进一步地,可根据选取出的符合条件的PCR特异性区域对生成符合条件的PCR特异性区域对集合。即,在符合条件的PCR特异性区域对集合中包含有一个或多个符合条件的PCR特异性区域对。For example, if the preset separation distance range is set to a range of 700 to 1300, then two non-overlapping specific regions with a separation distance in the range of 700 to 1300 can be selected, that is, two non-overlapping characters with a space of 700 to 1300 characters. Specific region. The two non-overlapping specific regions with a selected distance in the preset distance range are used as a pair of PCR-specific regions that meet the conditions. Further, a set of PCR-specific region pairs meeting the conditions may be generated according to the selected eligible PCR-specific region pairs. That is, one or more eligible PCR-specific region pairs are included in the set of eligible PCR-specific region pairs.
在其中一个实施例中,预设相距距离范围大于500bp且小于1500bp。In one embodiment, the preset distance range is greater than 500 bp and less than 1500 bp.
可以将预设相距距离范围设置为大于500bp且小于1500bp。You can set the preset distance range to be greater than 500bp and less than 1500bp.
在其中一个实施例中,步骤910,将选取的两个非重合特异性区域在代表基因组上相距最远的两段的位置作为待检测区间的边界;将待检测区间边界在代表基因组上行成的区间对应的序列作为待检测区间。In one embodiment, in step 910, the positions of the two non-overlapping specific regions selected as the two furthest segments on the representative genome are used as the boundary of the interval to be detected; the boundary of the interval to be detected is formed on the representative genome. The sequence corresponding to the interval is used as the interval to be detected.
在生成符合条件的PCR特异性区域对集合后,可从符合条件的PCR特异性区域对集合中任意选取两个非重合特异性区域定位至代表基因组上,获取到选取的两个非重合特异性区域在代表基因组上的位置。可获取到选取的这两个非重合特异性区域在代表基因组上相距最远的两段的位置,即获取到其中一个非重合特异性区域首端至另一个非重合特异性区域的末端的位置。假设A和B为选取的两个非重合特异性区域,A为ACGGTCATC,B为TCATCCGAG。将A和B定位至代表基因组上,在A和B之间相隔的序列为AAAATTTTT,那么可获取到A和B在代表基因组上相距最远的两段的位置为“ACGGTCATC”中首端的A字符至“TCATCCGAG”中末端的G字符。可将字符“A”和“G”作为待检测区间的边界,那么 待检测区间实际上为“ACGGTCATC”+“AAAATTTTT”+“TCATCCGAG”。即最终得到的待检测区间为“ACGGTCATCAAAATTTTT TCATCCGAG”。对符合条件的PCR特异性区域对集合中任意选取的两个非重合特异性区域均做此处理,即可得到多个对应的待检测区间。After generating a set of qualified PCR-specific region pairs, randomly select two non-overlapping specific regions from the set of eligible PCR-specific region pairs to locate on the representative genome, and obtain the selected two non-overlapping specificities. The location of the region on the representative genome. The positions of the two non-overlapping specific regions selected on the representative genome can be obtained, that is, the position from the head of one of the non-overlapping specific regions to the end of the other non-overlapping specific region . Assume that A and B are the two non-overlapping specific regions selected, A is ACGGTCATC, and B is TCATCCGAG. Locate A and B on the representative genome, and the sequence separated between A and B is AAAATTTTT, then you can obtain the two characters of A and B that are farthest apart on the representative genome are the A characters at the beginning of "ACGGTCATC" G character to the end in "TCATCCGAG". The characters "A" and "G" can be used as the boundaries of the interval to be detected, and then the interval to be detected is actually "ACGGTCATC" + "AAAATTTTT" + "TCATCCGAG". That is, the finally obtained detection interval is "ACGGTCATCAAAATTTTTTCATCCGAG". This processing is performed on two non-overlapping specific regions selected randomly from the set of PCR-specific regions that meet the conditions, and multiple corresponding intervals to be detected can be obtained.
在其中一个实施例中,在代表基因组上标注出选取的两个非重合特异性区域的位置,以及选取的两个非重合特异性区域之间的非特异性区域的位置。In one embodiment, the positions of the two non-overlapping specific regions selected and the positions of the non-specific regions between the two selected non-overlapping specific regions are marked on the representative genome.
将选取的两个非重合特异性定位至代表基因组上后,还可以还在代表基因组上标注出选取的这两个非重合特异性区域的位置,也可以记录下这两个非重合特异性区域之间的非特异性区域的位置。非特异性区域指的是不属于选取出的特异性k-mer,也不属于由特异性k-mer形成非重合特异性区域。在对非重合特异性区域的位置进行标记后,可用于确定PCR引物对。PCR引物需要用到非重合特异性区域,而PCR的特异性取决于引物,因此可以标记出哪些属于非重合特异性区域,哪些属于非特异性区域。After locating the two non-overlapping specificities on the representative genome, the positions of the two non-overlapping specific regions can also be marked on the representative genome, and the two non-overlapping specific regions can also be recorded. Location between non-specific regions. The non-specific region refers to a specific k-mer that does not belong to the selection, nor does it belong to a non-overlapping specific region formed by the specific k-mer. The positions of non-overlapping specific regions can be used to identify PCR primer pairs. PCR primers need to use non-overlapping specific regions, and the specificity of PCR depends on the primers, so you can mark which belong to non-overlapping specific regions and which belong to non-specific regions.
在其中一个实施例中,如图11所示,步骤912,包括:In one embodiment, as shown in FIG. 11, step 912 includes:
步骤1102,运用PCR引物工具对待检测区间集合中的每个待检测区间进行筛选,得到备选PCR引物对集合。In step 1102, a PCR primer tool is used to screen each to-be-detected interval in the set of to-be-detected intervals to obtain a set of candidate PCR primer pairs.
在将选取的两个非重合特异性区域在代表基因组上行成的区间对应的序列作为待检测区间,以此方式得到多个待检测区间后,对每个待检测区间进行筛选,如可以运用PCR引物工具对待检测区间集合中的每个待检测区间进行筛选。PCR引物工具可以是Primer3。从而可以筛选出部分待检测区间,得到一个或多个备选PCR引物对集合。After selecting the sequences corresponding to the two non-overlapping specific regions in the region representing the genome as the test intervals, in this way, multiple test intervals are obtained, and then each test interval is screened. For example, PCR can be applied. The primer tool screens each to-be-detected interval in the set of to-be-detected intervals. The PCR primer tool may be Primer3. Thereby, a part of the interval to be detected can be selected, and one or more candidate PCR primer pair sets can be obtained.
步骤1104,从备选PCR引物对集合中选出针对目标病原体操作组的特异性引物对,生成与目标病原体操作组对应的特异性引物对集合。Step 1104: Select a specific primer pair for the target pathogen operation group from the candidate PCR primer pair set, and generate a specific primer pair set corresponding to the target pathogen operation group.
现有的大部分自动PCR引物生成工具自动生成的一对PCR引物仅仅能够满足在所使用这个待检测区域内的引物的特异性,并不能保证在其他区域的特异性。同时,某些自动PCR引物生成工具并不能参考待检测区间内的特异性区域标注消息。因此需要进一步确定得到的备选PCR引物对集合中的引物的特异性。因此,可以从得到的备选PCR引物对集合中选出针对目标病原体操作组的特异性引物对,生成与目标病原体操作组对应的特异性引物对集合。可以使用全集中的不属于目标病原体操作组中的基因组分别作为比对参考基因组,并将一个备选PCR引物对中的两个引物分别与比对参考基因组中的序列进行比对并定位至比对参考基因组上。判断是否定位成功时,可以将比对参考基因组的引物对与定位在比对参考基因组的位置对应的序列进行对比,当序列相似度达到预设相似度阈值时,则判定引物对定位成功。定位成功的备选PCR引物对即被判定为不是目标病原体操作组相对应的特异性引物对,并从备选PCR引物对集合中去除该对引物。可根据选出的特异性引物对生成与目标病原体操作组对应的特异性引物对集合。A pair of PCR primers automatically generated by most existing automatic PCR primer generation tools can only satisfy the specificity of the primers in the region to be detected, and cannot guarantee the specificity in other regions. At the same time, some automatic PCR primer generation tools cannot refer to specific regions within the interval to be detected to mark messages. Therefore, it is necessary to further determine the specificity of the obtained candidate PCR primers to the primers in the collection. Therefore, a specific primer pair for the target pathogen operating group can be selected from the obtained candidate PCR primer pair set, and a specific primer pair set corresponding to the target pathogen operating group can be generated. The genomes in the complete set that are not in the target pathogen operating group can be used as alignment reference genomes, and the two primers in an alternative PCR primer pair can be aligned with the sequences in the alignment reference genome and mapped to the alignment. On the reference genome. When judging whether the mapping is successful, the primer pair of the aligned reference genome can be compared with the sequence corresponding to the position of the aligned reference genome. When the sequence similarity reaches a preset similarity threshold, it is determined that the primer pair is successfully located. The candidate PCR primer pairs that have been successfully mapped are determined to be not specific primer pairs corresponding to the target pathogen operating group, and the pair of primers are removed from the set of candidate PCR primer pairs. A set of specific primer pairs corresponding to the target pathogen operating group can be generated based on the selected specific primer pairs.
步骤1106,选取特异性引物对集合中符合预设引物条件的引物对,作为最终检测引物对。Step 1106: Select a primer pair in the specific primer pair set that meets the preset primer conditions as the final detection primer pair.
步骤1108,根据最终检测引物对生成最终检测引物对集合。Step 1108: Generate a set of final detection primer pairs based on the final detection primer pairs.
在选取出针对目标病原体操作组的特异性引物对后,可以再进行进一步的筛选。从特异性引物对集合中挑选出符合预设引物条件的引物对,将这些选取的引物对作为最终检测引物 对,从而生成对应的最终检测引物对集合。After selecting specific primer pairs for the target pathogen operating group, further screening can be performed. A primer pair is selected from the specific primer pair set that meets the preset primer conditions, and these selected primer pairs are used as the final detection primer pair, thereby generating a corresponding final detection primer pair set.
在其中一个实施例中,如图12所示,步骤1104,包括:In one embodiment, as shown in FIG. 12, step 1104 includes:
步骤1202,从靶点数据库中获取全集,全集中包含有多个收集到的高可信度基因组。In step 1202, a full set is obtained from the target database, and the full set contains a plurality of collected high-confidence genomes.
步骤1204,通过全集获取到不包含于目标病原体操作组中的基因组,作为比对参考基因组。In step 1204, a genome that is not included in the target pathogen operating group is obtained through the complete set, and is used as a comparison reference genome.
在靶点数据库中存储有全集,在全集中包含有多个收集到的高可信度基因组。高可信度基因组是指选取出的满足预设可信度条件的基因组。因此可获知目标病原体操作组中包含的基因组,从而可从全集中去除目标病原体操作组中包含的基因组,将其他不属于目标病原体操作组中的基因组作为比对参考基因组。比对参考基因组并不包含在目标病原体操作组中。A full set is stored in the target database, and the full set contains multiple collected high-confidence genomes. A high-confidence genome refers to a selected genome that meets a preset reliability condition. Therefore, the genomes included in the target pathogen operating group can be known, so that the genomes included in the target pathogen operating group can be removed from the full set, and other genomes that are not in the target pathogen operating group can be used as reference reference genomes. The alignment reference genome is not included in the target pathogen manipulation group.
步骤1206,依次从备选PCR引物对集合中选取引物对定位至比对参考基因组。In step 1206, primer pairs are selected from the set of candidate PCR primer pairs in order to map the reference genome.
步骤1208,将选取的引物对与定位在比对参考基因组的位置对应的序列进行对比,当序列相似度达到预设相似度阈值时,则判定引物对定位成功。In step 1208, the selected primer pair is compared with the sequence corresponding to the location of the aligned reference genome. When the sequence similarity reaches a preset similarity threshold, it is determined that the primer pair is successfully located.
判定引物对是否有定位至比对。参考基因组时,可将选取定位至比对参考基因的引物对,与该引物对定位至比对参考基因组时所在的位置对应的序列进行对比。当检测到引物对与所在的位置对应的序列之间的相似度达到预设相似度阈值时,则判定引物对定位成功。预设相似度阈值可由技术人员进行设置,比如可将预设相似度设置为95%,99%。则当检测到引物对与所在的位置对应的序列之间的相似度达到95%或99%时,才会判定引物对成功定位至比对参考基因组。Determine if the primer pair is aligned. When referring to a reference genome, a primer pair selected to be aligned to a reference gene may be compared with a sequence corresponding to a position where the primer pair is located to be aligned to the reference genome. When it is detected that the similarity between the primer pair and the sequence corresponding to the position reaches a preset similarity threshold, it is determined that the primer pair is positioned successfully. The preset similarity threshold can be set by a technician, for example, the preset similarity can be set to 95%, 99%. Then when it is detected that the similarity between the primer pair and the sequence corresponding to the position reaches 95% or 99%, it is determined that the primer pair is successfully mapped to the reference genome.
步骤1210,从判定为定位成功的引物对中去除满足预设比对条件的引物对,得到与目标病原体操作组对应的特异性引物对。In step 1210, a primer pair that satisfies a preset alignment condition is removed from the primer pair determined to be successfully located to obtain a specific primer pair corresponding to the target pathogen operation group.
步骤1212,根据特异性引物对生成与目标病原体操作组对应的特异性引物对集合。Step 1212: Generate a set of specific primer pairs corresponding to the target pathogen operating group according to the specific primer pairs.
依次从备选PCR引物对集合中选取一个引物对定位至比对参考基因组,并判断选取的引物对是否有成功定位至对比参考基因组上。从判定为定位成功的引物对中,去除满足预设比对条件的引物对,留下的即为与目标病原体操作组对应的特异性引物对。可以根据特异性引物对生成与目标病原体操作组对应的特异性引物对集合。即,在目标病原体操作组对应的特异性引物对集合中,包含有一个或多个与目标病原体操作组对应的特异性引物对。Select one primer pair from the set of candidate PCR primer pairs in turn to locate the reference genome, and determine whether the selected primer pair has been successfully mapped to the comparative reference genome. From the primer pairs determined to be successfully located, the primer pairs that meet the preset alignment conditions are removed, and the specific primer pairs corresponding to the target pathogen operating group are left. A set of specific primer pairs corresponding to the target pathogen operating group can be generated from the specific primer pairs. That is, the set of specific primer pairs corresponding to the target pathogen operating group includes one or more specific primer pairs corresponding to the target pathogen operating group.
在其中一个实施例中,预设比对条件包括以下至少一种:选取的引物对的两个引物同时定位在同一个基因组的同一条染色体上;选取的引物对的两个引物的距离在预设距离范围内;选取的引物对的任意一个引物的3'末端存在有预设数量个碱基序列与引物对定位在比对参考基因组的位置上的碱基序列相同。In one embodiment, the preset comparison conditions include at least one of the following: two primers of the selected primer pair are located on the same chromosome of the same genome at the same time; the distance between the two primers of the selected primer pair is The distance range is set; a preset number of base sequences exists at the 3 ′ end of any primer of the selected primer pair, and the base sequences of the primer pair located at the position of the aligned reference genome are the same.
满足以上条件的引物对会被标记为非特异性引物对。而需要选取的是在判定为成功定位至比对参考基因组的引物对中,去除满足预设比对条件的引物对。会去除这部分标记为非特异性引物对的引物对,需要的是特异性引物对。预设距离范围是一个范围区间D,D可以在(MD-SD,MD+SD)这一范围内浮动,其中MD一般为1000bp,SD一般为500dp。k一般大于0.5。Primer pairs that meet the above conditions are labeled as non-specific primer pairs. What needs to be selected is to remove the primer pairs that meet the preset alignment conditions from the primer pairs determined to be successfully mapped to the aligned reference genome. These primer pairs marked as non-specific primer pairs will be removed. Specific primer pairs are required. The preset distance range is a range interval D, D can float within the range (MD-SD, MD + SD), where MD is generally 1000bp and SD is generally 500dp. k is generally greater than 0.5.
在其中一个实施例中,符合预设引物条件包括以下至少一种:引物长度在17到28bp之 间;引物退火温度在52到58摄氏度之间;GC百分比在40%到60%之间;引物的3'端为C、G、CG、或GC;引物的3'端最后5个碱基中G/C不超过3个,引物的3'端最后5个碱基内不含有连续超过2个的C或G;不包含重复序列或单核酸重复序列;不存在有两个引物之间的3'端互补,或单个引物的自互补。In one embodiment, meeting the preset primer conditions includes at least one of the following: primer length is between 17 and 28 bp; primer annealing temperature is between 52 and 58 degrees Celsius; GC percentage is between 40% and 60%; primer The 3 'end of the primer is C, G, CG, or GC; the G / C of the last 5 bases of the 3' end of the primer does not exceed 3, and the last 5 bases of the 3 'end of the primer does not contain more than 2 consecutive C or G; does not contain repeats or single nucleic acid repeats; there is no 3 'end complementary between the two primers, or self-complementation of a single primer.
在其中一个实施例中,如图13所示,提供了一种确定检测靶点的方法,包括以下步骤:In one embodiment, as shown in FIG. 13, a method for determining a detection target is provided, including the following steps:
步骤1302,建立目标病原体操作组的特征靶点序列集合。In step 1302, a characteristic target sequence set of the target pathogen operation group is established.
如图14所示,步骤1302,包括:As shown in FIG. 14, step 1302 includes:
步骤1302A,高可信度基因组的收集与整理。 Step 1302A: Collection and sorting of high-confidence genomes.
在靶点数据库中,可以存储有每个病原体操作组对应的特征靶点序列集合。在建立每个病原体操作组对应的特征靶点序列集合时,需要先对高可信度基因组数据进行收集与整理。高可信度基因组既可以包括病原体基因组,也包括非病原体基因组,例如共生菌、益生菌、人类、动物、植物等的高可信度基因组。高可信度的基因组可以来源于NCBI((National Center for Biotechnology Information,美国国立生物技术信息中心)的RefSeq数据集(RefSeq参考序列数据库,美国国家生物信息技术中心提供的具有生物意义上的非冗余的基因和蛋白质序列)或其他公共或私有的高可信度基因组。In the target database, a feature target sequence set corresponding to each pathogen operation group can be stored. When establishing the characteristic target sequence set corresponding to each pathogen operating group, it is necessary to collect and sort high-reliability genomic data first. The high-confidence genome can include both the pathogen genome and the non-pathogen genome, such as high-confidence genomes of symbiotic, probiotic, human, animal, and plant. The highly reliable genome can be derived from the RefSeq dataset (RefSeq reference sequence database of the National Center for Biotechnology Information) of the NCBI (RefSeq reference sequence database, non-redundant in a biological sense provided by the National Center for Bioinformatics). Other genes and protein sequences) or other public or private high-confidence genomes.
高可信度的基因组的确认和筛选方法可以通过以下这三种方式:There are three ways to identify and screen high-confidence genomes:
1、根据一条基因组数据中所含非确定性字符的比例进行筛选。例如对于DNA基因组来说,非确定性字符的比例是指其中含有的非ACGT字符的比例,一条DNA基因组数据如果其非ACGT字符的比例过高,那么该条数据即为疑似低可信度的基因组。对于DNA或RNA序列,非确定性字符是指除去ACGTU这几个确定性字符以外的字符;对于蛋白质序列,非确定性字符则是指除了确定的氨基酸字符以外的字符。1. Screen based on the proportion of non-deterministic characters contained in a genomic data. For example, for the DNA genome, the proportion of non-deterministic characters refers to the proportion of non-ACGT characters contained in it. If the proportion of non-ACGT characters in a piece of DNA genome data is too high, then the piece of data is suspected of low confidence. Genome. For DNA or RNA sequences, non-deterministic characters refer to characters other than ACGTU. For protein sequences, non-deterministic characters refer to characters other than certain amino acid characters.
2、根据一条完整的染色体所包括的基因组数据片段的数目进行筛选,如果有过多的片段同属于一条染色体,那么该基因组即为疑似低可信度的基因组。2. Screen based on the number of genomic data fragments included in a complete chromosome. If there are too many fragments that belong to the same chromosome, then the genome is a suspected low-confidence genome.
3、通过与该基因组遗传关系相近的(例如遗传距离小于某一阈值)多个基因组进行全基因组序列比对,确定该基因组在其相近基因组中的全基因组平均覆盖百分比,然后根据这个全基因组平均覆盖百分比进行筛选:平均覆盖百分比过低的基因组即为疑似低完成度、即低可信度的基因组。遗传距离是指衡量物种间(或个体间)综合遗传差异大小的指标。可将所有收集整理到的高可信度基因组统称为全集。3. Perform genome-wide sequence alignment of multiple genomes with similar genetic relationships (eg, genetic distance is less than a certain threshold) to determine the average genome-wide coverage percentage of the genome in its similar genomes, and then average based on this whole genome Screening by percentage of coverage: Genomes with a low average percentage of coverage are those that are suspected of having low completion, ie, low confidence. Genetic distance refers to an index that measures the size of the overall genetic difference between species (or individuals). All high-confidence genomes collected can be collectively referred to as the complete set.
步骤1302B,确定目标病原体操作组中包含的基因组。In step 1302B, the genome included in the target pathogen operating group is determined.
高可信度基因组是指选取出的满足预设可信度条件的基因组。在获取到高可信度基因组后,可以确定一个病原体操作组中包含的高可信度基因组,即可确定一个病原体操作组中对应包含的基因组。因此可确定目标病原体操作组中包含的基因组。目标病原体操作组是指待检测的一种病原体操作组。比如待检测的病原体操作组为金黄色葡萄球菌,那么步骤102中的目标病原体操作组则指的是金黄色葡萄球菌。A high-confidence genome refers to a selected genome that meets a preset reliability condition. After obtaining the high-confidence genome, the high-confidence genome contained in a pathogen operating group can be determined, and the corresponding contained genome in a pathogen operating group can be determined. The genome contained in the target pathogen operating group can thus be determined. The target pathogen operating group refers to a pathogen operating group to be detected. For example, the pathogen operating group to be detected is Staphylococcus aureus, then the target pathogen operating group in step 102 refers to Staphylococcus aureus.
步骤1302C,生成全集的基因组出现次数索引表。In step 1302C, an index table of the number of occurrences of the genome of the complete set is generated.
使用全集,可生成全集的基因组出现次数索引表,在全集的基因组出现次数索引表中, 记录有全集中包含的每个k-mer在全集的多少个基因组中出现过。k-mer是指长度为k的基因组序列,k可自行定义,一般可将范围设置在11到32之间。如果一种基因组数据中一共有a个不同的确定性字符,那么对于一个特定的k,一共有a的k次方个可能的不同k-mer。Using the corpus, a genomic appearance index table of the corpus can be generated. In the genomic appearance index table of the corpus, it is recorded how many genomes of each corpus in the corpus appeared in the corpus. k-mer refers to a genomic sequence of length k. k can be defined by itself, and the range can generally be set between 11 and 32. If there are a different deterministic characters in a genomic data, then for a specific k, there may be a total of k different powers of k.
例如,对于DNA基因组数据,DNA一共有ACGT四种不同的确定性字符,那么对于一个特定的k,一共有4的k次方个可能的不同k-mer。对于一个长度为n的基因组,其最多可能有n-k+1个不同的k-mer。但是因为一个基因组中含有重复区域,所以一般情况下一个n字符长的基因组中包含的不同k-mer会远远小于n-k+1。因此,若使用普通的k-mer计数法,在一个给定的基因组中,一个特定的k-mer可能会出现多次,并可能进行多次计数。此处建立的全集的基因组出现次数索引表中,与之前的方法不同的是,如果一个基因组中一个k-mer出现超过一次,那么在该全集的基因组出现次数索引表中仍然仅仅计数一次。因此,在由此产生的k-mer基因组出现次数索引表中一个k-mer所对应的计数即代表着该k-mer一共在全集中多少个基因组中出现过。For example, for DNA genomic data, DNA has a total of four different deterministic characters of ACGT, then for a particular k, there are 4 possible k-th different k-mers. For a genome of length n, there may be at most n-k + 1 different k-mers. But because a genome contains repeating regions, in general, a k-mer with a length of n characters will be much smaller than n-k + 1. Therefore, if the ordinary k-mer counting method is used, a given k-mer may appear multiple times and may be counted multiple times in a given genome. In the genome occurrence index table of the complete set, which is different from the previous method, if a k-mer occurs more than once in a genome, the genome occurrence index table of the complete set still counts only once. Therefore, the count corresponding to a k-mer in the resulting k-mer genome occurrence number index table represents how many genomes the k-mer has appeared in the total set.
如果使用的是DNA或RNA基因组序列,因为核酸序列的反向互补性,一个k-mer A出现后,其反向互补序列A'也应该被认定为已经出现,因此A和A'都应该被记录到表中。在后续步骤中,如果针对的是DNA或RNA序列的k-mer,当一个k-mer A被提及做某种操作时,默认也认为其反向互补序列A'也被提及并进行了相应的处理操作。If a DNA or RNA genomic sequence is used, because of the reverse complementarity of the nucleic acid sequence, after a k-mer A appears, its reverse complementary sequence A 'should also be considered to have appeared, so both A and A' should be Record into the table. In the subsequent steps, if the k-mer of the DNA or RNA sequence is targeted, when a k-mer 'A is mentioned to do some operation, it is also considered that its reverse complementary sequence A' is also mentioned and performed by default Corresponding processing operation.
步骤1302D,生成目标病原体操作组对应的基因组出现次数索引表。In step 1302D, an index table of the number of occurrences of the genome corresponding to the target pathogen operation group is generated.
目标病原体操作组的基因组出现次数索引表与上述步骤1302C中的全集的基因组出现次数索引表有所不同。全集的基因组出现次数索引表记录的是全集的,也就是一个k-mer在全部的病原体操作组包含的多少个基因组中出现过,即一个k-mer在全集的多少个基因组中出现过。但目标病原体操作组对应的基因组出现次数索引表是与目标病原体操作组对应的,记录的是目标病原体操作组中包含的k-mer,在目标病原体操作组的多少个基因组中出现过。The index table of the number of occurrences of the genome of the target pathogen operating group is different from the index table of the number of occurrences of the genome of the complete set in step 1302C. The complete set of genome occurrence index table records the complete set, that is, how many genomes a k-mer has appeared in the entire pathogen operating group, that is, how many genomes a k-mer has appeared in the complete set. However, the index table of the number of occurrences of the genome corresponding to the target pathogen operation group is corresponding to the target pathogen operation group. It records the k-mers contained in the target pathogen operation group and how many genomes have appeared in the target pathogen operation group.
步骤1302E,生成目标病原体操作组对应的特异性k-mer表。In step 1302E, a specific k-mer table corresponding to the target pathogen operation group is generated.
目标病原体操作组对应的特异性k-mer表中记录的是目标病原体操作组中满足预设特异性条件的k-mer,即特异性k-mer。特异性k-mer是从k-mer中挑选出的符合预设特异性条件的k-mer,挑选出成为特异性k-mer的需要满足以下两个条件:The specific k-mer table corresponding to the target pathogen operation group records the k-mers in the target pathogen operation group that satisfy the preset specific conditions, that is, the specific k-mer. The specific k-mer is a k-mer selected from the k-mers that meets the preset specificity conditions. The selection of a specific k-mer must meet the following two conditions:
1、如果目标病原体操作组含有N个基因组,某个k-mer在目标病原体操作组对应的基因组出现次数索引表中的出现次数为C 1,那么需要满足条件:C 1/N+P 1≥1,即在目标病原体操作组的基因组出现次数索引表中的出现次数与目标病原体操作组中包含的基因组数量的比值与第一阈值的和大于等于1,其中第一阈值P 1通常小于5%。 1. If the target pathogen operation group contains N genomes, and the number of occurrences of a k-mer in the genome occurrence index table corresponding to the target pathogen operation group is C 1 , then the condition must be met: C 1 / N + P 1 ≥ 1. That is, the ratio of the number of occurrences in the genome appearance index table of the target pathogen operation group to the number of genomes included in the target pathogen operation group and the first threshold is greater than or equal to 1, where the first threshold P 1 is usually less than 5%. .
2、如果某个k-mer在目标病原体操作组对应的基因组出现次数索引表中的出现次数为C 1,该k-mer在全集的基因组出现次数索引表中的出现次数为C 2,那么则需要满足条件:C 1/C 2+P 2≥1,即在目标病原体操作组的基因组出现次数索引表中的出现次数与在全集的基因组出现次数索引表中的出现次数的比值与第二阈值的和大于等于1。其中第二阈值P 2通常小于5%。 2. If a k-mer appears in the genome appearance index table corresponding to the target pathogen operation group as C 1 , and the k-mer appears in the genome episode appearance index table of the complete set as C 2 , then The condition needs to be met: C 1 / C 2 + P 2 ≥1, that is, the ratio of the number of occurrences in the genome occurrence index table of the target pathogen operating group to the occurrence number in the genome occurrence index table of the complete set and the second threshold Is greater than or equal to 1. Wherein the second threshold value P 2 is usually less than 5%.
第一阈值P 1与第二阈值P 2可以相等,也可以不相等。选取特异性k-mer时加入了第一阈 值P 1与第二阈值P 2这两个参数,允许了在一定范围内的误差率,即允许了一定范围内的特异性k-mer的非特异性。如果没有这两个参数,则不能允许一定范围内的非特异性,那么针对某一个病原体操作组,往往很难找到特异性k-mer。 The first threshold value P 1 and the second threshold value P 2 may be equal to or different from each other. When the specific k-mer is selected, the two parameters of the first threshold P 1 and the second threshold P 2 are added, allowing an error rate within a certain range, that is, allowing the non-specificity of the specific k-mer within a certain range. . Without these two parameters, non-specificity in a certain range cannot be allowed, and it is often difficult to find a specific k-mer for a certain pathogen operating group.
对于一个病原体操作组,如果发现有n个特异性k-mer,假设本步骤条件(1)中的P 1出现情况是随机分布于该病原体操作组的各个基因组中的,那么实际上对于该病原体操作组出现假阴性的概率则小于或等于P 1 n。对于足够大的n,此处可能出现的假阴性的可能性将极小。同时,如果最终实际检测到该病原体操作组有n'个特异性k-mer,假设本步骤条件(2)中的P 2出现情况是随机分布于非本病原体操作组的各个其他基因组中的,那么实际上对于该病原体操作组出现假阳性的概率则小于或等于P 1 n'(即P 2的n'次方)。对于足够大的n',此处可能出现的假阳性的可能性将极小。假阴性率是指在测试中产生阴性测试结果的阳性的比例,即考虑到正在查找的状况存在阴性测试结果的条件概率。 For a pathogen operating group, if n specific k-mers are found, assuming that the occurrence of P 1 in condition (1) of this step is randomly distributed in each genome of the pathogen operating group, then actually for the pathogen The probability of false negatives in the operation group was less than or equal to P 1 n . For n that is large enough, the likelihood of false negatives occurring here will be extremely small. At the same time, if n 'specific k-mers are actually detected in the operating group of the pathogen, assuming that the occurrence of P 2 in the condition (2) of this step is randomly distributed in each other genome of the operating group other than the pathogen, In fact, the probability of false positives for the pathogen operation group is actually less than or equal to P 1 n ' (that is, the power of n' to P 2 ). For n 'large enough, the probability of false positives that can occur here is extremely small. The false negative rate refers to the proportion of positives that produce a negative test result in the test, that is, the conditional probability that a negative test result exists considering the condition being searched for.
步骤1304,确定目标病原体操作组的检测靶点。Step 1304: Determine a detection target of the target pathogen operation group.
如图15所示,步骤1304,包括:As shown in FIG. 15, step 1304 includes:
步骤1304A,依次将目标病原体操作组中包含的每个基因组作为参考基因组,将目标病原体操作组中包含的每个特异性k-mer定位至参考基因组。In step 1304A, each genome included in the target pathogen operating group is used as a reference genome in turn, and each specific k-mer included in the target pathogen operating group is mapped to the reference genome.
依次将目标病原体操作组中包含的每个基因组作为参考基因组,将目标病原体操作组对应的特异性k-mer表中的每个特异性k-mer定位至参考基因组。由于特异性k-mer是预先挑选出来符合预设特异性条件的k-mer,因此会存在有部分特异性k-mer无法定位到基因组上的情况。可以将成功定位至参考基因组的特异性k-mer作为该参考基因组中包含的特异性k-mer。有的特异性k-mer是无法定位至某个基因组上的,则可以认为该特异性k-mer并不是该基因组中包含的。因此此处的定位也可以认为是再次确认各个基因组中包含的特异性k-mer。由于可能存在位移的情况,因此通过这种定位操作,对每个基因组中包含的特异性k-mer进行二次确认,增加容错几率。Each genome included in the target pathogen operating group is used as a reference genome in turn, and each specific k-mer in the specific k-mer table corresponding to the target pathogen operating group is mapped to the reference genome. Since the specific k-mer is a k-mer selected in advance to meet the preset specificity conditions, there may be cases where some specific k-mers cannot be mapped to the genome. The specific k-mer successfully mapped to the reference genome can be used as the specific k-mer included in the reference genome. Some specific k-mers cannot be mapped to a certain genome, then it can be considered that the specific k-mers are not included in the genome. Therefore, the localization here can be considered as confirming the specific k-mer included in each genome again. Because there may be a displacement situation, through this positioning operation, the specific k-mer contained in each genome is reconfirmed to increase the probability of fault tolerance.
步骤1304B,依次从参考基因组中选取一个区域与特异性k-mer进行比较,当检测到选取的区域与特异性k-mer的相似度达到预设相似阈值时,则将特异性k-mer作为参考基因组包含的特异性k-mer。 Step 1304B, sequentially select a region from the reference genome and compare it with the specific k-mer. When it is detected that the similarity between the selected region and the specific k-mer reaches a preset similarity threshold, the specific k-mer is taken as The reference genome contains a specific k-mer.
步骤1304C,依次选取参考基因组中包含的两个特异性k-mer进行检测。In step 1304C, two specific k-mers included in the reference genome are sequentially selected for detection.
步骤1304D,当检测到选取的两个特异性k-mer在参考基因组上的距离小于预设距离阈值时,则将选取的两个特异性k-mer进行替换,得到非重合特异性区域。In step 1304D, when it is detected that the distance between the selected two specific k-mers on the reference genome is less than a preset distance threshold, the selected two specific k-mers are replaced to obtain a non-overlapping specific region.
步骤1304E,通过获得的每个基因组中的各个非重合特异性区域生成与每个基因组对应的非重合特异性区域集合。 Step 1304E: Generate a set of non-overlapping specific regions corresponding to each genome from each of the non-overlapping-specific regions in each genome obtained.
可以依次选取基因组中包含的两个特异性k-mer进行检测,检测选取的两个特异性k-mer在基因组上的距离是否小于预设距离阈值,若是,则将选取的两个特异性k-mer进行替换。当检测到选取的两个特异性k-mer在基因组上的距离小于或等于0时,则对这两个特异性k-mer的替换方式可以是,取能覆盖选取的两个特异性k-mer的最小区域替换选取的两个特异性k-mer。也就是一个区域替换这两个特异性k-mer,此区域就是根据这两个特异性k-mer得 到的非重合特异性区域。也可以是截取选取的这两个特异性k-mer在基因组上定位的那一段序列作为对应的非重合特异性区域。预设距离阈值可以是一个负数,也可以为0,一般设置为小于5的整数。当选取的这两个特异性k-mer在基因组上的距离为0时,意味着选取的这两个特异性k-mer直接相邻并相接。距离为负数时,意味着选取的这两个特异性k-mer有一定数量的碱基对的重合。The two specific k-mers included in the genome can be selected in order to detect whether the distance between the selected two specific k-mers on the genome is less than a preset distance threshold, and if so, the selected two specific k-mers will be selected. -mer for replacement. When it is detected that the distance between two selected specific k-mers on the genome is less than or equal to 0, the replacement of the two specific k-mers may be to cover the selected two specific k-mers. The smallest region of mer replaces the two specific k-mers selected. That is, a region replaces the two specific k-mers, and this region is a non-overlapping specific region obtained based on the two specific k-mers. It may also be a sequence in which the selected two specific k-mers are located on the genome as corresponding non-overlapping specific regions. The preset distance threshold can be a negative number or 0, and is generally set to an integer less than 5. When the distance between the two specific k-mers selected on the genome is 0, it means that the two specific k-mers selected are directly adjacent and connected to each other. When the distance is negative, it means that the two specific k-mers selected have a certain number of base pairs.
在对特异性k-mer进行处理得到非重合特异性区域时,可能存在的情况是,两个特异性k-mer进行替换能得到一个非重合特异性区域,也可能是三个特异性k-mer进行替换能得到一个非重合特异性区域,或者也可以是多个特异性k-mer进行替换能得到的非重合特异性区域。因此得到的非重合特异性区域的长度并无限制。若选取的两个特异性k-mer在基因组上的距离并未小于预设距离阈值,则无需处理。针对每个基因组均做此处理后,即可得到每个基因组中包含的非重合特异性区域集合。在每个基因组对应的非重合特异性区域集合中包含的是该基因组中的非重合特异性区域。When processing specific k-mers to obtain non-overlapping specific regions, there may be situations in which two specific k-mers can be replaced to obtain one non-overlapping specific region or three specific k-mers. The replacement of mer can obtain a non-overlapping specific region, or it can also be a non-overlapping specific region obtained by replacing multiple specific k-mers. Therefore, the length of the obtained non-overlapping specific region is not limited. If the distance between the selected two specific k-mers on the genome is not less than a preset distance threshold, no processing is required. After doing this for each genome, a set of non-overlapping specific regions contained in each genome can be obtained. The set of non-overlapping specific regions corresponding to each genome contains the non-overlapping specific regions in the genome.
步骤1304F,获取每个基因组对应的非重合特异性区域集合中包含的每个非重合特异性区域在全部非重合特异性区域集合中的出现次数。Step 1304F: Obtain the number of occurrences of each non-overlapping specific region included in the non-overlapping specific region set corresponding to each genome in the entire non-overlapping specific region set.
在得到每个基因组对应的非重合特异性区域集合后,可获取每个非重合特异性区域集合中每个非重合特异性区域在全部的非重合特异性区域集合中的出现次数。可以将每个基因组对应的非重合特异性区域集合看做是一个小的集合,将全部的基因组的非重合特异性区域集合组成非重合特异性区域全集,即由多个小的集合组成了全集。在非重合特异性区域全集中包含有全部的基因组中的非重合特异性区域,因此可获取到每个基因组各自的非重合特异性区域集合中的每个非重合特异性区域,在重合特异性区域全集中的出现次数。After obtaining the non-overlapping specific region set corresponding to each genome, the number of occurrences of each non-overlapping specific region in each non-overlapping specific region set in each non-overlapping specific region set can be obtained. The set of non-overlapping specific regions corresponding to each genome can be regarded as a small set, and the entire set of non-overlapping specific regions of the entire genome is composed of a complete set of non-overlapping specific regions, that is, the complete set is composed of multiple small sets. . The full set of non-overlapping specific regions contains all non-overlapping specific regions in the genome, so each non-overlapping specific region in the set of non-overlapping specific regions of each genome can be obtained. The number of occurrences in the regional ensemble.
假设得到了N个基因组对应的N个非重合特异性区域集合,一般情况下,一个非重合特异性区域并不会在每一个基因组中都出现,因此每个非重合特异性区域在全部的非重合特异性区域集合中的出现次数一般会小于等于N。Assume that a set of N non-overlapping specific regions corresponding to N genomes is obtained. In general, a non-overlapping specific region does not appear in every genome, so each non-overlapping specific region is in all non-overlapping specific regions. The number of occurrences in the set of coincident specific regions is generally less than or equal to N.
步骤1304G,选取出现次数超过预设次数阈值的非重合特异性区域作为目标病原体操作组的检测靶点。In step 1304G, a non-overlapping specific region with a number of occurrences exceeding a preset number of times is selected as a detection target of the target pathogen operation group.
在获取到每个非重合特异性区域在全部的非重合特异性区域集合中的出现次数后,可从中选取出现次数超过预设次数阈值的非重合特异性区域作为目标病原体操作组的检测靶点。预设次数阈值可由技术人员根据实际项目需求设定。选取出的非重合特异性区域可以是多个,也可以对应多个基因组。After obtaining the number of occurrences of each non-overlapping specific region in the entire set of non-overlapping specific regions, a non-overlapping specific region with an appearance frequency exceeding a preset number of thresholds can be selected as a detection target of the target pathogen operation group . The preset number of times threshold can be set by technicians according to actual project requirements. The selected non-overlapping specific regions may be multiple, or may correspond to multiple genomes.
挑选出来的非重合特异性区域都有如下特点:(1)长度一般比目标病原体操作的特征靶点序列中得到的k-mer都要长很多;(2)基本上都在目标病原体操作组包含所有的基因组中出现过;(3)基本上都没有在不是目标病原体操作组的基因组中出现过。这些特点可以满足大部分诊断检测技术中所需要使用的检测靶点的技术需求。因此将挑选出来的各个非重合特异性区域,再按照某一特定诊断检测技术的检测靶点的技术需求进行简单筛选(例如满足长度,CG含量百分比,退火温度等的条件),最后就可以得到满足检测靶点技术需求的非重合特异性区域,即组成了最终的可以用于检测目标病原体操作组的检测靶点集合。用户根据该 检测靶点集合中的序列,就可以合成及制造适用于该特定的诊断检测技术的分子探针。The selected non-overlapping specific regions have the following characteristics: (1) the length is generally much longer than the k-mer obtained from the characteristic target sequence manipulated by the target pathogen; (2) basically all are included in the target pathogen manipulation group It has appeared in all genomes; (3) It has basically not appeared in genomes that are not the target pathogen operating group. These characteristics can meet the technical needs of the detection target used in most diagnostic detection technologies. Therefore, the selected non-overlapping specific regions are simply screened according to the technical requirements of the detection target of a specific diagnostic detection technology (such as meeting the conditions of length, CG content percentage, annealing temperature, etc.), and finally can be obtained The non-overlapping specific regions that meet the needs of the detection target technology constitute the final set of detection targets that can be used to detect the target pathogen operating group. Based on the sequences in the set of detection targets, users can synthesize and manufacture molecular probes suitable for this particular diagnostic detection technology.
步骤1304H,将选取的出现次数超过预设次数阈值的非重合特异性区域作为代表特异性区域,将包含有出现次数超过预设次数阈值的非重合特异性区域数量最多的基因组作为代表基因组。In step 1304H, the selected non-overlapping specific region with a number of occurrences exceeding a preset number threshold is taken as the representative specific region, and the genome containing the most non-overlapping specific region with the number of occurrences exceeding a preset number threshold is taken as a representative genome.
步骤1304I,将代表基因组对应的代表特异性区域集合作为PCR代表特异性区域集合。Step 1304I: The representative specific region set corresponding to the representative genome is used as the PCR specific region set.
在将选取出的出现次数超过预设次数阈值的非重合特异性区域,作为目标病原体操作组的检测靶点之后,可以根据这些选取出的非重合特异性区域在基因组中的出现次数,选出代表基因组。可统计每个基因组中包含有出现次数超过预设次数阈值的非重合特异性区域的数量,将包含有这些出现次数超过预设次数阈值的非重合特异性区域数量最多的基因组选出,作为代表基因组。实际上,代表基因组中包含的非重合特异性区域就是确定的目标病原体操作组的检测靶点。After selecting the non-overlapping specific regions whose occurrences exceed a preset number of times as the detection target of the target pathogen operation group, the selected non-overlapping specific regions can be selected based on the number of occurrences in the genome of the selected non-overlapping specific regions. Representing the genome. Can count the number of non-overlapping specific regions in each genome that exceed the preset number of thresholds, and select the genome that contains the largest number of non-overlapping specific regions in which these occurrences exceed the preset number of thresholds as representatives Genome. In fact, the non-overlapping specific region contained in the representative genome is the detection target of the identified target pathogen operating group.
代表特异性区域是选取的在全部的非重合特异性区域集合中出现次数超过预设次数阈值的非重合特异性区域,即代表特异性区域有多个。每个基因组中选取出的代表特异性区域可以是多个,每个基因组对应有一个代表特异性区域集合,代表特异性区域集合中包含的就是代表特异性区域。代表基因组实际上就是从目标病原体操作组包含的多个基因组中选出的一个基因组,那么代表基因组也有其对应的代表特异性区域集合。因此,可以在选出代表基因组后,将代表基因组对应的代表特异性区域集合作为PCR(聚合酶链式反应)代表特异性区域集合。The representative specific area is a non-overlapping specific area selected in all non-overlapping specific area sets that exceeds a preset number of times, that is, there are multiple representative specific areas. There can be multiple representative specific regions selected in each genome, and each genome corresponds to a set of representative specific regions, and the set of representative specific regions contains representative specific regions. The representative genome is actually a genome selected from multiple genomes contained in the target pathogen operating group, then the representative genome also has its corresponding set of representative specific regions. Therefore, after the representative genome is selected, the representative specific region set corresponding to the representative genome can be used as a PCR (polymerase chain reaction) representative specific region set.
在步骤1304I之前需要去除部分不具备生物功能的非重合特异性区域,以得到具备生物功能的代表特异性区域。对于目标病原体操作组中包含的每一个基因组,可从靶点数据库中获取到每一个基因组的基因注释信息。基因注释信息是指标注一个基因组中各个基因位置及功能的信息,因此每个基因组的基因注释信息中包含有每个基因组上每个已知的有功能的区域的位置及对应的功能信息。例如通过NCBI的GenBank数据库获得该基因组的GenBank基因注释信息,或通过Ensembl数据库获得该基因组的基因注释信息。该类基因注释信息包括该基因组上任何一个已知的有功能的区域的位置及该区域所对应的功能信息。位置包括起止位置、正负链、序列等,功能信息,是指如编码蛋白的基因,编码microRNA的基因,编码promoter的区域,编码调控蛋白识别结合的区域,复制起始区等。Before step 1304I, some non-overlapping specific regions that do not have biological functions need to be removed to obtain representative specific regions that have biological functions. For each genome contained in the target pathogen operating group, the gene annotation information of each genome can be obtained from the target database. The gene annotation information is information indicating the position and function of each gene in a genome. Therefore, the gene annotation information of each genome includes the position and corresponding function information of each known functional region on each genome. For example, the GenBank gene annotation information of the genome is obtained through the NCBI's GenBank database, or the genome annotation information of the genome is obtained through the Ensembl database. The type of gene annotation information includes the position of any known functional region on the genome and the functional information corresponding to the region. Positions include start and stop positions, plus and minus strands, sequences, etc. Functional information refers to, for example, genes encoding proteins, genes encoding microRNA, regions encoding promoters, regions encoding regulatory proteins for recognition and binding, and replication initiation regions.
可以根据获取到的目标病原体操作组中各个基因组的基因注释信息,对每个代表基因组中包含的各个非重合特异性区域进行筛选。代表特异性区域是选取出的出现次数超过预设次数阈值的非重合特异性区域。在确定每个代表特异性区域是否具备生物功能时,可依次选取出一个代表特异性区域与目标病原体操作组中的全部基因组包含的已知的有功能的区域进行比对,并判断选取出的代表特异性区域与已知的有功能的区域的重合区域是否有显著重合,即判断两个序列重合程度是否达到预先设定的重合阈值。当选取出的代表特异性区域与已知的有功能的区域的重合程度低于预设重合阈值时,则认为该代表特异性区域是不具备生物功能的,可去除这些与已知的有功能的区域的重合程度低于预设重合阈值的代表特异性区域。从而其他剩余的与已知的有功能的区域的重合区域程度高于预设重合阈值的代表特异性区 域,则就是具备生物功能的代表特异性区域。According to the obtained gene annotation information of each genome in the target pathogen operation group, each non-overlapping specific region included in each representative genome can be screened. The representative specific region is a non-overlapping specific region whose appearance frequency exceeds a preset number of times. When determining whether each representative specific region has a biological function, a representative specific region may be sequentially selected for comparison with a known functional region included in the entire genome of the target pathogen operating group, and the selected Whether the overlapping region of the specific region and the known functional region is significantly overlapped, that is, it is judged whether the overlap degree of the two sequences reaches a preset overlap threshold. When the degree of coincidence between the selected representative specific region and the known functional region is lower than the preset coincidence threshold, the representative specific region is considered to have no biological function, and these and the known functional region can be removed. The specific overlap of the regions whose degree of coincidence is lower than a preset coincidence threshold. Therefore, the other representative specific regions whose degree of overlap with a known functional region is higher than a preset coincidence threshold are representative specific regions with biological functions.
在比对判断选取出的代表特异性区域与已知的有功能的区域的重合区域是否有显著重合时,即判断两个序列重合程度是否达到预先设定的重合阈值。此处显著重合的预先设定的重合阈值可以是:重合的区域长度超过一定阈值T1,例如12bp,或重合区域的长度占特异性区域长度的百分比超过一定阈值T2,例如30%,或重合区域的长度占相关的有功能的区域的长度的百分比超过一定阈值T3,例如30%,或该特异性区域所包含的所有有功能区域的总长度占该特异性区域的长度的百分比超过一定阈值T4,例如30%。When comparing and judging whether the overlapped region of the selected representative specific region and the known functional region has significant overlap, it is judged whether the overlap degree of the two sequences reaches a preset overlap threshold. The preset coincidence threshold for significant coincidence here may be: the length of the coincident region exceeds a certain threshold T1, for example, 12bp, or the length of the coincident region as a percentage of the length of the specific region exceeds a certain threshold T2, for example, 30%, or the coincident region The percentage of the length of the relevant functional region exceeds a certain threshold T3, such as 30%, or the percentage of the total length of all functional regions contained in the specific region to the length of the specific region exceeds a certain threshold T4 , For example 30%.
步骤1304J,从PCR代表特异性区域集合中选取符合预设相距距离范围的非重合特异性区域,生成符合条件的PCR特异性区域对集合。 Step 1304J: Select non-overlapping specific regions from the set of PCR specific regions that meet the preset distance distance range to generate a set of PCR-specific region pairs that meet the conditions.
在PCR代表特异性区域集合中包含有一个或多个非重合特异性区域,可以获取到PCR代表特异性区域集合中包含的每个非重合特异性区域在代表基因组上的位置。从而可以找到在基因组上的距离符合预设相距距离范围的两个非重合特异性区域,生成符合条件的PCR特异性区域对集合。预设相距距离范围D可以是(MD-SD,MD+SD)。MD可以设置为1000bp左右,SD可以设置为500bp左右。One or more non-overlapping specific regions are included in the PCR representative specific region set, and the position of each non-overlapping specific region included in the PCR representative specific region set on the representative genome can be obtained. Thereby, two non-overlapping specific regions whose distances on the genome match a preset distance range can be found, and a set of PCR-specific region pairs meeting the conditions can be generated. The preset distance range D can be (MD-SD, MD + SD). MD can be set to about 1000bp, SD can be set to about 500bp.
步骤1304K,从符合条件的PCR特异性区域对集合中选取一个PCR特异性区域对中的两个非重合特异性区域定位至代表基因组上,将选取的两个非重合特异性区域在代表基因组上行成的区间对应的序列作为待检测区间。 Step 1304K: Select two non-overlapping specific regions in a PCR-specific region pair from the set of eligible PCR-specific region pairs to locate on the representative genome, and place the selected two non-overlapping specific regions on the representative genome. The sequence corresponding to the interval is used as the interval to be detected.
步骤1304L,对每个待检测区间进行筛选,得到最终检测引物对集合。In step 1304L, screening is performed for each interval to be detected to obtain a final detection primer pair set.
从符合条件的PCR特异性区域对集合中选取一对PCR特异性区域对中的两个PCR特异性区域,并定位到代表基因组上,则这选取的两个PCR特异性区域会在代表基因组上形成一个区间,可获取到这两个PCR特异性区域形成的区间所对应的序列,将这序列作为代表基因组的待检测区间。将符合条件的PCR特异性区域对集合中的每一对PCR特异性区域对中的两个PCR特异性区域均选取出来定位至基因组上,则能获取到各个相应的待检测区间,就能够获取到待检测区间集合。即在待检测区间集合中包含有一个或多个待检测区间。Select two PCR-specific regions in a pair of PCR-specific region pairs from a set of eligible PCR-specific region pairs and locate them on the representative genome. Then, the two selected PCR-specific regions will be on the representative genome. By forming an interval, the sequence corresponding to the interval formed by these two PCR-specific regions can be obtained, and this sequence is taken as the interval to be detected representing the genome. The two PCR-specific regions in each pair of PCR-specific region pairs in the set of eligible PCR-specific region pairs are selected and positioned on the genome, and then each corresponding interval to be detected can be obtained. Set to the interval to be detected. That is, the set of intervals to be detected includes one or more intervals to be detected.
以此方式得到多个待检测区间后,对每个待检测区间进行筛选,如可以运用PCR引物工具对待检测区间集合中的每个待检测区间进行筛选。PCR引物工具可以是Primer3。从而可以筛选出部分待检测区间,得到一个或多个备选PCR引物对集合。现有的大部分自动PCR引物生成工具自动生成的一对PCR引物仅仅能够满足在所使用这个待检测区域内的引物的特异性,并不能保证在其他区域的特异性。同时,某些自动PCR引物生成工具并不能参考待检测区间内的特异性区域标注消息。因此需要进一步确定得到的备选PCR引物对集合中的引物的特异性。因此,可以从得到的备选PCR引物对集合中选出针对目标病原体操作组的特异性引物对,生成与目标病原体操作组对应的特异性引物对集合。After obtaining multiple detection intervals in this way, screening is performed on each detection interval. For example, PCR primer tools can be used to screen each detection interval in the set of detection intervals. The PCR primer tool may be Primer3. Thereby, a part of the interval to be detected can be selected, and one or more candidate PCR primer pair sets can be obtained. A pair of PCR primers automatically generated by most existing automatic PCR primer generation tools can only satisfy the specificity of the primers in the region to be detected, and cannot guarantee the specificity in other regions. At the same time, some automatic PCR primer generation tools cannot refer to specific regions within the interval to be detected to mark messages. Therefore, it is necessary to further determine the specificity of the obtained candidate PCR primers to the primers in the collection. Therefore, a specific primer pair for the target pathogen operating group can be selected from the obtained candidate PCR primer pair set, and a specific primer pair set corresponding to the target pathogen operating group can be generated.
此处,可以使用全集中的不属于目标病原体操作组中的基因组分别作为比对参考基因组,并将一个备选PCR引物对中的两个引物分别与比对参考基因组中的序列进行比对并定位至比对参考基因组上。判断是否定位成功时,可以将比对参考基因组的引物对与定位在比对参考基因组的位置对应的序列进行对比,当序列相似度达到预设相似度阈值时,则判定引物对定 位成功。定位成功的备选PCR引物对即被判定为不是目标病原体操作组相对应的特异性引物对,并从备选PCR引物对集合中去除该队引物。可根据选出的特异性引物对生成与目标病原体操作组对应的特异性引物对集合。在选取出针对目标病原体操作组的特异性引物对后,可以再进行进一步的筛选。从特异性引物对集合中挑选出符合预设引物条件的引物对,将这些选取的引物对作为最终检测引物对,从而生成对应的最终检测引物对集合。Here, the genomes in the complete set that do not belong to the target pathogen operating group can be used as alignment reference genomes respectively, and the two primers in one candidate PCR primer pair can be compared with the sequences in the alignment reference genome and combined. Map to the aligned reference genome. When judging whether the mapping is successful, the primer pair of the aligned reference genome can be compared with the sequence corresponding to the position of the aligned reference genome. When the sequence similarity reaches a preset similarity threshold, the primer pair is determined to be successful. The candidate PCR primer pairs that have been successfully located are determined to be not specific primer pairs corresponding to the target pathogen operating group, and the team primers are removed from the set of candidate PCR primer pairs. A set of specific primer pairs corresponding to the target pathogen operating group can be generated based on the selected specific primer pairs. After selecting specific primer pairs for the target pathogen operating group, further screening can be performed. A primer pair that meets the preset primer conditions is selected from the specific primer pair set, and these selected primer pairs are used as the final detection primer pair to generate a corresponding final detection primer pair set.
符合预设引物条件包括以下至少一种:引物长度在17到28bp之间;引物退火温度在52到58摄氏度之间;GC百分比在40%到60%之间;引物的3'端为C、G、CG、或GC;引物的3'端最后5个碱基中G/C不超过3个,引物的3'端最后5个碱基内不含有连续超过2个的C或G;不包含重复序列或单核酸重复序列;不存在有两个引物之间的3'端互补,或单个引物的自互补。Meet the preset primer conditions including at least one of the following: the length of the primer is between 17 and 28bp; the annealing temperature of the primer is between 52 and 58 degrees Celsius; the GC percentage is between 40% and 60%; the 3 'end of the primer is C, G, CG, or GC; no more than 3 G / C in the last 5 bases of the 3 'end of the primer, no more than 2 consecutive C or G in the last 5 bases of the 3' end of the primer; no Repeats or single nucleic acid repeats; there is no 3 'end complementarity between two primers, or self-complementation of a single primer.
针对一个选定的目标病原体操作组,需要先运行步骤1302中的流程再运行步骤1304中的流程。如果病原体的基因组数据,或背景基因组的数据被更新了,那么就需要重新运行步骤1302和步骤1304。For a selected target pathogen operation group, the process in step 1302 needs to be run before the process in step 1304 is run. If the genomic data of the pathogen, or background genome data is updated, then steps 1302 and 1304 need to be rerun.
应该理解的是,虽然图1-图15各个流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the steps in the flowcharts of FIG. 1 to FIG. 15 are sequentially displayed in accordance with the instructions of the arrows, these steps are not necessarily performed in the order indicated by the arrows. Unless explicitly stated in this document, the execution of these steps is not strictly limited, and these steps can be performed in other orders. Moreover, at least a part of the steps in the figure may include multiple sub-steps or stages. These sub-steps or stages are not necessarily executed at the same time, but may be executed at different times. The execution order of these sub-steps or stages It is not necessarily performed sequentially, but may be performed in turn or alternately with at least a part of other steps or sub-steps or stages of other steps.
在其中一个实施例中,如图16所示,提供了一种确定检测靶点的装置,包括:In one embodiment, as shown in FIG. 16, a device for determining a detection target is provided, including:
确定模块1602,用于确定待检测的目标病原体操作组;A determining module 1602, configured to determine a target pathogen operation group to be detected;
特异性k-mer获取模块1604,用于从靶点数据库中获取目标病原体操作组中包含的特异性k-mer,特异性k-mer为满足预设特异性条件的k-mer,k-mer是指长度为k的基因组序列;确定目标病原体操作组中包含的每个基因组中包含的特异性k-mer;The specific k-mer acquisition module 1604 is used to acquire the specific k-mer included in the target pathogen operation group from the target database. The specific k-mer is a k-mer that satisfies preset specific conditions. Refers to a genomic sequence of length k; determines the specific k-mer contained in each genome contained in the target pathogen operating group;
非重合特异性区域获取模块1606,用于对每个基因组包含的特异性k-mer进行处理,得到每个基因组对应的非重合特异性区域集合,非重合特异性区域集合中包含有非重合特异性区域;及A non-overlapping specific region acquisition module 1606 is configured to process specific k-mers contained in each genome to obtain a set of non-overlapping specific regions corresponding to each genome. The non-overlapping specific region set includes non-overlapping specific regions. Sexual area; and
检测靶点选取模块1608,用于获取每个基因组对应的非重合特异性区域集合中包含的每个非重合特异性区域在全部非重合特异性区域集合中的出现次数;取出现次数超过预设次数阈值的非重合特异性区域作为目标病原体操作组的检测靶点。Detection target selection module 1608, used to obtain the number of occurrences of each non-overlapping specific region contained in the set of non-overlapping specific regions corresponding to each genome in the entire set of non-overlapping specific regions; taking the number of occurrences exceeding a preset The non-overlap specific region of the number of times threshold is used as the detection target of the target pathogen operation group.
在其中一个实施例中,上述特异性k-mer获取模块1604还用于依次将目标病原体操作组中包含的每个基因组作为参考基因组;将目标病原体操作组中包含的每个特异性k-mer定位至参考基因组;及将定位至参考基因组的特异性k-mer作为参考基因组包含的特异性k-mer。In one embodiment, the specific k-mer acquisition module 1604 is further configured to sequentially use each genome included in the target pathogen operation group as a reference genome; and use each specific k-mer included in the target pathogen operation group as a reference genome. Mapping to a reference genome; and a specific k-mer included in the reference genome as a specific k-mer included in the reference genome.
在其中一个实施例中,上述特异性k-mer获取模块1604还用于依次从参考基因组中选取一个区域与特异性k-mer进行比较;及当检测到选取的区域与特异性k-mer的相似度达到预设相似阈值时,则将特异性k-mer作为参考基因组包含的特异性k-mer。In one embodiment, the specific k-mer acquisition module 1604 is further configured to sequentially select a region from the reference genome for comparison with the specific k-mer; and when the selected region is detected with the specific k-mer When the similarity reaches a preset similarity threshold, the specific k-mer is used as the specific k-mer included in the reference genome.
在其中一个实施例中,上述特异性k-mer获取模块1604还用于依次从参考基因组中选取与特异性k-mer长度相同的序列,将选取的与特异性k-mer长度相同的序列与特异性k-mer进行比较;及当检测到选取的序列与特异性k-mer相同时,则将特异性k-mer作为参考基因组包含的特异性k-mer。In one embodiment, the specific k-mer acquisition module 1604 is further configured to sequentially select a sequence with the same length as the specific k-mer from the reference genome, and select the selected sequence with the same length as the specific k-mer from The specific k-mer is compared; and when the selected sequence is detected to be the same as the specific k-mer, the specific k-mer is used as the specific k-mer included in the reference genome.
在其中一个实施例中,上述装置还包括数据建立模块(图中未示出),用于获取预先选取的满足预设可信度条件的基因组作为高可信度基因组;及确定每个病原体操作组包括的高可信度基因组,作为每个病原体操作组对应的基因组。In one embodiment, the above-mentioned device further includes a data creation module (not shown in the figure) for obtaining a pre-selected genome that meets a preset confidence condition as a high-confidence genome; and determining each pathogen operation The group includes a high-confidence genome as the genome corresponding to each pathogen operating group.
在其中一个实施例中,满足预设可信度条件包括以下任意一种:基因组序列中包含的非确定性字符的比例低于预设比例阈值时;基因组序列中包含的属于同一条染色体的序列片段低于预设片段阈值时;及将某一基因组序列与其他所有遗传关系符合预设遗传距离阈值范围的基因组序列进行序列比对,以确定该基因组序列在其相近的基因组序列中的全序列平均覆盖百分比,当该平均覆盖百分比高于预设百分比值时。In one embodiment, satisfying the preset credibility condition includes any of the following: when the proportion of non-deterministic characters contained in the genome sequence is lower than a preset proportion threshold; the sequences belonging to the same chromosome included in the genome sequence When a fragment is below a preset fragment threshold; and performing a sequence comparison between a genomic sequence and all other genomic sequences whose genetic relationship meets a preset genetic distance threshold range to determine the full sequence of the genomic sequence in a similar genomic sequence Average coverage percentage, when the average coverage percentage is higher than a preset percentage value.
在其中一个实施例中,上述非重合特异性区域获取模块1606还用于将每个基因组包含的特异性k-mer定位至该基因组上;依次选取每个基因组包含的所述特异性k-mer和/或所述非重合特异性区域进行检测;当检测到选取的特异性k-mer和/或非重合特异性区域在所述基因组上的距离小于预设距离阈值时,则将选取的特异性k-mer和/或非重合特异性区域进行替换,得到替换后的非重合特异性区域;及根据最终所保留的特异性k-mer和替换后的非重合特异性区域得到每个基因组对应的非重合特异性区域集合。In one embodiment, the above-mentioned non-overlapping specific region acquisition module 1606 is further configured to locate a specific k-mer included in each genome onto the genome; and sequentially select the specific k-mer included in each genome. And / or the non-overlapping specific region is detected; when it is detected that the selected specific k-mer and / or non-overlapping specific region has a distance on the genome that is less than a preset distance threshold, the selected specific Replacement of specific k-mer and / or non-overlapping specific regions to obtain the non-overlapping specific regions after replacement; and corresponding to each genome according to the specific k-mer and the non-overlapping specific regions after replacement. Of non-coincidence specific regions.
在其中一个实施例中,上述非重合特异性区域获取模块1606还用于选取每个基因组包含的特异性k-mer和/或非重合特异性区域进行检测;当检测到选取的特异性k-mer和/或非重合特异性区域在基因组上的距离小于预设距离阈值时,则将选取的特异性k-mer和/或非重合特异性区域进行替换,得到替换后的非重合特异性区域;及根据最终所保留的特异性k-mer和替换后的非重合特异性区域与替换后的非重合特异性区域得到每个基因组对应的非重合特异性区域集合。In one embodiment, the above-mentioned non-overlapping specific region acquisition module 1606 is further configured to select a specific k-mer and / or non-overlapping specific region contained in each genome for detection; when the selected specific k- When the distance between the mer and / or non-overlapping specific region on the genome is less than a preset distance threshold, the selected specific k-mer and / or non-overlapping specific region are replaced to obtain the replaced non-overlapping specific region. ; And obtaining a set of non-overlapping specific regions corresponding to each genome according to the finally retained specific k-mer and the replaced non-overlap-specific regions and the replaced non-overlap-specific regions.
在其中一个实施例中,上述非重合特异性区域获取模块1606还用于当检测到选取的特异性k-mer和/或非重合特异性区域在基因组上的距离小于或等于零时,则取能覆盖选取的两个特异性k-mer的最小区域替换选取的两个特异性k-mer,得到非重合特异性区域;当检测到选取的特异性k-mer和/或非重合特异性区域在基因组上的距离大于零时,则获取选取的两个特异性k-mer和/或非重合特异性区域在定位的基因组上中间间隔的序列;将选取的两个特异性k-mer和中间间隔的序列依次进行拼接,得到拼接序列;及将选取的两个特异性k-mer替换成拼接序列,得到非重合特异性区域。In one embodiment, the above-mentioned non-overlapping specific region acquisition module 1606 is further configured to enable energy detection when the detected specific k-mer and / or non-overlapping specific region distance on the genome is less than or equal to zero. Covering the smallest region of the selected two specific k-mers and replacing the selected two specific k-mers to obtain non-overlapping specific regions; when the selected specific k-mer and / or non-overlapping specific regions are detected When the distance on the genome is greater than zero, the sequence of the two specific k-mers and / or non-overlapping specific regions spaced on the located genome is obtained; the two specific k-mers and the intermediate space are selected. The sequences are spliced in order to obtain the spliced sequence; and the two specific k-mers selected are replaced with the spliced sequence to obtain non-overlapping specific regions.
在其中一个实施例中,预设距离阈值小于5。In one embodiment, the preset distance threshold is less than 5.
在其中一个实施例中,上述检测靶点选取模块1608还用于将所述目标病原体操作组中包含的每个基因组对应的非重合特异性区域集合进行汇总,得到非重合特异性区域并集;及获取每个基因组对应的非重合特异性区域集合中包含的每个非重合特异性区域在非重合特异性区域并集中的出现次数。In one embodiment, the detection target selection module 1608 is further configured to summarize a set of non-overlapping specific regions corresponding to each genome included in the target pathogen operation group, to obtain a union of non-overlapping specific regions; And obtaining the number of times that each non-overlap specific region included in the set of non-overlap specific regions corresponding to each genome appears in the non-overlap specific regions.
在其中一个实施例中,预设次数阈值=(1-Y)*N,Y为预设第一条件阈值,N为非重合特异性区域集合的数量。In one embodiment, the preset number of times threshold = (1-Y) * N, Y is a preset first condition threshold, and N is the number of non-overlapping specific region sets.
在其中一个实施例中,预设第一条件阈值小于5%。In one embodiment, the preset first condition threshold is less than 5%.
在其中一个实施例中,上述装置还包括引物筛选模块(图中未示出),用于将选取的出现次数超过预设次数阈值的非重合特异性区域作为代表特异性区域;根据获得的各个代表特异性区域组成代表特异性区域总集合;去除代表特异性区域总集合中不具备生物功能的代表特异性区域,得到具备生物功能的代表特异性区域总集合;及将具备生物功能的代表特异性区域总集合中的非重合特异性区域作为目标病原体操作组的检测靶点。In one embodiment, the above-mentioned device further includes a primer screening module (not shown in the figure), configured to use the non-overlapping specific region selected by the occurrence number exceeding a preset number threshold as the representative specific region; according to each obtained The composition of representative specific regions represents the total set of representative specific regions; the representative specific regions that do not have biological functions are removed from the total set of representative specific regions to obtain the total set of representative specific regions that have biological functions; and Non-overlapping specific regions in the total set of sexual regions are used as detection targets for the target pathogen manipulation group.
在其中一个实施例中,上述引物筛选模块还用于从靶点数据库中获取目标病原体操作组中包含的每个基因组的基因注释信息,基因注释信息包含有每个基因组上每个已知的有功能的区域的位置及对应的功能信息;依次从代表特异性区域总集合中选取一个代表特异性区域与全部基因组中已知的有功能的区域进行比对;及去除与已知的有功能的区域的重合区域长度低于预设重合阈值的代表特异性区域,得到具备生物功能的代表特异性区域总集合。In one embodiment, the primer screening module is further configured to obtain the gene annotation information of each genome included in the target pathogen operation group from the target database, and the gene annotation information includes each known Location of functional regions and corresponding functional information; sequentially selecting a representative specific region from the total set of representative specific regions for comparison with known functional regions in the entire genome; and removing known functional regions The overlapped region has a representative specific region whose length is less than a preset coincidence threshold, and a total set of representative specific regions with biological functions is obtained.
在其中一个实施例中,上述引物筛选模块还用于将包含有出现次数超过预设次数阈值的非重合特异性区域数量最多的基因组作为代表基因组;将代表基因组对应的代表特异性区域集合作为PCR代表特异性区域集合;从PCR代表特异性区域集合中选取符合预设相距距离范围的非重合特异性区域,生成符合条件的PCR特异性区域对集合;从符合条件的PCR特异性区域对集合中选取一个PCR特异性区域对中的两个PCR特异性区域定位至代表基因组上;将选取的两个PCR特异性区域在代表基因组上行成的区间对应的序列作为待检测区间;根据获得的各个待检测区间组成待检测区间集合;及对待检测区间集合中的每个待检测区间进行筛选,得到最终检测引物对集合。In one embodiment, the primer screening module is further configured to use the genome containing the largest number of non-overlapping specific regions that exceed the preset number of times as the representative genome; and use the set of representative specific regions corresponding to the representative genome as PCR. Represents a specific region set; select non-overlapping specific regions from a set of PCR specific regions that meet a preset distance range to generate a set of eligible PCR-specific region pairs; from a set of eligible PCR-specific region pairs Select two PCR-specific regions in a PCR-specific region pair to locate on the representative genome; use the sequence corresponding to the interval formed by the two selected PCR-specific regions on the representative genome as the interval to be detected; The detection intervals constitute a set of intervals to be detected; and each of the intervals to be detected in the set of intervals to be detected is screened to obtain a final set of detection primer pairs.
在其中一个实施例中,上述引物筛选模块还用于根据选取的出现次数超过预设次数阈值的非重合特异性区域得到代表特异性区域总集合;选取包含有代表特异性区域总集合中非重合特异性区域数量最多的基因组为代表基因组。In one embodiment, the primer screening module is further configured to obtain a total set of representative specific regions according to the selected non-overlapping specific regions whose occurrences exceed a preset number of times; and select a non-overlapping region that includes the total set of representative specific regions. The genome with the most specific regions is the representative genome.
在其中一个实施例中,上述引物筛选模块还用于获取PCR代表特异性区域集合中每个PCR代表特异性区域在代表基因组中的位置;将相距距离符合预设相距距离范围的两个PCR代表特异性区域作为符合条件的PCR特异性区域对;及根据符合条件的PCR特异性区域对生成符合条件的PCR特异性区域对集合。In one embodiment, the above primer screening module is further used to obtain the position of each PCR representative specific region in the representative genome in the set of PCR representative specific regions; the two PCR representatives whose distances match a preset distance range The specific region is used as a qualified PCR specific region pair; and a set of qualified PCR specific region pairs is generated based on the qualified PCR specific region pair.
在其中一个实施例中,预设相距距离范围大于500bp且小于1500bp。In one embodiment, the preset distance range is greater than 500 bp and less than 1500 bp.
在其中一个实施例中,上述引物筛选模块还用于将选取的一个PCR特异性区域对中的两个非重合特异性区域在代表基因组上相距最远的两段的位置作为待检测区间的边界;及将待检测区间边界在代表基因组上行成的区间对应的序列作为待检测区间。In one embodiment, the primer screening module is further configured to use the position of two non-overlapping specific regions in a selected pair of PCR-specific regions on the representative genome as the two segments that are furthest apart from each other as the boundary of the interval to be detected. ; And a sequence corresponding to the interval formed by the boundary of the interval to be detected representing the genome is taken as the interval to be detected.
在其中一个实施例中,上述引物筛选模块还用于在代表基因组上标注出选取的两个非重合特异性区域的位置,以及选取的两个非重合特异性区域之间的非特异性区域的位置。In one embodiment, the above primer screening module is further configured to mark the positions of the selected two non-overlapping specific regions on the representative genome and the positions of the non-specific regions between the two selected non-overlapping specific regions. .
在其中一个实施例中,上述引物筛选模块还用于运用PCR引物工具对待检测区间集合中的每个待检测区间进行筛选,得到备选PCR引物对集合;从备选PCR引物对集合中选出针 对目标病原体操作组的特异性引物对,生成与目标病原体操作组对应的特异性引物对集合;及选取特异性引物对集合中符合预设引物条件的引物对,作为最终检测引物对;根据最终检测引物对生成最终检测引物对集合。In one embodiment, the above primer screening module is further configured to use the PCR primer tool to screen each to-be-detected interval in the set of intervals to be detected to obtain a set of candidate PCR primer pairs; and select from the set of candidate PCR primer pairs. For the specific primer pair of the target pathogen operating group, generate a specific primer pair set corresponding to the target pathogen operating group; and select the primer pair in the specific primer pair set that meets the preset primer conditions as the final detection primer pair; according to the final Detection primer pairs generate a final set of detection primer pairs.
在其中一个实施例中,上述引物筛选模块还用于从靶点数据库中获取全集,全集中包含有多个收集到的高可信度基因组;通过全集获取到不包含于目标病原体操作组中的基因组,作为比对参考基因组;依次从备选PCR引物对集合中选取引物对定位至比对参考基因组;将选取的引物对与定位在比对参考基因组的位置对应的序列进行对比,当序列相似度达到预设相似度阈值时,则判定引物对定位成功;从判定为定位成功的引物对中去除满足预设比对条件的引物对,得到与目标病原体操作组对应的特异性引物对;及根据特异性引物对生成与目标病原体操作组对应的特异性引物对集合。In one embodiment, the above primer screening module is further configured to obtain a complete set from a target database, and the complete set contains a plurality of collected high-confidence genomes; the complete set is obtained through the complete set and is not included in the target pathogen operating group The genome is used as an alignment reference genome; a primer pair is selected from the set of alternative PCR primer pairs in order to locate the alignment reference genome; the selected primer pair is compared with a sequence corresponding to the position of the alignment reference genome, and when the sequences are similar When the degree reaches the preset similarity threshold value, the primer pair is determined to be successfully positioned; the primer pair that satisfies the preset alignment condition is removed from the primer pair determined to be successfully positioned to obtain a specific primer pair corresponding to the target pathogen operation group; and A set of specific primer pairs corresponding to the target pathogen operating group is generated based on the specific primer pairs.
在其中一个实施例中,上述引物筛选模块还用于将选取的引物对与定位在比对参考基因组的位置对应的序列进行对比,当序列相似度达到预设相似度阈值时,则判定引物对定位成功。In one embodiment, the primer screening module is further configured to compare the selected primer pair with a sequence corresponding to the position of the aligned reference genome, and determine the primer pair when the sequence similarity reaches a preset similarity threshold. Positioning succeeded.
在其中一个实施例中,预设比对条件包括以下至少一种:选取的引物对的两个引物同时定位在同一个基因组的同一条染色体上;选取的引物对的两个引物的距离在预设距离范围内;选取的引物对的任意一个引物的3'末端存在有预设数量个碱基序列与引物对定位在比对参考基因组的位置上的碱基序列相同。In one embodiment, the preset comparison conditions include at least one of the following: two primers of the selected primer pair are located on the same chromosome of the same genome at the same time; the distance between the two primers of the selected primer pair is The distance range is set; a preset number of base sequences exists at the 3 ′ end of any primer of the selected primer pair, and the base sequences of the primer pair located at the position of the aligned reference genome are the same.
在其中一个实施例中,符合预设引物条件包括以下至少一种:引物长度在17到28bp之间;引物退火温度在52到58摄氏度之间;GC百分比在40%到60%之间;引物的3'端为C、G、CG、或GC;引物的3'端最后5个碱基中G/C不超过3个,引物的3'端最后5个碱基内不含有连续超过2个的C或G;不包含重复序列或单核酸重复序列;不存在有两个引物之间的3'端互补,或单个引物的自互补。In one embodiment, meeting the preset primer conditions includes at least one of the following: primer length is between 17 and 28 bp; primer annealing temperature is between 52 and 58 degrees Celsius; GC percentage is between 40% and 60%; primer The 3 'end of the primer is C, G, CG, or GC; the G / C of the last 5 bases of the 3' end of the primer does not exceed 3, and the last 5 bases of the 3 'end of the primer does not contain more than 2 consecutive C or G; does not contain repeats or single nucleic acid repeats; there is no 3 'end complementary between the two primers, or self-complementation of a single primer.
在其中一个实施例中,特异性k-mer中的k-mer满足以下两个条件:在目标病原体操作组对应的基因组出现次数索引表中的出现次数满足第一预设误差条件;在目标病原体操作组对应的基因组出现次数索引表中的出现次数,以及在全集的基因组出现次数索引表中的出现次数满足第二预设误差条件;目标病原体操作组对应的基因组次数索引表记录了目标病原体操作组包含的基因组中包含有每个k-mer的基因组的个数;全集的基因组出现次数索引表记录了在全集包含的基因组中包含有每个k-mer的基因组的个数。In one embodiment, the k-mer in the specific k-mer satisfies the following two conditions: the number of occurrences in the genome occurrence index table corresponding to the target pathogen operation group meets a first preset error condition; and in the target pathogen The number of occurrences in the genome occurrence index table corresponding to the operation group and the number of occurrences in the genome occurrence index table of the complete set meet the second preset error condition; the genome number index table corresponding to the target pathogen operation group records the target pathogen operation The genome contained in the group contains the number of genomes of each k-mer; the genome occurrence index table of the complete set records the number of genomes of each k-mer contained in the genome of the complete set.
在其中一个实施例中,第一预设误差条件为:在目标病原体操作组的基因组出现次数索引表中的出现次数与目标病原体操作组中包含的基因组数量的比值与第一阈值的和大于等于1。In one embodiment, the first preset error condition is: the ratio of the number of occurrences in the genome occurrence index table of the target pathogen operation group to the number of genomes included in the target pathogen operation group is greater than or equal to the first threshold 1.
在其中一个实施例中,第一阈值小于5%。In one of these embodiments, the first threshold is less than 5%.
在其中一个实施例中,第二预设误差条件为:在目标病原体操作组的基因组出现次数索引表中的出现次数与在全集的基因组出现次数索引表中的出现次数的比值与第二阈值的和大于等于1。In one embodiment, the second preset error condition is: the ratio of the number of occurrences in the genome appearance number index table of the target pathogen operation group to the number of appearances in the genome occurrence number index table of the complete set and the second threshold value And is greater than or equal to 1.
在其中一个实施例中,第二阈值小于5%。In one of these embodiments, the second threshold is less than 5%.
关于确定检测靶点的装置的具体限定可以参见上文中对于确定检测靶点的方法的限定,在此不再赘述。上述确定检测靶点的装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For the specific limitation of the device for determining the detection target, refer to the foregoing limitation on the method for determining the detection target, which is not repeated here. Each module in the above device for determining a detection target can be implemented in whole or in part by software, hardware, and a combination thereof. The above-mentioned modules may be embedded in the hardware in or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
在其中一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图17所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储确定检测靶点的数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种确定检测靶点的方法。In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure diagram may be as shown in FIG. 17. The computer device includes a processor, a memory, a network interface, and a database connected through a system bus. The processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer-readable instructions, and a database. The internal memory provides an environment for the operation of the operating system and computer-readable instructions in a non-volatile storage medium. The database of the computer equipment is used to store data for determining the detection target. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer-readable instructions are executed by a processor to implement a method for determining a detection target.
本领域技术人员可以理解,图17中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。Those skilled in the art can understand that the structure shown in FIG. 17 is only a block diagram of a part of the structure related to the scheme of the present application, and does not constitute a limitation on the computer equipment to which the scheme of the present application is applied. The specific computer equipment may be Include more or fewer parts than shown in the figure, or combine certain parts, or have a different arrangement of parts.
一种计算机设备,包括存储器和一个或多个处理器,存储器中存储有计算机可读指令,计算机可读指令被处理器执行时实现本申请任意一个实施例中提供的确定检测靶点的方法的步骤。A computer device includes a memory and one or more processors. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the processor, the method for determining a detection target provided in any embodiment of the present application is implemented. step.
计算机可读指令计算机可读指令计算机可读指令计算机可读指令一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器实现本申请任意一个实施例中提供的确定检测靶点的方法的步骤。Computer-readable instructions computer-readable instructions computer-readable instructions computer-readable instructions One or more non-volatile computer-readable storage media storing computer-readable instructions, when the computer-readable instructions are executed by one or more processors , So that one or more processors implement the steps of the method for determining a detection target provided in any embodiment of the present application.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。A person of ordinary skill in the art can understand that all or part of the processes in the methods of the foregoing embodiments can be implemented by using computer-readable instructions to instruct related hardware. The computer-readable instructions can be stored in a non-volatile computer. In the readable storage medium, the computer-readable instructions, when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other media used in the embodiments provided in this application may include non-volatile and / or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above embodiments can be arbitrarily combined. In order to make the description concise, all possible combinations of the technical features in the above embodiments have not been described. However, as long as there is no contradiction in the combination of these technical features, it should be It is considered to be the range described in this specification.
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不 脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only express several implementation manners of the present application, and the description thereof is more specific and detailed, but cannot be understood as a limitation on the scope of the invention patent. It should be noted that, for those of ordinary skill in the art, without departing from the concept of the present application, several modifications and improvements can be made, and these all belong to the protection scope of the present application. Therefore, the protection scope of this application patent shall be subject to the appended claims.

Claims (27)

  1. 一种确定检测靶点的方法,包括:A method for determining a detection target includes:
    确定待检测的目标病原体操作组;Determine the target pathogen operating group to be detected;
    从靶点数据库中获取所述目标病原体操作组中包含的特异性k-mer,所述特异性k-mer为满足预设特异性条件的k-mer,k-mer是指长度为k的基因组序列;Obtain a specific k-mer included in the target pathogen operating group from the target database, where the specific k-mer is a k-mer that satisfies preset specificity conditions, and k-mer refers to a genome of length k sequence;
    确定所述目标病原体操作组中包含的每个基因组中包含的特异性k-mer;Determining a specific k-mer included in each genome included in the target pathogen operating group;
    对每个基因组包含的特异性k-mer进行处理,得到每个基因组对应的非重合特异性区域集合,所述非重合特异性区域集合中包含有非重合特异性区域;Processing the specific k-mer included in each genome to obtain a set of non-overlapping specific regions corresponding to each genome, wherein the set of non-overlapping specific regions includes non-overlapping specific regions;
    获取每个基因组对应的非重合特异性区域集合中包含的每个非重合特异性区域在全部非重合特异性区域集合中的出现次数;及Obtaining the number of occurrences of each non-overlapping specific region contained in the non-overlapping specific region set corresponding to each genome in the entire non-overlapping specific region set; and
    选取所述出现次数超过预设次数阈值的非重合特异性区域作为所述目标病原体操作组的检测靶点。Selecting a non-overlapping specific region where the number of occurrences exceeds a preset number of times as a detection target of the target pathogen operation group.
  2. 根据权利要求1所述的方法,其特征在于,所述确定所述目标病原体操作组中包含的每个基因组中包含的特异性k-mer,包括:The method according to claim 1, wherein the determining a specific k-mer included in each genome included in the target pathogen operating group comprises:
    依次将所述目标病原体操作组中包含的每个基因组作为参考基因组;Sequentially using each genome included in the target pathogen operating group as a reference genome;
    将所述目标病原体操作组中包含的每个特异性k-mer定位至所述参考基因组;及Mapping each specific k-mer included in the target pathogen manipulation group to the reference genome; and
    将定位至所述参考基因组的特异性k-mer作为所述参考基因组包含的特异性k-mer。The specific k-mer localized to the reference genome is used as the specific k-mer included in the reference genome.
  3. 根据权利要求1所述的方法,其特征在于,在所述确定待检测的目标病原体操作组之前,还包括:The method according to claim 1, before the determining the target pathogen operation group to be detected, further comprising:
    获取预先选取的满足预设可信度条件的基因组作为高可信度基因组;及Obtaining a pre-selected genome that meets a preset confidence condition as a high-confidence genome; and
    确定每个病原体操作组包括的高可信度基因组,作为每个病原体操作组对应的基因组。The high-confidence genome included in each pathogen operating group is determined as the genome corresponding to each pathogen operating group.
  4. 根据权利要求3所述的方法,其特征在于,所述满足预设可信度条件包括以下任意一种:The method according to claim 3, wherein satisfying the preset credibility condition comprises any one of the following:
    基因组序列中包含的非确定性字符的比例低于预设比例阈值时;When the proportion of non-deterministic characters contained in the genomic sequence is lower than a preset proportion threshold;
    基因组序列中包含的属于同一条染色体的序列片段低于预设片段阈值时;When the sequence fragments belonging to the same chromosome contained in the genomic sequence are below a preset fragment threshold;
    将某一基因组序列与其他所有遗传关系符合预设遗传距离阈值范围的基因组序列进行序列比对,以确定该基因组序列在其相近的基因组序列中的全序列平均覆盖百分比,当该平均覆盖百分比高于预设百分比值时。Perform a sequence comparison between a certain genomic sequence and all other genomic sequences whose genetic relationship meets the preset genetic distance threshold range to determine the average full coverage percentage of the genome sequence in similar genomic sequences. When the average coverage percentage is high When preset percentage value.
  5. 根据权利要求1所述的方法,其特征在于,所述对每个基因组包含的特异性k-mer进行处理,得到每个基因组对应的非重合特异性区域集合,包括:The method according to claim 1, wherein processing the specific k-mer included in each genome to obtain a set of non-overlapping specific regions corresponding to each genome, comprising:
    将每个基因组包含的特异性k-mer定位至该基因组上;Map the specific k-mer contained in each genome to that genome;
    依次选取每个基因组包含的所述特异性k-mer和/或所述非重合特异性区域进行检测;Sequentially selecting the specific k-mer and / or the non-overlapping specific region contained in each genome for detection;
    当检测到选取的特异性k-mer和/或非重合特异性区域在所述基因组上的距离小于预设距离阈值时,则将选取的特异性k-mer和/或非重合特异性区域进行替换,得到替换后的非重合特异性区域;及When it is detected that the distance between the selected specific k-mer and / or non-overlapping specific region on the genome is less than a preset distance threshold, the selected specific k-mer and / or non-overlapping specific region is performed. Replacement to obtain the non-overlapping specific region after replacement; and
    根据最终所保留的特异性k-mer和替换后的非重合特异性区域得到每个基因组对应的非重合特异性区域集合。A set of non-overlapping specific regions corresponding to each genome is obtained according to the specific retained k-mer and the non-overlapping specific regions after replacement.
  6. 根据权利要求5所述的方法,其特征在于,所述将选取的两个特异性k-mer进行替换,得到非重合特异性区域,包括:The method according to claim 5, wherein replacing the selected two specific k-mers to obtain a non-overlapping specific region comprises:
    当检测到选取的特异性k-mer和/或非重合特异性区域在所述基因组上的距离小于或等于零时,则取能覆盖选取的两个特异性k-mer的最小区域替换选取的两个特异性k-mer,得到非重合特异性区域;When the distance between the selected specific k-mer and / or non-overlapping specific region on the genome is detected to be less than or equal to zero, then the smallest region covering the selected two specific k-mers is replaced with the selected two. Specific k-mers to obtain non-overlapping specific regions;
    当检测到选取的特异性k-mer和/或非重合特异性区域在所述基因组上的距离大于零时,则获取所述选取的两个特异性k-mer和/或非重合特异性区域在定位的基因组上中间间隔的序列;When the distance between the selected specific k-mer and / or non-overlapping specific region on the genome is greater than zero, the selected two specific k-mer and / or non-overlapping specific regions are obtained. Intermediately spaced sequences on a mapped genome;
    将所述选取的两个特异性k-mer和所述中间间隔的序列依次进行拼接,得到拼接序列;及Splicing the selected two specific k-mers and the intermediate interval sequence in sequence to obtain a spliced sequence; and
    将所述选取的两个特异性k-mer替换成拼接序列,得到非重合特异性区域。The selected two specific k-mers are replaced with a splicing sequence to obtain a non-overlapping specific region.
  7. 根据权利要求5所述的方法,其特征在于,所述预设距离阈值小于5。The method according to claim 5, wherein the preset distance threshold is less than 5.
  8. 根据权利要求1所述的方法,其特征在于,所述获取每个基因组对应的非重合特异性区域集合中包含的每个非重合特异性区域在全部非重合特异性区域集合中的出现次数,包括:The method according to claim 1, wherein the obtaining the number of occurrences of each non-overlapping specific region included in the non-overlapping specific region set corresponding to each genome in the entire non-overlapping specific region set, include:
    将所述目标病原体操作组中包含的每个基因组对应的非重合特异性区域集合进行汇总,得到非重合特异性区域并集;及Summarize the set of non-overlapping specific regions corresponding to each genome included in the target pathogen operating group to obtain a union of non-overlapping specific regions; and
    获取每个基因组对应的非重合特异性区域集合中包含的每个非重合特异性区域在所述非重合特异性区域并集中的出现次数。The number of occurrences of each non-overlapping specific region contained in the set of non-overlapping specific regions corresponding to each genome in the non-overlapping specific regions is obtained.
  9. 根据权利要求1所述的方法,其特征在于,所述预设次数阈值=(1-Y)*N,Y为预设第一条件阈值,N为非重合特异性区域集合的数量。The method according to claim 1, wherein the preset number of times threshold = (1-Y) * N, Y is a preset first condition threshold, and N is a number of non-overlapping specific region sets.
  10. 根据权利要求9所述的方法,其特征在于,所述预设第一条件阈值小于5%。The method according to claim 9, wherein the preset first condition threshold is less than 5%.
  11. 根据权利要求1所述的方法,其特征在于,在所述选取所述出现次数超过预设次数阈值的非重合特异性区域作为所述目标病原体操作组的检测靶点之后,还包括:The method according to claim 1, further comprising: after selecting the non-overlapping specific region in which the number of occurrences exceeds a preset number of times as a detection target of the target pathogen operation group, further comprising:
    将选取的所述出现次数超过预设次数阈值的非重合特异性区域作为代表特异性区域;Using the selected non-overlapping specific region with a number of occurrences exceeding a preset number threshold as a representative specific region;
    根据获得的各个代表特异性区域组成代表特异性区域总集合;According to each representative specific region obtained, a total set of representative specific regions is formed;
    去除所述代表特异性区域总集合中不具备生物功能的代表特异性区域,得到具备生物功能的代表特异性区域总集合;及Removing the representative specific regions that do not have biological functions from the total set of representative specific regions to obtain a total set of representative specific regions that have biological functions; and
    将所述具备生物功能的代表特异性区域总集合中的非重合特异性区域作为所述目标病原体操作组的检测靶点。A non-overlapping specific region in the total set of representative specific regions with biological functions is used as a detection target of the target pathogen operating group.
  12. 根据权利要求11所述的方法,其特征在于,所述去除所述代表特异性区域总集合中不具备生物功能的代表特异性区域,得到具备生物功能的代表特异性区域总集合,包括:The method according to claim 11, characterized in that the step of removing the representative specific regions that do not have biological functions from the total set of representative specific regions and obtaining the total set of representative specific regions that have biological functions comprises:
    从所述靶点数据库中获取所述目标病原体操作组中包含的每个基因组的基因注释信息,所述基因注释信息包含有每个基因组上每个已知的有功能的区域的位置及对应的功能信息;Obtaining the gene annotation information of each genome included in the target pathogen operating group from the target database, the gene annotation information including the location of each known functional region on each genome and the corresponding Function information
    依次从代表特异性区域总集合中选取一个代表特异性区域与全部基因组中已知的有功能 的区域进行比对;及Select one representative specific region from the total set of representative specific regions in turn to compare with known functional regions in the entire genome; and
    去除与所述已知的有功能的区域的重合区域长度低于预设重合阈值的代表特异性区域,得到具备生物功能的代表特异性区域总集合。The representative specific region whose length of the overlapping region with the known functional region is lower than a preset coincidence threshold is removed to obtain a total set of representative specific regions with biological functions.
  13. 根据权利要求1所述的方法,其特征在于,在所述选取所述出现次数超过预设次数阈值的非重合特异性区域作为所述目标病原体操作组的检测靶点之后,还包括:The method according to claim 1, further comprising: after selecting the non-overlapping specific region in which the number of occurrences exceeds a preset number of times as a detection target of the target pathogen operation group, further comprising:
    将包含有所述出现次数超过预设次数阈值的非重合特异性区域数量最多的基因组作为代表基因组;Taking the genome containing the largest number of non-overlapping specific regions with a number of occurrences exceeding a preset number of times as a representative genome;
    将所述代表基因组对应的代表特异性区域集合作为PCR代表特异性区域集合;Using the representative specific region set corresponding to the representative genome as the PCR specific region set;
    从所述PCR代表特异性区域集合中选取符合预设相距距离范围的非重合特异性区域,生成符合条件的PCR特异性区域对集合;Selecting non-overlapping specific regions from the set of PCR specific regions that meet a preset distance range to generate a set of PCR specific region pairs that meets the conditions;
    从所述符合条件的PCR特异性区域对集合中选取一个PCR特异性区域对中的两个非重合特异性区域定位至所述代表基因组上;Selecting two non-overlapping specific regions in one PCR-specific region pair from the set of eligible PCR-specific region pairs to locate on the representative genome;
    将选取的两个非重合特异性区域在所述代表基因组上行成的区间对应的序列作为待检测区间;Using the sequence corresponding to the interval formed by the two non-overlapping specific regions on the representative genome as the interval to be detected;
    根据获得的各个待检测区间组成待检测区间集合;及Forming a set of to-be-detected intervals based on the obtained to-be-detected intervals; and
    对所述待检测区间集合中的每个待检测区间进行筛选,得到最终检测引物对集合。Screening each to-be-detected interval in the set of to-be-detected intervals to obtain a final set of detection primer pairs.
  14. 根据权利要求13所述的方法,其特征在于,所述从所述PCR代表特异性区域集合中选取符合预设相距距离范围的非重合特异性区域,生成符合条件的PCR特异性区域对集合,包括:The method according to claim 13, characterized in that the non-overlapping specific regions matching a preset distance range are selected from the set of PCR specific regions to generate a set of PCR specific region pairs that meets the conditions, include:
    获取所述PCR代表特异性区域集合中每个PCR代表特异性区域在所述代表基因组中的位置;Obtaining the position of each PCR representative specific region in the PCR representative specific region set in the representative genome;
    将相距距离符合预设相距距离范围的两个PCR代表特异性区域作为符合条件的PCR特异性区域对;及Use two PCR-representing specific regions that are separated by a preset distance range as eligible PCR-specific region pairs; and
    根据所述符合条件的PCR特异性区域对生成符合条件的PCR特异性区域对集合。A set of eligible PCR-specific region pairs is generated based on the eligible PCR-specific region pairs.
  15. 根据权利要求14所述的方法,其特征在于,所述预设相距距离范围大于500bp且小于1500bp。The method according to claim 14, wherein the preset distance range is greater than 500 bp and less than 1500 bp.
  16. 根据权利要求13所述的方法,其特征在于,所述对所述待检测区间集合中的每个待检测区间进行筛选,得到最终检测引物对集合,包括:The method according to claim 13, wherein the screening each detected interval in the set of intervals to be detected to obtain a final set of detection primer pairs comprises:
    运用PCR引物工具对所述待检测区间集合中的每个待检测区间进行筛选,得到备选PCR引物对集合;Using a PCR primer tool to screen each to-be-detected interval in the set of to-be-detected intervals to obtain an alternative set of PCR primer pairs;
    从所述备选PCR引物对集合中选出针对所述目标病原体操作组的特异性引物对,生成与所述目标病原体操作组对应的特异性引物对集合;Selecting a specific primer pair for the target pathogen operating group from the candidate PCR primer pair set, and generating a specific primer pair set corresponding to the target pathogen operating group;
    选取所述特异性引物对集合中符合预设引物条件的引物对,作为最终检测引物对;及Selecting a primer pair in the specific primer pair set that meets the preset primer conditions as the final detection primer pair; and
    根据所述最终检测引物对生成最终检测引物对集合。A set of final detection primer pairs is generated based on the final detection primer pairs.
  17. 根据权利要求16所述的方法,其特征在于,所述从所述备选PCR引物对集合中选出针对所述目标病原体操作组的特异性引物对,生成与所述目标病原体操作组对应的特异性 引物对集合,包括:The method according to claim 16, characterized in that said selecting a specific primer pair for said target pathogen operation group from said set of candidate PCR primer pairs, and generating a corresponding one of said target pathogen operation group A collection of specific primer pairs, including:
    从所述靶点数据库中获取全集,全集中包含有多个收集到的高可信度基因组;Obtaining a full set from the target database, the full set containing a plurality of collected high-confidence genomes;
    通过所述全集获取到不包含于所述目标病原体操作组中的基因组,作为比对参考基因组;Obtaining a genome that is not included in the target pathogen operating group through the complete set as a reference genome for comparison;
    依次从所述备选PCR引物对集合中选取引物对定位至所述比对参考基因组;Sequentially selecting primer pairs from the set of candidate PCR primer pairs to locate the aligned reference genome;
    将所述选取的引物对与定位在所述比对参考基因组的位置对应的序列进行对比,当序列相似度达到预设相似度阈值时,则判定所述引物对定位成功;Comparing the selected primer pair with a sequence corresponding to the position of the aligned reference genome, and when the sequence similarity reaches a preset similarity threshold, it is determined that the primer pair is successfully positioned;
    从判定为定位成功的引物对中去除满足预设比对条件的引物对,得到与所述目标病原体操作组对应的特异性引物对;及Removing a primer pair that satisfies a preset alignment condition from the primer pair determined to be successfully located to obtain a specific primer pair corresponding to the target pathogen operating group; and
    根据所述特异性引物对生成与所述目标病原体操作组对应的特异性引物对集合。A set of specific primer pairs corresponding to the target pathogen operating group is generated according to the specific primer pairs.
  18. 根据权利要求17所述的方法,其特征在于,所述预设比对条件包括以下至少一种:The method according to claim 17, wherein the preset comparison condition comprises at least one of the following:
    所述选取的引物对的两个引物同时定位在同一个基因组的同一条染色体上;Two primers of the selected primer pair are located on the same chromosome of the same genome at the same time;
    所述选取的引物对的两个引物的距离在预设距离范围内;A distance between two primers of the selected primer pair is within a preset distance range;
    所述选取的引物对的任意一个引物的3'末端存在有预设数量个碱基序列与所述引物对定位在所述比对参考基因组的位置上的碱基序列相同。A preset number of base sequences exist at the 3 ′ end of any one of the selected primer pairs, and the base sequences of the primer pairs located at the positions of the aligned reference genomes are the same.
  19. 根据权利要求16所述的方法,其特征在于,所述符合预设引物条件包括以下至少一种:The method according to claim 16, wherein the meeting the preset primer conditions comprises at least one of the following:
    引物长度在17到28bp之间;Primer length is between 17 and 28bp;
    引物退火温度在52到58摄氏度之间;Primer annealing temperature is between 52 and 58 degrees Celsius;
    GC百分比在40%到60%之间;GC percentage is between 40% and 60%;
    引物的3'端为C、G、CG、或GC;The 3 'end of the primer is C, G, CG, or GC;
    引物的3'端最后5个碱基中G/C不超过3个,引物的3'端最后5个碱基内不含有连续超过2个的C或G;不包含重复序列或单核酸重复序列;The G / C of the last 5 bases of the 3 ′ end of the primer does not exceed 3, and the last 5 bases of the 3 ′ end of the primer does not contain more than 2 consecutive C or G; it does not contain repeats or single nucleic acid repeats ;
    不存在有两个引物之间的3'端互补,或单个引物的自互补。There is no 3 'end complementarity between two primers or self-complementation of a single primer.
  20. 根据权利要求1所述的方法,其特征在于,所述特异性k-mer中的k-mer满足以下两个条件:The method according to claim 1, wherein the k-mer in the specific k-mer satisfies the following two conditions:
    在目标病原体操作组对应的基因组出现次数索引表中的出现次数满足第一预设误差条件;在所述目标病原体操作组对应的基因组出现次数索引表中的出现次数,以及在全集的基因组出现次数索引表中的出现次数满足第二预设误差条件;The number of occurrences in the genome occurrence index table corresponding to the target pathogen operating group meets a first preset error condition; the number of appearances in the genome occurrence index table corresponding to the target pathogen operating group, and the number of genome occurrences in the complete set The number of occurrences in the index table satisfies a second preset error condition;
    所述目标病原体操作组对应的基因组次数索引表记录了所述目标病原体操作组包含的基因组中包含有每个k-mer的基因组的个数;所述全集的基因组出现次数索引表记录了在全集包含的基因组中包含有每个k-mer的基因组的个数。The genome number index table corresponding to the target pathogen operation group records the number of genomes of each k-mer included in the genome contained in the target pathogen operation group; the genome occurrence number index table of the complete set records The included genome contains the number of genomes per k-mer.
  21. 根据权利要求20的方法,其特征在于,第一预设误差条件为:在所述目标病原体操作组的基因组出现次数索引表中的出现次数与所述目标病原体操作组中包含的基因组数量的比值与第一阈值的和大于等于1。The method according to claim 20, characterized in that the first preset error condition is: a ratio of the number of occurrences in the genome occurrence index table of the target pathogen operating group to the number of genomes included in the target pathogen operating group The sum with the first threshold is greater than or equal to 1.
  22. 根据权利要求21的方法,其特征在于,第一阈值小于5%。The method according to claim 21, wherein the first threshold value is less than 5%.
  23. 根据权利要求20的方法,其特征在于,第二预设误差条件为:在所述目标病原体操 作组的基因组出现次数索引表中的出现次数与在全集的基因组出现次数索引表中的出现次数的比值与第二阈值的和大于等于1。The method according to claim 20, characterized in that the second preset error condition is: the number of occurrences in the genome occurrence index table of the target pathogen operating group and the occurrence number in the genome occurrence index table of the complete set The sum of the ratio and the second threshold is greater than or equal to 1.
  24. 根据权利要求23的方法,其特征在于,第二阈值小于5%。The method according to claim 23, wherein the second threshold is less than 5%.
  25. 一种确定检测靶点的装置,包括:A device for determining a detection target includes:
    确定模块,用于确定待检测的目标病原体操作组;A determination module for determining a target pathogen operation group to be detected;
    特异性k-mer获取模块,用于从靶点数据库中获取所述目标病原体操作组中包含的特异性k-mer,所述特异性k-mer为满足预设特异性条件的k-mer,k-mer是指长度为k的基因组序列;确定所述目标病原体操作组中包含的每个基因组中包含的特异性k-mer;A specific k-mer acquisition module, configured to acquire a specific k-mer included in the target pathogen operation group from a target database, where the specific k-mer is a k-mer that satisfies a preset specific condition, k-mer refers to a genomic sequence of length k; determining a specific k-mer included in each genome included in the target pathogen operating group;
    非重合特异性区域获取模块,用于对每个基因组包含的特异性k-mer进行处理,得到每个基因组对应的非重合特异性区域集合,所述非重合特异性区域集合中包含有非重合特异性区域;及A non-overlapping specific region acquisition module is configured to process specific k-mers contained in each genome to obtain a set of non-overlapping specific regions corresponding to each genome, where the non-overlapping specific region set includes non-overlapping specific regions Specific regions; and
    检测靶点选取模块,用于获取每个基因组对应的非重合特异性区域集合中包含的每个非重合特异性区域在全部非重合特异性区域集合中的出现次数;选取所述出现次数超过预设次数阈值的非重合特异性区域作为所述目标病原体操作组的检测靶点。The detection target selection module is used to obtain the number of occurrences of each non-overlapping specific region contained in the set of non-overlapping specific regions corresponding to each genome; A non-overlapping specific region of the number of times threshold is set as a detection target of the target pathogen operation group.
  26. 一种计算机设备,包括存储器及一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:A computer device includes a memory and one or more processors. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the one or more processors, the one or more processors are caused. Each processor performs the following steps:
    确定待检测的目标病原体操作组;Determine the target pathogen operating group to be detected;
    从靶点数据库中获取所述目标病原体操作组中包含的特异性k-mer,所述特异性k-mer为满足预设特异性条件的k-mer,k-mer是指长度为k的基因组序列;Obtain a specific k-mer included in the target pathogen operating group from the target database, where the specific k-mer is a k-mer that satisfies preset specificity conditions, and k-mer refers to a genome of length k sequence;
    确定所述目标病原体操作组中包含的每个基因组中包含的特异性k-mer;Determining a specific k-mer included in each genome included in the target pathogen operating group;
    对每个基因组包含的特异性k-mer进行处理,得到每个基因组对应的非重合特异性区域集合,所述非重合特异性区域集合中包含有非重合特异性区域;Processing the specific k-mer included in each genome to obtain a set of non-overlapping specific regions corresponding to each genome, wherein the set of non-overlapping specific regions includes non-overlapping specific regions;
    获取每个基因组对应的非重合特异性区域集合中包含的每个非重合特异性区域在全部非重合特异性区域集合中的出现次数;及Obtaining the number of occurrences of each non-overlapping specific region contained in the non-overlapping specific region set corresponding to each genome in the entire non-overlapping specific region set; and
    选取所述出现次数超过预设次数阈值的非重合特异性区域作为所述目标病原体操作组的检测靶点。Selecting a non-overlapping specific region where the number of occurrences exceeds a preset number of times as a detection target of the target pathogen operation group.
  27. 一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:One or more non-volatile computer-readable storage media storing computer-readable instructions, which when executed by one or more processors, cause the one or more processors to perform the following steps:
    确定待检测的目标病原体操作组;Determine the target pathogen operating group to be detected;
    从靶点数据库中获取所述目标病原体操作组中包含的特异性k-mer,所述特异性k-mer为满足预设特异性条件的k-mer,k-mer是指长度为k的基因组序列;Obtain a specific k-mer included in the target pathogen operating group from the target database, where the specific k-mer is a k-mer that satisfies preset specificity conditions, and k-mer refers to a genome of length k sequence;
    确定所述目标病原体操作组中包含的每个基因组中包含的特异性k-mer;Determining a specific k-mer included in each genome included in the target pathogen operating group;
    对每个基因组包含的特异性k-mer进行处理,得到每个基因组对应的非重合特异性区域集合,所述非重合特异性区域集合中包含有非重合特异性区域;Processing the specific k-mer included in each genome to obtain a set of non-overlapping specific regions corresponding to each genome, wherein the set of non-overlapping specific regions includes non-overlapping specific regions;
    获取每个基因组对应的非重合特异性区域集合中包含的每个非重合特异性区域在全部非重合特异性区域集合中的出现次数;及Obtaining the number of occurrences of each non-overlapping specific region contained in the non-overlapping specific region set corresponding to each genome in the entire non-overlapping specific region set; and
    选取所述出现次数超过预设次数阈值的非重合特异性区域作为所述目标病原体操作组的检测靶点。Selecting a non-overlapping specific region where the number of occurrences exceeds a preset number of times as a detection target of the target pathogen operation group.
PCT/CN2018/111924 2018-06-22 2018-10-25 Method, apparatus, computer device and storage medium for determining target to be detected WO2019242186A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810651693.9A CN110021365B (en) 2018-06-22 2018-06-22 Method, device, computer equipment and storage medium for determining detection target point
CN201810651693.9 2018-06-22

Publications (1)

Publication Number Publication Date
WO2019242186A1 true WO2019242186A1 (en) 2019-12-26

Family

ID=67188391

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/111924 WO2019242186A1 (en) 2018-06-22 2018-10-25 Method, apparatus, computer device and storage medium for determining target to be detected

Country Status (2)

Country Link
CN (1) CN110021365B (en)
WO (1) WO2019242186A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111326210B (en) * 2020-03-11 2023-07-14 中国科学院生态环境研究中心 Primer design method and system based on k-mer algorithm
CN112634983B (en) * 2021-01-08 2021-07-09 江苏先声医疗器械有限公司 Pathogen species specific PCR primer optimization design method
CN116597893B (en) * 2023-06-14 2023-12-15 北京金匙医学检验实验室有限公司 Method for predicting drug resistance gene-pathogenic microorganism attribution

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030143554A1 (en) * 2001-03-31 2003-07-31 Berres Mark E. Method of genotyping by determination of allele copy number
CN102222175A (en) * 2011-05-06 2011-10-19 西南大学 Method for developing functional molecular marker related to miRNA
CN102270282A (en) * 2010-06-01 2011-12-07 上海聚类生物科技有限公司 MicroRNA coding region target gene forecasting method
CN103571833A (en) * 2013-11-18 2014-02-12 四川农业大学 Design method of SSR label primer and wheat SSR label primers
WO2015058095A1 (en) * 2013-10-18 2015-04-23 Seven Bridges Genomics Inc. Methods and systems for quantifying sequence alignment

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060286566A1 (en) * 2005-02-03 2006-12-21 Helicos Biosciences Corporation Detecting apparent mutations in nucleic acid sequences
US20140288844A1 (en) * 2013-03-15 2014-09-25 Cosmosid Inc. Characterization of biological material in a sample or isolate using unassembled sequence information, probabilistic methods and trait-specific database catalogs
US10230390B2 (en) * 2014-08-29 2019-03-12 Bonnie Berger Leighton Compressively-accelerated read mapping framework for next-generation sequencing
CA2977548A1 (en) * 2015-04-24 2016-10-27 University Of Utah Research Foundation Methods and systems for multiple taxonomic classification
CA2911002C (en) * 2015-11-04 2016-11-29 Travis Wilfred BANKS High throughput method of screening a population for members comprising mutations(s) in a target sequence using alignment-free sequence analysis
CN108090327B (en) * 2017-12-20 2022-03-29 吉林大学 Prediction method for exogenous miRNA (micro ribonucleic acid) regulation and control target gene containing three-dimensional free energy

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030143554A1 (en) * 2001-03-31 2003-07-31 Berres Mark E. Method of genotyping by determination of allele copy number
CN102270282A (en) * 2010-06-01 2011-12-07 上海聚类生物科技有限公司 MicroRNA coding region target gene forecasting method
CN102222175A (en) * 2011-05-06 2011-10-19 西南大学 Method for developing functional molecular marker related to miRNA
WO2015058095A1 (en) * 2013-10-18 2015-04-23 Seven Bridges Genomics Inc. Methods and systems for quantifying sequence alignment
CN103571833A (en) * 2013-11-18 2014-02-12 四川农业大学 Design method of SSR label primer and wheat SSR label primers

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HENG LI: "Mapping short DNA sequencing reads and calling variants using apping quality scores", GENOME RESEARCH, vol. 18, no. 11, 30 November 2008 (2008-11-30), pages 1851 - 1858, XP001503357, DOI: 10.1101/GR.078212.108 *
PETER J. CAMPBELL: "The patterns and dynamics of genomic instability in metastatic pancreatic cancer", NATURE, vol. 467, 28 October 2010 (2010-10-28), pages 1109 - 1113, XP055553003, DOI: 10.1038/nature09460 *

Also Published As

Publication number Publication date
CN110021365B (en) 2021-01-22
CN110021365A (en) 2019-07-16

Similar Documents

Publication Publication Date Title
JP7302081B2 (en) Variant Classifier Based on Deep Neural Networks
KR102317911B1 (en) Deep learning-based splice site classification
Shafi et al. A survey of the approaches for identifying differential methylation using bisulfite sequencing data
US20190318806A1 (en) Variant Classifier Based on Deep Neural Networks
JP2020505947A (en) Methods and systems for generating unique molecular index sets with heterogeneous molecular length and error correction
CA3160566A1 (en) Systems and methods for predicting homologous recombination deficiency status of a specimen
WO2019242186A1 (en) Method, apparatus, computer device and storage medium for determining target to be detected
CN115699205A (en) Generating cancer detection analysis sets from performance metrics
EP4446439A2 (en) Identification of host rna biomarkers of infection
Bickhart et al. Generation of lineage-resolved complete metagenome-assembled genomes by precision phasing
US20110040488A1 (en) System and method for analysis of a dna sequence by converting the dna sequence to a number string and applications thereof in the field of accelerated drug design
WO2019242445A1 (en) Detection method, device, computer equipment and storage medium of pathogen operation group
Dehghannasiri et al. Unsupervised reference-free inference reveals unrecognized regulated transcriptomic complexity in human single cells
Videm et al. ChiRA: an integrated framework for chimeric read analysis from RNA-RNA interactome and RNA structurome data
WO2019071219A1 (en) Site-specific noise model for targeted sequencing
CN116508105A (en) Genomic marker interpolation based on haplotype blocks
US20160055293A1 (en) Systems, Algorithms, and Software for Molecular Inversion Probe (MIP) Design
US20220068433A1 (en) Computational detection of copy number variation at a locus in the absence of direct measurement of the locus
US20240055073A1 (en) Sample contamination detection of contaminated fragments with cpg-snp contamination markers
Oloomi The impact of multi-mappings in short read mapping
Wang Transcriptome and genome analysis based on alignment-free protocols
Girilishena Complete computational sequence characterization of mobile element variations in the human genome using meta-personal genome data
Eteleeb An island-based approach for RNA-SEQ differential expression analysis.
NZ791625A (en) Variant classifier based on deep neural networks
Clarke Bioinformatics challenges of high-throughput SNP discovery and utilization in non-model organisms

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18923525

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 07/05/2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18923525

Country of ref document: EP

Kind code of ref document: A1