WO2019242445A1 - Procédé de détection, dispositif, équipement d'ordinateur et support d'informations de groupe d'opérations pathogènes - Google Patents

Procédé de détection, dispositif, équipement d'ordinateur et support d'informations de groupe d'opérations pathogènes Download PDF

Info

Publication number
WO2019242445A1
WO2019242445A1 PCT/CN2019/087580 CN2019087580W WO2019242445A1 WO 2019242445 A1 WO2019242445 A1 WO 2019242445A1 CN 2019087580 W CN2019087580 W CN 2019087580W WO 2019242445 A1 WO2019242445 A1 WO 2019242445A1
Authority
WO
WIPO (PCT)
Prior art keywords
pathogen
mer
occurrences
operation group
specific
Prior art date
Application number
PCT/CN2019/087580
Other languages
English (en)
Chinese (zh)
Inventor
孙亚洲
郭雨舜
杜晓骏
陈斌
杜刘稳
Original Assignee
深圳市达仁基因科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市达仁基因科技有限公司 filed Critical 深圳市达仁基因科技有限公司
Publication of WO2019242445A1 publication Critical patent/WO2019242445A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Definitions

  • the present application relates to the technical field of gene detection and the technical field of gene sequence analysis, and in particular, to a method, a device, a computer device, and a storage medium for detecting a pathogen operating group.
  • Gene sequencing is a new type of gene detection technology that can detect the sequence and fragment of each genome of each species contained in the sample from a variety of biological samples such as blood, saliva, and tissue samples, and then analyze the sequencing data
  • the method can realize the diagnostic functions of discovering the source of infection of infectious diseases, determining the pathogenic genes of hereditary diseases, and predicting the incidence of chronic diseases.
  • Current researchers can use high-throughput sequencing and subsequent analysis of sequencing data to detect various species contained in a sample, including pathogenic microorganisms.
  • a method, a device, a computer device, and a storage medium for detecting a pathogen operating group are provided.
  • a method for detecting a pathogen operating group includes:
  • the characteristic target sequence set includes the specific k-mer in the pathogen operation group that satisfies preset specific conditions.
  • K-mer refers to Genomic sequence of length k;
  • a method for detecting the relative concentration of a pathogen operating group includes:
  • the relative concentration in the sequencing data of the pathogen operating group contained in the ensemble was calculated based on the number of occurrences GCT and the sum of occurrences CT.
  • a detection device for a pathogen operating group includes:
  • Sequencing data acquisition module used to obtain sequencing data of the sample
  • the target sequence set acquisition module is used to obtain a characteristic target sequence set corresponding to each pathogen operation group stored in the target database, and the characteristic target sequence set includes the specificity in the pathogen operation group that satisfies preset specific conditions.
  • k-mer, k-mer refers to a genomic sequence of length k;
  • a specific k-mer appearance frequency acquisition module is used to obtain the number of occurrences of the specific k-mer included in the sequencing data in the characteristic target sequence set corresponding to each pathogen operation group;
  • the pathogen operation group selection module is configured to select a pathogen operation group whose occurrences exceed a preset number of times as a pathogen operation group included in the sequencing data.
  • a device for detecting the relative concentration of a pathogen operating group includes:
  • Sequencing data acquisition module which is used to obtain the sequencing data of the sample and calculate the total number of occurrences of k-mer CT in the sequencing data;
  • Occurrence GCT acquisition module used to obtain the number of occurrences of k-mer included in the pathogen operation group in the full set.
  • the relative concentration calculation module is used to calculate the relative concentration in the sequencing data of the pathogen operation group contained in the full set according to the number of occurrences GCT and the number of occurrences CT.
  • a computer device includes a memory and a processor.
  • the memory stores computer-readable instructions.
  • the processor executes the computer-readable instructions, the following steps are implemented:
  • the characteristic target sequence set includes the specific k-mer in the pathogen operation group that satisfies preset specific conditions.
  • K-mer refers to Genomic sequence of length k;
  • a computer-readable storage medium stores computer-readable instructions.
  • the computer-readable instructions implement the following steps when executed by a processor:
  • the characteristic target sequence set includes the specific k-mer in the pathogen operation group that satisfies preset specific conditions.
  • K-mer refers to Genomic sequence of length k;
  • a computer device includes a memory and a processor.
  • the memory stores computer-readable instructions.
  • the processor executes the computer-readable instructions, the following steps are implemented:
  • the relative concentration in the sequencing data of the pathogen operating group contained in the ensemble was calculated based on the number of occurrences GCT and the sum of occurrences CT.
  • a computer-readable storage medium stores computer-readable instructions.
  • the computer-readable instructions implement the following steps when executed by a processor:
  • the relative concentration in the sequencing data of the pathogen operating group contained in the ensemble was calculated based on the number of occurrences GCT and the sum of occurrences CT.
  • FIG. 1 is a schematic flowchart of a detection method for a pathogen operation group according to one or more embodiments
  • FIG. 2 is a schematic flow chart before step 102 according to one or more embodiments
  • FIG. 2A is a schematic flow chart before step 110 according to one or more embodiments.
  • FIG. 2B is a schematic flow chart after step 12 according to one or more embodiments.
  • FIG. 2C is a schematic flowchart of step after step 108 according to one or more embodiments.
  • FIG. 2D is a schematic flowchart of step 16 according to one or more embodiments.
  • 2E is a schematic flowchart of step 16 according to one or more embodiments.
  • step 108 is a schematic flowchart of step 108 according to one or more embodiments.
  • FIG. 4 is a schematic diagram of a record table of actual occurrence times of specific k-mers according to one or more embodiments
  • step 306 is a schematic flowchart of step 306 according to one or more embodiments.
  • step 506 is a schematic flowchart of step 506 according to one or more embodiments.
  • step 504 is a schematic flowchart of step 504 according to one or more embodiments.
  • FIG. 8 is a schematic flowchart of a method before obtaining sequencing data of a sample in another embodiment
  • FIG. 9 is a schematic diagram of a specific k-mer appearance ratio table according to one or more embodiments.
  • 3A is a schematic flowchart of step 108 in another embodiment
  • step 108 is a schematic flowchart of step 108 in still another embodiment
  • FIG. 10 is a schematic flow chart after step 108 according to one or more embodiments.
  • FIG. 11 is a schematic flowchart of a step after step 108 in another embodiment
  • FIG. 12 is a schematic flowchart of step 1104 according to one or more embodiments.
  • FIG. 13 is a schematic diagram of a table of estimated actual occurrence times of k-mer according to one or more embodiments
  • step 1204 is a schematic flowchart of step 1204 according to one or more embodiments.
  • FIG. 15 is a schematic flowchart before step 1206 according to one or more embodiments.
  • FIG. 16 is a schematic flowchart of a method before obtaining sequencing data of a sample in another embodiment
  • 17 is a schematic diagram of a proportion table of the number of k-mer occurrences in the complete set according to one or more embodiments
  • FIG. 18 is a schematic flowchart of a step before obtaining sequencing data of a sample in another embodiment
  • 19 is a schematic flowchart of a method before obtaining sequencing data of a sample in another embodiment
  • 20 is a schematic flowchart of a method for detecting a pathogen operation group in another embodiment
  • step 2002 is a schematic flowchart of step 2002 according to one or more embodiments.
  • step 2004 is a schematic flowchart of step 2004 according to one or more embodiments.
  • FIG. 23 is a schematic flowchart of step 2006 according to one or more embodiments.
  • 24 is a schematic flowchart of a method for detecting a relative concentration of a pathogen operating group according to one or more embodiments
  • FIG. 25 is a schematic flowchart of step 2404 according to one or more embodiments.
  • FIG. 26 is a schematic flow chart after step 2406 according to one or more embodiments.
  • 27 is a block diagram of a detection device for a pathogen operating group according to one or more embodiments.
  • FIG. 28 is a block diagram of a detection device for a pathogen operation group in another embodiment
  • FIG. 29 is a block diagram of a detection device for a pathogen operation group in another embodiment.
  • FIG. 30 is a structural block diagram of a device for detecting a relative concentration of a pathogen operating group according to one or more embodiments
  • FIG. 31 is a block diagram of a computer device according to one or more embodiments.
  • a method for detecting a pathogen operation group including the following steps:
  • Step 102 Obtain sequencing data of the sample.
  • Sample sequencing data refers to the data output by a device after the sequence of a biomolecule contained in a sample is read by a DNA sequencer, an RNA sequencer, or a protein sequencing device.
  • a set of sequencing data includes multiple (possibly more than several million) pieces of data to be tested, and each piece of data to be tested can be abstracted into a string.
  • DNA sequencing is the process of determining the exact sequence of nucleotides within a DNA molecule. It includes any method or technique for determining the four base sequences of adenine, guanine, cytosine, and thymine in a DNA strand.
  • a sequencer is an instrument capable of measuring the sequence of an input sample. The sequence measured here includes not only DNA sequences but also sequences composed of other substances such as proteins and RNA. Samples can be in the form of a drop of blood, a sputum, a handful of soil, and so on.
  • Step 104 Obtain a characteristic target sequence set corresponding to each pathogen operation group stored in the target database, and the characteristic target sequence set includes the specific k-mer, k- mer refers to a genomic sequence of length k.
  • a pathogen operating group can represent a genetic unit or a taxonomic unit of different classification levels such as a species, a subspecies, a subtype, a strain or virus strain, or a genus.
  • a pathogen operating group may include one or more Related genomes.
  • the target database stores a feature target sequence set previously established for each pathogen operating group, and the characteristic target sequence set corresponding to each pathogen operating group includes a specific k-mer corresponding to each pathogen operating group .
  • the specific k-mer refers to a k-mer selected from the k-mers included in each pathogen operating group and meeting a preset specificity condition, that is, a specific k-mer corresponding to each pathogen operating group.
  • the preset specific condition is a condition set by a technician in advance for selecting a matching k-mer. The preset specific condition may be determined according to a technician's consideration or an actual project requirement.
  • k-mer refers to a genomic sequence of length k, where k is a natural number. If there are a different deterministic characters in a genomic data, then for a particular k, there may be a total of k-mers with a power of a that are different.
  • deterministic characters refer to the five bases A (adenine), T (thymine), C (cytosine), G (guanine), and U (uracil); In the case of protein sequences, deterministic characters are defined amino acid characters.
  • Step 106 Obtain the number of occurrences of the specific k-mer included in the characteristic target sequence set corresponding to each pathogen operation group in the sequencing data.
  • Step 108 Select a pathogen operation group whose occurrences exceed a preset number of times as a pathogen operation group included in the sequencing data.
  • the sequencing data of the sample can be compared with each pathogen operation group stored in the target data, that is, the specificity contained in the characteristic target sequence set contained in each pathogen operation group is obtained.
  • the number of occurrences of k-mer in sequencing data can be determined according to the number of occurrences of specific k-mer included in each pathogen operation group in the sequencing data. Specifically, a threshold number of times can be set in advance, and when a specific k-mer appears in the sequencing data exceeds a preset number of times, a pathogen operation group corresponding to the specific k-mer can be operated as a pathogen included in the sequencing data. Therefore, there can be one or more pathogen operation groups contained in each sequencing data.
  • the preset number of times threshold can be set by technicians according to actual project requirements.
  • the target database stores a set of characteristic target sequences corresponding to each pathogen operation group, and the specific k-mer contained in each characteristic target sequence set is a pre-set Let k-mer be a specific condition. Therefore, when sequencing the sample sequencing data, the specific k-mer contained in the characteristic target sequence set corresponding to each pathogen operation group can be used in the sequencing data to determine the pathogen operation group contained in the sequencing data. .
  • This pathogen operating group detection method compares the sequencing data with the characteristic target sequences corresponding to each pathogen operating group to obtain the pathogen operating group contained in the sequencing data, reducing the comparison space, thereby reducing analysis time and improving detection. s efficiency.
  • the specific k-mer refers to the k-mer in the pathogen operation group in which the number of occurrences in the genome occurrence number index table of the pathogen operation group meets a preset error condition.
  • the characteristic target sequence set corresponding to each pathogen operation group includes a specific k-mer that satisfies a preset specific condition in each pathogen operation group.
  • the preset specific condition refers to a k-mer included in a pathogen operation group in which a number of occurrences in a genome occurrence number index table of each pathogen operation group meets a preset error condition.
  • the preset error condition refers to the error condition preset by the technician according to the actual project requirements.
  • the error condition can be a range of regions, that is, the k-mer selected as a specific can be allowed to have a certain error, instead of being completely satisfied. Some strict objective condition.
  • each pathogen operation group there is a corresponding genome occurrence number index table.
  • the genome occurrence number index table corresponding to each pathogen operation group it can be known that the k-mer included in each pathogen operation group is in the pathogen operation group. How many genomes have been included, that is, the k-mer in the pathogen operation group whose occurrences in the genome operation number index table of the pathogen operation group meet the preset error condition can be selected, and the selected k-mer will be selected As specific k-mer.
  • the method before the above step 102, the method further includes the following steps: generating a genome appearance index table corresponding to each pathogen operation group, and the genome times index table records that the pathogen operation group corresponding to each k-mer contains The number of the genome of the k-mer is included in the genome; the index table of the number of occurrences of the genome is stored in the characteristic target sequence set corresponding to the pathogen operation group.
  • the genome is all the genetic information in an organism. This genetic information is stored in the form of a nucleotide sequence. The sum of the genetic material in a complete monomer of an organism (such as an animal or plant individual, or animal or plant cell, or bacterial individual) is the genome.
  • each pathogen operating group multiple genomes can be included, and in each genome, multiple k-mers can be included.
  • the index table of the number of genome occurrences corresponding to each pathogen operation group the number of k-mers contained in each pathogen operation group has appeared in the genome of the pathogen operation group, that is, the genome number index table records each k- The number of genomes of the k-mer is included in the genome included in the pathogen operating group corresponding to mer.
  • the genome occurrence index table corresponding to each pathogen operating group can be stored in the feature target sequence set corresponding to each pathogen operating group, that is, stored in the target database. If you need to use the genomic appearance index table, you can retrieve data from the database, which improves the detection efficiency.
  • the method further includes the following steps:
  • Step 110 Select a k-mer that satisfies a preset specific condition from a k-mer corresponding to each pathogen operation group.
  • Step 112 Store the k-mers that meet the preset specificity conditions into a set of characteristic target sequences corresponding to each pathogen operation group.
  • each characteristic target sequence set includes a specific k-mer and a specific k-mer corresponding to each pathogen operating group. It refers to selecting k-mers that meet the preset specific conditions from the k-mers included in each pathogen operation group. When a k-mer that meets the preset specific conditions is selected, that is, the specific k-mer is stored in each The pathogen target group corresponds to the feature target sequence set.
  • a feature target library is established in advance, so when detecting and determining the pathogen operation group of sequencing data, it can directly call the required data, which improves the detection efficiency.
  • the method before step 110, the method further includes the following steps:
  • Step 10 Obtain high-confidence genome data to obtain a complete set.
  • the high-confidence genome refers to a genome that meets a preset condition.
  • the source of high-confidence genomic data can be the Refseq (RefSeq reference sequence database) of the National Center for Biotechnology Information (NCBI), a non-redundant biological sense provided by the National Center for Bioinformatics Gene and protein sequence) data sets or other public or private high-confidence genomes.
  • the set of all collected high-confidence genomes is called the complete set.
  • the high-confidence genome contains both the pathogenic and non-pathogenic genomes. For example, high-confidence genomes of symbiotic bacteria, probiotics, humans, animals, and plants.
  • the process of collecting high-confidence genomic data includes the process of confirming the credibility of a certain genome and performing screening. That is, select a genome that meets the following conditions as a high-confidence genome: (1) According to the proportion of non-deterministic characters contained in a genomic data: For example, for the DNA genome, the proportion of non-deterministic characters refers to The proportion of non-ACGT characters contained.
  • the piece of data is a suspected low-confidence genome;
  • Whole-genome sequence alignment is performed by using multiple genomes with similar genetic relationships (such as genetic distances less than a certain threshold). , Determine the average genome-wide average coverage of the genome in its similar genomes, and then screen based on the average genome-wide coverage percentage: genomes with too low average coverage percentages are suspected of having low completion, ie, low confidence. After removing suspected low-confidence genomes or low-confidence genomes, the remaining genomes are high-confidence genomes.
  • Step 12. Determine the pathogen operation group contained in the ensemble.
  • the pathogen set corresponding to the high-confidence genome in the complete set is determined.
  • high-reliability genomic data can be collected in advance to obtain a complete set and determine a pathogen operating group included in the complete set, which is convenient for subsequent direct use and improves efficiency.
  • the k-mer here includes not only the specific k-mer of the pathogen operation group, but all the k-mers in the pathogen operation group. K-mer has appeared in the genome. That is, the k-mer occurrence frequency record table records the k-mer contained in the pathogen operation group and the number of times that the k-mer has occurred in the pathogen operation group. If a k-mer occurs x times in a genome, x counts should be added to the counting unit corresponding to the k-mer occurrence number recording table.
  • the pathogen operation group contained in the ensemble has M pathogen operation groups, then a record table of the number of k-mer occurrences contained in the M corresponding pathogen operation groups will be established.
  • the k-mer occurrence number record table corresponding to each pathogen operation group is stored in the target database.
  • a k-mer occurrence record table included in the ensemble is established, and the k-mer occurrence record table corresponding to each pathogen operation group is used to calculate the pathogen k-mer summary table.
  • k-mer occurrence number record table, the k-mer occurrence number record table corresponding to the complete set is stored in the target database.
  • step 12 it further includes steps:
  • Step 14 Obtain sequencing data of the sample, and calculate the total CT of the number of occurrences of k-mer that has appeared in the sequencing data.
  • the server can obtain the sequencing data of the sample in advance.
  • the sample needs to be subjected to multiple steps such as sample preparation, sequencing, and signal processing to accurately determine the absolute concentration of the pathogen operation group contained in the sample.
  • the calculation is difficult, so the relative concentration of the pathogens contained in the sample can be estimated.
  • the pathogen operation group included in the sample sequencing data can be calculated at the same time.
  • CT Counteractive Total
  • the data used for calculation at this time is k-mer, not data related to specific k-mer, so k-mer included in each pathogen operation group can be obtained.
  • the total CT of k-mers that have appeared in the sequencing data can be obtained.
  • each piece of data to be tested contains n characters
  • each piece of data to be tested contains n-k + 1 k-mer
  • the number of k-mers contained in the M pieces of data to be tested can be added, that is, M n-k + 1 is added, and the total number of k-mer occurrences in the sequencing data can be obtained.
  • Step 16 Obtain the number of occurrences of k-mer included in the pathogen operating group included in the full set in the sequencing data.
  • the number of occurrences of k-mer included in each pathogen operation group included in the full set in the sequencing data was obtained.
  • the GCT is an estimate.
  • GCT can include two parts: the estimated total number of actual occurrences of k-mer in a pathogen operation group (ECT), and the CCT (confirmed counts of total occurrence) in this pathogen operation group. That is, the GCT can be obtained by calculating the estimated total number of actual occurrences of k-mer for each pathogen operation group in the ensemble ECT and the actual number of k-mer occurrences CCT confirmed by each pathogen operation group.
  • the number of occurrences GCT is the sum of the estimated total number of actual occurrences ECT and the confirmed total number of actual occurrences CCT of the k-mer included in the pathogen operation group included in the full set in the sequencing data.
  • the GCT can also include only a part: when k in the k-mer included in the pathogen operation group included in the ensemble is greater than the target value, the number of occurrences GCT can be completely estimated by the pathogen operation group's confirmation of the actual occurrence of k-mer CCT, ie When k is large enough, the number of occurrences GCT is equal to confirming the actual number of occurrences of k-mer CCT, and the total number of actual occurrences ECT may not be calculated.
  • step 18 the relative concentration in the sequencing data of the pathogen operating group contained in the ensemble is calculated according to the number of occurrences GCT and the sum of occurrences CT.
  • the K-mer in each pathogen operation group is obtained.
  • the ratio of the number of appearances of GCT in the mer sequencing data to the sum of the appearances of k-mer in the sequencing data and the CT, that is, the relative concentration in the sequencing data of each pathogen operating group included in the full set was obtained.
  • step 108 as shown in FIG. 2C, the method further includes steps:
  • step 108a the relative concentration of the pathogen operating group included in the sequencing data is obtained according to the relative concentration of the pathogen operating group contained in the full set in the sequencing data.
  • the server calculates the pathogen operation group included in the sample test data at the same time, the corresponding pathogen operation group in the ensemble is obtained according to the pathogen operation group included in the test data, and each pathogen operation group included in the ensemble is included in the sequencing data. Relative concentration to obtain the relative concentration of the pathogen operating group included in the test data.
  • step 108b a pathogen operation group whose relative concentration of the pathogen operation group contained in the sequencing data is higher than a preset threshold is selected as the pathogen operation group included in the sequencing data confirmation.
  • the server selects the pathogen operation group whose relative concentration of the pathogen operation group contained in the sequencing data is higher than a preset threshold, as the pathogen operation group included in the sequencing data confirmation, and the confirmed pathogen operation group refers to the pathogen that can be confirmed in the sequencing data.
  • the preset threshold is a value that is set in advance by the artificial concentration. When the relative concentration of a pathogen operating group in the sequencing data is higher than the preset threshold, it indicates that the pathogen operating group is determined to be included in the sequencing data. When the relative concentration of the pathogen operating group is not higher than a preset threshold, it indicates that the pathogen operating group determines that it may not be included in the sequencing data.
  • the relative concentrations in the sequencing data of all pathogen operations included in the ensemble are calculated by calculation, and when the pathogen operation group included in the sequencing data is calculated, the concentration of each pathogen operation group included in the sequencing data can be obtained.
  • the relative concentration and then select the pathogen operation group whose relative concentration is higher than a preset threshold value as the pathogen operation group included in the sequencing data to further improve the accuracy of the pathogen operation group included in the sequencing data.
  • step 16 includes steps:
  • Step 16a Obtain each first marker data in the sequencing data, calculate the number of k-mers contained in each first marker data, and calculate the actual occurrence according to the number of k-mers contained in each first marker data. Total number of CCTs.
  • the first marker data refers to the test data in the test data that includes only a specific k-mer of the pathogen operation group, and the test data is marked as being from the pathogen operation group. After the test data that each contains only a specific k-mer of the pathogen operation group is marked as being from the pathogen operation group that only contains, each first label data is obtained.
  • the server obtains each of the first marker data in the sequencing data, and calculates the number of k-mers that are each marked as the data to be tested from the pathogen-only operation group contained. The sum of the number of k-mers that are marked as the data to be tested from the same pathogen operating group was calculated respectively, and the total number of confirmed actual k-mer CCTs of each pathogen operating group was obtained.
  • Step 16b Obtain the proportion of the complete set of each k-mer included in the pathogen operation group included in the complete set.
  • the proportion of the complete set is the number of occurrences of k-mer in the corresponding pathogen operation group.
  • CG and k-mer appear in the complete set.
  • the feature target sequence set corresponding to each pathogen operation group also includes a proportion table of the number of k-mer occurrences corresponding to the pathogen operation group in the complete set.
  • the complete set proportion table includes the number of occurrences of each k-mer in the corresponding pathogen operation group CG, the number of occurrences of the complete set of each k-mer in the target database of the pathogen operation group CB, and each k-mer
  • the proportion of mer's complete episodes is F. Therefore, the proportion of the complete set of each k-mer can be obtained from the table of the number of occurrences of each k-mer in the complete set.
  • the number of occurrences of each k-mer in the corresponding pathogen operation group CG can be obtained from the k-mer appearance frequency record table corresponding to the pathogen operation group stored in the target database in advance.
  • the number of occurrences of k-mer in the complete set CB can be pre-stored in the target database corresponding to the complete set of k-mer occurrences recorded in the target table.
  • Step 16c Obtain each second marker data in the sequencing data, and obtain the actual number of occurrences of each k-mer included in the pathogen operation group included in the full set of each second marker data.
  • the second marker data refers to the test data that does not contain the specific k-mer of any pathogen operating group, or the specific k-mer belonging to multiple pathogen operating groups.
  • the test data is marked as not belonging to any pathogen operating group.
  • the server obtains each second marker data in the sequencing data, and for each second marker data, confirms that all k-mers contained in the second marker data are corresponding to each pathogen operation group stored in the target database in advance.
  • the k-mer occurrence count record table finds each pathogen operation group corresponding to all k-mers included in the second marker data, and obtains each k-mer included in the pathogen operation group corresponding to all k-mers in the second marker The actual number of occurrences in the data.
  • step 16d the estimated actual number of occurrences of each k-mer is calculated according to the proportion of the complete set and the actual number of occurrences.
  • the server When the server obtains the proportion of the complete set and the actual number of occurrences from the target database, it calculates the proportion of the complete set of each k-mer included in the pathogen operation group included in the complete set and the k- The product of the actual number of occurrences of each k-mer included in the pathogen operation group corresponding to mer in the second labeled data, to obtain the number of each k-mer included in the disease of each pathogen operation group in the second labeled data. Estimate the actual number of occurrences.
  • Step 16e Calculate an estimated total number of actual occurrences ECT according to the estimated actual number of occurrences of each k-mer.
  • the sum of the estimated actual occurrence times of each k-mer included in the disease of each pathogen operation group is calculated to obtain the estimated total actual occurrence times of k-mer of each pathogen operation group ECT.
  • Step 16f Calculate the number of occurrences of k-mer included in the pathogen operation group contained in the ensemble GCT in the sequencing data according to the estimated total number of actual occurrences ECT and the confirmed total number of actual occurrences CCT.
  • the sum of ECT and CCT is calculated to obtain The number of occurrences of k-mer included in the pathogen manipulation group included in the full set was GCT in the sequencing data.
  • the total number of actual occurrences of ECT is obtained.
  • the number of occurrences of k-mer in the sequencing data GCT included in the group can estimate the number of occurrences of k-mer GCT in sequencing data, so that the relative concentration can be calculated.
  • step 16 includes steps:
  • Step 16A Obtain each first marker data in the sequencing data, calculate the number of k-mers contained in each first marker data, and calculate the actual occurrence according to the number of k-mers contained in each first marker data. Total number of CCTs.
  • the first marker data refers to the test data in the test data that includes only a specific k-mer of the pathogen operation group, and the test data is marked as being from the pathogen operation group. After the test data that each contains only a specific k-mer of the pathogen operation group is marked as being from the pathogen operation group that only contains, each first label data is obtained.
  • the server obtains each of the first labeled data in the sequencing data, and calculates the number of k-mers that are each marked as the data to be tested from the pathogen-only operation group. The sum of the number of k-mers that are marked as the data to be tested from the same pathogen operating group was calculated respectively, and the total number of confirmed actual k-mer CCTs of each pathogen operating group was obtained.
  • Step 16B When the length of the k-mer included in the pathogen operation group included in the ensemble is greater than the target value, confirm the total number of actual occurrences CCT as the sequencing data of the k-mer included in the pathogen operation group included in the ensemble Occurrences in GCT.
  • the target value is determined through various experiments. This k can be greater than 23 or 27.
  • the calculated total number of times of confirmed actual occurrence CCT can be directly used as the number of times GCT of k-mer included in the pathogen operation group included in the ensemble in the sequencing data. Therefore, there is no need to calculate and estimate the total number of actual occurrences of ECT, which improves the efficiency of calculating the number of occurrences of GCT.
  • the k-mer in the specific k-mer satisfies the following two conditions: the number of occurrences in the genome occurrence index table of the pathogen operating group satisfies the first preset error condition; The number of occurrences in the genome occurrence index table and the occurrences in the genome occurrence index table of the complete set meet the second preset error condition; the genome times index table corresponding to each pathogen operating group records each k-mer correspondence The number of genomes contained in the k-mer is contained in the genome contained in the pathogen operating group of the pathogen; the genome occurrence index table of the complete set records the number of genomes containing the k-mer in the genome included in the complete set.
  • each pathogen operating group has its own set of characteristic target sequences, and the specific k-mer included in the set of characteristic target sequences refers to k-mers that meet preset specific conditions.
  • the preset specific condition includes a first preset error condition and a second preset error condition. When the k-mer satisfies these two conditions at the same time, it is considered that the k-mer meets the preset specific condition and the k -mer as a specific k-mer.
  • the number of occurrences of the k-mer in the genome occurrence index table of the pathogen operating group needs to satisfy the first preset error condition, and the number of occurrences of the k-mer in the genome appearance number index table of the pathogen operating group, and The number of occurrences in the genome occurrence number index table of the complete set satisfies the second preset error condition.
  • the count corresponding to each k-mer recorded in the genome occurrence index table of the complete set represents how many genomes of the k-mer have appeared in the total set. If the k-mer appears multiple times in the same genome, it will only be counted once.
  • a genome operation index table of a pathogen operation group the number of genomes contained in the k-mer in the genome included in the pathogen operation group corresponding to each k-mer is recorded, and the genome occurrence number index table of the complete set records The number of k-mer genomes is included in the genomes included in the complete set.
  • the selection of the specific k-mer in this embodiment adds two parameters: a preset error condition and a second preset error condition, and thus allows specific k-mer within a certain range.
  • Mer is non-specific. Without these two parameters, non-specificity in a certain range cannot be allowed, and it is often difficult to find a specific k-mer for a pathogen operating group. Therefore, by selecting a specific k-mer that allows a certain error, thereby establishing a set of characteristic target sequences, it is possible to find a specific target that can represent the pathogen operating group with a high probability. Therefore, the pathogen operations included in the sequencing data are determined. When grouping, it is only necessary to compare the characteristic target sequence set corresponding to the pathogen operation group that has been determined in advance, which reduces the comparison space, thereby shortening the analysis time and improving the detection efficiency.
  • the first preset error condition is: the sum of the ratio of the number of occurrences in the genome occurrence index table of the pathogen operation group to the number of genomes contained in the pathogen operation group and the first threshold is greater than or equal to 1.
  • the first preset error condition means that the sum of the ratio of the number of occurrences recorded in the genome occurrence number index table corresponding to the pathogen operation group to the number of genomes contained in the pathogen operation group and the sum of the first threshold is greater than or equal to 1.
  • the pathogen operating group contains N genomes
  • the number of occurrences of a k-mer in the genome operating frequency index table of the pathogen operating group is C1
  • the first threshold value is P1
  • the first threshold value P1 represents an acceptable error probability, and can be any value between 0 and 1.
  • the first threshold value can be set by a technician according to the actual project.
  • the first threshold is less than 5%. In one of these embodiments, the first threshold may be less than or equal to 90%. That is, the first threshold value can be manually set through test verification. Through testing and inspection, it is found that in some cases, the first threshold value can take a number less than or equal to 90%.
  • the first threshold value refers to an acceptable error probability.
  • the first threshold value can be any value between 0 and 1. In this embodiment, the first threshold value can be set to a value less than 5%.
  • the second preset error condition is: the sum of the ratio of the number of occurrences in the genome occurrence index table of the pathogen operation group to the number of occurrences in the genome occurrence index table of the complete set and the second threshold sum Greater than or equal to 1.
  • the second preset error condition refers to a ratio of the number of occurrences recorded in the genome occurrence number index table corresponding to the pathogen operation group to the occurrence number in the genome occurrence number index table of the complete set and a second threshold value. Is greater than or equal to 1. Assume that the number of occurrences of a k-mer in the genome occurrence index table of the pathogen operation group is C1, and the number of occurrences of the k-mer in the genome occurrence index table of the complete set is C2, and the second threshold is P2, then The second preset error condition refers to C1 / C2 + P2 ⁇ 1.
  • the second threshold value is the same as the above-mentioned first threshold value, which represents an acceptable error probability, and can be any value between 0 and 1.
  • the second threshold value P2 can also be set by a technician according to the actual project.
  • the second threshold is less than 5%.
  • the second threshold value is the same as the first threshold value, which refers to an acceptable error probability.
  • the second threshold value can also be any value between 0 and 1. In this embodiment, the second threshold value can be set to less than 5%. Value.
  • the first threshold and the second threshold may be equal or different.
  • the method before obtaining the sequencing data of the sample, the method further includes: generating a genome occurrence index table of the complete set, and the genome occurrence index table of the complete set records a genome containing the k-mer in a genome included in the complete set.
  • the number of genomic appearances index table of the complete set is stored in the target database.
  • a characteristic target sequence set corresponding to each pathogen operating group is stored.
  • the full set contains all the high-confidence genomes collected, that is, the full set contains both the high-reliability genomes of multiple pathogen operating groups and the high-reliability genomes of multiple non-pathogenic operating groups.
  • an index table of the number of occurrences of the genome of the complete set can be generated.
  • the genome occurrences index table of the complete set records how many genomes of the k-mer contained in each pathogen operation group have appeared in the complete set, that is, the genome count index table of the complete set records the genomes that each k-mer contains in the complete set. It contains the number of k-mer genomes.
  • the genome number table of the complete set actually how many genomes each k-mer contains in the complete set is recorded, that is, how many genomes each k-mer appears in the entire genome is recorded.
  • the number of measurements is the number of genomes, not the number of k-mer occurrences. If a k-mer occurs more than once in the same genome, it will still be counted only once in the genome occurrence index table of the complete set.
  • an index table of the number of occurrences of the genome for the complete set can be established.
  • the genome occurrence index table of the complete set is different from the genome occurrence index table corresponding to each pathogen operating group.
  • the genome occurrence index table corresponds to the pathogen operating group, and each pathogen operating group has its corresponding genome occurrence number. Index table, but the genomic appearance frequency index table of the complete set will only generate one, which is for all data. After storing the generated genomic appearance frequency index table of the complete set, if it is needed in the process of detecting the sequencing data, the data can be retrieved from the database, thereby improving the detection efficiency.
  • the above step 108 includes:
  • Step 302 Obtain the actual number of occurrences of each specific k-mer included in each pathogen operation group in the sequencing data.
  • Step 304 Select a pathogen operation group corresponding to the specific k-mer whose actual number of occurrences in the sequencing data exceeds a first preset number of thresholds to obtain a suspected pathogen operation group.
  • the first preset number of thresholds is used as a selection criterion, and the specific k-mer whose actual number of occurrences exceeds the first preset number of thresholds is selected.
  • the selected pathogen operating group corresponding to the specific k-mer was used as the suspected pathogen operating group.
  • the pathogen manipulation group corresponding to the specific k-mer is a pathogen manipulation group with a low probability of occurrence, Such a pathogenic operation group with a low probability of occurrence can be excluded, and the rest is a suspected pathogenic operation group.
  • Step 306 Select a pathogen operation group that meets a preset condition from the suspected pathogen operation group as the pathogen operation group included in the sequencing data.
  • the pathogen operation group with a low probability of occurrence can be filtered out by presetting the threshold for the first number of times, and the suspected pathogen operation group can be obtained. Then, in the suspected pathogen operation group, a pathogen operation group that meets the preset conditions can be selected as the pathogen operation group included in the sequencing data.
  • the preset condition is a selection condition set by a technician. This preset condition can be adjusted according to the actual project requirements.
  • the pathogen operation group included in the sequencing data can be selected, and the pathogen operation group to which the sequencing data belongs can be determined with high probability, which improves the accuracy of detection.
  • the method before step 302, the method further includes: obtaining the actual number of occurrences of the specific k-mer included in the pathogen operation group in the sequencing data; and generating the specificity corresponding to the pathogen operation group according to the actual occurrence number k-mer record of actual occurrences.
  • each pathogen operating group In the target database, specific k-mers included in each pathogen operating group are stored. After obtaining the sequencing data of the sample, the sequencing data can be compared with each specific k-mer of each pathogen operating group. That is, the actual number of occurrences of each specific k-mer in the sequencing data is obtained. After obtaining the actual number of occurrences of each specific k-mer in the sequencing data, a record table of the actual occurrences of specific k-mer corresponding to each pathogen operating group can be generated based on the acquired data.
  • M corresponding specific k-mer actual occurrence frequency record tables will be generated, and the specific k-mer actual occurrence frequency record table records each pathogen operating group The actual number of specific k-mers included in the sequencing data.
  • the actual number of occurrences of specific k-mer is recorded in the table.
  • the leftmost column records the specific k-mer included in pathogen operation group X, and the corresponding specific k-mer is recorded in the second column.
  • the actual number of times mer appears in the sequencing data is C 1 , C 2 ,....
  • an increased recording ratio of the specific occurrence of k-mer, 4, respectively, may be recorded as F 1, F 2, ....
  • the above step 306 includes:
  • Step 502 Obtain a specific k-mer whose actual number of occurrences is greater than a preset actual number of times of occurrence as a specific k-mer for confirming occurrence.
  • Each pathogen operating group has a corresponding record of actual occurrences of specific k-mers, and the actual occurrences of specific k-mers contained in each pathogen operating group in the sequencing data can be obtained according to this table.
  • a specific k-mer whose actual number of occurrences is greater than a preset threshold of actual number of occurrences may be selected as the specific k-mer which has occurred.
  • the preset actual number of occurrences threshold is a condition value preset by a technician.
  • the preset actual number of occurrences threshold is generally a natural number greater than 5.
  • Step 504 Calculate the false positive probability of each pathogen operation group according to the number of confirmed specific k-mers corresponding to each pathogen operation group.
  • the number of specific k-mers confirmed to be present in each pathogen operating group can be obtained according to the divided pathogen operating group.
  • the pathogen operation group A1 there are 1000 specific k-mers, but among the 1000 specific k-mers, only 400 specific k-mers actually appear in the sequencing data more than the preset actual occurrences. Threshold, the number of specific k-mers included in the pathogen manipulation group A1 confirmed to appear is 400.
  • the number of false specific k-mers corresponding to each pathogen operating group that can be confirmed can be calculated to obtain the false corresponding to each pathogen operating group. Positive probability.
  • False positive rate also known as misdiagnosis rate or Type I error, refers to the percentage of patients who are actually disease-free, but are judged to be sick based on screening. It can also be understood that a positive result should not be detected but Still the probability of getting a positive result. False positives can be calculated by assuming that n confirmed specific k-mers are found in a pathogen operating group, then for this pathogen operating group, the false positive probability is less than or equal to p n .
  • p represents a preset error threshold, which can be 1 minus the number of occurrences in the genome occurrence index table of the pathogen operating group and the number of occurrences in the genome occurrence index table of the complete set.
  • p can be any value between 0 and 1, usually less than 5%. When n is large enough, the calculated false positives will be small.
  • Step 506 Select a pathogen operation group whose false positive probability is lower than a preset standard probability as the pathogen operation group included in the sequencing data.
  • a pathogen operating group with a false positive probability lower than a preset standard probability can be selected as the pathogen operating group included in the sequencing data.
  • the preset standard probability is a condition value set by a technician in advance as a criterion for selecting a pathogen operation group based on a false positive. When the probability of a false positive corresponding to the pathogen operating group is high, it means that the set of sequencing data actually contains a greater probability of error.
  • the pathogen operating group can be excluded, and the false positive probability can be selected to be lower than the preset standard.
  • the probabilistic pathogen operating group is included as the pathogen operating group included in the sequencing data.
  • the above step 506 includes:
  • Step 602 Obtain the actual number of occurrences of the specific k-mer confirmed in the sequencing data.
  • Each pathogen operation group has a corresponding record of actual occurrences of specific k-mers, so the actual occurrences of each specific k-mer in the sequencing data can be obtained according to the actual occurrences of each specific k-mer.
  • the number of occurrences. Confirmed occurrence of specific k-mer refers to specific k-mer whose actual number of occurrences is greater than a preset threshold of actual number of occurrences, that is, confirmed occurrence of specific k-mer belongs to a part of specific k-mer, so it can be determined according to specificity.
  • the actual number of occurrences of specific k-mers obtained from the sexual k-mer record table is obtained.
  • step 604 the ratio of the actual number of occurrences of the specific k-mer confirmed to appear is calculated as the ratio of the actual number of occurrences of the specific k-mer confirmed to appear and the sum of the actual occurrences of the specific k-mers identified.
  • Step 606 selecting the pathogen operation group corresponding to the specific k-mer of confirmed occurrence from the pathogen operation group whose false positive probability is lower than the preset standard probability, as the pathogen operation included in the sequencing data group.
  • the ratio of the actual number of occurrences of a specific k-mer that appears to be confirmed is A / S. In this way, the ratio of the actual number of occurrences of each specific k-mer confirmed to appear can be calculated.
  • Each pathogen operation group can have a corresponding specific k-mer appearance ratio table, each specific k-mer appearance ratio table is recorded, and the specific k-mer included in each pathogen operation group is in the pathogen operation group The number of occurrences in the entire genome. If the specific k-mer has appeared C 1 time in a certain genome, the count is C 1 , that is, the specific k-mer appearance ratio table records the number of occurrences of each specific k-mer in the genome. In addition, each specific k-mer occurrence ratio table also records the occurrence ratio of each specific k-mer, that is, after obtaining the number of occurrences of each specific k-mer in the genome, it can be calculated Show the proportion of each specific k-mer.
  • the specific k-mer's The occurrence ratio is C 1 / C.
  • the occurrence ratio of each specific k-mer can be obtained according to the specific k-mer appearance ratio table, and the occurrence ratio of each specific k-mer is taken as the expected appearance frequency ratio, that is, according to the specific k-mer
  • the occurrence ratio table can obtain the proportion of expected occurrence times of each specific k-mer confirmed to appear. Therefore, it is understandable that the ratio of the expected number of occurrences of each specific k-mer included in each pathogen operation group is added to obtain a value of 1.
  • the actual number of occurrences of specific k-mers the actual number of occurrences of each specific k-mer confirmed to appear can be calculated, and the proportion of the actual occurrences of specific k-mers for each confirmed occurrence can be calculated.
  • the specific k-mer occurrence ratio table the expected number of occurrences of specific k-mer for each confirmed occurrence can be obtained.
  • the false positive probability of each pathogen operation group can be calculated according to the number of confirmed specific k-mers contained in each pathogen operation group. Therefore, from the pathogen operation group whose false positive probability is lower than the preset standard probability, the pathogen operation group corresponding to the specific k-mer of confirmed occurrence that the proportion of actual occurrences matches the proportion of expected occurrences can be selected.
  • the finally selected pathogen operating group needs to meet the following two conditions: 1. the probability of false positives is lower than the preset standard probability; 2. the actual number of occurrences of the specific k-mer confirmed to be included in the pathogen operating group The proportion is consistent with the expected frequency. That is, a pathogen operation group that satisfies both of these conditions can be selected as the pathogen operation group included in the sequencing data. In this way, the accuracy of detecting the pathogen operation group contained in the sequencing data is further ensured.
  • the above step 504 includes:
  • step 702 when it is confirmed that a specific k-mer exists in the specific k-mer that belongs to the same non-overlapping specific region, the specific k-mer belonging to the same non-overlapping specific region is regarded as the same specificity.
  • the k-mer is a specific k-mer in which the number of any two overlapping characters meets a preset coincidence threshold in the non-overlapping specific region.
  • step 704 after the number of confirmed specific k-mers is determined according to the non-overlapping specific region, the false positive probability of each pathogen operation group is calculated according to the number of confirmed specific k-mers.
  • Each pathogen operation group may have a corresponding record of the actual occurrence times of specific k-mers, and the table records the actual occurrence times of each specific k-mer included in the pathogen operation groups. Therefore, a specific k-mer whose actual number of occurrences is greater than a preset number of occurrences can be selected from the content recorded in the table.
  • the preset number of occurrences threshold is a threshold set by a technician in advance to select a specific k-mer required. For each pathogen operating group, there is a corresponding non-overlapping specific region, and each pathogen operating group may have multiple non-overlapping specific regions.
  • a specific k-mer included in each non-overlapping specific region, and at least one specific k-mer in the non-overlapping specific region satisfies a terminal overlap condition.
  • the end coincidence condition means that the number of characters that two specific k-mers coincide with meets a preset coincidence threshold. That is, in each non-overlapping specific region, there must be at least one specific k-mer overlapping with another specific k-mer in the non-overlapping specific region. The number of characters coincides with a preset coincidence threshold. That is, if there are no less than j characters between two specific k-mers in a non-overlapping specific region, the two specific k-mers are considered to be terminally overlapping. .
  • Terminal overlap refers to the fact that the last j characters of one specific k-mer are exactly the same as the first j characters of another specific k-mer, then the two specific k-mers are considered to be terminal overlap of.
  • J is an acceptable coincidence area.
  • the size of j can be set according to the actual project requirements. Generally, j can be set to a value greater than 5 and less than or equal to k-1, that is, the value range of j can be: 5 ⁇ J ⁇ k-1.
  • each pathogen operation group has its own record of actual occurrences of specific k-mers. Therefore, the specific k-mers in each of the records of actual occurrences of specific k-mers can be independently corrected.
  • the corresponding non-overlap specific region can be obtained, and a corresponding non-overlap specific region table can be generated according to the non-overlap specific region corresponding to each pathogen operation group.
  • the specific k-mer included in each pathogen operation group is independently corrected, it is equivalent to group the specific k-mers with terminal overlap into the same non-overlapping specific region.
  • the specific k-mers in the non-overlapping specific region can also be spliced, that is, multiple specific k-mers in the same non-overlapping specific region can be spliced to obtain a longer sequence. . Because the sequence is obtained by splicing from multiple specific k-mers, the length of the obtained sequence is greater than k, and the length of the sequence will change according to the splicing method, and the length will not be fixed. This stitching processing method can save memory space.
  • the splicing process may not be performed, and the specific k-mers that meet the conditions of terminal overlap can be placed in the same non-overlap specific region.
  • the actual number of occurrences of each specific k-mer in the sequencing data is recorded in the specific k-mer occurrences record table, so each non-overlapping specificity can be obtained according to the specific k-mer actual occurrences record table The actual number of occurrences of each specific k-mer contained in the region.
  • the false positive probability calculation when the false positive probability calculation is performed, the false positive probability of each pathogen operating group is actually calculated based on the number of specific k-mers in the non-overlapping specific region obtained after the independence correction is performed, instead of directly Calculate the false positive probability according to the specific k-mer with the selected actual number of occurrences being greater than a preset threshold of actual number of occurrences.
  • the accuracy of the false positive calculation result of the pathogen operation group can be more guaranteed, and the accuracy of the selected pathogen operation group is actually the pathogen operation group included in the sequencing data.
  • two specific k-mers that meet the preset terminal overlap conditions in the non-overlapping specific region are spliced to obtain a new non-overlapping specific sequence.
  • the independence correction refers to the correction of the presence of terminal overlap.
  • the two specific k-mers are processed as follows.
  • the set of characteristic target sequences corresponding to each pathogen operating group stored in the target data includes a specific k-mer table corresponding to each pathogen operating group, and the specific k-mer table contains a table that satisfies Set the specific k-mer for specific conditions.
  • the specific k-mer table can be copied to obtain a replication specific k-mer table.
  • Select two specific k-mers in sequence from the replication specific k-mer table and perform sequence detection on the two specific k-mers. When it is detected that two specific k-mers meet the preset terminal overlap conditions At this time, the two specific k-mers are considered to be terminally overlapping.
  • the two specific k-mers that are terminally overlapping can be spliced, and the resulting splicing sequence is a new non-overlapping specific sequence, that is, the two specific k-mers are independently modified.
  • the specific k-mer after independence correction constitutes non-overlapping specificity regions, and the non-overlapping specificity corresponding to each pathogen operation group can be generated based on the data of non-overlapping specificity regions contained in each pathogen operation group.
  • the region table, and the non-overlapping specific region table can be stored in the target database in the characteristic target sequence set corresponding to each pathogen operation group, as a data backup. After the data is stored, the data can be retrieved from the database if needed, thereby improving the detection efficiency.
  • the two specific k-mers can be spliced to obtain splicing. sequence. That is, when it is detected that two specific k-mers are terminally coincident, the smallest region that covers the two specific k-mers is taken instead of the two specific regions.
  • a and B are two specific k-mers with overlapping ends
  • A is ACGGTCATC
  • B is TCATCCGA
  • the overlapping portion of A and B is TCATC, and the two can be spliced to obtain coverage A and
  • the smallest region of B namely ACGG-TCATC-CGA, can replace the two specific k-mers of A and B with the sequence ACGGTCATCCGA.
  • the two specific k-mer non-overlapping specific regions that all match the terminal coincidence can also be used for sequence detection of the specific k-mer in the non-overlapping specific region and the new non-overlapping specific sequence. Sequence detection of two new non-overlapping specific sequences is performed until no two sequences in the non-overlapping specific region meet the terminal overlap conditions. Correction of the specific k-mer by stitching can save memory space. Correction of the specific k-mer's independence can better ensure the accuracy of the false positive calculation result of the pathogen operation group, thereby further improving the accuracy rate of the pathogen operation group that is actually included in the selected sequencing data.
  • the method before obtaining the sequencing data of the sample, the method further includes the following steps:
  • Step 802 Obtain the number of occurrences of each specific k-mer included in the pathogen operation group in the genome included in the pathogen operation group.
  • For each pathogen operation group calculate the number of occurrences of each specific k-mer included in the pathogen operation group in the genome included in the pathogen operation group. If the specific k-mer appears in the same genome N Times, N is counted, that is, the number of occurrences of each specific k-mer in the pathogen operation group.
  • Step 804 Calculate the ratio of occurrences corresponding to each specific k-mer to the ratio of the number of occurrences corresponding to each specific k-mer to the total number of occurrences of specific k-mer included in the pathogen operating group.
  • the total number of occurrences of specific k-mer included in the pathogen operating group can be obtained.
  • the number of occurrences of these specific k-mers in the genome is C 1 , C 2 , ..., C 500
  • the total number of occurrences S C 1 + C 2 + ... + C 500 .
  • the ratio of occurrence of each specific k-mer can be calculated as C 1 / S, C 2 / S, ..., C 500 / S.
  • Step 806 Generate a specific k-mer appearance ratio table corresponding to the pathogen operation group according to the number of occurrences and the occurrence ratio corresponding to each specific k-mer included in the pathogen operation group.
  • step 808 the specific k-mer appearance ratio table is stored in the characteristic target sequence set corresponding to the pathogen operation group.
  • the corresponding data for each pathogen operating group can be generated based on the obtained data.
  • Specific k-mer appearance ratio table If there are M pathogen operating groups, M corresponding specific k-mer appearance ratio tables will be generated.
  • the specific k-mer appearance ratio table can be stored in the target database and Feature target sequence set corresponding to each pathogen operation group.
  • the leftmost column records the specific k-mers contained in the pathogen operating group X
  • each specific k-mer recorded in the second column is in the The number of occurrences in the genome contained in the pathogen manipulation group X can be recorded as C 1 , C 2 ,... Respectively.
  • the third column records the proportion of occurrence of each specific k-mer. As shown in Figure 9, the proportion of occurrence of each specific k-mer is the number of occurrences of the specific k-mer and the total occurrence. The ratios of the times C can be recorded as F 1 , F 2 ,... Respectively.
  • step 306 includes steps:
  • Step 306A Obtain a specific k-mer whose actual number of occurrences is greater than a preset threshold of the actual number of occurrences, as a specific k-mer for confirming occurrence.
  • Each pathogen operation group has a corresponding record of actual occurrences of specific k-mers, so the actual occurrences of each specific k-mer in the sequencing data can be obtained according to the actual occurrences of each specific k-mer.
  • the number of occurrences. Confirmed occurrence of specific k-mer refers to specific k-mer whose actual number of occurrences is greater than a preset threshold of actual number of occurrences, that is, confirmed occurrence of specific k-mer belongs to part of specific k-mer, so it can be determined according to The actual number of occurrences of specific k-mers obtained from the sexual k-mer record table is obtained.
  • Step 306B Obtain a false positive distribution of the specific k-mer included in each pathogen operation group in the corresponding simulated test data.
  • the simulated test data refers to a pathogen operating group, using all the genomes that do not belong to the pathogen operating group in the ensemble as a data source, and randomly sampling from the data source to obtain a set of simulated test data.
  • the simulated test data has the same or similar data volume, error distribution, data format and other characteristics as the real data. For each pathogen operation group, N sets of simulated test data are generated.
  • the server obtains a false positive distribution of the specific k-mer included in each pathogen operation group in the simulated test data corresponding to each pathogen operation group.
  • the false positive distribution is based on the corresponding simulated test in advance according to each pathogen operation group.
  • the number of specific k-mers generated by the pathogen operating group in the data was generated.
  • the number of specific k-mers refers to the number of times that specific k-mer has appeared in the simulated test data, and the number of occurrences refers to the occurrence of the same specific k-mer multiple times in the set of simulated test data. once.
  • Step 306C comparing the number of confirmed specific k-mers corresponding to each pathogen operating group with the false positive distribution of the specific k-mer included in each pathogen operating group in the simulated test data to obtain each Probability of false positive detection in the pathogen manipulation group.
  • the server compares the number of confirmed specific k-mers corresponding to each pathogen operation group with the false positive distribution of the specific k-mer included in each pathogen operation group in the simulated test data to obtain each pathogen. Probability of false positive detection in the operation group.
  • Step 306D Select a pathogen operation group whose false positive detection probability is lower than a preset threshold as the pathogen operation group included in the sequencing data.
  • the server selects the pathogen operation group whose false positive detection probability of the pathogen operation group is lower than a preset threshold as the pathogen operation group included in the sequencing data.
  • a preset threshold By setting the threshold in advance, when the probability of false positive detection of the pathogen operating group is greater than the preset threshold, it indicates that the detection signal of the pathogen operating group in the sequencing data is suspected false positive and should not be selected. If the false positive detection probability of the pathogen operation group is lower than a preset threshold, the detection signal of the pathogen operation group in the sequencing data is considered to be non-false positive, and the pathogen operation group should be selected to include in the sequencing data.
  • Pathogen Operations Group The preset threshold may be a value less than 0.05.
  • the false positive distribution probability of each pathogen operation group is obtained through the generated simulated test data, and the pathogen operation group included in the sequencing data is obtained according to the false positive distribution probability and the specific k-mer confirming the occurrence.
  • the accuracy of the pathogen operation group included in the sequencing data and the detection accuracy rate of the pathogen operation group included in the sequencing data are improved.
  • step 306B includes the steps:
  • 306B1 Obtain simulated test data corresponding to each pathogen operation group; the simulated test data is data obtained by randomly sampling from the genome of the entire set that does not belong to the pathogen operation group corresponding to the simulated test data.
  • the server obtains simulation test data corresponding to each pathogen operation group according to the pathogen operation group contained in the ensemble, and the simulation test data is data obtained by randomly sampling the pathogen operation in the genome of the ensemble that does not belong to the pathogen operation group.
  • N sets of false positive simulation test data are generated.
  • the value of N here is related to the false positive accuracy required for the final detection: if a false positive distribution with 1% resolution is required, then N is required here Not less than 100; if a false positive distribution with a resolution of one thousandth is needed, then N needs to be not less than 1000 here.
  • the server calculates the number of specific k-mers included in each pathogen operation group in each corresponding set of simulated test data. According to the number of specific k-mers included in each pathogen operation group in the corresponding simulation test data, the false k-mers included in each pathogen operation group in the corresponding simulation test data are obtained. Positive distribution. That is, using the N sets of false positive simulation test data of each pathogen operating group, a specific k-mer containing N data points is used to generate the distribution of the number of species in each false positive simulation test data. This distribution is the specific k-mer false positive distribution of this pathogen operating group.
  • step 306 includes steps:
  • step 306a a specific k-mer whose actual number of occurrences is greater than a preset threshold of the actual number of occurrences is obtained as the specific k-mer for confirming occurrence, and the number of specific k-mers for confirming occurrence is obtained.
  • the specific k-mer whose actual number of occurrences is greater than a preset occurrence number threshold is obtained from the specific k-mer actual number of occurrences recording table, as the specific k-mer for confirming occurrence, and according to the specific k-mer for confirming occurrence Get the corresponding number of species.
  • Step 306b Obtain the number of specific k-mers included in each pathogen operation group.
  • step 306c the concentration of the confirmed specific k-mer of each pathogen operation group is calculated according to the number of specific k-mers confirmed to be present and the number of specific k-mers included in each pathogen operation group.
  • the species of specific k-mer included in the pathogen operation group is obtained from the record of the actual number of occurrences of the specific k-mer corresponding to the pathogen operation group. number. Calculate the ratio of the number of specific k-mers that are confirmed to be present and the number of specific k-mers included in each pathogen operating group, that is, DUKP (Detected Unique K-mer Percentage), and get confirmation of each pathogen operating group Appearance of specific k-mer concentration.
  • DUKP Detected Unique K-mer Percentage
  • step 306d a pathogen operation group whose specific k-mer concentration is confirmed to be higher than a preset threshold is selected as the pathogen operation group included in the sequencing data.
  • the confirmed specific k-mer concentration is compared with a preset threshold, which is generally a value greater than 0.05. If the confirmed specific k-mer concentration is less than the preset threshold, the specific k-mer is considered to be specific
  • the signal detected by the pathogen operating group corresponding to the sex k-mer in the sequencing data is a suspected false positive and should not be selected. If it is confirmed that the specific k-mer concentration is higher than a preset threshold, the detection signal of the pathogen operating group corresponding to the specific k-mer in the sequencing data is considered to be non-false positive, and the pathogen operating group is selected as sequencing
  • the data contains the pathogen operating group.
  • the proportion of specific k-mers that have been confirmed to account for all the specific k-mer species in the operating group of the pathogen can be calculated by calculating the sequencing data.
  • the pathogen operation can more quickly calculate the pathogen operation group contained in the sequencing data, which improves the efficiency.
  • the detection method of the pathogen operation group may include:
  • the ratio of the number of species of -mer to the number of specific k-mer species included in each suspected pathogen operating group A pathogen operating group whose ratio is higher than a preset threshold is selected from the suspected pathogen operating group for sequencing.
  • the data may contain groups of pathogen operations.
  • the pathogen operation group included in the sequencing data is determined by the false positive detection probability, specific k-mer concentration, and relative concentration of the pathogen operation group, which can further improve the detection accuracy of the pathogen operation group.
  • the method further includes:
  • step 1002 each piece of medical information related to the pathogen operation group contained in the sequencing data is obtained from the target database.
  • Step 1004 Generate a final pathogen operation group list according to the medical information of each pathogen operation group included in the sequencing data.
  • Step 1006 Output the final pathogen operation group list to the detection output result of the sequencing data.
  • the target database also stores medical information for each pathogen operating group.
  • Medical information includes medical, clinical application, biology, pathology and other related information for each pathogen operating group.
  • the corresponding medical information of a certain bacterium includes its bacterial morphology, metabolic characteristics, aerobic properties, Gram staining, common diseases caused by it, conventional detection methods, and conventional treatment drugs.
  • medical information of each pathogen operating group contained in the sequencing data can be obtained from the target database.
  • a final pathogen operating group list can be generated.
  • the final pathogen operation group list includes the pathogen operation group included in the sequencing data, and medical information corresponding to each pathogen operation group included in the table. This final pathogen operating group list is output to the detection results of the sequencing data, which can be used as part of the data in the diagnostic results of the sequencing data for reference by technicians.
  • the method further includes:
  • Step 1102 Obtain a total CT of the number of occurrences of k-mer that has appeared in the sequencing data.
  • the relative concentration of the pathogen operating group contained in the sequencing data can also be calculated.
  • the relative concentrations of the contained pathogens were estimated.
  • the data used is k-mer, rather than specific k-mer related data, so the k contained in each pathogen operating group can be obtained -mer.
  • the total CT of k-mers that have appeared in the sequencing data can be obtained. Specifically, if the sequencing data contains M pieces of data and each data contains n characters, then in the sequencing data, each piece of data contains n-k + 1 k-mer, then the M pieces of data can be The number of k-mers contained in the sequence is added, that is, M n-k + 1 is added to obtain the total number of occurrences of k-mer in the sequencing data CT.
  • Step 1104 Obtain the total estimated CF of the k-mers included in the pathogen operation group included in the sequencing data. CF, and estimate the total actual CF of the k-mers included in the pathogen operation group in the sequencing data. The total number of occurrences.
  • step 1106 the relative concentration of each pathogen operating group is calculated according to the total number of occurrences CT and the estimated total number of actual occurrences CF.
  • the ratio of CF to CT is the relative concentration of the pathogen operating group.
  • the total estimated number of actual occurrences of k-mer included in the pathogen operation group CF refers to the total number of estimated occurrences of k-mer included in the pathogen operation group in the sequencing data.
  • the estimated total number of actual occurrences CF is an estimated value, not an actual measured value.
  • step 1104 includes:
  • step 1202 the pathogen operation group included in the sequencing data is obtained as the final pathogen operation group.
  • the selected pathogen operation group can be used as Sequencing data contains the pathogen manipulation group.
  • the pathogen operating group included in the sequencing data can be used as the final pathogen operating group.
  • Step 1204 Obtain the proportion of the complete set of k-mer included in each final pathogen operation group.
  • the proportion of the complete set is the number of occurrences of k-mer in the corresponding pathogen operation group.
  • the feature target sequence set corresponding to each pathogen operating group stored in the target database also includes a table of the number of k-mer occurrences corresponding to each pathogen operating group in the complete set. Each pathogen is recorded in this table
  • the proportion of the complete set of k-mer included in the operation group that is, the proportion of the complete set of k-mer included in the final pathogen operation group.
  • the proportion of the complete set of each k-mer refers to the ratio of the number of occurrences of the k-mer in the pathogen operation group to which it belongs to the number of occurrences of the k-mer in the entire pathogen operation group.
  • step 1206 the estimated actual number of occurrences of each k-mer is calculated according to the proportion of the complete set corresponding to each k-mer and the actual number of occurrences.
  • Step 1208 Calculate a total CF of the estimated actual occurrence times of k-mers included in the pathogen operation group included in the sequencing data according to the estimated actual occurrence times of each k-mer.
  • a table of actual occurrences of k-mer corresponding to each final pathogen operation group can be generated.
  • the actual number of times k-mer appears in the sequencing data refers to the actual number of times that k-mer appears in the sequencing data. If the k-mer appears N times in the sequencing data, the k-mer appears in the sequencing data. The actual number of occurrences is N.
  • the actual occurrence times corresponding to each k-mer can be obtained according to this table. Based on the proportion of the complete set of each k-mer obtained, the estimated actual number of occurrences of each k-mer can be calculated. After obtaining the estimated actual occurrence times of each k-mer included in each final pathogen operation group, a table of estimated actual occurrence times of k-mer corresponding to each final pathogen operation group can be generated.
  • the k-mer estimated actual number of occurrences table shown in Fig. 13 the leftmost column records the k-mer included in the pathogen operation group X. Recorded in the second column is the proportion of the complete set of each k-mer, which can be recorded as F 1 , F 2 ,.... The data of the proportion of the complete set of each k-mer comes from the location of the k-mer. The proportion of k-mer occurrences in the final pathogen operation group accounted for the total set.
  • the third column records the actual number of occurrences of each k-mer in the sequencing data, which can be recorded as C 1 , C 2 , ... respectively.
  • the fourth column is the estimated actual number of occurrences of each k-mer, which can be recorded as F 1 ⁇ C 1 , F 2 ⁇ C 2 , ..., the estimated actual number of occurrences of each k-mer is based on the k-mer The corresponding proportion of the complete set is calculated from the actual number of occurrences.
  • the estimated actual numbers of all k-mers included in the final pathogen operation group can be obtained from each k-mer estimated actual frequency table.
  • the total number of occurrences CF can be obtained from each k-mer estimated actual number of occurrences table, and the estimated actual total number of occurrences of all k-mers included in the pathogen operation group included in the sequencing data is CF.
  • step 1204 includes:
  • Step 1402 Obtain a record table of the number of k-mer occurrences corresponding to each final pathogen operation group.
  • Step 1404 Obtain the k-mer included in the k-mer occurrence number record table.
  • the final pathogen operating group refers to the pathogen operating group included in the sequencing data.
  • a characteristic target sequence set corresponding to each pathogen operation group is stored, and the characteristic target sequence set of each pathogen operation group includes a record of the number of occurrences of k-mer corresponding to the pathogen operation group. table.
  • the record table of the number of k-mer occurrences corresponding to each pathogen operating group all k-mers that have occurred in the genome of the pathogen operating group are recorded, and the k-mer appears in the genome of the pathogen operating group. Times. If the k-mer appears N times in the same genome, the count is N.
  • each k-mer After obtaining the k-mer occurrence frequency record table corresponding to each pathogen operation group contained in the sequencing data, after obtaining the k-mer occurrence frequency record table corresponding to each final pathogen operation group, each k-mer can be obtained k-mer contained in the -mer occurrence count table.
  • the k-mer occurrence number record table records the number of occurrences of k-mer and k-mer included in each pathogen operation group, here, only the list of k-mer included in each pathogen operation group is required, and It is not necessary to obtain the number of occurrences of each k-mer, so it is sufficient to obtain the k-mer list in the k-mer occurrences record table.
  • Step 1406 obtaining a proportion table of the total number of k-mer occurrences corresponding to each final pathogen operation group, and the proportion table of the number of k-mer occurrences in the complete set including the number of occurrences of each k-mer sequence in the pathogen operation group and the total number The proportion of occurrences in.
  • Step 1408 Obtain the proportion of the complete set of each k-mer included in the k-mer occurrence number record table from the k-mer occurrence number of the complete set ratio table.
  • the feature target sequence set corresponding to each pathogen operation group also includes a proportion table of the number of k-mer occurrences corresponding to the pathogen operation group in the complete set.
  • the complete set proportion table includes the number of occurrences of each k-mer in the corresponding pathogen operation group CG, the number of occurrences of the complete set of each k-mer in the target database of the pathogen operation group CB, and each k-mer
  • the proportion of mer's complete episodes is F. Therefore, the proportion of the complete set of each k-mer can be obtained from the table of the number of occurrences of each k-mer in the complete set.
  • the k-mer included in each k-mer occurrence number record table can be obtained, and then the k-mer occurrence can be further obtained through the obtained k-mer
  • the number of times in the complete set proportion table, the proportion of the complete set of each k-mer recorded in the k-mer occurrence number record table corresponding to each final pathogen operation group is obtained.
  • the method further includes:
  • Step 1502 Obtain the k-mers included in each final pathogen operation group according to the k-mer occurrence number record table corresponding to each final pathogen operation group.
  • step 1504 a k-mer occurrence union list is generated according to the k-mer occurrence times record table corresponding to each pathogen operation group.
  • Step 1506 Obtain the actual number of occurrences of each k-mer included in the k-mer appearance union list in the sequencing data.
  • Step 1508 Acquire the actual number of occurrences corresponding to k-mer included in each final pathogen operation group.
  • Step 1510 Generate a table of actual occurrence times of k-mer corresponding to the final pathogen operation group according to the actual occurrence times corresponding to each k-mer.
  • the target database stores the characteristic target sequence set of each pathogen operating group, and the characteristic target sequence set of each pathogen operating group includes a corresponding k-mer occurrence number record table, and the number of occurrences in k-mer
  • the record table records the k-mer included in each pathogen operation group, and the number of occurrences of each k-mer in the genome included in the pathogen operation group.
  • the final pathogen operating group is actually the pathogen operating group contained in the sequencing data, but it has just been renamed and has not changed substantially. For each final pathogen operation group, there is a record table of the number of occurrences of k-mer corresponding to each final pathogen operation group. Therefore, after obtaining the k-mer occurrence number record table corresponding to each pathogen operation group contained in the target database, it is equivalent to obtaining the k-mer occurrence number record table corresponding to the final pathogen operation group.
  • the k-mer occurrence number record table records the number of occurrences of k-mer and k-mer included in each pathogen operation group, here, only the list of k-mer included in each pathogen operation group is required, and It is not necessary to obtain the number of occurrences of each k-mer, so it is sufficient to obtain the k-mer list in the k-mer occurrences record table.
  • the k-mer occurrence union list can be generated according to the k-mer included in the k-mer occurrence number record table of each pathogen operation group, that is, the k-mer list of each pathogen operation group is merged into one The total list.
  • the actual number of occurrences of each k-mer included in the k-mer appearance union list in the sequencing data can be obtained. Therefore, the actual number of occurrences corresponding to each k-mer included in each final pathogen operation group can be obtained, and the actual occurrences of k-mer corresponding to the final pathogen operation group can be generated according to the actual number of occurrences corresponding to each k-mer.
  • Frequency table For example, according to the proportion of the complete set corresponding to the k-mers contained in each pathogen operating group and the actual number of occurrences recorded by these k-mers in the k-mer actual occurrences table, the pathogens contained in each pathogen operating group can be calculated. k-mer estimates the actual number of occurrences, thereby generating a table of the k-mer estimates of the actual number of occurrences contained in the corresponding pathogen operating group.
  • the method before obtaining the sequencing data of the sample, the method further includes the following steps:
  • step 1602 the number of occurrences of k-mer included in the pathogen operation group in the pathogen operation group CG is obtained.
  • the number of occurrences of k-mer in the pathogen operation group CG refers to the number of times that k-mer has appeared in the pathogen operation group in which it is located, that is, the k-mer appears in the genome contained in the pathogen operation group in which it is located If the k-mer has occurred N times in the same genome, the count is N.
  • the number of occurrences of k-mer in the pathogen operation group CG can be obtained from the k-mer occurrence number record table corresponding to each pathogen operation group, and the corresponding k-mer occurrence number record table corresponding to each pathogen operation group records the corresponding The number of occurrences of k-mer included in the pathogen manipulation group in the pathogen manipulation group.
  • Step 1604 Obtain the occurrence number CB of the k-mer included in the pathogen operation group in the ensemble.
  • the number of occurrences of k-mer in the complete set of pathogens in the target database CB refers to the number of occurrences of k-mer in the genome included in the complete set.
  • the total number of occurrences of each k-mer CB can be obtained from the k-mer occurrences record table of the complete episode.
  • the k-mer occurrences record table of the complete episode records the occurrences of each k-mer in the complete episode.
  • step 1606 the proportion F of the complete set of each k-mer is calculated as the ratio of the number of occurrences CG to the number of occurrences of the complete set CB.
  • Step 1608 Generate a proportion table of the number of k-mer occurrences corresponding to each pathogen operation group according to the proportion of the complete set.
  • step 1610 the proportion table of the number of k-mer occurrences in the complete set is stored in the feature target sequence set corresponding to the pathogen operation group.
  • the number of k-mer occurrences in the complete set is shown in the table.
  • the leftmost column records the k-mers contained in the pathogen operation group X.
  • the second column records the number of occurrences corresponding to each k-mer.
  • CG is CG 1 , CG 2 ,...
  • Recorded in the third column is the number of complete episodes CB corresponding to each k-mer, which are CB 1 , CB 2 ,...
  • the fourth column records the proportion F of the complete set corresponding to each k-mer.
  • the k-mer occurrence number corresponding to the full set ratio table can be saved in the target database to the characteristic target sequence corresponding to the pathogen operation group set. After storage, if needed, data can be retrieved from the target database, thereby improving the detection efficiency.
  • the method further includes: obtaining the actual number of occurrences of the sequencing data of the k-mer included in the pathogen operation group; according to the sequencing of the sample by the k-mer included in the pathogen operation group The actual number of occurrences of the data generates a table of the actual number of occurrences of k-mer corresponding to each pathogen operating group.
  • the actual number of occurrences of each k-mer included in each pathogen operation group in the sequencing data can be obtained, and according to the actual occurrence of k-mer included in each pathogen operation group The number of times generates a table of actual occurrence times of k-mer corresponding to each pathogen operation group.
  • the data can be obtained by obtaining the table of actual occurrences of k-mer, which improves the efficiency of estimating the relative concentration of the pathogen operating group.
  • the method before obtaining the sequencing data of the sample, the method further includes:
  • Step 1802 Obtain the number of occurrences of k-mer included in each pathogen operation group in the pathogen operation group.
  • step 1804 a k-mer appearance frequency record table corresponding to each pathogen operation group is generated according to the number of occurrences of each k-mer in the pathogen operation group.
  • Step 1806 Store the k-mer occurrence number record table into a feature target sequence set corresponding to the pathogen operation group.
  • a record of the number of k-mer occurrences corresponding to each pathogen operation group can be generated. If a total of M pathogen operations exist in the target database Group, then there will be M records of the number of k-mer occurrences corresponding to the pathogen operating group. Then, the k-mer occurrence number record table is stored into a target target sequence set corresponding to each pathogen operation group stored in the target database, and the data is stored for subsequent use.
  • Obtain the number of occurrences of k-mer included in the pathogen operation group in the pathogen operation group CG including: obtaining the occurrence of k-mer included in the pathogen operation group in the pathogen operation group from the record of the number of k-mer occurrences corresponding to the pathogen operation group Number of CG.
  • the method before acquiring the sequencing data of the sample, the method further includes:
  • step 1902 the number of occurrences of the k-mer included in each pathogen operation group in the corpus in the corpus is obtained.
  • step 1904 a k-mer occurrence number record table of the ensemble is generated according to the ensemble occurrence number of each k-mer.
  • step 1906 the k-mer occurrence times record table of the complete episode is stored in the target database.
  • the number of occurrences of the complete set of k-mer refers to the number of occurrences of the k-mer in all pathogen operation groups and non-pathogen operation groups included in the complete set. It can also be understood that the k-mer appears in all genomes included in the complete set. Times. Therefore, after obtaining the number of occurrences of k-mer included in each pathogen operation group in the corpus, a k-mer occurrence number record table of the corpus can be generated. It is equivalent to record all occurrences of k-mer in the ensemble in this table. After generating a complete set of k-mer occurrences record table, this table can be stored in the target database, and the data is stored for subsequent use.
  • the target database has already stored the k-mer occurrence frequency record table of the complete episode. Therefore, when it is necessary to obtain the k-mer occurrence times of the complete episode in the pathogen operation group contained in the target database, the k-mer included in the target database can be obtained. Obtained through the k-mer occurrences record table of the complete episode.
  • a method for detecting a pathogen operation group including the following steps:
  • step 2002 a feature target sequence set corresponding to each pathogen operation group is established.
  • establishing a characteristic target sequence set corresponding to each pathogen operation group includes the following steps:
  • Step 2002A collecting and sorting high-confidence genomes.
  • the high-confidence genome can include both the pathogen genome and the non-pathogen genome, such as high-confidence genomes of symbiotic, probiotic, human, animal, and plant.
  • High confidence genomes can be derived from the NCBI's RefSeq dataset or other public or private high confidence genomes.
  • non-deterministic characters For example, for the DNA genome, the proportion of non-deterministic characters refers to the proportion of non-ACGT characters contained in it. If the proportion of non-ACGT characters in a piece of DNA genome data is too high, then the piece of data is suspected of low confidence. Genome. For DNA or RNA sequences, non-deterministic characters refer to characters other than ACGTU. For protein sequences, non-deterministic characters refer to characters other than certain amino acid characters.
  • Genomes with a low average percentage of coverage are those that are suspected of having low completion, ie, low confidence.
  • Genetic distance refers to an index that measures the size of the overall genetic difference between species (or individuals).
  • step 2002B a pathogen operation group included in the ensemble is determined.
  • a pathogen operating group can include one or more related genomes. Therefore, according to clinical needs and / or taxonomic support, a pathogen operating group can represent genetic units or species classifications at different classification levels such as a species, a subspecies, a subtype, a strain or virus strain, or a genus. School unit. While establishing each pathogen operating group, the relevant information in terms of medicine, clinical application, biology, pathology, etc. of each pathogen operating group can be clarified at the same time. For example, for a certain kind of bacteria, confirm its bacterial morphology, metabolic characteristics, aerobic properties, Gram staining, common diseases caused by it, conventional detection methods, conventional treatment drugs, and so on.
  • step 2002C an index table of the number of occurrences of the genome of the complete set is generated.
  • the genomic occurrence index table of the corpus can be generated.
  • k-mer refers to a genomic sequence of length k.
  • k can be defined by itself, and the range can generally be set between 11 and 32. If there are a different deterministic characters in a genomic data, then for a specific k, there may be a total of k different powers of k.
  • DNA has a total of four different deterministic characters of ACGT, then for a particular k, there are 4 possible k-th different k-mers.
  • For a genome of length n there may be at most n-k + 1 different k-mers. But because a genome contains repeating regions, in general, a k-mer with a length of n characters will be much smaller than n-k + 1. Therefore, if the ordinary k-mer counting method is used, a given k-mer may appear multiple times and may be counted multiple times in a given genome.
  • the genomic appearance frequency index table of the complete set established in this embodiment different from the previous method, if a k-mer in a genome appears more than once, then the genomic appearance frequency index table of the complete set still only counts once. Therefore, the count corresponding to a k-mer in the resulting k-mer genome occurrence number index table represents how many genomes the k-mer has appeared in the total set.
  • step 2002D an index table of the number of occurrences of the genome corresponding to each pathogen operation group is generated.
  • the genome occurrence number index table of a pathogen operating group is different from the genome occurrence number index table of the complete set in the above step 2002C.
  • the complete set of genome occurrence index table records the complete set, that is, how many genomes a k-mer has appeared in the entire pathogen operating group, that is, how many genomes a k-mer has appeared in the complete set.
  • the index table of the number of occurrences of the genome corresponding to the pathogen operation group is corresponding to each pathogen operation group, and records the k-mer contained in each pathogen operation group, and how many genomes have appeared in the pathogen operation group.
  • step 2002E a specific k-mer table corresponding to each pathogen operation group is generated.
  • the specific k-mer table corresponding to each pathogen operating group records the k-mers that meet the preset specific conditions in each pathogen operating group, that is, the specific k-mer.
  • the specific k-mer is a k-mer selected from the k-mers that meets the preset specificity conditions. The selection of a specific k-mer must meet the following two conditions:
  • the pathogen operating group contains N genomes
  • the number of occurrences of a k-mer in the genome occurrence index table corresponding to the pathogen operating group is C 1
  • the condition needs to be met: C 1 / C 2 + P 2 ⁇ 1, that is, the ratio of the number of occurrences in the genome occurrence index table of the pathogen operating group to the number of occurrences in the genome occurrence index table of the complete set and the second threshold And is greater than or equal to 1.
  • the second threshold value P 2 is usually less than 5%.
  • the first threshold value P 1 and the second threshold value P 2 may be equal to or different from each other.
  • the two parameters of the first threshold P 1 and the second threshold P 2 are added when the specific k-mer is selected, allowing an error rate within a certain range, that is, allowing specificity within a certain range.
  • k-mer is non-specific. Without these two parameters, non-specificity in a certain range cannot be allowed, and it is often difficult to find a specific k-mer for a certain pathogen operating group.
  • the probability of false positives for the pathogen operation group is actually less than or equal to P 1 n ' (that is, the power of n' to P 2 ).
  • P 1 n ' that is, the power of n' to P 2
  • the false negative rate refers to the proportion of positives that produce a negative test result in the test, that is, the conditional probability that a negative test result exists considering the condition being searched for.
  • step 2002B If a total of M pathogen operating groups are divided in step 2002B, then M corresponding specific k-mer tables will be established in this step.
  • step 2002F a specific k-mer appearance ratio table corresponding to each pathogen operation group is generated.
  • the specific k-mer corresponding to the pathogen operation group can be established A scale table appears.
  • M pathogen operating groups are divided in step 2002B, then in this step, M corresponding specific k-mer appearance ratio tables will be established.
  • step 2002G a record table of the number of occurrences of k-mer corresponding to each pathogen operation group is generated.
  • step 2002H a k-mer occurrence number record table of the complete set is generated.
  • step 2002I a proportion table of the number of k-mer occurrences corresponding to each pathogen operation group in the complete set is generated.
  • the record of the number of k-mer occurrences corresponding to each pathogen operation group records the k-mers contained in each pathogen operation group and the number of occurrences of each k-mer in the pathogen operation group.
  • the difference between the genome occurrence index table corresponding to the pathogen operating group is that in the genome occurrence index table, it is recorded how many genomes each k-mer has appeared in the pathogen operating group. If the k-mer is in the Occurred X (X greater than 1) times in the same genome, and only recorded as 1.
  • the number of occurrences of each k-mer in the genome contained in the pathogen operating group is recorded, that is, if the k-mer has occurred X times in the same genome, then Record X times.
  • the k-mer occurrence count record generated in step 2002H is consistent with the k-mer occurrence count record in step 2002G. If the k-mer appears X times in the same genome, then Record X times. But the difference is that the complete set of k-mer occurrences record table records the number of occurrences of each k-mer in the genome contained in the entire pathogen operating group, rather than one for each pathogen operating group, and It is created as a whole, and the total number of occurrences of all k-mers in all genomes is obtained according to the k-mer occurrences record table of the complete set. That is, in step 2002G, a total of M pathogen operating groups are divided, and M corresponding k-mer occurrence times record tables will be established, but only one complete set of k-mer occurrence times record tables will still be established.
  • the number of occurrences of k-mer corresponding to each pathogen operation group generated in step 2002I in the complete set table then records the number of occurrences of k-mer included in each pathogen operation group in the pathogen operation group CG, each The k-mer included in the pathogen operation group includes the number of occurrences of the complete set CB in the pathogen operation group included in the target database, and the proportion F of the complete set calculated based on the ratio of the number of occurrences CG to the number of occurrences of the complete set CB.
  • a table of M corresponding k-mer occurrences in the complete set will be established.
  • step 2002G the three tables generated in step 2002G, step 2002H, and step 2002I are established for the process of estimating the relative concentration of the pathogen operating group contained in the sequencing data.
  • these three steps may not be required.
  • module A can run from time to time in order to continuously update the feature target sequence set corresponding to each pathogen operating group, that is, update the target database. However, module A does not need to be run or updated during the analysis of each sample.
  • step 2004 the pathogen operation group included in the sequencing data of the sample is detected.
  • Samples can be mixed metagenome samples and complex genome sequencing samples.
  • a mixed metagenomic sample refers to a sample that has been directly sequenced without any separation. It may contain viruses, bacteria, fungi, and other various prokaryotes and eukaryotes.
  • a complex genome sequencing sample refers to the fact that a sequenced genome may contain more than one organism or more than one organism, such as a sample that has not been completely separated or a sample that has been contaminated.
  • Sequencing data refers to the data output by a sample after the sequence of the biomolecules contained in a sample is read by a DNA sequencer, an RNA sequencer, or a protein sequencing device.
  • step 2004 includes:
  • Step 2004A Obtain sequencing data of the sample.
  • step 2004B the specific k-mer appearance ratio table is called.
  • step 2004C a record table of actual occurrence times of specific k-mer corresponding to each pathogen operation group is generated.
  • the sequencing data of the sample to be detected is obtained, and the specific k-mer occurrence ratio table generated in step 2002F can be called when performing the detection. If M specific k-mers corresponding to the pathogen operation group are generated in step 2002F If mer appears in the scale table, M needs to be called.
  • M For the sequencing data of a certain sample, it is necessary to obtain the actual number of occurrences of each specific k-mer included in the sequencing data for each specific k-mer occurrence ratio table, and generate corresponding data for each pathogen operation group. Record of actual occurrences of specific k-mer.
  • step 2004D the pathogen operation group with a low probability of occurrence is excluded, and the suspected pathogen operation group is selected.
  • the actual number of occurrences of each specific k-mer in the sequencing data can be obtained, so that the information contained in each pathogen operation group can be obtained.
  • the total number of actual occurrences of all specific k-mers in the sequencing data that is, for each pathogen operation group, the actual number of occurrences of each specific k-mer included in the pathogen operation group in the sequencing data is performed.
  • the sum obtained by addition is the sum of the actual occurrences.
  • a threshold T is set.
  • the pathogen operation group is determined to be a pathogen operation group with a low probability of occurrence, and the pathogen operation group can be removed.
  • the threshold T can be set by a technician. Generally, the threshold T is set to a number greater than 5.
  • a suspected pathogen operation group needs to be selected from the remaining pathogen operations, as follows:
  • the actual number of occurrences of the specific k-mer is C i . If C i is greater than the threshold T ′, it is considered that the specific k-mer has actually appeared in the sequencing data, and the specific k-mer can be used as a confirmation. Specific k-mer. T ′ can be set by a technician. Generally, the threshold T is set to a number greater than 5.
  • the probability of a false positive for the pathogen operation group is P 2 n , and P 2 is the second Threshold.
  • n is large enough, the probability of false positives in the pathogen manipulation group will be small.
  • n specific k-mers confirmed to be included in the operation group of the pathogen they belong to the non-overlapping specific k-mers after independence correction. That is, first obtain the specific kmer whose actual number of occurrences is greater than the threshold T as the specific k-mer for confirming the occurrence, and then check which of the specific k-mers for confirming the occurrence are the specific k-mers for confirming the occurrence. The non-overlapping specific k-mers after independence correction are taken, and then the number of specific k-mers that are identified as belonging to the non-overlapping specific k-mer is taken n, and then this n is used to calculate the false of the pathogen operation group. Positive probability.
  • a pathogen operating group whose false positive probability is lower than a preset standard probability can be selected as the suspected pathogen operating group.
  • the preset standard probability is preset by a technician, and is generally set to a value less than 0.05.
  • step 2004E a pathogen operation group that meets a preset condition is selected from the suspected pathogen operation group as the pathogen operation group included in the sequencing data.
  • k-mer appearance ratio the expected specific k-mer appearance ratio is the specific k-mer appearance ratio table for each specific k-mer appearance ratio. The result of this statistical test is the probability p-value calculated statistically here.
  • the pathogen operation group that does not reach the statistical test standard P ′ T is removed, that is, the pathogen operation group corresponding to the specific k-mer that confirms the occurrence of the proportion of actual occurrences that does not meet the expected occurrence ratio, that is, From the suspected pathogen operation group obtained in step 2004C, the pathogen operation group corresponding to the specific k-mer of confirmed occurrence that the actual occurrence ratio matches the expected occurrence ratio is selected as the pathogen operation group included in the sequencing data.
  • P ′ T is preset by a technician, and is generally set to a value less than 0.05.
  • the Goodness-of-fit test method can include chi-sqaured test, likelihood-ratio test, and the like.
  • the actual number of specific k-mer occurrences can be reduced proportionally. The magnitude of the reduction is determined according to the value of the specific k-mer that has the least number of occurrences of all the specific k-mers actually detected. After the proportion is reduced, it is not less than 1.
  • the equal ratio referred to here is a ratio between the number of occurrences of each specific k-mer actually detected.
  • the final pathogen operation group list contains the name of each pathogen operation group, the calculated p-value, and the relevant medical, clinical application, biology, pathology and other related information obtained for each pathogen operation group. .
  • step 2006 the relative concentration of the pathogen operating group contained in the sequencing data is estimated.
  • the relative concentration of the pathogen operating group contained in the sequencing data can be further estimated.
  • the sample needs to be prepared, sequenced, and signal-processed. It is difficult to accurately calculate the absolute concentration of the pathogen contained in the sample. Therefore, the relative concentration of the pathogen contained in the sample can be calculated. estimate.
  • step 2006 includes:
  • Step 2006A Obtain the final pathogen operation group list.
  • step 2006B a record table of the number of occurrences of k-mer corresponding to each final pathogen operation group is obtained.
  • the generated pathogen operation group list is called to obtain the pathogen operation group included in the sequencing data.
  • the pathogen operation group included in the sequencing data may be referred to as the final pathogen operation group.
  • the k-mer occurrence number record table corresponding to each final pathogen operation group is obtained, and the k-mer list recorded in the k-mer occurrence number record table can be obtained.
  • the k-mer occurrence number record table records the number of occurrences of k-mer and k-mer included in each pathogen operation group, here, only the list of k-mer included in each pathogen operation group is required, and It is not necessary to obtain the number of occurrences of each k-mer, so it is sufficient to obtain the k-mer list in the k-mer occurrences record table.
  • step 2006C a proportion table of the number of k-mer occurrences corresponding to each final pathogen operation group in the complete set is obtained.
  • step 2006C a k-mer appearance union list is generated, that is, the k-mer list of each pathogen operation group is merged into a total list.
  • step 2006D a total CT of the number of occurrences of k-mer that has appeared in the sequencing data is obtained.
  • the data used is k-mer, rather than specific k-mer related data, so the k contained in each pathogen operating group can be obtained -mer, and the number of k-mers that have appeared in the sequencing data.
  • the sum of the number of k-mers that appear in the sequencing data is CT.
  • step 2006E a table of actual occurrence times of k-mer corresponding to each final pathogen operation group is generated.
  • step 2006F the estimated total number of actual occurrences of k-mer included in the final pathogen operation group CF is generated.
  • the target database there is also a proportion table of the number of k-mer occurrences corresponding to each pathogen operating group in the complete set.
  • This table records the proportion of the complete set of k-mer contained in each pathogen operating group, that is, The proportion of the complete set of k-mer included in the final pathogen operation group.
  • the proportion of the complete set of each k-mer refers to the ratio of the number of occurrences of the k-mer in the pathogen operation group in which it belongs to the number of occurrences of the k-mer in the complete set.
  • a table of actual occurrences of k-mer corresponding to each final pathogen operation group can be generated.
  • the actual number of times k-mer appears in the sequencing data refers to the actual number of times that k-mer appears in the sequencing data. If the k-mer appears N times in the sequencing data, the k-mer appears in the sequencing data. The actual number of occurrences is N.
  • the actual occurrence times corresponding to each k-mer can be obtained according to this table. Based on the proportion of the complete set of each k-mer obtained, the estimated actual number of occurrences of each k-mer can be calculated. After obtaining the estimated actual occurrence times of each k-mer included in each final pathogen operation group, a table of estimated actual occurrence times of k-mer corresponding to each final pathogen operation group can be generated. After the k-mer actual frequency table corresponding to each final pathogen operation group is successfully established, the estimated actual numbers of all k-mers included in the final pathogen operation group can be obtained from each k-mer estimated actual frequency table. The total number of occurrences CF can be obtained from each k-mer estimated actual number of occurrences table, and the estimated actual total number of occurrences of all k-mers included in the pathogen operation group included in the sequencing data is CF.
  • a k-mer occurrence union list can be generated according to the k-mer included in each k-mer occurrence number record table, that is, the k-mer list of each pathogen operation group is merged into a total list. .
  • the k-mer appears in the union list, which contains the k-mer in the pathogen operation group that is included in the known sequencing data.
  • the total number of k-mer occurrences records includes the k-mers contained in all known pathogen operation groups that have appeared in the set of sequencing data. Therefore, the final pathogen operations can be obtained according to the k-mer occurrences total records The actual number of k-mers included in the group in the sequencing data.
  • a table of actual occurrences of k-mer corresponding to each final pathogen operation group can be generated for subsequent use, for example, according to each k
  • the proportion of the complete set corresponding to -mer and the actual number of occurrences recorded in the k-mer actual occurrences table can be calculated to obtain the estimated actual occurrences of each k-mer, thereby generating a corresponding k-mer estimated actual occurrences table.
  • step 2006G the relative concentration of each final pathogen operating group is calculated according to the total number of occurrences CT and the estimated total number of actual occurrences CF.
  • the ratio of CF to CT is the relative concentration of the pathogen operating group.
  • the estimated total number of actual occurrences of k-mer included in the pathogen operation group is CF, which refers to the total number of estimated occurrences of k-mer included in the pathogen operation group in the sequencing data.
  • the estimated actual total number of occurrences CF is The estimated value is not the actual measured value.
  • steps 2004 and 2006 are in a serial operation mode, that is, step 2006 is performed after step 2004 is completed.
  • Step 2004 and step 2006 may be in a parallel or mixed operation mode.
  • step 2006A needs to be modified to: call the pathogen operation group list contained in the ensemble; and after step 2004 and step 2006 are completed at the same time, according to the final pathogen operation obtained in step 2004
  • the list of groups calls the corresponding relative concentration values calculated in step 2006.
  • This pathogen operating group detection method compares the sequencing data with the characteristic target sequences corresponding to each pathogen operating group to obtain the pathogen operating group contained in the sequencing data, reducing the comparison space, thereby reducing analysis time and improving detection. s efficiency.
  • a corresponding characteristic target sequence set is established for each pathogen operating group.
  • the characteristic target sequence set of a certain pathogen operating group can guarantee that these characteristic target sequences are only in the pathogen operating group within a certain probability range. It exists in the species inside, but does not exist in other species, so it represents the operating group of the pathogen characteristically with high probability, thereby improving the accuracy of the detection result.
  • a method for detecting the relative concentration of a pathogen operating group including the following steps:
  • Step 2402 Obtain sequencing data of the sample, and calculate a total CT of the number of occurrences of k-mers that have appeared in the sequencing data.
  • Sample sequencing data refers to the data output by a device after the sequence of a biomolecule contained in a sample is read by a DNA sequencer, an RNA sequencer, or a protein sequencing device.
  • a set of sequencing data includes multiple (possibly more than several million) pieces of data to be tested, and each piece of data to be tested can be abstracted into a string. Samples can be in the form of a drop of blood, a sputum, a handful of soil, and so on.
  • a sequencer is an instrument that can measure the sequence of an input sample. The sequences measured here include not only DNA sequences, but also sequences composed of other substances such as proteins and RNA.
  • the concentration of the pathogen contained in the sample needs to be calculated, since the sample is sequenced, the sample needs to be subjected to multiple steps such as sample preparation, sequencing, signal processing, etc., and the absolute concentration of the pathogen operating group contained in the sample It is difficult to make accurate calculations, so the relative concentration of the pathogens contained in the sample can be estimated.
  • the server obtains the sequencing data of the sample, and calculates the total number of occurrences of k-mer CT in the sequencing data. Specifically, if the sequencing data includes M pieces of data to be tested, each piece of data to be tested Contains n characters, then in the sequencing data, each piece of data to be tested contains n-k + 1 k-mer, the number of k-mers contained in M pieces of data to be tested can be added, that is, M Add n-k + 1 to get the total CT of the number of k-mer occurrences in the sequencing data.
  • step 2404 the number of occurrences of k-mer included in the pathogen operation group included in the full set in the sequencing data GCT is obtained.
  • the complete set refers to a set composed by collecting high-reliability genomes in advance
  • the high-reliability genome refers to a genome that meets a preset condition.
  • the high-confidence genome contains both the pathogenic and non-pathogenic genomes. For example, high-confidence genomes of symbiotic bacteria, probiotics, humans, animals, and plants.
  • the server obtains the GCT of the number of occurrences of k-mer included in the pathogen operation group included in the full set in the sequencing data, which is an estimated value.
  • GCT can include two parts: the estimated total number of actual occurrences of k-mer in a pathogen operation group, ECT, and the actual number of confirmed k-mer occurrences in this pathogen operation group, CCT. That is, the GCT can be obtained by calculating the estimated total number of actual occurrences of k-mer for each pathogen operation group in the ensemble ECT and the actual number of k-mer occurrences CCT confirmed by each pathogen operation group.
  • the number of occurrences GCT is the sum of the estimated total number of actual occurrences ECT and the confirmed total number of actual occurrences CCT of the k-mer included in the pathogen operation group included in the full set in the sequencing data.
  • the GCT can also include only a part: when k in the k-mer included in the pathogen operation group included in the ensemble is greater than the target value, the number of occurrences GCT can be completely estimated by the pathogen operation group's confirmation of the actual occurrence of k-mer CCT, ie When k is large enough, the number of occurrences GCT is equal to confirming the actual number of occurrences of k-mer CCT, and the total number of actual occurrences ECT may not be calculated.
  • step 2406 the relative concentration in the sequencing data of the pathogen operating group contained in the ensemble is calculated according to the number of occurrence GCT and the sum of the number of occurrence CT.
  • the k-mer sequencing data of each pathogen operation group is obtained.
  • the ratio of the number of occurrences of GCT and k-mer in the sequencing data to the total CT that is, the relative concentration of each pathogen operation group included in the full set in the sequencing data.
  • the total number of occurrences of k-mers that have appeared in the sequencing data and the CT and the number of occurrences of k-mers contained in the pathogen operation group included in the ensemble in the sequencing data GCT are obtained.
  • the relative concentration of the pathogen operating group in the sequencing data can improve the accuracy of calculating the relative concentration of the pathogen operating group contained in the full set in the sequencing data.
  • the method before step 2402, the method further includes steps:
  • High-confidence genomic data to obtain the complete set.
  • High-confidence genome refers to the genome that meets the preset conditions. Identify the pathogen operating groups included in the ensemble.
  • the source of high-confidence genomic data can be the Refseq dataset or other public or private high-confidence genomes.
  • the process of collecting high-confidence genomic data includes the process of confirming the credibility of a certain genome and performing screening. That is, select a genome that meets the following conditions as a high-confidence genome: (1) According to the proportion of non-deterministic characters contained in a genomic data: For example, for the DNA genome, the proportion of non-deterministic characters refers to The proportion of non-ACGT characters contained.
  • the piece of data is a suspected low-confidence genome;
  • Whole-genome sequence alignment is performed by using multiple genomes with similar genetic relationships (such as genetic distances less than a certain threshold). , Determine the average genome-wide average coverage of the genome in its similar genomes, and then screen based on the average genome-wide coverage percentage: genomes with too low average coverage percentages are suspected of having low completion, ie, low confidence. After removing suspected low-confidence or low-confidence genomes, the remaining genomes are high-confidence genomes.
  • the pathogen set corresponding to the high-confidence genome in the complete set is determined.
  • high-reliability genomic data can be collected in advance to obtain a complete set and determine a pathogen operating group included in the complete set, which is convenient for subsequent direct use and improves efficiency.
  • the method further includes:
  • the server can calculate the k-mers that have occurred in the pathogen operation group group contained in the ensemble, and establish each pathogen operation group in the ensemble that has k-mer pathogens k-mer that have appeared in its group mer list.
  • the operations occur within each pathogen group the list through the k-mer pathogen, and then to establish the total k-mer appeared table of k-mer, k-mer pathogen i.e. a summary table of all pathogens corpus operation group, the respective pathogen k -mer list of pathogens k-mer and the total stored in the database table.
  • the k-mer here includes not only the specific k-mer of the pathogen operation group, but all the k-mers in the pathogen operation group. K-mer has appeared in the genome. That is, the k-mer occurrence frequency record table records the k-mer contained in the pathogen operation group, and the number of times that the k-mer has occurred in the pathogen operation group. If a k-mer occurs x times in a genome, x counts should be added to the counting unit corresponding to the k-mer occurrence number recording table.
  • the pathogen operation group contained in the ensemble has M pathogen operation groups, then a record table of the number of k-mer occurrences contained in the M corresponding pathogen operation groups will be established.
  • the k-mer occurrence number record table corresponding to each pathogen operation group is stored in a database.
  • a k-mer occurrence number record table included in the ensemble is established, and the k-mer occurrence number record table corresponding to each pathogen operation group is used to calculate the k-mer in the pathogen k-mer total table.
  • the mer occurrence number record table stores the k-mer occurrence number record table corresponding to the corpus into a database.
  • each pathogen operation is obtained from the record of the number of k-mer occurrences corresponding to the pathogen operation group.
  • the group contains k-mer occurrences CG.
  • the number of occurrences of k-mers contained in each pathogen operation group in the complete set is obtained through the record of the number of k-mer occurrences corresponding to the complete set in the complete set. Calculate the ratio of CG and CB, and get the proportion F of k-mer occurrences contained in each pathogen operation group to the k-mer occurrences in the full set.
  • a proportion record table of the number of occurrences of k-mers in the pathogen operation group in the total set of the number of occurrences of k-mers is established, and the proportion record table is stored in the database.
  • step 2404 includes steps
  • Step 2502 Obtain each first marker data in the sequencing data, calculate the number of k-mers contained in each first marker data, and calculate the actual occurrence according to the number of k-mers contained in each first marker data. Total number of CCTs.
  • the first marker data refers to the test data in the test data that includes only a specific k-mer of the pathogen operation group, and the test data is marked as being from the pathogen operation group. After the test data that each contains only a specific k-mer of the pathogen operation group is marked as being from the pathogen operation group that only contains, each first label data is obtained.
  • the server obtains each of the first labeled data in the sequencing data, and calculates the number of k-mers that are each marked as the data to be tested from the pathogen-only operation group. The sum of the number of k-mers that are marked as the data to be tested from the same pathogen operating group was calculated respectively, and the total number of confirmed actual k-mer CCTs of each pathogen operating group was obtained.
  • Step 2504 Obtain the proportion of the complete set of each k-mer included in the pathogen operation group included in the complete set.
  • the proportion of the complete set is the number of occurrences of k-mer in the corresponding pathogen operation group.
  • CG and k-mer appear in the complete set.
  • the database also contains a table of the number of k-mer occurrences corresponding to the pathogen operation group in the complete set.
  • the table of the number of k-mer occurrences in the full set includes each k-mer in the corresponding pathogen operation group.
  • Step 2506 Obtain each second marker data in the sequencing data, and obtain the actual number of occurrences of each k-mer included in the pathogen operation group included in the full set of each second marker data.
  • the second marker data refers to the test data in the sequencing data that does not contain the specific k-mer of any pathogen operating group, or the specific k-mer that belongs to multiple pathogen operating groups.
  • the test data is marked as not belonging to any pathogen operating group.
  • the server obtains each second marker data in the sequencing data, and for each second marker data, confirms all k-mers contained in the second marker data, according to the k-mer corresponding to each pathogen operation group stored in the database in advance.
  • the mer occurrence record table finds each pathogen operation group corresponding to all k-mers included in the second tag data, and obtains each k-mer included in the pathogen operation groups corresponding to all k-mers in the second tag data The actual number of occurrences.
  • Step 2508 Calculate the estimated actual number of occurrences of each k-mer according to the proportion of the complete set and the actual number of occurrences.
  • the server When the server obtains the proportion of the corpus from the database and the actual number of occurrences, it calculates the proportion of the proportion of the corpus of each k-mer included in the pathogen operation group included in the corpus and the k-mer contained in the second tag data The product of the actual number of occurrences of each k-mer included in the pathogen operation group in the second marker data to obtain the estimated actual occurrence of each k-mer included in each pathogen operation group in the second marker data frequency.
  • Step 2510 Calculate the estimated total number of actual occurrences ECT according to the estimated actual number of occurrences of each k-mer.
  • the estimated actual occurrence times of each k-mer included in the disease of each pathogen operation group in the second marker data is calculated After that, all k-mers of each pathogen operation group are estimated in the second marker data to estimate the total number of actual occurrences ECT.
  • step 2512 according to the estimated total number of actual occurrences ECT and the confirmed total number of actual occurrences CCT, the number of occurrences of k-mer included in the pathogen operation group in the ensemble GCT in the sequencing data is calculated.
  • the sum of ECT and CCT is calculated to obtain The number of occurrences of k-mer included in the pathogen manipulation group included in the full set was GCT in the sequencing data.
  • the total number of actual occurrences of ECT is obtained.
  • the number of occurrences of k-mer in the sequencing data GCT included in the group can estimate the number of occurrences of k-mer GCT in sequencing data, so that the relative concentration can be calculated.
  • step 2404 includes steps:
  • each first marker data in the sequencing data calculates the number of k-mers contained in each first marker data, and calculate the total number of confirmed actual occurrences CCT based on the number of k-mers contained in each first marker data .
  • the total number of actual occurrences CCT is confirmed as the number of occurrences of k-mer included in the pathogen operation group included in the ensemble in the sequencing data GCT.
  • the first marker data refers to the test data in the test data that includes only a specific k-mer of the pathogen operation group, and the test data is marked as being from the pathogen operation group. After the test data that each contains only a specific k-mer of the pathogen operation group is marked as being from the pathogen operation group that only contains, each first label data is obtained.
  • the server obtains each of the first labeled data in the sequencing data, and calculates the number of k-mers that are each marked as the data to be tested from the pathogen-only operation group. The sum of the number of k-mers that are marked as the data to be tested from the same pathogen operating group was calculated respectively, and the total number of confirmed actual k-mer CCTs of each pathogen operating group was obtained.
  • k of the k-mer included in the pathogen operation group included in the full set is larger than the target value, the target value is determined through various experiments. This k can be greater than 23 or 27.
  • the calculated total number of confirmed actual occurrences CCT can be directly used as the number of occurrences of k-mer included in the pathogen operation group in the complete set in the sequencing data GCT, that is, the number of occurrences GCT is equal to the number of confirmed actual occurrences of k-mer CCT . Therefore, there is no need to calculate and estimate the total number of actual occurrences of ECT, which improves the efficiency of calculating the number of occurrences of GCT.
  • the method further includes:
  • Step 2602 Obtain a pathogen operation group included in the sequencing data, and obtain a relative concentration of the pathogen operation group included in the sequencing data.
  • the server may obtain the pathogen operation contained in the sequencing data, and the pathogen operation contained in the sequencing data may be obtained by detecting the sequencing data in advance. According to the pathogen operation group contained in the obtained sequencing data, the corresponding pathogen operation group is found in the full set. When found, the relative concentration of the pathogen operation group in the sequencing data is obtained, and the relative concentration of the pathogen operation group contained in the sequencing data is obtained. .
  • Step 2604 selecting a pathogen operation group whose relative concentration of the pathogen operation group contained in the sequencing data is higher than a preset threshold, as the pathogen operation group confirmed to be included in the sequencing data.
  • the server selects the pathogen operation group whose relative concentration of the pathogen operation group contained in the sequencing data is higher than a preset threshold, and as the confirmed pathogen operation group contained in the sequencing data, the server sends the confirmed pathogen operation group to the terminal for display.
  • the relative concentration of the pathogen operation group contained in the sequencing data is used to further determine the pathogen operation group contained in the sequencing data, which further improves the accuracy of obtaining the pathogen operation group contained in the sequencing data.
  • steps in the flowcharts of FIG. 1 to FIG. 26 are sequentially displayed according to the directions of the arrows, the steps are not necessarily performed sequentially in the order indicated by the arrows. Unless explicitly stated in this document, the execution of these steps is not strictly limited, and these steps can be performed in other orders. Moreover, at least a part of the steps in these figures may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily performed at the same time, but may be performed at different times. The execution of these sub-steps or stages The order is not necessarily performed sequentially, but may be performed in turn or alternately with other steps or at least a part of the sub-steps or stages of other steps.
  • a detection device for a pathogen operating group is provided. As shown in FIG. 27, the device includes:
  • the sequencing data acquisition module 2702 is configured to acquire sequencing data of a sample.
  • the target sequence set acquisition module 2704 is configured to obtain a characteristic target sequence set corresponding to each pathogen operation group stored in the target database, and the characteristic target sequence set includes specificities in the pathogen operation group that satisfy preset specific conditions.
  • Sex k-mer, k-mer refers to a genomic sequence of length k.
  • a specific k-mer appearance frequency acquisition module 2706 is configured to obtain the number of occurrences of the specific k-mer included in the sequencing data of the characteristic target sequence set corresponding to each pathogen operation group.
  • the pathogen operation group selection module 2708 is configured to select a pathogen operation group whose occurrences exceed a preset number of times as a pathogen operation group included in the sequencing data.
  • the detection device of the pathogen operation group further includes: a specific k-mer selection module for selecting a k-mer that satisfies a preset specific condition from the k-mer corresponding to each pathogen operation group. ; A specific k-mer storage module, configured to store k-mers that meet preset specific conditions into a set of characteristic target sequences corresponding to each pathogen operating group.
  • the detection device of the pathogen operating group further includes: a complete set acquisition module for obtaining high-confidence genome data to obtain a complete set, the high-confidence genome refers to a genome that satisfies a preset condition; a pathogen operation
  • the group determination module is used to determine the pathogen operation group contained in the ensemble.
  • the detection device of the pathogen operating group further includes: a CT calculation module for obtaining sequencing data of the sample, calculating the total number of occurrences of k-mer appearing in the sequencing data CT; and a GCT acquisition module for Used to obtain the number of occurrences of k-mer included in the pathogen operation group in the full set in the sequencing data GCT; a concentration calculation module for calculating the sequence data of the pathogen operation group included in the full set based on the number of occurrences GCT and the sum of the number of occurrences CT Relative concentration in
  • the detection device of the pathogen operation group further includes: a concentration obtaining module for obtaining the relative concentration of the pathogen operation group contained in the sequencing data according to the relative concentration of the pathogen operation group contained in the ensemble in the sequencing data; the pathogen operation group selection module, It is used to select a pathogen operation group whose relative concentration of the pathogen operation group contained in the sequencing data is higher than a preset threshold value, as the pathogen operation group included in the sequencing data confirmation.
  • the GCT acquisition module includes:
  • CCT calculation module used to obtain each first marker data in the sequencing data, calculate the number of k-mers contained in each first marker data, and calculate according to the number of k-mers contained in each first marker data Confirm the total number of actual occurrences of CCT;
  • the proportion acquisition module is used to obtain the proportion of the complete set of each k-mer included in the pathogen operation group included in the ensemble, and the proportion of the complete set is k-mer in the corresponding pathogen operation group The ratio of the number of occurrences of CG to the number of occurrences of k-mer in the corpus CB;
  • the actual number of occurrences acquisition module is used to obtain each second marker data in the sequencing data, and to obtain the pathogens contained in the corpus in each second marker data The actual number of occurrences of each k-mer included in the group;
  • the number of times estimation module is used to calculate the estimated actual number of occurrences of each k-mer according to the proportion of the complete set and the actual number of times;
  • the GCT acquisition module includes a CCT calculation module for acquiring each first marker data in the sequencing data, calculating the number of k-mers contained in each first marker data, and according to each first The number of k-mers included in a marker data is calculated to confirm the total number of actual occurrences CCT; the length judgment module is used to confirm the actual when the length of the k-mers included in the pathogen operation group included in the ensemble is greater than the target value The total number of occurrences CCT is taken as the number of occurrences of k-mer included in the pathogen manipulation group included in the ensemble in the sequencing data GCT.
  • the specific k-mer refers to the k-mer in the pathogen operation group in which the number of occurrences in the genome occurrence number index table of the pathogen operation group meets a preset error condition.
  • the k-mer in the specific k-mer satisfies the following two conditions: the number of occurrences in the genome occurrence index table of the pathogen operating group satisfies the first preset error condition; The number of occurrences in the genomic occurrences index table and the occurrences in the complete set of genomics occurrences index table meet the second preset error condition; the genomics number index table records the genomes contained in the pathogen operating group corresponding to each k-mer The number of genomes of the k-mer is included in the genome; the genome appearance index table of the complete set records the number of genomes of the k-mer included in the genome of the complete set.
  • the first preset error condition is: the sum of the ratio of the number of occurrences in the genome occurrence index table of the pathogen operation group to the number of genomes contained in the pathogen operation group and the first threshold is greater than or equal to 1.
  • the first threshold is less than 5%.
  • the second preset error condition is: the sum of the ratio of the number of occurrences in the genome occurrence index table of the pathogen operation group to the number of occurrences in the genome occurrence index table of the complete set and the second threshold sum Greater than or equal to 1.
  • the second threshold is less than 5%.
  • the detection device of the pathogen operating group further includes a generation module 2710 of a characteristic target sequence set, which is used to generate an index table of the number of occurrences of the genome corresponding to each pathogen operating group, and the genome The number of times index table records the number of genomes contained in the k-mer genome contained in the pathogen operation group corresponding to each k-mer; the genome occurrence number index table is stored in the feature target sequence set corresponding to the pathogen operation group .
  • the above-mentioned feature target sequence set generation module 2710 is further configured to select a k-mer that satisfies a preset specificity condition from a k-mer corresponding to each pathogen operation group; the preset specificity will be met
  • the conditional k-mer is stored in the feature target sequence set corresponding to each pathogen operating group.
  • the above-mentioned feature target sequence set generation module 2710 is further configured to generate a genomic appearance number index table of the corpus, and the genomic appearance number index table of the corpus records that the k-mer included in the genome included in the corpus The number of genomes in the genome; the index table of the number of occurrences of the genome in the complete set is stored in the target database.
  • the pathogen operation group selection module 2708 is further configured to obtain the actual number of occurrences of each specific k-mer included in each pathogen operation group in the sequencing data; and select the actual occurrence in the sequencing data.
  • the pathogen operation group corresponding to the specific k-mer whose threshold exceeds the first preset number of times, and the suspected pathogen operation group is obtained; from the suspected pathogen operation group, a pathogen operation group that meets a preset condition is selected as the pathogen operation group included in the sequencing data .
  • the above-mentioned pathogen operation group selection module 2708 is further configured to obtain the actual number of occurrences of the specific k-mer included in the pathogen operation group in the sequencing data; and generate a specificity corresponding to the pathogen operation group according to the actual occurrence number. Record of actual occurrences of sexual k-mer.
  • the pathogen operating group selection module 2708 is further configured to obtain a specific k-mer whose actual number of occurrences is greater than a preset threshold of actual number of occurrences, as a specific k-mer for confirming occurrence; operate according to each pathogen The number of confirmed specific k-mers corresponding to the group is calculated to obtain the false positive probability of each pathogen operation group; a pathogen operation group with a false positive probability lower than a preset standard probability is selected as the pathogen operation group included in the sequencing data.
  • the above-mentioned pathogen operation group selection module 2708 is further configured to obtain the actual number of occurrences of the specific k-mer confirmed to appear; the ratio of the actual number of occurrences of the specific k-mer confirmed to be confirmed is calculated to appear The ratio of the actual number of specific k-mer occurrences to the sum of the actual occurrences of all specific k-mer confirmed occurrences; the proportion of actual occurrences selected from the group of pathogens whose false positive probability is lower than a preset standard probability is in line with expectations The pathogen operation group corresponding to the specific k-mer that appears in the occurrence frequency ratio is confirmed as the pathogen operation group included in the sequencing data.
  • the above-mentioned pathogen operation group selection module 2708 is further configured to, when it is confirmed that a specific k-mer exists in the specific k-mer belonging to the same non-overlapping specific region, it will belong to the same non-overlapping specific region.
  • the specific k-mers of the overlapping specific regions are regarded as the same specific k-mer.
  • the non-overlapping specific regions contain specific k-mers whose number of any two overlapping characters meets a preset coincidence threshold. After the number of specific k-mers confirmed to be confirmed in the specific region, the false positive probability of each pathogen operating group was calculated based on the number of specific k-mers confirmed to appear.
  • the generating module 2710 of the above-mentioned characteristic target sequence set is further configured to obtain the number of occurrences of each specific k-mer included in the pathogen operation group in the genome included in the pathogen operation group; calculate each The proportion of occurrence of specific k-mer is the ratio of the number of occurrences of each specific k-mer to the sum of the number of occurrences of specific k-mer included in the pathogen operation group; according to each specific k included in the pathogen operation group The number of occurrences and the proportion corresponding to -mer generates a specific k-mer appearance ratio table corresponding to the pathogen operation group; the specific k-mer appearance ratio table is stored in the feature target sequence set corresponding to the pathogen operation group.
  • the generating module 2710 of the feature target sequence set is further configured to obtain the number of occurrences of k-mer included in the pathogen operation group in the pathogen operation group CG; obtain the k-mer included in the pathogen operation group in the full set The number of occurrences CB in the calculation; the proportion F of the complete set of each k-mer is calculated as the ratio of the number of occurrences CG to the number of occurrences of the complete set CB; the occurrence of k-mer corresponding to each pathogen operation group is generated according to the proportion F of the complete set. Number of times in the complete set proportion table; k-mer occurrences in the complete set proportion table is stored in the feature target sequence set corresponding to the pathogen operation group.
  • the above-mentioned pathogen operation group selection module 2708 is further configured to obtain a specific k-mer whose actual number of occurrences is greater than a preset actual number of times of occurrence as a specific k-mer for confirming the occurrence; obtaining each pathogen The false positive distribution of specific k-mers included in the operation group in the corresponding simulated test data; the number of confirmed specific k-mers corresponding to each pathogen operation group and the specificity included in each pathogen operation group The false-positive distribution of sex k-mer in the simulated test data is compared to obtain the false-positive detection probability of each pathogen operating group; the pathogen operating group with a false-positive detection probability lower than a preset threshold is selected as the sequencing data. Pathogen Operations Group.
  • the pathogen operating group selection module 2708 is further configured to: obtain simulated test data corresponding to each pathogen operating group; the simulated test data is a genome from the complete set that does not belong to the pathogen operating group corresponding to the simulated testing data Data obtained by random sampling in the test; calculate the number of specific k-mers included in each pathogen operation group in the corresponding simulation test data; according to the specific k-mer included in each pathogen operation group in the corresponding simulation The number of species in the test data to obtain the false positive distribution of the specific k-mer included in each pathogen operating group in the corresponding simulated test data.
  • the above pathogen operation group selection module 2708 is further configured to: obtain a specific k-mer whose actual number of occurrences is greater than a preset threshold of actual number of occurrences, as a specific k-mer for confirming occurrence, and obtain confirmation The number of specific k-mers that appear; Get the number of specific k-mers that are included in each pathogen operating group; According to the number of specific k-mers that are confirmed to appear and the number of specific k-mers included in each pathogen operating group The number of specific k-mers is calculated to obtain the specific k-mer concentration of each pathogen operation group, which is confirmed to appear; a pathogen operation group whose specific k-mer concentration is confirmed to be higher than a preset threshold is selected and included as sequencing data. Pathogen Operations Group.
  • the detection device for the pathogen operation group further includes a pathogen operation group list generation module (not shown in the figure), which is used to obtain the medicine of each pathogen operation group included in the sequencing data from the target database. Information; generating the final pathogen operation group list according to the medical information of each pathogen operation group included in the sequencing data; and outputting the final pathogen operation group list to the detection results of the sequencing data.
  • a pathogen operation group list generation module (not shown in the figure), which is used to obtain the medicine of each pathogen operation group included in the sequencing data from the target database. Information; generating the final pathogen operation group list according to the medical information of each pathogen operation group included in the sequencing data; and outputting the final pathogen operation group list to the detection results of the sequencing data.
  • the detection device of the pathogen operating group further includes a relative concentration calculation module 2712, configured to obtain a total CT of the number of occurrences of k-mer that has appeared in the sequencing data; obtain sequencing data
  • the total estimated estimated number of occurrences of k-mer included in the pathogen manipulation group, CF, and the estimated total number of actual occurrences, CF is the total number of estimated occurrences in sequencing data for each k-mer included in the pathogen manipulation group;
  • the relative concentration of each pathogen operating group was calculated from the total number of occurrences CT and the estimated total number of actual occurrences CF.
  • the relative concentration calculation module 2712 is further configured to obtain the pathogen operation group contained in the sequencing data as the final pathogen operation group; obtain the proportion of the complete set of k-mer included in each final pathogen operation group, The ratio is the ratio of the number of occurrences of k-mer in the corresponding pathogen operation group CG to the number of occurrences of k-mer in the complete set CB; calculated according to the proportion of the complete set corresponding to each k-mer and the actual number of occurrences The estimated actual number of occurrences of each k-mer; based on the estimated actual number of occurrences of each k-mer, the sum of the estimated actual occurrences of k-mer included in the pathogen operation group included in the sequencing data is calculated as CF.
  • the relative concentration calculation module 2712 is further configured to obtain the actual number of occurrences of the sequencing data of the k-mer included in the pathogen operation group; based on the actual sequencing data of the k-mer included in the pathogen operation group The number of occurrences generates a table of actual occurrences of k-mer corresponding to each pathogen operating group.
  • the relative concentration calculation module 2712 is further configured to obtain a k-mer occurrence number record table corresponding to each final pathogen operation group; obtain a k-mer included in the k-mer occurrence number record table; obtain each The final table of the number of k-mer occurrences in the pathogen set corresponding to the complete set, and the table of the number of k-mer occurrences in the full set contains the number of occurrences of each k-mer sequence in the pathogen set and the number of occurrences in the full set; from The proportion of k-mer occurrences in the complete set proportion table is obtained from the k-mer occurrence times record table for the proportion of the complete set of each k-mer.
  • the generating module 2710 of the feature target sequence set is further configured to obtain the number of occurrences of the k-mer included in each pathogen operation group in the pathogen operation group; according to each k-mer in the pathogen operation group The number of occurrences in the generation of a k-mer occurrence number record table corresponding to each pathogen operation group; the k-mer occurrence number record table is stored in the feature target sequence set corresponding to the pathogen operation group.
  • the relative concentration calculation module 2412 is further configured to obtain the number of occurrences CG of the k-mer included in the pathogen operation group in the pathogen operation group from the k-mer occurrence number record table corresponding to the pathogen operation group.
  • the generating module 2710 of the feature target sequence set is further configured to obtain the number of occurrences of the k-mer included in each pathogen operation group in the complete set; generate the complete set according to the number of occurrences of the complete set of each k-mer.
  • K-mer occurrence times record table ; store the complete set of k-mer occurrence times record table to the target database.
  • the relative concentration calculation module 2712 is further configured to obtain, from the k-mer occurrence times record table of the corpus, the corpus occurrence times CB of the pathogen operation group contained in the target database by k-mer included in the pathogen operation group.
  • the relative concentration calculation module 2712 is further configured to obtain the k-mer included in each final pathogen operation group according to the k-mer occurrence number record table corresponding to each final pathogen operation group; according to each pathogen operation group
  • the corresponding k-mer occurrence count record table generates a k-mer occurrence union list; obtains the actual number of occurrences of each k-mer in the sequencing data included in the k-mer occurrence union list; obtains each final pathogen operation group
  • the actual number of occurrences corresponding to the k-mer included; a table of the actual number of occurrences of k-mer corresponding to the final pathogen operation group is generated according to the actual number of occurrences corresponding to each k-mer.
  • a device for detecting the relative concentration of a pathogen operating group includes:
  • Sequencing data acquisition module which is used to obtain the sequencing data of the sample and calculate the total number of occurrences of k-mer CT in the sequencing data;
  • Occurrence GCT acquisition module used to obtain the number of occurrences of k-mer included in the pathogen operation group in the full set.
  • the relative concentration calculation module is used to calculate the relative concentration in the sequencing data of the pathogen operation group contained in the full set according to the number of occurrences GCT and the number of occurrences CT.
  • the detection device for the relative concentration of the pathogen operating group is further configured to: obtain high-confidence genome data to obtain a complete set, the high-confidence genome refers to a genome that meets a preset condition; and determine that the complete set contains Pathogen Operations Group.
  • the above-mentioned occurrence frequency GCT acquisition module is further configured to: obtain each first marker data in the sequencing data, calculate the number of k-mers contained in each first marker data, and according to each first The number of k-mers included in the labeled data is calculated to confirm the total number of actual occurrences CCT; the proportion of the complete set of each k-mer included in the pathogen operation group included in the complete set is obtained, and the proportion of the complete set is k-mer in The ratio of the number of occurrences of the corresponding pathogen operation group CG to the number of occurrences of k-mer in the full set CB; obtain each second marker data in the sequencing data, and obtain the pathogen operation group included in the full set of each second marker data The actual number of occurrences of each k-mer included; the estimated actual number of occurrences of each k-mer is calculated according to the proportion of the complete set and the actual number of occurrences; the estimated actual occurrence is calculated according to the estimated actual number of occurrences of each .
  • the GCT acquisition module includes a CCT calculation module for acquiring each first marker data in the sequencing data, calculating the number of k-mers contained in each first marker data, and according to each first The number of k-mers included in a marker data is calculated to confirm the total number of actual occurrences CCT; the length judgment module is used to confirm the actual when the length of the k-mers included in the pathogen operation group included in the ensemble is greater than the target value The total number of occurrences CCT is taken as the number of occurrences of k-mer included in the pathogen manipulation group included in the ensemble in the sequencing data GCT.
  • the device for detecting the relative concentration of the pathogen operation group is further configured to: obtain the pathogen operation group included in the sequencing data, and obtain the relative concentration of the pathogen operation group included in the sequencing data; and select the pathogen operation included in the sequencing data.
  • the pathogen operating group whose relative concentration of the group is higher than a preset threshold is used as the pathogen operating group confirmed to be included in the sequencing data.
  • each module in the detection device of the pathogen operation group and the detection device of the relative concentration of the pathogen operation group may be implemented in whole or in part by software, hardware, and a combination thereof.
  • the above-mentioned modules may be embedded in the hardware in or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
  • a computer device is provided.
  • the computer device may be a server, and its internal structure diagram may be as shown in FIG. 31.
  • the computer device includes a processor, a memory, a network interface, and a database connected through a system bus.
  • the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, computer-readable instructions, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer-readable instructions in a non-volatile storage medium.
  • the database of the computer equipment is used to store data in the target data, such as a characteristic target sequence set corresponding to each pathogen operating group.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer-readable instructions are executed by the processor, a method for detecting a pathogen operating group and a method for detecting a relative concentration of the pathogen operating group are implemented.
  • FIG. 31 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied.
  • the specific computer equipment may be Include more or fewer parts than shown in the figure, or combine certain parts, or have a different arrangement of parts.
  • a computer device includes a memory and one or more processors.
  • Computer-readable instructions are stored in the memory, and when the computer-readable instructions are executed by the processor, the one or more processors are executed.
  • the steps of the method for detecting the pathogen operating group and the method for detecting the relative concentration of the pathogen operating group provided in any one of the embodiments of the application are implemented by computer-readable instructions.
  • one or more non-volatile storage media storing computer-readable instructions are provided.
  • the computer-readable instructions are executed by one or more processors, the execution of one or more processors.
  • the steps of implementing the detection method of the pathogen operating group and the detection of the relative concentration of the pathogen operating group provided in any one of the embodiments of the present application.
  • Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Medical Informatics (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Abstract

L'invention concerne un procédé de détection d'un groupe d'opérations pathogènes, comprenant les étapes consistant : à acquérir les données de séquençage d'un échantillon ; à acquérir l'ensemble de séquences cibles caractéristiques correspondant à chaque groupe d'opérations pathogènes stocké dans une base de données cible, l'ensemble de séquences cibles de caractéristiques comprenant un k-mer spécifique satisfaisant les conditions spécifiques prédéfinies dans le groupe d'opérations pathogènes, et le k-mer se référant à une séquence génomique ayant la longueur de k ; à acquérir le numéro d'occurrence du k-mer spécifique compris dans l'ensemble de séquences cibles de caractéristiques correspondantes de chaque groupe d'opérations pathogènes dans les données de séquençage ; et à sélectionner le groupe d'opérations pathogènes avec le nombre d'occurrences dépassant la valeur seuil de nombre prédéfinie en tant que groupe d'opérations de pathogènes compris dans les données de séquençage.
PCT/CN2019/087580 2018-06-22 2019-05-20 Procédé de détection, dispositif, équipement d'ordinateur et support d'informations de groupe d'opérations pathogènes WO2019242445A1 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN201810649271 2018-06-22
CN201810649271.8 2018-06-22
CN201910316653.3A CN109949866B (zh) 2018-06-22 2019-04-19 病原体操作组的检测方法、装置、计算机设备和存储介质
CN201910316653.3 2019-04-19

Publications (1)

Publication Number Publication Date
WO2019242445A1 true WO2019242445A1 (fr) 2019-12-26

Family

ID=67015792

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/087580 WO2019242445A1 (fr) 2018-06-22 2019-05-20 Procédé de détection, dispositif, équipement d'ordinateur et support d'informations de groupe d'opérations pathogènes

Country Status (2)

Country Link
CN (1) CN109949866B (fr)
WO (1) WO2019242445A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021226522A1 (fr) * 2020-05-08 2021-11-11 Illumina, Inc. Techniques de séquençage et de détection de génome

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110473594B (zh) * 2019-08-22 2020-05-05 广州微远基因科技有限公司 病原微生物基因组数据库及其建立方法
CN115578554B (zh) * 2021-06-21 2024-02-02 数坤(上海)医疗科技有限公司 一种血管病灶识别方法、装置、电子设备和可读存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101233240A (zh) * 2004-03-26 2008-07-30 斯昆诺有限公司 与质量分析结合的甲基化特异性扩增产物的碱基特异性切割
CN102439591A (zh) * 2009-02-25 2012-05-02 特拉华大学 识别结构上或功能上重要的氨基酸序列的系统和方法
CN107076729A (zh) * 2014-10-16 2017-08-18 康希尔公司 变异体调用器
CN107532332A (zh) * 2015-04-24 2018-01-02 犹他大学研究基金会 用于多重分类学分类的方法和系统

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060095241A1 (en) * 2004-10-29 2006-05-04 Microsoft Corporation Systems and methods that utilize machine learning algorithms to facilitate assembly of aids vaccine cocktails
US20140188397A1 (en) * 2011-05-17 2014-07-03 Bgi Tech Solutions Co., Ltd. Methods of acquiring genome size and error
CN102332064B (zh) * 2011-10-07 2013-11-06 吉林大学 基于基因条形码的生物物种识别方法
US20140106974A1 (en) * 2012-10-15 2014-04-17 Synblex, Llc Pathogen identification process and transport container
CN105441432B (zh) * 2014-09-05 2019-05-28 天津华大基因科技有限公司 组合物及其在序列测定和变异检测中的用途
US20160273049A1 (en) * 2015-03-16 2016-09-22 Personal Genome Diagnostics, Inc. Systems and methods for analyzing nucleic acid
PE20181137A1 (es) * 2015-09-25 2018-07-17 Contextual Genomics Inc Metodos moleculares de aseguramiento de la calidad para su uso en la secuenciacion
CN107133493B (zh) * 2016-02-26 2020-01-14 中国科学院数学与系统科学研究院 基因组序列的组装方法、结构变异探测方法和相应的系统
CN106484865A (zh) * 2016-10-10 2017-03-08 哈尔滨工程大学 一种基于DNA k‑mer index问题四字链表字典树检索算法
CN106971088A (zh) * 2017-03-28 2017-07-21 泽塔生物科技(上海)有限公司 一种真核生物来源成分的分子鉴定方法及系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101233240A (zh) * 2004-03-26 2008-07-30 斯昆诺有限公司 与质量分析结合的甲基化特异性扩增产物的碱基特异性切割
CN102439591A (zh) * 2009-02-25 2012-05-02 特拉华大学 识别结构上或功能上重要的氨基酸序列的系统和方法
CN107076729A (zh) * 2014-10-16 2017-08-18 康希尔公司 变异体调用器
CN107532332A (zh) * 2015-04-24 2018-01-02 犹他大学研究基金会 用于多重分类学分类的方法和系统

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021226522A1 (fr) * 2020-05-08 2021-11-11 Illumina, Inc. Techniques de séquençage et de détection de génome

Also Published As

Publication number Publication date
CN109949866A (zh) 2019-06-28
CN109949866B (zh) 2021-02-02

Similar Documents

Publication Publication Date Title
US11702708B2 (en) Systems and methods for analyzing viral nucleic acids
Lowe et al. Transcriptomics technologies
Harvey et al. QuASAR: quantitative allele-specific analysis of reads
JP7051900B2 (ja) 不均一分子長を有するユニーク分子インデックスセットの生成およびエラー補正のための方法およびシステム
Gupta et al. Hierarchical clustering can identify B cell clones with high confidence in Ig repertoire sequencing data
US20230114581A1 (en) Systems and methods for predicting homologous recombination deficiency status of a specimen
WO2020168008A1 (fr) Structure intégrée d'apprentissage automatique pour estimer une déficience de recombinaison homologue
US20200098448A1 (en) Methods of normalizing and correcting rna expression data
WO2019023517A2 (fr) Classificateur de séquençage génomique
KR101828052B1 (ko) 유전자의 복제수 변이(cnv)를 분석하는 방법 및 장치
CN110770838A (zh) 用于确定体细胞突变克隆性的方法和系统
EP3405573A1 (fr) Procédés et systèmes de séquençage haute fidélité
WO2019242445A1 (fr) Procédé de détection, dispositif, équipement d'ordinateur et support d'informations de groupe d'opérations pathogènes
EP3590058A1 (fr) Systèmes et procédés d'analyse métagénomique
JP2016518822A (ja) アセンブルされていない配列情報、確率論的方法、及び形質固有(trait−specific)のデータベースカタログを用いた生物材料の特性解析
WO2019133937A1 (fr) Détection d'instabilité de microsatellites
CN112259167A (zh) 基于高通量测序的病原体分析方法、装置和计算机设备
Rajaby et al. SurVIndel: improving CNV calling from high-throughput sequencing data through statistical testing
WO2019242187A1 (fr) Procédé et appareil de détection de variations du nombre de copies de chromosome, et milieu de stockage
JP2018504669A (ja) 非コード−コード遺伝子共発現ネットワークを生成する方法及びシステム
Brothers II et al. Integrity, standards, and QC-related issues with big data in pre-clinical drug discovery
Gouda et al. Computational Tools for Whole Genome and Metagenome Analysis of NGS Data for Microbial Diversity Studies
Marić et al. Approaches to metagenomic classification and assembly
AU2018375008B2 (en) Methods and systems for determining somatic mutation clonality
Gollwitzer et al. MetaFast: Enabling Fast Metagenomic Classification via Seed Counting and Edit Distance Approximation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19823271

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 07/05/2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19823271

Country of ref document: EP

Kind code of ref document: A1