CN109949866B - Method and device for detecting pathogen operation group, computer equipment and storage medium - Google Patents

Method and device for detecting pathogen operation group, computer equipment and storage medium Download PDF

Info

Publication number
CN109949866B
CN109949866B CN201910316653.3A CN201910316653A CN109949866B CN 109949866 B CN109949866 B CN 109949866B CN 201910316653 A CN201910316653 A CN 201910316653A CN 109949866 B CN109949866 B CN 109949866B
Authority
CN
China
Prior art keywords
pathogen
operation group
mers
contained
mer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201910316653.3A
Other languages
Chinese (zh)
Other versions
CN109949866A (en
Inventor
孙亚洲
郭雨舜
杜晓骏
陈斌
杜刘稳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Diagnoa Genomics Technology Co ltd
Original Assignee
Shenzhen Diagnoa Genomics Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Diagnoa Genomics Technology Co ltd filed Critical Shenzhen Diagnoa Genomics Technology Co ltd
Priority to PCT/CN2019/087580 priority Critical patent/WO2019242445A1/en
Publication of CN109949866A publication Critical patent/CN109949866A/en
Application granted granted Critical
Publication of CN109949866B publication Critical patent/CN109949866B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Medical Informatics (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Abstract

The application relates to a method, a device, a computer device and a storage medium for detecting a pathogen operation group. The method comprises the following steps: obtaining sequencing data of a sample; acquiring a characteristic target sequence set corresponding to each pathogen operation group stored in a target database, wherein the characteristic target sequence set comprises specific k-mers meeting preset specific conditions in the pathogen operation groups, and the k-mers refer to genome sequences with the length of k; acquiring the occurrence times of specific k-mers in sequencing data, wherein the specific k-mers are contained in a feature target sequence set corresponding to each pathogen operation group; and selecting the pathogen operation group with the occurrence frequency exceeding a preset frequency threshold value as the pathogen operation group contained in the sequencing data. The sequencing data are compared with the characteristic target sequences corresponding to all pathogen operation groups to obtain the pathogen operation groups contained in the sequencing data, so that the comparison space is reduced, the analysis time is shortened, and the detection efficiency is improved.

Description

Method and device for detecting pathogen operation group, computer equipment and storage medium
Technical Field
The present application relates to the field of gene detection technology and gene sequence analysis technology, and is especially detection method, device, computer equipment and storage medium for pathogen operation group.
Background
Gene sequencing is a novel gene detection technique, and can detect sequences and sequence fragments of genomes of various species contained in a sample from various biological samples such as blood, saliva, tissue samples and the like, and further realize diagnostic functions such as discovery of infectious diseases infection sources, determination of pathogenic genes of genetic diseases, prediction of incidence probability of chronic diseases and the like by a method of analyzing sequencing data. Researchers are currently able to detect various species, including pathogenic microorganisms, etc., contained in a sample by using high-throughput sequencing and subsequent means of analyzing the sequencing data.
However, the current technology requires comparing the sequencing data with the known whole gene sequences of all genomes to obtain information about various species including pathogens contained in a sample, and this method results in inefficient detection of the pathogen manipulation group because it requires comparing the test data with the known whole gene sequences of all genomes.
Disclosure of Invention
In view of the above, there is a need to provide a pathogen operation group detection method, apparatus, computer device and storage medium capable of improving detection efficiency in response to the technical problem of low detection efficiency.
A method of detecting a manipulation group of pathogens, the method comprising:
obtaining sequencing data of a sample;
acquiring a characteristic target sequence set corresponding to each pathogen operation group stored in a target database, wherein the characteristic target sequence set comprises specific k-mers meeting preset specific conditions in the pathogen operation groups, and the k-mers refer to genome sequences with the length of k;
acquiring the occurrence times of specific k-mers in sequencing data, wherein the specific k-mers are contained in a feature target sequence set corresponding to each pathogen operation group;
and selecting the pathogen operation group with the occurrence frequency exceeding a preset frequency threshold value as the pathogen operation group contained in the sequencing data.
A method for detecting the relative concentration of a pathogen operating group, the method comprising:
obtaining sequencing data of a sample, and calculating the sum CT of the occurrence times of k-mers appearing in the sequencing data;
obtaining the occurrence times GCT of k-mers in sequencing data in pathogen operation groups contained in a complete set;
and calculating the relative concentration of the pathogen operation group contained in the complete set in the sequencing data according to the occurrence number GCT and the occurrence number sum CT.
A detection device for a pathogen operating group, the device comprising:
the sequencing data acquisition module is used for acquiring sequencing data of the sample;
the target sequence set acquisition module is used for acquiring a characteristic target sequence set corresponding to each pathogen operation group stored in a target database, wherein the characteristic target sequence set comprises specific k-mers meeting preset specific conditions in the pathogen operation groups, and the k-mers refer to genome sequences with the length of k;
the specific k-mer occurrence frequency acquisition module is used for acquiring the occurrence frequency of the specific k-mer in sequencing data, wherein the specific k-mer occurrence frequency is contained in a feature target sequence set corresponding to each pathogen operation group;
and the pathogen operation group selection module is used for selecting the pathogen operation group with the occurrence frequency exceeding a preset frequency threshold value as the pathogen operation group contained in the sequencing data.
A device for detecting the relative concentration of an operative group of pathogens, the device comprising:
the sequencing data acquisition module is used for acquiring sequencing data of a sample and calculating the sum CT of the occurrence times of k-mers appearing in the sequencing data;
the occurrence number GCT acquisition module is used for acquiring the occurrence number GCT of the k-mers in the sequencing data in the pathogen operation group contained in the complete set;
and the relative concentration calculating module is used for calculating the relative concentration of the pathogen operation group contained in the complete set in the sequencing data according to the occurrence times GCT and the total occurrence times CT.
A computer device comprising a memory, the memory storing a computer program, and a processor implementing the following steps when the processor executes the computer program:
obtaining sequencing data of a sample;
acquiring a characteristic target sequence set corresponding to each pathogen operation group stored in a target database, wherein the characteristic target sequence set comprises specific k-mers meeting preset specific conditions in the pathogen operation groups, and the k-mers refer to genome sequences with the length of k;
acquiring the occurrence times of specific k-mers in sequencing data, wherein the specific k-mers are contained in a feature target sequence set corresponding to each pathogen operation group;
and selecting the pathogen operation group with the occurrence frequency exceeding a preset frequency threshold value as the pathogen operation group contained in the sequencing data.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:
obtaining sequencing data of a sample;
acquiring a characteristic target sequence set corresponding to each pathogen operation group stored in a target database, wherein the characteristic target sequence set comprises specific k-mers meeting preset specific conditions in the pathogen operation groups, and the k-mers refer to genome sequences with the length of k;
acquiring the occurrence times of specific k-mers in sequencing data, wherein the specific k-mers are contained in a feature target sequence set corresponding to each pathogen operation group;
and selecting the pathogen operation group with the occurrence frequency exceeding a preset frequency threshold value as the pathogen operation group contained in the sequencing data.
A computer device comprising a memory, the memory storing a computer program, and a processor implementing the following steps when the processor executes the computer program:
obtaining sequencing data of a sample, and calculating the sum CT of the occurrence times of k-mers appearing in the sequencing data;
obtaining the occurrence times GCT of k-mers in sequencing data in pathogen operation groups contained in a complete set;
and calculating the relative concentration of the pathogen operation group contained in the complete set in the sequencing data according to the occurrence number GCT and the occurrence number sum CT.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:
obtaining sequencing data of a sample, and calculating the sum CT of the occurrence times of k-mers appearing in the sequencing data;
obtaining the occurrence times GCT of k-mers in sequencing data in pathogen operation groups contained in a complete set;
and calculating the relative concentration of the pathogen operation group contained in the complete set in the sequencing data according to the occurrence number GCT and the occurrence number sum CT.
According to the detection method, the detection device, the computer equipment and the storage medium of the pathogen operation group, the target point database is established in advance, the characteristic target point sequence set corresponding to each pathogen operation group is stored in the target point database, and the specificity k-mer contained in each characteristic target point sequence set is the k-mer meeting the preset specificity condition in the pathogen operation group. Therefore, when sequencing data of a sample, the pathogen operation groups contained in the sequencing data can be determined according to the occurrence times of specific k-mers contained in the feature target sequence set corresponding to each pathogen operation group in the sequencing data. According to the detection method of the pathogen operation group, sequencing data are compared with the characteristic target sequences corresponding to the pathogen operation groups, so that the pathogen operation group contained in the sequencing data is obtained, the comparison space is reduced, the analysis time is shortened, and the detection efficiency is improved.
Drawings
FIG. 1 is a schematic flow chart of a method for detecting an operational group of pathogens according to one embodiment;
FIG. 2 is a schematic flow chart diagram illustrating a process prior to step 102 in one embodiment;
FIG. 2A is a schematic flow chart diagram illustrating a process prior to step 110 in one embodiment;
FIG. 2B is a schematic flow chart illustrating a process after step 12 in one embodiment;
FIG. 2C is a schematic flow chart diagram illustrating a process after step 108 in one embodiment;
FIG. 2D is a schematic flow chart illustrating step 16 in one embodiment;
FIG. 2E is a schematic flow chart illustrating step 16 in one embodiment;
FIG. 3 is a schematic flow chart of step 108 in one embodiment;
FIG. 4 is a schematic representation of a table of records of actual occurrences of specific k-mers in one embodiment;
FIG. 5 is a flow chart illustrating step 306 in one embodiment;
FIG. 6 is a flow chart illustrating step 506 in one embodiment;
FIG. 7 is a flowchart illustrating step 504, according to an embodiment;
FIG. 8 is a schematic flow chart of a step preceding the step of obtaining sequencing data of a sample according to another embodiment;
FIG. 9 is a diagram showing a table of occurrence ratios of specific k-mers in one embodiment;
FIG. 3A is a schematic flow chart of step 108 in another embodiment;
FIG. 3a is a schematic flow chart of step 108 in yet another embodiment;
FIG. 10 is a schematic flow chart diagram illustrating the process after step 108 in one embodiment;
FIG. 11 is a schematic flow chart showing a process after step 108 in another embodiment;
FIG. 12 is a flowchart illustrating step 1104 in one embodiment;
FIG. 13 is a diagram of a table of estimated actual occurrences of k-mers in one embodiment;
FIG. 14 is a flowchart illustrating step 1204, according to an embodiment;
FIG. 15 is a schematic flow chart diagram illustrating a process before step 1206, in one embodiment;
FIG. 16 is a schematic flow chart of a further embodiment of the method before the step of obtaining sequencing data of the sample;
FIG. 17 is a diagram illustrating a table of k-mer occurrence count to corpus ratio in one embodiment;
FIG. 18 is a schematic flow chart of a further embodiment of the method before the step of obtaining sequencing data of the sample;
FIG. 19 is a schematic flow chart of a further embodiment of the method before the step of obtaining sequencing data of the sample;
FIG. 20 is a schematic flow chart showing a method for detecting a pathogen operating group in still another embodiment;
FIG. 21 is a flowchart of step 2002 in one embodiment;
FIG. 22 is a schematic flow chart of step 2004 in one embodiment;
FIG. 23 is a schematic flow chart illustrating step 2006 in one embodiment;
FIG. 24 is a schematic flow chart of a method for detecting the relative concentration of the working group of pathogens in one embodiment;
FIG. 25 is a schematic flow chart diagram illustrating step 2404 in one embodiment;
FIG. 26 is a schematic flow chart diagram illustrating a step 2406 according to an embodiment;
FIG. 27 is a block diagram showing the structure of a detection apparatus of a pathogen operating group in one embodiment;
FIG. 28 is a block diagram showing the structure of a detection apparatus of a pathogen operating group in another embodiment;
FIG. 29 is a block diagram showing the structure of a detection apparatus of a pathogen operating group in still another embodiment;
FIG. 30 is a block diagram of the structure of a device for detecting the relative concentration of the working group of pathogens in one embodiment;
FIG. 31 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
In one embodiment, as shown in fig. 1, there is provided a method of detecting a manipulation group of pathogens, comprising the steps of:
at step 102, sequencing data is obtained for the sample.
Sequencing data of a sample refers to data output by a device after the sequence of a biomolecule contained in the sample is read by the device such as a DNA sequencer, an RNA sequencer, a protein sequencing device and the like. A set of sequencing data includes multiple pieces (possibly more than several million pieces) of data to be tested, and each piece of data to be tested can be abstracted into a character string. DNA sequencing is the process of determining the precise sequence of nucleotides within a DNA molecule and includes any method or technique for determining the four base sequences adenine, guanine, cytosine and thymine in a DNA strand. A sequencer is an instrument that can measure a sequence of an input sample, and the sequence measured here includes not only a DNA sequence but also a sequence composed of other substances such as a protein and an RNA. The sample may be in the form of a drop of blood, a sputum, a slug of soil, or the like.
And 104, acquiring a characteristic target sequence set corresponding to each pathogen operation group stored in the target database, wherein the characteristic target sequence set comprises specific k-mers meeting preset specific conditions in the pathogen operation groups, and the k-mers refer to genome sequences with the length of k.
A panel of pathogen manipulations that may represent a species, a subspecies, a subtype, a strain or strain, or a genus, etc., at various taxonomic levels of genetic units or taxonomic units of species, and a panel of pathogen manipulations that may include one or more related genomes. And a characteristic target sequence set established for each pathogen operation group in advance is stored in the target database, and the characteristic target sequence set corresponding to each pathogen operation group comprises a specific k-mer corresponding to each pathogen operation group. The specific k-mer is a k-mer satisfying a preset specificity condition selected from k-mers included in each pathogen operation group, namely, the specific k-mer corresponding to each pathogen operation group. The preset specificity condition is a condition preset by a technician and used for selecting the satisfied k-mer, and the preset specificity condition can be determined according to the consideration of the technician or the actual project requirement.
The k-mer refers to a genomic sequence of length k, k being a natural number. If a total of a different deterministic characters are present in a genome data, then for a particular k, a total number of k-mers to the power of a may not be the same. For DNA or RNA (ribonucleic acid) sequences, a definitive character refers to the five bases A (adenine), T (thymine), C (cytosine), G (guanine), U (uracil); a definitive character, if a protein sequence, refers to a defined amino acid character.
And 106, acquiring the occurrence times of the specific k-mers contained in the feature target sequence set corresponding to each pathogen operation group in sequencing data.
And 108, selecting the pathogen operation group with the occurrence frequency exceeding a preset frequency threshold value as the pathogen operation group contained in the sequencing data.
After the sequencing data of the sample are obtained, the sequencing data can be compared with each pathogen operation group stored in the target data, namely the occurrence frequency of the specific k-mer contained in the characteristic target sequence set contained in each pathogen operation group in the sequencing data is obtained, and the pathogen operation group contained in the sequencing data can be determined according to the occurrence frequency of the specific k-mer contained in each pathogen operation group in the sequencing data. Specifically, a time threshold may be preset, and when the occurrence number of the specific k-mer in the sequencing data exceeds the preset time threshold, the pathogen operation group corresponding to the specific k-mer may be used as the pathogen operation group included in the sequencing data, so that each sequencing data may include one pathogen operation group or a plurality of pathogen operation groups. The preset time threshold can be set by technicians according to actual project requirements.
By pre-establishing a target database, storing a characteristic target sequence set corresponding to each pathogen operation group in the target database, wherein a specificity k-mer contained in each characteristic target sequence set is a k-mer meeting a preset specificity condition in the pathogen operation group. Therefore, when sequencing data of a sample, the pathogen operation groups contained in the sequencing data can be determined according to the occurrence times of specific k-mers contained in the feature target sequence set corresponding to each pathogen operation group in the sequencing data. According to the detection method of the pathogen operation group, sequencing data are compared with the characteristic target sequences corresponding to the pathogen operation groups, so that the pathogen operation group contained in the sequencing data is obtained, the comparison space is reduced, the analysis time is shortened, and the detection efficiency is improved.
In one embodiment, a specific k-mer refers to a k-mer in a pathogen manipulation group whose number of occurrences in the index of numbers of occurrences in genomes of the pathogen manipulation group satisfies a preset error condition.
And the specific k-mers meeting the preset specific conditions in each pathogen operation group are contained in the characteristic target sequence set corresponding to each pathogen operation group. Further, the preset specificity condition refers to k-mers included in pathogen manipulation groups whose occurrence times satisfy a preset error condition in the genome occurrence time index table of each pathogen manipulation group. The preset error condition refers to an error condition preset by a technician according to actual project requirements, and the error condition can be a region range, namely, a certain error is allowed to exist in the k-mer selected as the specificity, but a certain strict objective condition is not completely met.
And aiming at each pathogen operation group, a corresponding genome occurrence index table is provided, the number of genomes of the k-mers contained in each pathogen operation group in the pathogen operation group can be known according to the genome occurrence index table corresponding to each pathogen operation group, the k-mers in the pathogen operation group with the occurrence frequency meeting a preset error condition in the genome occurrence index table of the pathogen operation group can be selected, and the selected k-mers are used as the specificity k-mers.
Certain error is allowed when selecting the specific k-mer, so that the specific sequence representing the pathogen operation group can be found with high probability within a certain error range, so that only the specific sequence is used instead of the whole genome sequence when determining the pathogen operation group contained in the sequencing data. The technical scheme reduces the space of sequence comparison when a real sample to be detected is processed, thereby shortening the analysis time and improving the detection efficiency.
In one embodiment, before the step 102, the following steps are further included: generating a genome occurrence index table corresponding to each pathogen operation group, wherein the genome occurrence index table records the number of genomes containing the k-mers in the genomes contained in the pathogen operation group corresponding to each k-mer; and storing the genome occurrence index table to a characteristic target point sequence set corresponding to the pathogen operation group.
Genome refers to all genetic information in an organism, which is stored in the form of a nucleotide sequence. The sum of the genetic material within an entire individual of an organism (e.g., an individual animal or plant, or an individual animal or plant cell, or an individual bacterium) is the genome. Multiple genomes may be included in each pathogen manipulation group, and multiple k-mers may be included in each genome. And recording how many genomes of the k-mers contained in each pathogen operation group appear in the genome occurrence index table corresponding to each pathogen operation group, namely recording the number of the genomes containing the k-mers in the genomes contained in the pathogen operation group corresponding to each k-mer in the genome occurrence index table corresponding to each pathogen operation group.
Thus, what is actually recorded in the genome order table is how many genomes each k-mer has appeared in the genome contained in the k-mer's corresponding pathogen manipulation group. If a k-mer occurs more than once in the same genome, it is still counted only once in the index of occurrence counts for that genome. After data of how many genomes of each k-mer appear in the corresponding pathogen operation group are obtained, a genome appearance frequency index table corresponding to each pathogen operation group can be established. If the number of the pathogen operation groups is M, M corresponding genome occurrence index tables are generated.
After the genome occurrence index table corresponding to each pathogen operation group is established, the genome occurrence index table can be stored to the characteristic target point sequence set corresponding to each pathogen operation group, namely, the characteristic target point sequence set is stored to the target point database, and after the genome occurrence index table is stored, if the genome occurrence index table is needed, data can be called from the database, so that the detection efficiency is improved.
In one embodiment, as shown in fig. 2, before the step 102, the following steps are further included:
and 110, selecting k-mers meeting preset specificity conditions from the k-mers corresponding to each pathogen operation group.
And step 112, storing the k-mers meeting the preset specificity condition into a characteristic target sequence set corresponding to each pathogen operation group.
In the target database, a characteristic target sequence set corresponding to each pathogen operation group is stored, each characteristic target sequence set comprises a specific k-mer corresponding to each pathogen operation group, the specific k-mer refers to a k-mer meeting a preset specificity condition selected from the k-mers contained in each pathogen operation group, and the selected k-mer meeting the preset specificity condition, namely the specific k-mer, is stored in the characteristic target sequence set corresponding to each pathogen operation group.
The method establishes a characteristic target library in advance, so that the data required to be used can be directly called when a pathogen operation group of sequencing data is detected and determined, and the detection efficiency is improved.
In one embodiment, as shown in fig. 2A, before step 110, the method further comprises the steps of:
and step 10, acquiring high-reliability genome data to obtain a complete set, wherein the high-reliability genome refers to a genome meeting preset conditions.
The source of the high-confidence genome data may be Refseq (Refseq reference sequence database, biologically non-redundant gene and protein sequences provided by the National Center for Biotechnology Information) data set of NCBI (National Center for Biotechnology Information) or other public or private high-confidence genomes, and the collection of all collected high-confidence genomes is called a corpus. The high confidence genome includes both individual pathogen genomes and non-pathogen genomes. Such as symbionts, probiotics, highly reliable genomes of humans, animals, plants, etc.
The process of collecting high-reliability genome data includes a process of confirming the reliability of a certain genome and performing screening. Namely, a genome satisfying the following conditions is selected as a high-reliability genome: (1) screening is carried out according to the proportion of nondeterministic characters contained in a genome data: for example, for a DNA genome, the proportion of non-deterministic characters refers to the proportion of non-ACGT characters contained therein. If the proportion of non-ACGT characters in a piece of DNA genome data is too high, the piece of data is a suspected genome with low credibility; (2) the screening is based on the number of genomic data segments comprised by a complete chromosome: if too many fragments belong to one chromosome, the genome is suspected to be a low-credibility genome; (3) determining the average whole genome coverage percentage of the genome in the similar genome by performing whole genome sequence alignment on a plurality of genomes with similar genetic relationship (for example, genetic distance smaller than a certain threshold value) with the genome, and then screening according to the average whole genome coverage percentage: a genome with too low average coverage percentage is a genome suspected of low completeness, i.e., low confidence. And after the suspected low-credibility genome or the low-credibility genome is removed, the residual genome is the high-credibility genome.
And step 12, determining pathogen operation groups contained in the complete set.
And determining pathogen operation groups corresponding to the genomes with high credibility in the complete set according to the genomes contained by the species, the subspecies, the subtypes, the strains or the virus strains, or genetic units or taxonomic units of species and other different classification levels. In the embodiment, high-reliability genome data can be collected in advance, a complete set is obtained, pathogen operation groups contained in the complete set are determined, follow-up direct use is facilitated, and efficiency is improved.
In this embodiment, after determining the pathogen operation group contained in the complete set, the method further includes:
k-mers occurring within groups of pathogen operation groups contained in the ensemble are calculated, and a list of pathogen k-mers occurring within groups is created for each pathogen operation group. And establishing a k-mer total table of k-mers appearing in all pathogen operation groups in a complete set, namely a pathogen k-mer total table, according to the k-mer lists appearing in all the pathogen operation groups, and storing the various pathogen k-mer lists and the pathogen k-mer total table into a target database.
And calculating a record table of the occurrence times of k-mers contained in the pathogen operation group contained in the complete set, wherein the k-mers not only comprise the specific k-mers of the pathogen operation group, but also all the k-mers occurring in the genome in the pathogen operation group. Namely, the k-mer occurrence number recording table records the k-mers contained in the pathogen operation group and the number of times the k-mers have occurred in the pathogen operation group. If a k-mer occurs x times in a genome, then the number of occurrences of the k-mer is recorded as the number of counts to which x counts should be added to the corresponding count unit. If there are M pathogen operation groups in the pathogen operation groups contained in the ensemble, a record of the number of occurrences of k-mers contained in the M corresponding pathogen operation groups is established. And storing the obtained k-mer occurrence frequency record table corresponding to each pathogen operation group into a target database.
Establishing a k-mer occurrence frequency recording table contained in the ensemble according to a pathogen k-mer summary table in a target database, calculating the k-mer occurrence frequency recording table in the pathogen k-mer summary table by using the k-mer occurrence frequency recording table corresponding to each pathogen operation group, and storing the k-mer occurrence frequency recording table corresponding to the ensemble into the target database.
In one embodiment, as shown in fig. 2B, after step 12, the method further comprises the steps of:
and step 14, obtaining sequencing data of the sample, and calculating the total occurrence frequency CT of k-mers appearing in the sequencing data.
The server can acquire sequencing data of the sample in advance, and because the sample needs to be subjected to multiple steps of sample preparation, sequencing, signal processing and the like in the process of sequencing the sample, the absolute concentration of the pathogen operation group contained in the sample is difficult to calculate accurately, and the relative concentration of the pathogen contained in the sample can be estimated. The calculation of the pathogen manipulation group contained in the sample sequencing data can be performed simultaneously with the relative concentration estimation.
In the relative concentration estimation, the sum of the number of occurrences of k-mers in the sequencing data, CT (count total), was first calculated. The data used in the calculations at this time are k-mers, not specific k-mer related data, so the k-mers contained in each pathogen operation group can be obtained. The sum CT of k-mers present in the sequencing data can also be obtained. Specifically, if the sequencing data comprises M pieces of data to be tested and each piece of data to be tested comprises n characters, each piece of data to be tested comprises n-k +1 k-mers in the sequencing data, the number of k-mers contained in the M pieces of data to be tested can be added, that is, the M pieces of n-k +1 can be added, and the total occurrence number CT of the k-mers in the sequencing data can be obtained. When the sum of the occurrence times CT of k-mers appearing in sequencing data is calculated, in order to increase the calculation speed, it is not necessary to actually record which k-mers appear, that is, it is not necessary to know or record the sequence of each k-mer, but the sum of the occurrence times CT is directly calculated.
And step 16, acquiring the occurrence times GCT of the k-mers in the sequencing data in the pathogen operation group contained in the complete set.
The number of occurrences gct (group Count total) of k-mers in the sequencing data contained in each pathogen manipulation group contained in the total was obtained. The GCT is an estimate. The GCT may comprise two parts: the estimated total number of actual occurrences of k-mer ECT (estimated Count Total) for a pathogen manipulation group, and the number of confirmed actual occurrences of k-mer CCT (confirmed Count Total) for this pathogen manipulation group. The number of occurrences GCT is obtained by calculating the estimated total number of actual occurrences ECT of k-mer for each pathogen manipulation group contained in the ensemble and the number of actual occurrences CCT of k-mer confirmed for each pathogen manipulation group. I.e., the number of occurrences GCT is the sum of the estimated total number of actual occurrences ECT and the confirmed total number of actual occurrences CCT of k-mers contained in the pathogen manipulation group contained in the ensemble in the sequencing data. The GCT may also include only a portion: the number of occurrences GCT can be completely estimated from the number of confirmed actual occurrences CCT of k-mers in the pathogen manipulation group contained in the ensemble when k in k-mers contained in the pathogen manipulation group is greater than a target value, i.e., when k is sufficiently large, the number of occurrences GCT is equal to the number of confirmed actual occurrences CCT of k-mers, and the estimated total number of actual occurrences ECT may not be calculated.
And step 18, calculating the relative concentration of the pathogen operation group contained in the complete set in the sequencing data according to the occurrence times GCT and the occurrence time total CT.
At this time, when the number of occurrences GCT in the K-mer sequencing data in each pathogen operation group contained in the ensemble and the total number of occurrences CT of K-mers appearing in the sequencing data are obtained, the ratio of the number of occurrences GCT in the K-mer sequencing data in each pathogen operation group to the total number of occurrences CT of K-mers appearing in the sequencing data is obtained, and the relative concentration of each pathogen operation group contained in the ensemble in the sequencing data is obtained.
Then after step 108, as shown in fig. 2C, the method further comprises the steps of:
and step 108a, obtaining the relative concentration of the pathogen operation group contained in the sequencing data according to the relative concentration of the pathogen operation group contained in the complete set in the sequencing data.
When the server simultaneously calculates and obtains the pathogen operation groups contained in the sample test data, the corresponding pathogen operation groups in the complete set are obtained according to the pathogen operation groups contained in the test data, and the corresponding relative concentrations of the pathogen operation groups contained in the test data are obtained according to the relative concentrations of each pathogen operation group contained in the complete set in the sequencing data.
And 108b, selecting the pathogen operation group with the relative concentration of the pathogen operation group contained in the sequencing data higher than a preset threshold value as the sequencing data to confirm the contained pathogen operation group.
And the server selects the pathogen operation group with the relative concentration of the pathogen operation group contained in the sequencing data higher than a preset threshold value as the sequencing data to confirm the contained pathogen operation group, wherein the confirmed contained pathogen operation group refers to the pathogen operation group which can be confirmed in the sequencing data. The predetermined threshold is a value of a relative concentration that is manually predetermined, and when the relative concentration of a pathogen operation group in the sequencing data is higher than the predetermined threshold, the pathogen operation group is determined to be included in the sequencing data, and when the relative concentration of a pathogen operation group in the sequencing data is not higher than the predetermined threshold, the pathogen operation group is determined to be not included in the sequencing data.
In the above embodiment, the relative concentrations of all pathogen operations contained in the complete set in the sequencing data are obtained through calculation, and further when the pathogen operation groups contained in the sequencing data are obtained through calculation, the relative concentration of each pathogen operation group contained in the sequencing data can be obtained, and then the pathogen operation group with the relative concentration higher than the preset threshold is selected as the sequencing data to confirm the contained pathogen operation groups, so that the accuracy of obtaining the pathogen operation groups contained in the sequencing data can be further improved.
In one embodiment, as shown in FIG. 2D, step 16, comprises the steps of:
step 16a, obtaining each first marking data in the sequencing data, calculating the number of k-mers contained in each first marking data, and calculating the total number of confirmed actual occurrences CCT according to the number of k-mers contained in each first marking data.
The first marking data refers to the data to be tested, which only contains the specific k-mer of one pathogen operation group, in the data to be tested, and the data to be tested is marked to be from the pathogen operation group. And marking the data to be detected, which only contains the specific k-mer of one pathogen operation group, in the data to be detected respectively into the data from the pathogen operation group only contained to obtain first marking data.
The server acquires each first marker data in the sequencing data and calculates the number of k-mers of each data to be tested marked as from the pathogen operation group only contained. And respectively calculating the sum of the number of k-mers of the data to be detected marked from the same pathogen operation group to obtain the actual occurrence total number CCT of the confirmed k-mers of each pathogen operation group.
And step 16b, acquiring a ratio of the total set proportion of each k-mer contained in the pathogen operation group contained in the total set, wherein the ratio of the total set proportion is the ratio of the occurrence frequency CG of the k-mer in the corresponding pathogen operation group to the occurrence frequency CB of the k-mer in the total set.
In the target database, the stored characteristic target sequence set corresponding to each pathogen operation group further comprises a k-mer occurrence number-to-corpus proportion table corresponding to the pathogen operation group, each k-mer occurrence number-to-corpus proportion table comprises the occurrence number CG of each k-mer in the corresponding pathogen operation group, the corpus occurrence number CB of each k-mer in the pathogen operation group contained in the target database, and the corpus proportion F of each k-mer. Therefore, the corpus proportion of each k-mer can be obtained from the occurrence number of each k-mer and the corpus proportion table.
The occurrence CG of each k-mer in the corresponding pathogen operation group can be obtained from the occurrence record table of the k-mer corresponding to the pathogen operation group pre-stored in the target database. The occurrence frequency CB of the k-mer in the corpus can be obtained by pre-storing the occurrence frequency record table of the k-mer corresponding to the corpus in the target point database.
And step 16c, acquiring each second marker data in the sequencing data, and acquiring the actual occurrence times of each k-mer contained in the pathogen operation group contained in the complete set in each second marker data.
And the second marking data refers to the data to be tested which does not contain the specificity k-mer of any pathogen operation group in the data to be tested, or the data to be tested which contains the specificity k-mers belonging to a plurality of pathogen operation groups, and the data to be tested is marked as not belonging to any pathogen operation group.
The server acquires each second marking data in the sequencing data, confirms all k-mers contained in the second marking data for each second marking data, finds out each pathogen operation group corresponding to all k-mers contained in the second marking data according to a k-mer occurrence frequency record table corresponding to each pathogen operation group pre-stored in a target database, and acquires the actual occurrence frequency of each k-mer contained in the pathogen operation group corresponding to all k-mers in the second marking data.
And step 16d, calculating the estimated actual occurrence times of each k-mer according to the proportion of the total set to the total set and the actual occurrence times.
And when the server acquires the full set proportion and the actual occurrence times from the target database, calculating the product of the full set proportion of each k-mer contained in the pathogen operation group contained in the full set and the actual occurrence times of each k-mer contained in the pathogen operation group corresponding to the k-mer contained in the second marking data, and acquiring the estimated actual occurrence times of each k-mer contained in each pathogen operation group in the second marking data.
And step 16e, calculating to obtain the total estimated actual occurrence times ECT according to the estimated actual occurrence times of each k-mer.
And calculating the sum of the estimated actual occurrence times of each k-mer contained in each pathogen operation group to obtain the estimated total actual occurrence times ECT of each pathogen operation group.
Step 16f, calculating the number of occurrences GCT of k-mers in sequencing data in pathogen operation groups contained in the repertoire according to the estimated total number of actual occurrences ECT and the confirmed total number of actual occurrences CCT
And when the estimated total number of actual occurrences ECT of the k-mer of each pathogen operation group in the second marking data and the confirmed total number of actual occurrences CCT of the corresponding pathogen operation group in the first marking data are obtained through calculation, calculating the sum of the ECT and the CCT to obtain the number of occurrences GCT of the k-mer in the sequencing data of the pathogen operation groups contained in the complete set.
The occurrence times GCT of the k-mers in the sequencing data in the pathogen operation groups contained in the complete set can be obtained by calculating the total occurrence times CCT of the pathogen operation groups corresponding to the first marker data in the data to be detected and the estimated total actual occurrence times ECT of the k-mers in each pathogen operation group in the second marker data, and the occurrence times GCT of the k-mers in the sequencing data can be estimated, so that the relative concentration can be calculated.
In one embodiment, as shown in FIG. 2E, step 16, comprises the steps of:
and step 16A, acquiring each first marker data in the sequencing data, calculating the number of k-mers contained in each first marker data, and calculating to obtain the CCT (total number of confirmed actual occurrences) according to the number of k-mers contained in each first marker data.
The first marking data refers to the data to be tested, which only contains the specific k-mer of one pathogen operation group, in the data to be tested, and the data to be tested is marked to be from the pathogen operation group. And marking the data to be detected, which only contains the specific k-mer of one pathogen operation group, in the data to be detected respectively into the data from the pathogen operation group only contained to obtain first marking data.
The server acquires each first marker data in the sequencing data and calculates the number of k-mers of each data to be tested marked as from the pathogen operation group only contained. And respectively calculating the sum of the number of k-mers of the data to be detected marked from the same pathogen operation group to obtain the actual occurrence total number CCT of the confirmed k-mers of each pathogen operation group.
And step 16B, when the lengths of the k-mers contained in the pathogen operation groups contained in the complete set are larger than a target value, determining the actual occurrence total number CCT as the occurrence number GCT of the k-mers in the sequencing data contained in the pathogen operation groups contained in the complete set.
When k of the k-mers included in the pathogen operation group included in the ensemble is greater than a target value, wherein the target value is determined through various experiments. The k may be greater than 23 or 27. At this time, the calculated total number of confirmed actual occurrences CCT may be directly used as the number of occurrences GCT of k-mers in sequencing data included in the pathogen manipulation group included in the ensemble. Therefore, the actual occurrence total times ECT do not need to be calculated and estimated, and the efficiency of calculating the occurrence times GCT is improved.
In one embodiment, the k-mers in the specific k-mers satisfy the following two conditions: the occurrence times in the genome occurrence time index table of the pathogen operation group meet a first preset error condition; the occurrence times in the genome occurrence time index table of the pathogen operation group and the occurrence times in the genome occurrence time index table of the corpus satisfy a second preset error condition; the number of genomes containing the k-mer in the genome contained in the pathogen operation group corresponding to each k-mer is recorded in a genome number index table corresponding to each pathogen operation group; the index of occurrence number of genomes of the corpus records the number of genomes containing the k-mer in the genome contained in the corpus.
In the target database, each pathogen operation group has a corresponding characteristic target sequence set, and the specific k-mers contained in the characteristic target sequence set refer to k-mers meeting preset specificity conditions. The preset specificity condition comprises a first preset error condition and a second preset error condition, when the k-mer simultaneously satisfies the two conditions, the k-mer is considered to satisfy the preset specificity condition, and the k-mer can be used as the specificity k-mer.
Further, the occurrence number of the k-mer in the index table of the occurrence number of genomes of the pathogen manipulation group is required to satisfy a first preset error condition, and the occurrence number of the k-mer in the index table of the occurrence number of genomes of the pathogen manipulation group and the occurrence number in the index table of the occurrence number of genomes of the corpus satisfy a second preset error condition.
The count corresponding to each k-mer recorded in the index of occurrence of genomes in the corpus represents how many genomes in the corpus the k-mer has occurred. If the k-mer occurs multiple times in the same genome, it is counted only once. In the index table of the number of times of genome of one pathogen manipulation group, the number of the genome containing the k-mer in the genome contained in the pathogen manipulation group corresponding to each k-mer is recorded, and the index table of the number of times of genome occurrence of the complete set records the number of the genome containing the k-mer in the genome contained in the complete set.
Unlike the prior art, the selection of specific k-mers in this embodiment incorporates two parameters, a predetermined error condition and a second predetermined error condition, thereby allowing non-specificity of specific k-mers within a certain range. Without these two parameters, a range of non-specificities cannot be tolerated, and it is often difficult to find specific k-mers for a single pathogen panel. Therefore, the specific target points which can represent the pathogen operation group can be found with high probability through the established characteristic target point sequence set by allowing a certain error mode to select the specific k-mers, so that when the pathogen operation group contained in the sequencing data is determined, the specific target points only need to be compared with the characteristic target point sequence set corresponding to the predetermined pathogen operation group, the comparison space is reduced, the analysis time is shortened, and the detection efficiency is improved.
In one embodiment, the first predetermined error condition is: the sum of the ratio of the number of occurrences in the genome occurrence number index table of the pathogen manipulation group to the number of genomes contained in the pathogen manipulation group and the first threshold value is 1 or more.
In this embodiment, the first preset error condition is that the sum of the ratio of the number of occurrences recorded in the genome occurrence number index table corresponding to the pathogen manipulation group to the number of genomes included in the pathogen manipulation group and the first threshold is greater than or equal to 1. Assuming that the pathogen manipulation group comprises N genomes, the occurrence number of a certain k-mer in the genome occurrence number index table of the pathogen manipulation group is C1, and the first threshold is P1, the first predetermined error condition is that C1/N + P1 is greater than or equal to 1. The first threshold P1 represents an acceptable error probability and can be any value between 0 and 1, and can be set by a technician according to actual projects.
In one embodiment, the first threshold is less than 5%. In one embodiment, the first threshold may be less than or equal to 90%. I.e. the first threshold value may be set manually by a test check. It was found by test checking that the first threshold value can take a number less than or equal to 90% in some cases.
The first threshold refers to an acceptable error probability, and may be any value between 0 and 1, and in this embodiment, the first threshold may be set to a value less than 5%.
In one embodiment, the second predetermined error condition is: the sum of the ratio of the number of occurrences in the genome number of occurrences index table of the pathogen manipulation group to the number of occurrences in the genome number of occurrences index table of the corpus and the second threshold value is 1 or more.
In this embodiment, the second predetermined error condition is that the sum of the ratio of the number of occurrences recorded in the genome occurrence number index table corresponding to the pathogen manipulation group to the number of occurrences in the genome occurrence number index table of the corpus and the second threshold is 1 or more. Assuming that the number of occurrences of a k-mer in the index of the number of occurrences of genome of the pathogen manipulation group is C1, the number of occurrences of the k-mer in the index of the number of occurrences of genome of the corpus is C2, and the second threshold is P2, the second predetermined error condition is that C1/C2+ P2 is greater than or equal to 1. The second threshold, like the first threshold described above, represents an acceptable error probability, and may be any value between 0 and 1, and the second threshold P2 may be set by the skilled person according to practical terms.
In one embodiment, the second threshold is less than 5%.
The second threshold, like the first threshold, refers to an acceptable error probability, and may be any value between 0 and 1, and in this embodiment, the second threshold may be set to a value less than 5%. The first threshold and the second threshold may be equal or unequal.
In one embodiment, before obtaining the sequencing data of the sample, the method further comprises: generating a genome occurrence index table of the corpus, wherein the genome occurrence index table of the corpus records the number of genomes containing the k-mer in the genomes contained in the corpus; and storing the genome occurrence index table of the complete set into a target database.
And storing a characteristic target sequence set corresponding to each pathogen operation group in a target database. The complete set contains all the high-credibility genomes collected, namely the high-credibility genomes of a plurality of pathogen operation groups and the high-credibility genomes of a plurality of non-pathogen operation groups. And acquiring data of how many genomes each k-mer contained in each pathogen operation group appears in the complete set, and generating a genome occurrence index table of the complete set. The genome occurrence index table of the corpus records how many genomes of each pathogen operation group contain the k-mers in the corpus, namely the genome occurrence index table of the corpus records the number of genomes of each k-mer containing the k-mer in the genome contained in the corpus.
Thus, in the genome order table of the corpus, it is actually recorded how many genomes each k-mer has appeared in the corpus that the corpus contains, i.e., it is recorded how many genomes each k-mer has appeared in the whole genome, i.e., the number of counts is the number of genomes, not the number of occurrences of k-mers. If a k-mer occurs more than once in the same genome, it is still counted only once in the index of occurrence of the genome for the corpus. After data of how many genomes of each k-mer appear in the corpus are obtained, a genome occurrence frequency index table for the corpus can be established. The genome occurrence index table of the corpus is different from the genome occurrence index tables corresponding to the pathogen operation groups, the genome occurrence index table corresponds to the pathogen operation groups, each pathogen operation group has the corresponding genome occurrence index table, but only one genome occurrence index table of the corpus is generated, and all data are aimed at. After the generated genome occurrence index table of the complete set is stored, if the index table is needed in the process of detecting sequencing data, data can be called from the database, and the detection efficiency is further improved.
In one embodiment, as shown in fig. 3, the step 108 includes:
step 302, the actual number of occurrences of each specific k-mer contained in each pathogen manipulation group in the sequencing data is obtained.
For each pathogen manipulation group, there is a corresponding table of actual occurrences of specific k-mers, and the actual occurrences of specific k-mers contained in each pathogen manipulation group in the sequencing data are recorded in this table.
And 304, selecting a pathogen operation group corresponding to the specific k-mer with the actual occurrence frequency in the sequencing data exceeding a first preset frequency threshold value to obtain a suspected pathogen operation group.
And when the actual occurrence frequency of each specificity k-mer in the sequencing data is obtained, selecting the specificity k-mer with the actual occurrence frequency exceeding a first preset frequency threshold value by taking the first preset frequency threshold value as a selection standard, and taking a pathogen operation group corresponding to the selected specificity k-mer as a suspected pathogen operation group. Setting a first preset time threshold value, filtering out specific k-mers with low occurrence probability, namely filtering out corresponding pathogen operation groups. And if the actual occurrence frequency of the specific k-mer in one pathogen operation group in the sequencing data is lower than a first preset frequency threshold value, the pathogen operation group corresponding to the specific k-mer is a pathogen operation group with low occurrence probability, the pathogen operation group with low occurrence probability can be excluded, and the rest is a suspected pathogen operation group.
And step 306, selecting the pathogen operation group meeting the preset conditions from the suspected pathogen operation groups as the pathogen operation group contained in the sequencing data.
In step 304, a pathogen operation group with a low probability of occurrence can be filtered out by presetting a first time threshold, and a suspected pathogen operation group is obtained. Then the pathogen operation group meeting the preset condition can be selected from the suspected pathogen operation groups as the pathogen operation group contained in the sequencing data. The preset condition is a selection condition set by a technician, and the preset condition can be correspondingly adjusted according to the actual project requirement.
According to the actual occurrence times of the specific k-mers in the sequencing data, the pathogen operation group contained in the sequencing data is selected, the pathogen operation group to which the sequencing data belongs can be determined with high probability, and the detection accuracy is improved.
In one embodiment, before the step 302, the method further includes: acquiring the actual occurrence times of specific k-mers contained in a pathogen operation group in sequencing data; and generating a specific k-mer actual occurrence number recording table corresponding to the pathogen operation group according to the actual occurrence number.
And storing the specific k-mers contained in each pathogen operation group in a target database, and comparing the sequencing data with each specific k-mer of each pathogen operation group after the sequencing data of the sample is obtained, namely obtaining the actual occurrence times of each specific k-mer in the sequencing data. After the actual occurrence number of each specific k-mer in the sequencing data is obtained, a record table of the actual occurrence number of the specific k-mer corresponding to each pathogen operation group can be generated according to the obtained data. If M pathogen operation groups exist in the target database, M corresponding specific k-mer actual occurrence frequency recording tables are generated, and the actual occurrence frequency of the specific k-mer contained in each pathogen operation group in sequencing data is recorded in the specific k-mer actual occurrence frequency recording tables.
FIG. 4 shows a table of actual occurrence counts of specific k-mers, in which the specific k-mers included in the pathogen manipulation group X are recorded in the leftmost column, and the actual occurrence counts of the corresponding specific k-mers in the sequencing data, C, are recorded in the second column1,C2…. Further, the occurrence ratio of the recording-specific k-mers can be increased in one row, as shown in FIG. 4, and can be recorded as F, respectively1,F2,…。
And generating a corresponding specific k-mer actual occurrence frequency recording table according to the actual occurrence frequency of the specific k-mer in the sequencing data, and storing the data for subsequent calling, so that the detection efficiency can be improved.
In one embodiment, as shown in fig. 5, the step 306 includes:
step 502, obtaining the specific k-mer with the actual occurrence number larger than the preset actual occurrence number threshold as the specific k-mer with the occurrence confirmation.
And each pathogen operation group is provided with a corresponding recording table of the actual occurrence times of the specific k-mers, and the actual occurrence times of the specific k-mers contained in each pathogen operation group in sequencing data can be obtained according to the table. And after the actual occurrence number of each specific k-mer in the sequencing data is obtained, selecting the specific k-mer with the actual occurrence number larger than a preset actual occurrence number threshold value as the specific k-mer which is confirmed to occur. The preset actual occurrence threshold is a condition value preset by a technician, the preset actual occurrence threshold is a natural number generally larger than 5, when the actual occurrence of a specific k-mer exceeds the preset actual occurrence threshold, the specific k-mer is considered to be actually present in sequencing data, and the specific k-mer can be used as the specific k-mer for confirming the presence.
And step 504, calculating the false positive probability of each pathogen operation group according to the number of the corresponding confirmed specific k-mers of each pathogen operation group.
After the confirmed occurrence of the specific k-mers is obtained, the number of the confirmed occurrence of the specific k-mers contained in each pathogen operation group can be obtained according to the divided pathogen operation groups. For example, if 1000 specificity k-mers were included in pathogen manipulation group A1, but only 400 of the 1000 specificity k-mers were present in the sequencing data above the predetermined threshold number of actual occurrences, then the number of confirmed occurrences of specificity k-mers included in pathogen manipulation group A1 is 400. And after the number of the confirmed specific k-mers corresponding to each pathogen operation group is obtained, calculating the false positive probability corresponding to each pathogen operation group according to the number of the confirmed specific k-mers corresponding to each pathogen operation group.
False Positive (FPR), also known as misdiagnosis rate or class I error, means that there is virtually no disease, but the percentage of disease that is judged to be diseased based on screening can also be understoodIf so, the probability that a positive result is obtained is not yet a positive result. False positives can be calculated by assuming that n confirmed specific k-mers are found in a pathogen manipulation group, and that the probability of false positive is less than or equal to p for that pathogen manipulation groupn. p represents a predetermined error threshold, which may be 1 minus the ratio of the number of occurrences in the index of occurrence of genomes in the pathogen manipulation group to the number of occurrences in the index of occurrence of genomes in the corpus. Specifically, the formula is as follows, wherein p is 1-C1/C2, wherein C1 is the number of occurrences of a certain k-mer in the index table of the number of occurrences of the genome of the corresponding pathogen operation group, and C2 is the number of occurrences of the k-mer in the index table of the number of occurrences of the genome of the corpus. p can be any value between 0 and 1, typically less than 5%. When n is large enough, the false positive calculated will be small.
And step 506, selecting the pathogen operation group with the false positive probability lower than the preset standard probability as the pathogen operation group contained in the sequencing data. And after the false positive probability of each pathogen operation group is calculated, selecting the pathogen operation group with the false positive probability lower than the preset standard probability as the pathogen operation group contained in the sequencing data. The predetermined standard probability is a condition value preset by a technician as a standard for selecting a pathogen operation group based on false positives. When the false positive probability corresponding to the pathogen operation group is higher, the error probability representing that the pathogen operation group is really contained in the sequencing data set is higher, the pathogen operation group can be excluded, namely, the pathogen operation group with the false positive probability lower than the preset standard probability can be selected as the pathogen operation group contained in the sequencing data.
By calculating the false positive and selecting the conforming pathogen operation group according to the false positive, the accuracy rate of detecting the pathogen operation group contained in the sequencing data is improved.
In one embodiment, as shown in fig. 6, the step 506 includes:
step 602, obtaining the actual occurrence number of the confirmed specific k-mer in the sequencing data.
And each pathogen operation group is provided with a corresponding recording table of the actual occurrence times of the specificity k-mers, so that the actual occurrence times of each specificity k-mer in the sequencing data can be obtained according to the recording table of the actual occurrence times of each specificity k-mer. The confirmed specific k-mer is the specific k-mer with the actual occurrence frequency larger than the preset actual occurrence frequency threshold, namely the confirmed specific k-mer belongs to one part of the specific k-mer, so the actual occurrence frequency of the confirmed specific k-mer can be obtained according to the actual occurrence frequency recording table of the specific k-mer.
Step 604, calculating the ratio of the actual occurrence number of the confirmed specific k-mers as the ratio of the actual occurrence number of the confirmed specific k-mers to the sum of the actual occurrence numbers of all confirmed specific k-mers.
And 606, selecting the pathogen operation group corresponding to the confirmed specific k-mer with the actual occurrence frequency ratio meeting the expected occurrence frequency ratio from the pathogen operation groups with the false positive probability lower than the preset standard probability, and taking the pathogen operation group as the pathogen operation group contained in the sequencing data.
After the actual occurrence number of each confirmed specific k-mer is obtained, the sum of the actual occurrence numbers of all confirmed specific k-mers can be calculated. For example, when there are 500 confirmed occurrences of specific k-mers, the actual occurrence count of each confirmed occurrence of specific k-mers is S1,S2,…,S500Then the sum of the actual occurrences S ═ S1+S2+...+S500. After the actual occurrence number sum is obtained by calculation, the actual occurrence number proportion of each confirmed occurring specific k-mer can be obtained by calculation according to the actual occurrence number of each confirmed occurring specific k-mer. When the actual occurrence number of a certain confirmed specific k-mer is A, the ratio of the actual occurrence number of the confirmed specific k-mer is A/S. Thus, the actual occurrence ratio of the specific k-mers for each confirmed occurrence can be calculated.
Each pathogen operation group can have a corresponding specific k-mer occurrence ratio table, and each specific k-mer occurrence ratio tableThe number of times each pathogen manipulation group contains a specific k-mer that appears in the entire genome of the pathogen manipulation group is recorded in the comparative table. If the specific k-mer is present in a genome1Then, the count is C1The occurrence ratio of specific k-mers is reported in the table as the number of occurrences of each specific k-mer in the genome. And the occurrence proportion of each specific k-mer is also recorded in each specific k-mer occurrence proportion table, namely the occurrence proportion of each specific k-mer can be calculated after the occurrence frequency of each specific k-mer in the genome is obtained. Assuming that the number of occurrences of a specific k-mer in the corresponding pathogen manipulation group is C1The total number of occurrences of all specific k-mers contained in the pathogen manipulation group is C, and the occurrence ratio of the specific k-mers is C1/C。
Therefore, the occurrence ratio of each specific k-mer can be obtained according to the occurrence ratio table of the specific k-mers, and the occurrence ratio of each specific k-mer is used as the expected occurrence ratio, that is, the expected occurrence ratio of each confirmed specific k-mer can be obtained according to the occurrence ratio table of the specific k-mer. It can thus be understood that the ratio of the expected number of occurrences of each specific k-mer contained in each pathogen manipulation group was added to give a value of 1.
And obtaining the actual occurrence times of the specific k-mers in each confirmation according to the actual occurrence times recording table of the specific k-mers, thereby calculating the actual occurrence time proportion of the specific k-mers in each confirmation. The expected occurrence ratio of each identified occurrence of a specific k-mer is obtained from the table of occurrence ratios of specific k-mers. When confirmed occurrence of specific k-mers is obtained, the probability of false positives for each pathogen manipulation group can be calculated based on the number of confirmed occurrence of specific k-mers contained in each pathogen manipulation group. Therefore, the pathogen operation group corresponding to the confirmed specific k-mer with the actual occurrence frequency ratio meeting the expected occurrence frequency ratio can be selected from the pathogen operation groups with the false positive probability lower than the preset standard probability.
It is also understood that the final selected pathogen operating panel needs to satisfy the following two conditions: 1. the false positive probability is lower than the preset standard probability; 2. the actual occurrence ratio of confirmed occurrences of specific k-mers contained in the pathogen manipulation group corresponds to the expected occurrence ratio. The pathogen operation group satisfying both of these conditions can be selected as the pathogen operation group included in the sequencing data. In this way, the detection accuracy of the pathogen operation group contained in the sequencing data is further ensured.
In one embodiment, as shown in fig. 7, the step 504 includes:
step 702, when the specific k-mers are confirmed to belong to the same non-overlapped specific region, the specific k-mers belonging to the same non-overlapped specific region are regarded as the same specific k-mer, and the non-overlapped specific region contains any two specific k-mers with the overlapped character number meeting a preset overlapping threshold value.
And step 704, after the number of the confirmed specific k-mers is determined according to the non-coincident specific region, calculating the false positive probability of each pathogen operation group according to the number of the confirmed specific k-mers. Each pathogen manipulation group may have a respective table of actual occurrences of the specific k-mers, which is a record of the actual occurrences of each specific k-mer contained in the pathogen manipulation group. Therefore, specific k-mers with actual occurrence times larger than a preset occurrence time threshold can be selected from the specific k-mers according to the content recorded in the table. The preset occurrence threshold is a threshold preset by a technician for selecting the desired specific k-mer. For each pathogen manipulation group, there is a corresponding non-coincident specificity region, and there may be multiple non-coincident specificity regions for each pathogen manipulation group.
A specific k-mer contained in each of the noncoincident specific regions, satisfying a terminal-overlap condition with at least one specific k-mer in the noncoincident specific region. The end coincidence condition means that the number of characters coincided by the two specific k-mers meets a preset coincidence threshold value. That is, in each non-overlapping specificity region, there must be at least one specificity k-mer whose number of characters overlapping another specificity k-mer in the non-overlapping specificity region meets a preset overlap threshold. That is, two specific k-mers in a non-overlapping specific region are considered end-overlapped if there are no less than j characters between them that overlap at their ends. End-coincidence refers to the two specificity k-mers being considered end-coincident if the last j characters of one specificity k-mer are identical to the first j characters of the other specificity k-mer. J is an acceptable overlap area, the size of J can be set according to the actual project requirement, J can be generally set to be a value greater than 5 and less than or equal to k-1, that is, the value interval of J can be: j is more than 5 and less than or equal to k-1.
However, it should be noted that in the case of DNA or RNA sequences, it is necessary to detect not only the two specific k-mers but also their reverse complements when detecting whether the two specific k-mers belong to terminal coincidences because of the reverse complementarity of the nucleic acid sequence. For example, if two specific k-mers are detected as A and B, respectively, then detecting whether A and B are end-aligned requires detecting the reverse complements A 'and B of A and B, A, the reverse complement B' of A and B, and the reverse complement A 'and B' of A.
In the case where the presence of end-coincidence of two specific k-mers is detected, then the two specific k-mers need to be corrected for independence. Each pathogen operation group has a corresponding specific k-mer actual occurrence number recording table, so that the specific k-mers in each specific k-mer actual occurrence number recording table can be independently corrected to obtain corresponding non-overlapping specific regions, and a corresponding non-overlapping specific region table can be generated according to the non-overlapping specific regions corresponding to each pathogen operation group.
When the specificity k-mer contained in each pathogen operation group is independently corrected, the specificity k-mers with end coincidence are classified into the same non-coincident specificity region. In this case, it is also possible to splice specific k-mers in non-overlapping specific regions, i.e.to splice specific k-mers in the same non-overlapping specific region, resulting in a longer sequence. The sequence obtained by splicing a plurality of specific k-mers has a length larger than k, and the length of the sequence can be changed according to the splicing mode and is not fixed. The splicing processing mode can save the memory space. Or the specific k-mers meeting the end coincidence condition can be placed in the same non-coincident specific region without splicing treatment. The actual occurrence number of each specific k-mer in the sequencing data is recorded in the actual occurrence number recording table of specific k-mers, so that the actual occurrence number of each specific k-mer contained in each non-overlapping specific region can be obtained according to the actual occurrence number recording table of specific k-mers.
Therefore, when a specific k-mer with an actual occurrence number greater than a preset actual occurrence number threshold is selected and used as a specific k-mer for confirming the occurrence, it can be verified whether the specific k-mer for confirming the occurrence exists in the specific k-mers for confirming the occurrence, which belong to the same non-overlapping specific region. And if the specific k-mers belonging to the same non-coincident specific region exist in the presented specific k-mers, the specific k-mers belonging to the same non-coincident specific region are regarded as the same specific k-mer.
For example, if there are 10 identified occurring specific k-mers, where specific k-mer A, specific k-mer B, and specific k-mer C belong to the same non-overlapping specific region, then these three specific k-mers are considered to be the same specific k-mer. That is, in counting the number, the three specific k-mers will count to one. The number of specific k-mers identified is 10-3+1 to 8. That is, when the false positive probability is calculated, the false positive probability of each pathogen operation group is actually calculated according to the number of specificity k-mers in the non-coincident specificity region obtained after the independence correction is carried out, rather than directly calculating the false positive probability according to the specificity k-mers with the selected actual occurrence number larger than the preset actual occurrence number threshold. After the independence correction is carried out, the accuracy of the false positive calculation result of the pathogen operation group can be better ensured, and the accuracy of the pathogen operation group which is actually contained in the sequencing data can be further improved.
In one embodiment, two specific k-mers in the non-coincident specific region that meet the preset terminal coincidence condition are spliced to obtain a new non-coincident specific sequence.
The probability of false positive for each pathogen manipulation group is calculated from the number of specific k-mers in the non-overlapping specific region obtained after the independence correction for correcting two specific k-mers with overlapping ends, as follows.
And the characteristic target point sequence set corresponding to each pathogen operation group stored in the target point data comprises a specificity k-mer table corresponding to each pathogen operation group, and the specificity k-mer table comprises specificity k-mers meeting preset specificity conditions. After the specific k-mer table corresponding to each pathogen operation group is obtained, in order to avoid destroying original data, the specific k-mer table can be copied, and a copy specific k-mer table can be obtained. And sequentially selecting two specific k-mers from the replication specific k-mer table, carrying out sequence detection on the two selected specific k-mers, and when detecting that a certain two specific k-mers meet a preset terminal coincidence condition, determining that the two specific k-mers belong to terminal coincidence. Two specific k-mers with overlapped ends can be spliced to obtain a spliced sequence which is a new non-overlapped specific sequence, namely the two specific k-mers are subjected to independent correction.
In the method, the sequence detection is carried out on the specific k-mer in the copy specific k-mer table, and the sequence detection can also be carried out on the specific k-mer which is not spliced and a new non-coincident specific sequence generated by splicing operation, and the corresponding treatment is carried out. Until no two specific k-mers, or new non-overlapping specific sequences, in the replication specific k-mer table meeting the predetermined end-overlapping condition are detected, all specific k-mers in the replication specific k-mer table are considered to have completed the independence correction. The specificity k-mers subjected to the independent correction form non-coincident specificity regions, a non-coincident specificity region table corresponding to each pathogen operation group can be generated according to data of the non-coincident specificity regions contained in each pathogen operation group, and the non-coincident specificity region table can be stored in a feature target point sequence set corresponding to each pathogen operation group in a target point database to serve as data backup. After the data are stored, the data can be called from the database if necessary, and then the detection efficiency is improved.
When two specific k-mers are detected to meet the preset end coincidence condition, the two specific k-mers are considered to belong to end coincidence, and then the two specific k-mers can be spliced to obtain a splicing sequence. That is, when two specific k-mers are detected as belonging to an end-coincidence, the smallest region that can cover both specific k-mers is taken instead of both specific regions. Assuming that A and B are two specific k-mers with overlapped ends, A is ACGGTCATC and B is TCATCCGA, the overlapped part of the A and B ends is TCATC, the two can be spliced to obtain the minimum region covering A and B, namely ACGG-TCATC-CGA, and the two specific k-mers of A and B can be replaced by the sequence ACGGTCATCCGA.
And (3) carrying out sequence detection on the two specific k-mer non-coincident specific regions which are all coincident with the terminal, the specific k-mer in the non-coincident specific region and a new non-coincident specific sequence, or carrying out sequence detection on any two new non-coincident specific sequences until any two sequences are not coincident with the terminal coincident condition in the non-coincident specific region. And the specific k-mer is corrected by adopting a splicing mode, so that the memory space can be saved. After the specificity k-mer is subjected to independent correction, the accuracy of the false positive calculation result of the pathogen operation group can be better ensured, and the accuracy of the pathogen operation group really contained in the selected sequencing data can be further improved.
In one embodiment, as shown in fig. 8, before obtaining the sequencing data of the sample, the following steps are further included:
step 802, obtaining a number of occurrences of each specific k-mer comprised in the pathogen manipulation group in a genome comprised in the pathogen manipulation group.
And calculating the number of occurrences of each specific k-mer contained in the pathogen manipulation group in the genome contained in the pathogen manipulation group aiming at each pathogen manipulation group, and counting N if the specific k-mer appears N times in the same genome, namely the number of occurrences of each specific k-mer in the pathogen manipulation group.
And step 804, calculating the occurrence ratio of each specific k-mer to the sum of the occurrence times of each specific k-mer and the specific k-mers contained in the pathogen operation group.
When the number of occurrences of each specific k-mer in the genome comprised by the corresponding pathogen manipulation group is calculated, the sum of the number of occurrences of the specific k-mers comprised by the pathogen manipulation group is obtained. Assuming that a pathogen manipulation group contains 500 specific k-mers, the number of occurrences of these specific k-mers in the genome is C1、C2,…,C500Then the sum of the number of occurrences S ═ C1+C2+...+C500. The occurrence ratio of each specific k-mer is calculated to be C1/S,C2/S,…,C500/S。
And step 806, generating a specific k-mer appearance proportion table corresponding to the pathogen operation group according to the appearance times and the appearance proportion corresponding to each specific k-mer in the pathogen operation group.
And 808, storing the specific k-mer appearance proportion table to a characteristic target point sequence set corresponding to the pathogen operation group.
And when the occurrence frequency of each specific k-mer in the genome contained in the pathogen operation group and the occurrence proportion of each specific k-mer are obtained, generating a specific k-mer occurrence proportion table corresponding to each pathogen operation group according to the obtained data, if M pathogen operation groups exist, generating M corresponding specific k-mer occurrence proportion tables, and storing the specific k-mer occurrence proportion tables into a characteristic target point sequence set corresponding to each pathogen operation group in a target point database.
The occurrence ratios of specific k-mers are shown in FIG. 9, where the specific k-mers contained in pathogen manipulation group X are recorded in the leftmost column, and the number of occurrences of each specific k-mer in the genome contained in pathogen manipulation group X is recorded in the second column and may be respectively recorded as C1,C2…. The third column records the occurrence ratio of each specific k-mer, as shown in FIG. 9, as the ratio of the number of occurrences of the specific k-mer to the total number of occurrences C, which can be recorded as F1,F2,…。
In one embodiment, as shown in FIG. 3A, step 306, comprises the steps of:
step 306A, obtaining the specific k-mer with the actual occurrence number larger than the preset actual occurrence number threshold as the specific k-mer with the occurrence confirmation.
And each pathogen operation group is provided with a corresponding recording table of the actual occurrence times of the specificity k-mers, so that the actual occurrence times of each specificity k-mer in the sequencing data can be obtained according to the recording table of the actual occurrence times of each specificity k-mer. The confirmed specific k-mer is the specific k-mer with the actual occurrence frequency larger than the preset actual occurrence frequency threshold, namely the confirmed specific k-mer belongs to one part of the specific k-mer, so the actual occurrence frequency of the confirmed specific k-mer can be obtained according to the actual occurrence frequency recording table of the specific k-mer.
Step 306B, the false positive distribution of the specific k-mers contained in each pathogen manipulation group in the corresponding simulated test data is obtained.
The simulation test data refers to a set of simulation test data obtained by randomly sampling all genomes which do not belong to a pathogen operation group in a complete set as data sources for the pathogen operation group. The simulation test data and the real data have the same or similar characteristics of data quantity, error distribution, data format and the like. N sets of simulated test data were generated for each pathogen operating group.
And the server acquires the false positive distribution of the specific k-mers contained in each pathogen operation group in the simulated test data corresponding to each pathogen operation group, wherein the false positive distribution is generated in advance according to the number of the specific k-mers of each pathogen operation group in the corresponding simulated test data. The number of specific k-mers refers to the number of times a specific k-mer appears in the simulated test data, and the number of times the same specific k-mer appears in the set of simulated test data is recorded only once.
And step 306C, comparing the number of the specificity k-mers corresponding to each pathogen operation group, which are confirmed to appear, with the false positive distribution of the specificity k-mers contained in each pathogen operation group in the simulation test data to obtain the false positive detection probability of each pathogen operation group.
And the server compares the number of the corresponding confirmed occurring specificity k-mers of each pathogen operation group with the false positive distribution of the specificity k-mers contained in each pathogen operation group in the simulation test data to obtain the false positive detection probability of each pathogen operation group.
And step 306D, selecting the pathogen operation group with the false positive detection probability lower than a preset threshold value as the pathogen operation group contained in the sequencing data.
And the server selects the pathogen operation group with the false positive detection probability lower than a preset threshold value as the pathogen operation group contained in the sequencing data. And if the false positive detection probability of the pathogen operation group is lower than the preset threshold, the detection signal of the pathogen operation group in the sequencing data is considered to be non-false positive, and the pathogen operation group is selected as the pathogen operation group contained in the sequencing data. Wherein the predetermined threshold may be a value less than 0.05.
In the above embodiment, the probability of false positive distribution for each pathogen operation group is obtained from the generated simulation test data, and the pathogen operation groups included in the sequencing data are obtained according to the probability of false positive distribution and the occurrence of confirmed specific k-mers. The false positive respectively probability of the pathogen operation group is calculated by simulating the test data, so that the accuracy of obtaining the pathogen operation group contained in the sequencing data and the detection accuracy of the pathogen operation group contained in the sequencing data are improved.
In one embodiment, step 306B, comprises the steps of:
306B1, acquiring simulation test data corresponding to each pathogen operation group; the mock test data is data randomly sampled from a genome in the corpus that does not belong to the pathogen manipulation group to which the mock test data corresponds.
And the server acquires the simulated test data corresponding to each pathogen operation group according to the pathogen operation groups contained in the complete set, wherein the simulated test data is obtained by randomly sampling the pathogen operation in the genome which does not belong to the pathogen operation group in the complete set. Generating N sets of false positive simulation test data for each pathogen operation group, wherein the value of N is related to the false positive precision required by the final detection: if a false positive distribution of 1% resolution is desired, then N here needs to be no less than 100; if a false positive distribution of one-thousandth resolution is desired, then here N needs to be no less than 1000.
306B2, the number of specific k-mers contained in each pathogen manipulation group in the corresponding simulated test data is calculated.
306B2, obtaining a false positive distribution of the specific k-mers contained in each pathogen operation group in the corresponding simulated test data according to the number of the specific k-mers contained in each pathogen operation group in the corresponding simulated test data.
The server calculates the number of specific k-mers contained in each pathogen operation group in each corresponding set of simulated test data. And obtaining the false positive distribution of the specific k-mers contained in each pathogen operation group in the corresponding simulated test data according to the number of the specific k-mers contained in each pathogen operation group in each corresponding set of simulated test data. Using N sets of false positive mock test data for each pathogen manipulation group, a distribution of occurrence of species numbers of specific k-mers in each false positive mock test data is generated that contains N data points. This distribution is the specific k-mer false positive distribution for this pathogen manipulation group.
In this example, the false positive distribution of the specific k-mers included in each pathogen operation group in the corresponding simulated test data is obtained by generating simulated test data for each pathogen operation group, and calculating the number of the specific k-mers included in each pathogen operation group in the corresponding simulated test data. The false positive distribution of the specificity k-mer in the corresponding simulation test data is calculated by generating the simulation test data, the false positive distribution can be obtained more easily, the false positive detection probability of a pathogen operation group in the sequencing data is further calculated, and the accuracy of obtaining the false positive detection probability is improved.
In one embodiment, as shown in FIG. 3a, step 306, comprises the steps of:
step 306a, acquiring the specificity k-mer with the actual occurrence frequency larger than the preset actual occurrence frequency threshold as the confirmed occurrence specificity k-mer, and acquiring the number of the confirmed occurrence specificity k-mer.
The specific k-mer with the actual occurrence times larger than the preset occurrence time threshold is obtained from the specific k-mer actual occurrence time recording table and is used as the specific k-mer which is confirmed to occur, and the corresponding number of the seeds is obtained according to the specific k-mer which is confirmed to occur.
Step 306b, the number of specific k-mers contained in each pathogen manipulation group is obtained.
And step 306c, calculating the concentration of the confirmed specific k-mers of each pathogen operation group according to the number of the confirmed specific k-mers and the number of the specific k-mers contained in each pathogen operation group.
And acquiring the number of the specific k-mers contained in the pathogen operation group from the record table of the actual occurrence times of the specific k-mers corresponding to the pathogen operation group according to each pathogen operation group corresponding to the confirmed occurrence of the specific k-mers. The ratio of the number of confirmed specific K-mers and the number of specific K-mers contained in each pathogen manipulation group, i.e., DUKP (Detected Unit K-mer percent), was calculated to obtain the concentration of confirmed specific K-mers in each pathogen manipulation group.
And step 306d, selecting the pathogen operation group with the concentration of the specific k-mer confirmed to be higher than the preset threshold value as the pathogen operation group contained in the sequencing data.
And comparing the concentration of the specific k-mer which is confirmed to be present with a preset threshold value, wherein the threshold value is generally a value greater than 0.05, and if the concentration of the specific k-mer which is confirmed to be present is less than the preset threshold value, the detection signal of the pathogen operation group corresponding to the specific k-mer in the sequencing data is considered to be suspected false positive and should not be selected. And if the concentration of the specific k-mer is confirmed to be higher than the preset threshold, the detected signal of the pathogen operation group corresponding to the specific k-mer in the sequencing data is considered to be non-false positive, and the pathogen operation group is selected as the pathogen operation group contained in the sequencing data.
In the above embodiment, in a certain case, the pathogen operation included in the sequencing data can be determined according to the ratio of the specific k-mers confirmed to appear in the sequencing data to the number of all the specific k-mers of the pathogen operation group, so that the pathogen operation group included in the sequencing data can be calculated more quickly, and the efficiency is improved.
In a specific embodiment, the method for detecting the operation group of pathogens may comprise:
and obtaining sequencing data of the sample through a server, obtaining a characteristic target sequence set corresponding to each pathogen operation group stored in a target database, and obtaining the occurrence times of specific k-mers contained in the characteristic target sequence set corresponding to each pathogen operation group in the sequencing data. And acquiring the specific k-mer with the actual occurrence number larger than a preset actual occurrence number threshold value as the specific k-mer for confirming the occurrence. And acquiring the false positive distribution of the specificity k-mers contained in each pathogen operation group in the corresponding simulated test data, comparing the number of the specificity k-mers corresponding to each pathogen operation group and appearing in confirmation with the false positive distribution of the specificity k-mers contained in each pathogen operation group in the simulated test data to obtain the false positive detection probability of each pathogen operation group, and selecting the pathogen operation group with the false positive detection probability lower than a preset threshold value as the pathogen operation group suspected to be contained in the sequencing data. The number of specific k-mers contained in each suspected pathogen manipulation group is obtained. And acquiring the occurrence times of the specific k-mers contained in the characteristic target sequence set corresponding to each suspected pathogen operation group in the sequencing data. And acquiring the specificity k-mers with the actual occurrence times larger than a preset actual occurrence time threshold value as the specificity k-mers confirmed to occur, calculating the number of the specificity k-mers confirmed to occur, calculating the ratio of the number of the specificity k-mers confirmed to the number of the specificity k-mers contained in each pathogen operation group suspected to contain, and selecting the pathogen operation group with the ratio higher than the preset threshold value from the pathogen operation groups suspected to contain as the pathogen operation group possibly contained in the sequencing data. And acquiring the relative concentration of the pathogen operation groups possibly contained in the sequencing data, and selecting the pathogen operation groups with the relative concentration greater than a preset threshold value as the sequencing data to confirm the contained pathogen operation groups. In this embodiment, the detection accuracy of the pathogen operation panel can be further improved by determining the pathogen operation panel included in the sequencing data by the false positive detection probability, the specific k-mer concentration, and the relative concentration of the pathogen operation panel.
In one embodiment, as shown in fig. 10, after step 108, the method further includes:
at step 1002, medical information is obtained from the target database for each of the groups of pathogen manipulations contained in the sequencing data.
At step 1004, a final pathogen operation set list is generated based on the medical information for each pathogen operation set included in the sequencing data.
Step 1006, outputting the final pathogen action group list to the detection output result of the sequencing data.
In the target database, medical information of each pathogen operation group is also stored. The medical information includes information relevant to each pathogen operational group in terms of medicine, clinical application, biology, pathology, etc. For example, the corresponding medical information includes the bacterial form, metabolic characteristics, good anaerobism, gram stain, common diseases caused by the bacteria, conventional detection methods, conventional treatment medicines and the like. After the pathogen operation group contained in the sequencing data is obtained, the medical information of each pathogen operation group contained in the sequencing data can be obtained from the target point database. After medical information of each pathogen operation group contained in the sequencing data is obtained, a final pathogen operation group list can be generated. In the final pathogen operation group list, the pathogen operation groups included with the sequencing data are included, as well as the medical information corresponding to each pathogen operation group included in the table. And outputting the final pathogen operation group list to a detection result of the sequencing data, wherein the final pathogen operation group list can be used as partial data in a diagnosis result of the sequencing data for technical personnel to refer.
In one embodiment, as shown in fig. 11, after step 108, the method further includes:
step 1102, obtain the sum of the number of occurrences of k-mers present in the sequencing data, CT.
After the pathogen manipulation groups contained in the sequencing data are obtained, the relative concentrations of the pathogen manipulation groups contained in the sequencing data can also be calculated. In the process of sequencing a sample, the sample needs to be subjected to multiple steps of sample preparation, sequencing, signal processing and the like, and the absolute concentration of the pathogen operation group contained in the sample is difficult to calculate accurately, so that the relative concentration of the pathogen contained in the sample can be estimated.
In estimating the relative concentrations of the pathogen operation groups contained in the sequencing data, the data used is k-mers, rather than data relating to specific k-mers, so that the k-mers contained in each pathogen operation group can be obtained. The sum CT of k-mers present in the sequencing data can also be obtained. Specifically, if the sequencing data comprises M pieces of data, and each piece of data comprises n characters, then each piece of data in the sequencing data comprises n-k +1 k-mers, the number of k-mers contained in the M pieces of data can be added, that is, M n-k +1 s are added, so that the total occurrence number CT of k-mers in the sequencing data can be obtained. When the sum of the occurrence times CT of k-mers appearing in sequencing data is calculated, in order to increase the calculation speed, it is not necessary to actually record which k-mers appear, that is, it is not necessary to know or record the sequence of each k-mer, but the sum of the occurrence times CT is directly calculated.
And step 1104, obtaining the estimated total number of actual occurrences CF of the k-mers contained in the pathogen operation group contained in the sequencing data, wherein the estimated total number of actual occurrences CF is the estimated total number of occurrences of each k-mer contained in the pathogen operation group in the sequencing data.
At step 1106, the relative concentration of each pathogen operation group is calculated based on the sum of occurrences CT and the estimated sum of actual occurrences CF.
In order to calculate the relative concentration of each pathogen operation group, the sum CT of the occurrence times of k-mers appearing in the sequencing data and the sum CF of the estimated actual occurrence times of k-mers contained in each pathogen operation group are obtained, and the ratio of the CF to the CT is the relative concentration of the pathogen operation group. The estimated actual occurrence number sum CF of k-mers included in the pathogen operation group refers to a sum of estimated occurrences of k-mers included in all the pathogen operation groups in sequencing data, and as the name suggests, the estimated actual occurrence number sum CF is an estimated value and is not a value actually measured.
The relative concentration of each pathogen manipulation group can be estimated by estimating the number of occurrences of the k-mer contained in the pathogen manipulation group in the sequencing data and calculating an estimate of the relative concentration from the number of actual occurrences of the k-mer in the sequencing data.
In one embodiment, as shown in FIG. 12, step 1104, comprises:
and step 1202, acquiring a pathogen operation group contained in the sequencing data as a final pathogen operation group.
And when a pathogen operation group with the occurrence frequency exceeding a preset frequency threshold is selected according to the occurrence frequency of the specific k-mer in the sequencing data in the feature target sequence set corresponding to the pathogen operation group, taking the selected pathogen operation group as the pathogen operation group contained in the sequencing data. For ease of reference and understanding, the pathogen operation panel contained in the sequencing data can be taken as the final pathogen operation panel.
And step 1204, acquiring a ratio of the total set ratio of the k-mers contained in each final pathogen operation group, wherein the ratio of the occurrence frequency CG of the k-mers in the corresponding pathogen operation group to the occurrence frequency CB of the k-mers in the total set is the ratio.
The characteristic target sequence set corresponding to each pathogen operation group stored in the target database also comprises a k-mer occurrence frequency-to-ensemble ratio table corresponding to each pathogen operation group, and the ensemble ratio of the k-mers contained in each pathogen operation group, namely the ensemble ratio of the k-mers contained in the final pathogen operation group, is recorded in the table. Further, the ensemble proportion of each k-mer refers to the ratio of the number of occurrences of the k-mer in the pathogen operation group in which the k-mer is located to the number of occurrences of the k-mer in all pathogen operation groups.
And step 1206, calculating to obtain the estimated actual occurrence times of each k-mer according to the proportion of the full set to the total set corresponding to each k-mer and the actual occurrence times.
And step 1208, calculating the sum of the estimated actual occurrence times CF of the k-mers contained in the pathogen operation group contained in the sequencing data according to the estimated actual occurrence times of each k-mer.
After the actual occurrence number of the k-mers in the sequencing data included in each final pathogen operation group is obtained, a k-mer actual occurrence number table corresponding to each final pathogen operation group can be generated. The actual occurrence number of the k-mer in the sequencing data refers to the actual occurrence number of the k-mer in the sequencing data, and if the k-mer has occurred N times in the sequencing data, the actual occurrence number of the k-mer in the sequencing data is N.
And generating a k-mer actual occurrence number table corresponding to each final pathogen operation group, and then acquiring the actual occurrence number corresponding to each k-mer according to the table. And calculating to obtain the estimated actual occurrence times of each k-mer according to the acquired proportion of the total set of each k-mer. And after the estimated actual occurrence times of each k-mer contained in each final pathogen operation group are obtained, generating a k-mer estimated actual occurrence time table corresponding to each final pathogen operation group.
The table of estimated actual occurrences of k-mers as shown in FIG. 13, records in the left-most column the k-mers contained in pathogen operation set X. Recorded in the second column is the corpus fraction of each k-mer, which may be recorded as F, respectively1,F2…, the data for the ensemble proportion of each k-mer is derived from a table of k-mer occurrences versus ensemble proportion for the final pathogen operation group in which the k-mer is located. The third column records the actual number of occurrences of each k-mer in the sequencing data, which can be recorded as C1,C2…. The fourth column is the estimated actual number of occurrences of each k-mer, which may be recorded as F1×C1,F2×C2…, the estimated actual occurrence number of each k-mer is calculated based on the ratio of the total set to the actual occurrence number of the k-mer.
After the k-mer actual occurrence table corresponding to each final pathogen operation group is successfully established, the estimated actual occurrence sum CF of all k-mers contained in the final pathogen operation group can be obtained from each k-mer estimated actual occurrence table, and the estimated actual occurrence sum CF of all k-mers contained in the pathogen operation group contained in the sequencing data can be obtained from each k-mer estimated actual occurrence table.
In one embodiment, as shown in FIG. 14, step 1204, comprises:
and 1402, acquiring a record table of the occurrence times of the k-mers corresponding to each final pathogen operation group.
Step 1404, obtaining the k-mer contained in the k-mer occurrence count record table.
The final pathogen manipulation group refers to the pathogen manipulation group comprised by the sequencing data. And storing a characteristic target sequence set corresponding to each pathogen operation group in the target data, wherein the characteristic target sequence set of each pathogen operation group comprises a k-mer occurrence frequency record table corresponding to the pathogen operation group. In the record table of the occurrence number of k-mers corresponding to each pathogen operation group, all k-mers occurring in the genome of the pathogen operation group are recorded, and the occurrence number of the k-mers occurring in the genome of the pathogen operation group is recorded. If the k-mer appears N times in the same genome, it is counted as N.
And after acquiring each k-mer occurrence number record table corresponding to the pathogen operation group contained in the sequencing data, namely acquiring the k-mer occurrence number record table corresponding to each final pathogen operation group, acquiring the k-mers contained in each k-mer occurrence number record table. Although the k-mer occurrence number recording table records the k-mers contained in each pathogen operation group and the occurrence numbers of the k-mers, the k-mer list contained in each pathogen operation group is only required here, and the occurrence numbers of each k-mer do not need to be obtained, so that the k-mer list in the k-mer occurrence number recording table can be obtained.
And step 1406, acquiring a k-mer occurrence count to corpus ratio table corresponding to each final pathogen operation group, wherein the k-mer occurrence count to corpus ratio table comprises the ratio of the occurrence count of each k-mer sequence in the pathogen operation group to the occurrence count in the corpus.
Step 1408, obtaining the proportion of the occurrence times of the k-mers in the corpus of each k-mer contained in the k-mer occurrence time recording table from the list of the occurrence times of the k-mers in the corpus proportion.
In the target database, the stored characteristic target sequence set corresponding to each pathogen operation group further comprises a k-mer occurrence number-to-corpus proportion table corresponding to the pathogen operation group, each k-mer occurrence number-to-corpus proportion table comprises the occurrence number CG of each k-mer in the corresponding pathogen operation group, the corpus occurrence number CB of each k-mer in the pathogen operation group contained in the target database, and the corpus proportion F of each k-mer. Therefore, the corpus proportion of each k-mer can be obtained from the occurrence number of each k-mer and the corpus proportion table.
And after the k-mer occurrence number recording table corresponding to each final pathogen operation group is obtained, the k-mers contained in each k-mer occurrence number recording table can be obtained, and then the corpus proportion of each k-mer recorded in the k-mer occurrence number recording table corresponding to each final pathogen operation group can be further obtained through the obtained k-mer occurrence number proportion table.
In one embodiment, as shown in fig. 15, before step 1206, the method further includes:
and step 1502, acquiring the k-mers contained in each final pathogen operation group according to the k-mer occurrence frequency record table corresponding to each final pathogen operation group.
And 1504, generating a k-mer occurrence union list according to the k-mer occurrence frequency record table corresponding to each pathogen operation group.
At step 1506, the actual number of occurrences of each k-mer in the sequencing data contained in the list of k-mer occurrence unisets is obtained.
At step 1508, the actual number of occurrences for each k-mer included in the final pathogen operation group is obtained.
Step 1510, generating a k-mer actual occurrence table corresponding to the final pathogen operation group according to the actual occurrence number corresponding to each k-mer.
The characteristic target sequence set of each pathogen operation group is stored in the target database, the characteristic target sequence set of each pathogen operation group comprises a corresponding k-mer occurrence number recording table, the k-mer occurrence number recording table records the k-mer comprised in each pathogen operation group, and the occurrence number of each k-mer in the genome comprised in the pathogen operation group. The final pathogen manipulation group is actually the pathogen manipulation group included in the sequencing data, and is simply given a name without substantial change. For each final pathogen operating group, there is a table of k-mer occurrence counts corresponding to each final pathogen operating group. Therefore, after the k-mer occurrence number recording table corresponding to each pathogen operation group contained in the target database is obtained, the k-mer occurrence number recording table corresponding to the final pathogen operation group is obtained.
Although the k-mer occurrence number recording table records the k-mers contained in each pathogen operation group and the occurrence numbers of the k-mers, the k-mer list contained in each pathogen operation group is only required here, and the occurrence numbers of each k-mer do not need to be obtained, so that the k-mer list in the k-mer occurrence number recording table can be obtained. In order to accelerate the subsequent operation speed, a k-mer occurrence union list can be generated according to k-mers contained in a k-mer occurrence number recording table of each pathogen operation group, namely, the k-mer lists of each pathogen operation group are combined into an overall list.
Further, the actual number of occurrences of each k-mer in the sequencing data contained in the list of k-mer occurrence unisets may be obtained. Therefore, the actual occurrence times corresponding to each k-mer in each final pathogen operation group can be obtained, and a k-mer actual occurrence time table corresponding to the final pathogen operation group can be generated according to the actual occurrence times corresponding to each k-mer. For example, according to the proportion of the total set corresponding to the k-mers contained in each pathogen operation group and the actual occurrence times of the k-mers recorded in the k-mer actual occurrence times table, the estimated actual occurrence times of the k-mers contained in each pathogen operation group can be calculated, so that the k-mer estimated actual occurrence times table contained in each corresponding pathogen operation group is generated.
In one embodiment, as shown in fig. 16, before obtaining the sequencing data of the sample, the method further comprises the following steps:
in step 1602, the number of occurrences CG of k-mers in a pathogen operation group included in the pathogen operation group is obtained.
The number of occurrences CG of a k-mer in a pathogen manipulation group refers to the number of occurrences CG of the k-mer in the pathogen manipulation group in which it is present, i.e., the number of occurrences CG of the k-mer in the genome contained in the pathogen manipulation group in which it is present, and N is counted if the k-mer has occurred N times in the same genome. The occurrence number CG of the k-mers in the pathogen operation group can be obtained from a k-mer occurrence number recording table corresponding to each pathogen operation group, and the occurrence number of the k-mers in the pathogen operation group contained in the corresponding pathogen operation group is recorded in the k-mer occurrence number recording table corresponding to each pathogen operation group.
And step 1604, acquiring the occurrence frequency CB of the k-mers contained in the pathogen operation group in the complete set.
The occurrence CB of a k-mer in a repertoire of pathogen manipulations contained in a target database refers to the occurrence of the k-mer in the genome contained in the repertoire. The occurrence frequency CB of each k-mer can be obtained from the occurrence frequency record table of the k-mers of the corpus, and the occurrence frequency of each k-mer in the corpus is recorded in the occurrence frequency record table of the k-mers of the corpus.
And 1606, calculating to obtain a ratio F of the total set of each k-mer, which is a ratio of the occurrence times CG to the occurrence times CB of the total set.
Step 1608, generating a ratio table of k-mer occurrence times to ensembles corresponding to each pathogen operation group according to the ratio F of ensembles to ensembles.
And step 1610, storing the k-mer occurrence times-to-corpus proportion table to a characteristic target sequence set corresponding to the pathogen operation group.
After the occurrence frequency CG and the occurrence frequency CB of the corpus corresponding to each k-mer are obtained, the proportion F of the corpus corresponding to each k-mer can be calculated according to the two data, and the formula is as follows: f is CG/CB. And after the proportion F of the total set corresponding to each k-mer is calculated, generating a table of the proportion of the occurrence times of the k-mers corresponding to each pathogen operation group according to the data of the k-mers contained in each pathogen operation group.
FIG. 17 is a table of k-mer occurrence ratios in a corpus, where the left column records k-mers included in the pathogen operation group X, and the second column records the occurrence ratios CG, for each k-mer1,CG2…. The third column records the occurrence frequency CB of the corpus corresponding to each k-mer, which is CB1,CB2…. The fourth column records the ratio F of the total set corresponding to each k-mer, which is calculated according to CG and CB and is F1=CG1/CB1,F2=CG2/CB2,…。
After the k-mer occurrence count-to-corpus proportion table corresponding to each pathogen operation group is generated, each k-mer occurrence count-to-corpus proportion table can be stored into a characteristic target point sequence set corresponding to the pathogen operation group in a target point database. After storage, if the data needs to be used, the data can be called from the target point database, and the detection efficiency is further improved.
In one embodiment, after obtaining the sequencing data of the sample, the method further comprises: acquiring the actual occurrence times of the k-mers contained in the pathogen operation group in the sequencing data of the sample; and generating a k-mer actual occurrence number table corresponding to each pathogen operation group according to the actual occurrence number of the k-mers contained in the pathogen operation group in the sequencing data of the sample.
After the sequencing data of the sample is obtained, the actual occurrence number of each k-mer contained in each pathogen operation group in the sequencing data can be obtained, and a k-mer actual occurrence number table corresponding to each pathogen operation group is generated according to the actual occurrence number of the k-mers contained in each pathogen operation group. When the actual occurrence times of the k-mers in the sequencing data need to be acquired subsequently, the data can be acquired by acquiring the actual occurrence times table of the k-mers, and the estimation efficiency of the relative concentration of the pathogen operation group is improved.
In one embodiment, as shown in fig. 18, before obtaining the sequencing data of the sample, the method further comprises:
at step 1802, the number of occurrences of k-mers included in each pathogen manipulation group in the pathogen manipulation group is obtained.
And 1804, generating a k-mer occurrence number record table corresponding to each pathogen operation group according to the occurrence number of each k-mer in the pathogen operation group.
And step 1806, storing the k-mer occurrence frequency record table to a characteristic target sequence set corresponding to the pathogen operation group.
And acquiring the occurrence number of each k-mer in each pathogen operation group in the pathogen operation group, and if a k-mer occurs x times in a certain genome in the pathogen operation group in which the k-mer exists, adding x counts in the counting unit corresponding to the k-mer in the k-mer occurrence number recording table. And after the occurrence times of each k-mer contained in each pathogen operation group are obtained, generating a k-mer occurrence time recording table corresponding to each pathogen operation group, and if M pathogen operation groups exist in the target database, establishing M k-mer occurrence time recording tables corresponding to the pathogen operation groups. And storing the k-mer occurrence frequency recording table into a characteristic target point sequence set corresponding to each pathogen operation group stored in a target point database, and storing the data for subsequent use.
Obtaining the occurrence number CG of k-mers in a pathogen operation group, wherein the k-mers comprise the pathogen operation group, and the occurrence number CG comprises the following steps: and acquiring the occurrence CG of the k-mers in the pathogen operation group contained in the pathogen operation group from the record table of the occurrence of the k-mers corresponding to the pathogen operation group.
Therefore, when the occurrence number CG of the k-mer contained in each pathogen operation group in the pathogen operation group in which the k-mer is located needs to be obtained, the k-mer occurrence number CG can be obtained through a record table of the occurrence number of the k-mer corresponding to each pathogen operation group, so that the estimation efficiency of the relative concentration of the pathogen operation group is improved.
In one embodiment, as shown in fig. 19, before obtaining the sequencing data of the sample, the method further comprises:
at step 1902, a corpus of occurrences of k-mers in the corpus for each pathogen operation group is obtained.
Step 1904, generate a k-mer occurrence record table for the corpus according to the occurrence of each k-mer corpus.
Step 1906, store the occurrence frequency record table of k-mers in the corpus into the target database.
The number of occurrences of a k-mer in a corpus refers to the number of occurrences of the k-mer in all pathogen manipulation groups and non-pathogen manipulation groups comprised in the corpus, and is also understood to be the number of occurrences of the k-mer in all genomes comprised in the corpus. Therefore, after the occurrence number of the k-mers contained in each pathogen operation group in the complete set is obtained, a record table of the occurrence number of the k-mers in the complete set can be generated. This is equivalent to recording all occurrences of k-mers in the corpus in this table. After the occurrence frequency recording table of the k-mer of the corpus is generated, the table can be stored in a target point database, and the data is stored for subsequent use.
Obtaining the occurrence frequency CB of the complete set of the k-mers contained in the pathogen operation group in the target database, wherein the occurrence frequency CB comprises the following steps: and acquiring the occurrence times CB of the k-mers contained in the pathogen operation group contained in the target database from the occurrence times record table of the k-mers of the complete set.
The occurrence frequency recording table of the k-mers of the corpus is stored in the target database, so when the occurrence frequency CB of the k-mers in the pathogen operation group contained in the pathogen operation group needs to be acquired, the k-mers in the corpus can be acquired through the occurrence frequency recording table of the corpus.
In one embodiment, as shown in fig. 20, there is provided a method of detecting a pathogen operating group, comprising the steps of:
step 2002, a set of characteristic target sequences corresponding to each pathogen operation group is established.
Further, establishing a feature target sequence set corresponding to each pathogen operation group, as shown in fig. 21, includes the following steps:
and step 2002A, collecting and sorting high-reliability genome.
When a feature target sequence set corresponding to each pathogen operation group is established, high-reliability genome data needs to be collected and sorted firstly. The high confidence genome may include both pathogen and non-pathogen genomes, such as high confidence genomes of symbiotic bacteria, probiotics, humans, animals, plants, and the like. The high-confidence genome may be derived from the NCBI's RefSeq dataset or other public or private high-confidence genomes.
The high-confidence genome identification and screening method can be realized by the following three ways:
1. the screening is performed based on the ratio of non-deterministic characters contained in a piece of genome data. For example, in the case of a DNA genome, the proportion of non-deterministic characters refers to the proportion of non-ACGT characters contained therein, and a piece of DNA genome data is a genome suspected of low confidence if its proportion of non-ACGT characters is too high. For DNA or RNA sequences, a non-deterministic character refers to a character excluding several deterministic characters, ACGTU; for protein sequences, non-deterministic characters refer to characters other than the amino acid character determined.
2. And (4) screening according to the number of genome data fragments included in a complete chromosome, wherein if too many fragments belong to the same chromosome, the genome is suspected to be a low-credibility genome.
3. Determining the average whole genome coverage percentage of the genome in the similar genome by performing whole genome sequence alignment on a plurality of genomes with similar genetic relationship (for example, genetic distance smaller than a certain threshold value) with the genome, and then screening according to the average whole genome coverage percentage: a genome with too low average coverage percentage is a genome suspected of low completeness, i.e., low confidence. Genetic distance is an index that measures the magnitude of the overall genetic difference between species (or individuals).
All the pooled high-confidence genomes can be collectively referred to as a corpus.
In step 2002B, the operational group of pathogens contained in the repertoire is determined.
A pathogen manipulation group may include one or more related genomes. Thus, a panel of pathogen operations may represent a species, a subspecies, a subtype, a strain or strain, or a genus, at various levels of classification, genetic units or taxonomic units of species, depending on clinical needs and/or support of taxonomy of species. When the various pathogen operation groups are established, the related information of medical, clinical application, biology, pathology and the like of each pathogen operation group can be simultaneously determined. For example, information on the form, metabolic property, good anaerobism, gram stain, common diseases caused by a certain bacterium, a conventional detection method, a conventional therapeutic agent, and the like of the bacterium is confirmed.
Step 2002C, generating a genome occurrence index table of the corpus.
Using the corpus, a genome occurrence index table of the corpus may be generated in which how many genomes of the corpus each k-mer contained in the corpus has occurred is recorded. k-mer refers to a genomic sequence of length k, which can be self-defined and can generally range from 11 to 32. If a total of a different deterministic characters are present in a genome data, then for a particular k, there are a total of a possible different k-mers to the power of k.
For example, for DNA genome data, DNA has a total of four different deterministic characters of ACGT, and then for a particular k, there are a total of 4 to the power of k possible different k-mers. For a genome of length n, there may be at most n-k +1 different k-mers. However, because of the repetitive regions in a genome, it is common that a genome of n characters will contain many different k-mers that are much smaller than n-k + 1. Thus, if conventional k-mer counting methods are used, a particular k-mer may appear multiple times in a given genome, and may be counted multiple times. In the index table of the number of occurrences of genome of the corpus established in this example, unlike the previous method, if one k-mer occurs more than once in one genome, it is still counted only once in the index table of the number of occurrences of genome of the corpus. Thus, the count corresponding to a k-mer in the index of k-mer genome occurrences thus generated represents how many genomes of the k-mer total occurred in the corpus.
If DNA or RNA genomic sequences are used, the reverse complement A 'of a k-mer A should be considered as present after its occurrence because of the reverse complementarity of the nucleic acid sequence, and both A and A' should be recorded in the table. In the subsequent step, if the k-mers of the DNA or RNA sequence are targeted, when a k-mer A is mentioned for some manipulation, the reverse complement A' is also considered by default to be mentioned and the corresponding manipulation is performed.
Step 2002D, generating a genome occurrence index table corresponding to each pathogen operation group.
The index of the number of occurrences of the genome of a pathogen manipulation group is different from the index of the number of occurrences of the genome of the corpus in step 2002C above. The genome occurrence index table of the corpus records how many genomes a k-mer has occurred in all pathogen manipulation groups, i.e., how many genomes a k-mer has occurred in the corpus. However, the index of the number of occurrences of genome corresponding to each pathogen manipulation group is recorded as how many genomes of each pathogen manipulation group the k-mers contained in the pathogen manipulation group have occurred.
Step 2002E, generate a list of specific k-mers for each pathogen manipulation group.
Recorded in the specific k-mer table corresponding to each pathogen manipulation group are k-mers, i.e., specific k-mers, satisfying a preset specificity condition in each pathogen manipulation group. The specific k-mer is a k-mer which is selected from the k-mers and meets a preset specificity condition, and the requirement for selecting the specific k-mer meets the following two conditions:
1. if the pathogen manipulation group contains N genomes, the occurrence number of a certain k-mer in the index table of the occurrence number of the genome corresponding to the pathogen manipulation group is C1Then the condition needs to be satisfied: c1/N+P 11 or more, that is, the sum of the ratio of the number of occurrences in the genome occurrence number index table of the pathogen manipulation group to the number of genomes contained in the pathogen manipulation group and the first threshold value P, wherein the first threshold value P is 1 or more1Typically less than 5%.
2. If the number of occurrences of a certain k-mer in the index table of the number of occurrences of the genome corresponding to the pathogen operation group is C1The occurrence number of the k-mer in the genome occurrence number index table of the corpus is C2Then the condition needs to be satisfied: c1/C2+P 21 or more, i.e., the number of occurrences in the genome occurrence index table of the pathogen manipulation group and the totalThe sum of the ratio of the number of occurrences in the genome number-of-occurrence index table of the set and the second threshold value is 1 or more. Wherein the second threshold value P2Typically less than 5%.
First threshold value P1And a second threshold value P2May or may not be equal. In this example, the specific k-mers were selected with the addition of a first threshold P1And a second threshold value P2These two parameters allow a range of error rates, i.e.a range of non-specificity of specific k-mers. Without these two parameters, a range of non-specificities cannot be tolerated, and it is often difficult to find specific k-mers for a particular pathogen panel.
For a pathogen manipulation group, if n specific k-mers are found, assume P in condition (1) of this step1The occurrence is randomly distributed in each genome of the pathogen manipulation group, and the probability of false negatives for the pathogen manipulation group is less than or equal to P1 n. For sufficiently large n, the probability of false negatives that may occur here will be minimal. Meanwhile, if n' specific k-mers are actually detected in the pathogen manipulation group at the end, it is assumed that P is present in condition (2) of the present step2The occurrence is randomly distributed in each other genome of the non-subject pathogen manipulation group, and the probability of false positives for the pathogen manipulation group is actually less than or equal to P1 n'(i.e. P)2To the power of n'). For sufficiently large n', the likelihood of false positives that may occur here will be minimal. The false negative rate refers to the proportion of positives in a test that produce negative test results, i.e., the conditional probability that a negative test result exists given the condition being sought.
If a total of M pathogen manipulation groups are partitioned in step 2002B, M corresponding lists of specific k-mers are created in this step.
Step 2002F, generating a specific k-mer appearance proportion table corresponding to each pathogen operation group.
And aiming at each pathogen operation group, calculating the occurrence frequency of each specific k-mer selected in the step 2002E in the pathogen operation group in which the specific k-mer is located, and then establishing a specific occurrence table of the specific k-mers corresponding to the pathogen operation group. Similarly, if a total of M pathogen manipulation groups are partitioned in step 2002B, then M corresponding occurrence ratios of specific k-mers are built in this step.
And step 2002G, generating a k-mer occurrence frequency record table corresponding to each pathogen operation group.
And step 2002H, generating a k-mer occurrence frequency record table of the corpus.
And step 2002I, generating a proportion table of the occurrence times of the k-mers corresponding to each pathogen operation group in the corpus.
And recording the occurrence number of the k-mers in each pathogen operation group in a corresponding k-mer occurrence number recording table, wherein the k-mers contained in each pathogen operation group and the occurrence number of each k-mer in the pathogen operation group are recorded. Unlike the index of the number of occurrences of genome corresponding to the pathogen manipulation group, in the index of the number of occurrences of genome, it is recorded how many genomes of the pathogen manipulation group each k-mer has occurred, and if the k-mer has occurred X (X is greater than 1) times in the same genome, it is also recorded as 1. However, in the k-mer occurrence count table, the number of occurrences of each k-mer in the genome included in the pathogen manipulation group is recorded, i.e., if the k-mer has occurred X times in the same genome, X times are recorded.
The record table of k-mer occurrence times of the generated corpus in step 2002H is identical in count to the record table of k-mer occurrence times in step 2002G, and if the k-mer occurs X times in the same genome, the k-mer occurrence times is recorded X times. However, the difference is that the occurrence number of each k-mer in the genome included in all pathogen operation groups is recorded in the corpus k-mer occurrence number recording table, instead of establishing one for each pathogen operation group, one is created in whole, and the occurrence number of all k-mers in all genomes is obtained according to the corpus k-mer occurrence number recording table. That is, if M pathogen operation groups are divided in step 2002G, M corresponding k-mer occurrence number tables are created, but only one full set k-mer occurrence number table is still created.
In the table of the proportion of k-mer occurrence times to the corpus generated in step 2002I, the occurrence times CG of the k-mer included in each pathogen operation group in the pathogen operation group, the corpus occurrence times CB of the k-mer included in each pathogen operation group in the pathogen operation group included in the target database, and the corpus proportion F calculated according to the ratio of the occurrence times CG to the corpus occurrence times CB are recorded. Similarly, if a total of M pathogen operation groups are partitioned in step 2002B, then M corresponding k-mers occurrence percentage table is established in this step.
In addition, the three tables generated in step 2002G, step 2002H and step 2002I are established for the process of estimating the relative concentrations of the pathogen operation groups contained in the sequencing data, and if only the detection and determination of the pathogen operation groups contained in the sequencing data of the sample is required, and the estimation of the relative concentrations is not required, the three steps may not be required.
After the generation of the tables in steps 2002B-2002I is completed, it is considered that the creation of the feature target sequence set corresponding to each pathogen operation group is completed, and the process of creating the feature target sequence set may be collectively referred to as module a. Module a may run at irregular intervals to continually update the set of characteristic target sequences corresponding to each pathogen operating group, i.e., update the target database. Module a does not need to be run or updated as each sample is analyzed.
In step 2004, the panel of pathogen manipulations contained in the sequencing data of the sample is detected.
The sample can be a mixed metagenome sample and a complex genome sequencing sample, wherein the mixed metagenome sample means that one sample is directly sequenced without any separation, and may contain viruses, bacteria, fungi and other samples of various prokaryotes and eukaryotes. A complex genomic sequencing sample is one in which more than one organism or more than one individual organism may be contained in a sequenced genome, e.g., a sample that has not been completely isolated, or a sample that has been contaminated. Sequencing data refers to data output by a device after a sample is read by a DNA sequencer, an RNA sequencer, a protein sequencing device and other devices to obtain a sequence of a biomolecule contained in the sample.
As shown in fig. 22, step 2004, includes:
in step 2004A, sequencing data is obtained for the sample.
Step 2004B, call specific k-mer appearance ratio table.
Step 2004C, a table of records of actual occurrences of specific k-mers corresponding to each pathogen manipulation group is generated.
Firstly, obtaining sequencing data of a sample to be detected, calling the specific k-mer occurrence ratio table generated in the step 2002F during detection, and if M specific k-mer occurrence ratio tables corresponding to pathogen operation groups are generated in the step 2002F, calling M specific k-mers. Aiming at sequencing data of a certain sample, acquiring the actual occurrence times of each specific k-mer in the sequencing data contained in each specific k-mer occurrence proportion table, and generating a specific k-mer actual occurrence time recording table corresponding to each pathogen operation group.
And step 2004D, excluding the pathogen operation group with low probability of occurrence, and selecting the suspected pathogen operation group.
After the actual occurrence number recording table of the specific k-mers generated in step 2004B is obtained, the actual occurrence number of each specific k-mer in the sequencing data can be obtained, so that the sum of the actual occurrence numbers of all specific k-mers included in each pathogen operation group in the sequencing data can be obtained, that is, for each pathogen operation group, the sum obtained by adding the actual occurrence numbers of each specific k-mer included in the pathogen operation group in the sequencing data is the sum of the actual occurrence numbers. And setting a threshold value T, and when the sum of the actual occurrence times of a certain pathogen operation group is less than the threshold value T, judging the pathogen operation group as a pathogen operation group with low occurrence probability, and removing the pathogen operation group. The threshold T can be set by the technician in a customized manner, and is typically set to a number greater than 5.
When a portion of the pathogen operation sets are eliminated, the remaining pathogen operations are required to select a suspected pathogen operation set by:
1. the actual number of occurrences of the specific k-mer is CiIf C is presentiIf the value is greater than the threshold value T', the specific k-mer is considered to be actually appeared in the sequencing data, and the specific k-mer can be used as the specific k-mer for confirming the appearance. T' can be custom set by the skilled person, typically setting the threshold T to a number greater than 5.
2. For a certain pathogen operation group, if the pathogen operation group contains n specific k-mers for confirming the occurrence, the probability of the false positive of the pathogen operation group is P2 n,P2I.e. the second threshold. When n is large enough, the probability of false positives for this pathogen operating group will be small.
It is noted that the n confirmed occurring specific k-mers contained in each pathogen manipulation group belong to the unaligned specific k-mers after the independent correction. Specifically, specific k-mers with the actual occurrence times larger than a threshold value T are firstly acquired as specific k-mers for confirming the occurrence, then the specific k-mers belonging to the confirmed occurrence are checked to determine which specific k-mers are non-coincident specific k-mers subjected to independent correction, then the number n of the specific k-mers belonging to the non-coincident specific k-mers for confirming the occurrence is taken, and the n is used for calculating the false positive probability of the pathogen operation group.
In the actual sequencing data processing process, the sequences recorded in the non-coincident specific regions are not equal in length, and only the high-speed comparison between the kmer and the kmer can be ensured, but not the high-speed comparison of the kmer to a non-specific region can be ensured when the actual sequencing data to be analyzed is processed. Therefore, the actual occurrence number of the specificity kmers is firstly confirmed, then which specificity kmers belong to the non-coincident specificity region is confirmed, so that the number of the confirmation specificity k-mers belonging to the non-coincident specificity region in each pathogen operation group is further confirmed, and finally the false positive probability corresponding to each pathogen operation group is calculated.
And when the false positive probability corresponding to each pathogen operation group is obtained through calculation, selecting the pathogen operation group with the false positive probability lower than the preset standard probability as a suspected pathogen operation group. The predetermined standard probability is preset by the technician and is generally set to a value less than 0.05.
And step 2004E, selecting pathogen operation groups meeting preset conditions from the suspected pathogen operation groups as the pathogen operation groups contained in the sequencing data.
To further remove the possibility of false positives, a statistical Goodness-of-fit method can be used to check whether the proportion of the actual number of occurrences of specific k-mers in the sequencing data matches the expected proportion of occurrences of specific k-mers, i.e., the proportion of occurrences of each specific k-mer recorded in the table of occurrence ratios of specific k-mers. The result of this statistical test is the probability p-value calculated statistically here. According to the results of the statistical test, the failure to meet the statistical test criterion P' is eliminatedTThe pathogen operation group corresponding to the confirmed specific k-mer whose actual occurrence ratio does not meet the expected occurrence ratio is removed, that is, the pathogen operation group corresponding to the confirmed specific k-mer whose actual occurrence ratio meets the expected occurrence ratio is selected from the suspected pathogen operation groups obtained in step 2004C, and the selected pathogen operation group is used as the pathogen operation group included in the sequencing data. PTIs preset by the skilled person and is typically set to a value of less than 0.05.
In addition, the Goodness-of-fit test method may include chi-sqared test, likelihood-ratio test, and the like. To avoid large sample size detection errors, the actual number of occurrences of specific k-mers can be scaled down equally. The magnitude of the reduction is determined according to the value of the specific k-mer having the least occurrence among the number of occurrences of all the actually detected specific k-mers, which is not less than 1 after the equal scaling. The term "equal ratio" as used herein is intended to mean the ratio between the number of occurrences of the respective actually detected specific k-mer.
After the pathogen operation groups contained in the sequencing data are obtained, a final pathogen operation group list can be obtained. The final pathogen operation group list comprises the name of each pathogen operation group, the calculated p-value and the acquired relevant information of medical, clinical application, biology, pathology and the like corresponding to each pathogen operation group.
At step 2006, the relative concentrations of the pathogen manipulation groups contained in the sequencing data are estimated.
After the pathogen operation group contained in the sequencing data is obtained, the relative concentration of the pathogen operation group contained in the sequencing data can be further estimated. In the sequencing process, a sample needs to be subjected to multiple steps of preparation, sequencing, signal processing and the like, and the absolute concentration of the pathogen contained in the sample is difficult to calculate accurately, so that the relative concentration of the pathogen contained in the sample can be estimated.
As shown in fig. 23, step 2006 includes:
step 2006A, a final pathogen operational group list is obtained.
And step 2006B, acquiring a k-mer occurrence frequency record table corresponding to each final pathogen operation group.
The generated final pathogen operation group list is called, so that the pathogen operation group contained in the sequencing data can be obtained, and for the sake of easy description, the pathogen operation group contained in the sequencing data can be called as a final pathogen operation group. And acquiring a k-mer occurrence number recording table corresponding to each final pathogen operation group, and acquiring a k-mer list recorded in the k-mer occurrence number recording table. Although the k-mer occurrence number recording table records the k-mers contained in each pathogen operation group and the occurrence numbers of the k-mers, the k-mer list contained in each pathogen operation group is only required here, and the occurrence numbers of each k-mer do not need to be obtained, so that the k-mer list in the k-mer occurrence number recording table can be obtained.
And step 2006C, acquiring a ratio table of the occurrence times of the k-mers corresponding to each final pathogen operation group in the corpus.
And then acquiring a ratio table of occurrence times of the k-mers corresponding to each final pathogen operation group, and acquiring a ratio of the k-mers contained in each final pathogen operation group. To speed up the subsequent computation, a list of k-mer occurrence unions may be generated after step 2006C is performed, i.e., the k-mer lists for each pathogen operation group are merged into an aggregate list.
Step 2006D, obtain the sum of occurrences CT of k-mers that occurred in the sequencing data.
In estimating the relative concentrations of the pathogen operation groups contained in the sequencing data, the data used is k-mers rather than specific k-mer related data, so that the number of k-mers contained in each pathogen operation group and present in the sequencing data can be obtained. The sum of the number of k-mers present in the sequencing data is CT. When the sum of the occurrence times CT of k-mers appearing in sequencing data is calculated, in order to increase the calculation speed, it is not necessary to actually record which k-mers appear, that is, it is not necessary to know or record the sequence of each k-mer, but the sum of the occurrence times CT is directly calculated.
Step 2006E, generate a table of k-mer actual occurrences corresponding to each final pathogen operation group.
Step 2006F, generates a sum of estimated actual occurrences, CF, of k-mers included in the final pathogen operation set.
The target database also comprises a k-mer occurrence frequency-to-ensemble proportion table corresponding to each pathogen operation group, wherein the ensemble proportion of k-mers contained in each pathogen operation group, namely the ensemble proportion of k-mers contained in the final pathogen operation group, is recorded in the table. Further, the ensemble proportion of each k-mer refers to the ratio of the number of occurrences of the k-mer in the pathogen operation group in which it is located to the number of occurrences of the k-mer in the ensemble.
After the actual occurrence number of the k-mers in the sequencing data included in each final pathogen operation group is obtained, a k-mer actual occurrence number table corresponding to each final pathogen operation group can be generated. The actual occurrence number of the k-mer in the sequencing data refers to the actual occurrence number of the k-mer in the sequencing data, and if the k-mer has occurred N times in the sequencing data, the actual occurrence number of the k-mer in the sequencing data is N.
And generating a k-mer actual occurrence number table corresponding to each final pathogen operation group, and then acquiring the actual occurrence number corresponding to each k-mer according to the table. And calculating to obtain the estimated actual occurrence times of each k-mer according to the acquired proportion of the total set of each k-mer. And after the estimated actual occurrence times of each k-mer contained in each final pathogen operation group are obtained, generating a k-mer estimated actual occurrence time table corresponding to each final pathogen operation group. After the k-mer actual occurrence table corresponding to each final pathogen operation group is successfully established, the estimated actual occurrence sum CF of all k-mers contained in the final pathogen operation group can be obtained from each k-mer estimated actual occurrence table, and the estimated actual occurrence sum CF of all k-mers contained in the pathogen operation group contained in the sequencing data can be obtained from each k-mer estimated actual occurrence table.
In order to accelerate the subsequent operation speed, a k-mer occurrence union list can be generated according to k-mers contained in each k-mer occurrence number recording table, namely, the k-mer lists of each pathogen operation group are combined into an overall list. That is, the list of occurring and assembled k-mers contains the k-mers from the set of pathogen manipulations that are known to be contained in the sequencing data. The generated k-mer occurrence union list can be used, and after the actual occurrence times of each k-mer in the sequencing data, which are contained in the k-mer occurrence union list, are obtained, a k-mer occurrence time total record table can be generated, which is equivalent to the actual occurrence times of all k-mers of known pathogen operation groups in the sequencing data, which are recorded in the k-mer occurrence time total record table. And the k-mer occurrence number total record table contains k-mers contained in all the existing known pathogen operation groups in the set of sequencing data, so that the actual occurrence number of the k-mers contained in the final pathogen operation group in the sequencing data can be obtained according to the k-mer occurrence number total record table.
After the actual occurrence times of the k-mers included in each final pathogen operation group are obtained, a k-mer actual occurrence time table corresponding to each final pathogen operation group can be generated for later use, for example, according to the proportion of the corpus ratio corresponding to each k-mer and the actual occurrence times recorded in the k-mer actual occurrence time table, the estimated actual occurrence times of each k-mer can be calculated, and therefore a corresponding k-mer estimated actual occurrence time table is generated.
The two methods for generating the k-mer estimated actual occurrence number table can be carried out by technical personnel according to specific situations.
And step 2006G, calculating the relative concentration of each final pathogen operation group according to the sum of occurrence CT and the sum of estimated actual occurrence CF.
In order to calculate the relative concentration of each pathogen operation group, the sum CT of the k-mer occurrence times appearing in the sequencing data and the sum CF of the estimated actual k-mer occurrence times included in each pathogen operation group are obtained, and the ratio of the CF to the CT is the relative concentration of the pathogen operation group. The estimated actual occurrence sum CF of k-mers included in the pathogen manipulation group refers to the sum of the estimated occurrences of k-mers included in the pathogen manipulation group in the sequencing data, and as the name suggests, the estimated actual occurrence sum CF is an estimated value and is not a value actually measured.
Step 2004 and step 2006 are in a serial operation mode, that is, step 2006 is executed after step 2004 is executed. Steps 2004 and 2006 may be in a parallel or hybrid mode of operation. In the parallel or hybrid operation mode, step 2006A needs to be modified as follows: calling a pathogen operation group list contained in the complete set; and after steps 2004 and 2006 have been run simultaneously, the corresponding relative concentration values calculated in step 2006 are recalled from the final pathogen operating group list obtained in step 2004.
According to the detection method of the pathogen operation group, sequencing data are compared with the characteristic target sequences corresponding to the pathogen operation groups, so that the pathogen operation group contained in the sequencing data is obtained, the comparison space is reduced, the analysis time is shortened, and the detection efficiency is improved. In addition, a corresponding characteristic target sequence set is established for each pathogen operation group, and the characteristic target sequence set of a certain pathogen operation group can ensure that the characteristic target sequences exist only in species in the pathogen operation group and do not exist in other species within a certain probability range, so that the pathogen operation group is represented with high probability, and the accuracy of detection results is improved.
In one embodiment, as shown in fig. 24, a method for detecting the relative concentration of a pathogen operating group is provided, comprising the steps of:
step 2402, obtaining sequencing data of the sample, and calculating the total occurrence frequency CT of k-mers appearing in the sequencing data.
Sequencing data of a sample refers to data output by a device after the sequence of a biomolecule contained in the sample is read by the device such as a DNA sequencer, an RNA sequencer, a protein sequencing device and the like. A set of sequencing data includes multiple pieces (possibly more than several million pieces) of data to be tested, and each piece of data to be tested can be abstracted into a character string. The sample may be in the form of a drop of blood, a sputum, a slug of soil, or the like. A sequencer refers to an instrument that is capable of measuring the sequence of an input sample. The sequences measured here include not only DNA sequences but also sequences composed of other substances such as proteins and RNAs.
When the concentration of the pathogen contained in the sample needs to be calculated, the sample needs to be subjected to multiple steps of sample preparation, sequencing, signal processing and the like in the process of sequencing the sample, and the absolute concentration of the pathogen operation group contained in the sample is difficult to calculate accurately, so that the relative concentration of the pathogen contained in the sample can be estimated.
At this time, the server obtains sequencing data of the sample, and calculates the total CT of the occurrence times of k-mers appearing in the sequencing data, specifically, if the sequencing data contains M pieces of data to be tested, and each piece of data to be tested contains n characters, then in the sequencing data, each piece of data to be tested contains n-k +1 k-mers, the number of k-mers contained in the M pieces of data to be tested can be added, that is, the M pieces of n-k +1 are added, so that the total CT of the occurrence times of k-mers in the sequencing data can be obtained. When the sum of the occurrence times CT of k-mers appearing in sequencing data is calculated, in order to increase the calculation speed, it is not necessary to actually record which k-mers appear, that is, it is not necessary to know or record the sequence of each k-mer, but the sum of the occurrence times CT is directly calculated.
Step 2404, obtaining the number of occurrences GCT of k-mers in the sequencing data included in the pathogen manipulation group included in the ensemble.
The complete set refers to a set formed by collecting high-reliability genomes in advance, and the high-reliability genomes refer to genomes meeting preset conditions. The high confidence genome includes both individual pathogen genomes and non-pathogen genomes. Such as symbionts, probiotics, highly reliable genomes of humans, animals, plants, etc.
The server obtains the number of occurrences GCT, an estimate, of k-mers in the sequencing data contained in the pathogen operation group contained in the ensemble. The GCT may comprise two parts: the estimated total number of actual occurrences of k-mer for a pathogen manipulation group, ECT, and the confirmed number of actual occurrences of k-mer for that pathogen manipulation group, CCT. The number of occurrences GCT is obtained by calculating the estimated total number of actual occurrences ECT of k-mer for each pathogen manipulation group contained in the ensemble and the number of actual occurrences CCT of k-mer confirmed for each pathogen manipulation group. I.e., the number of occurrences GCT is the sum of the estimated total number of actual occurrences ECT and the confirmed total number of actual occurrences CCT of k-mers contained in the pathogen manipulation group contained in the ensemble in the sequencing data. The GCT may also include only a portion: the number of occurrences GCT can be completely estimated from the number of confirmed actual occurrences CCT of k-mers in the pathogen manipulation group contained in the ensemble when k in k-mers contained in the pathogen manipulation group is greater than a target value, i.e., when k is sufficiently large, the number of occurrences GCT is equal to the number of confirmed actual occurrences CCT of k-mers, and the estimated total number of actual occurrences ECT may not be calculated.
Step 2406, calculating the relative concentration of the pathogen operation group contained in the complete set in the sequencing data according to the occurrence number GCT and the occurrence number sum CT.
And obtaining the ratio of the number of occurrences GCT in the sequencing data of the K-mer in each pathogen operation group in the complete set to the number of occurrences CT of the K-mer in the sequencing data, so as to obtain the relative concentration of each pathogen operation group in the complete set in the sequencing data.
In the above embodiment, the accuracy of calculating the relative concentration of the pathogen operation groups contained in the complete set in the sequencing data can be improved by obtaining the total sum CT of the occurrence times of k-mers in the sequencing data and the occurrence times GCT of k-mers in the sequencing data contained in the pathogen operation groups contained in the complete set, and further calculating the relative concentration of the pathogen operation groups contained in the complete set in the sequencing data.
In one embodiment, before step 2402, further comprising the step of:
and acquiring high-reliability genome data to obtain a complete set, wherein the high-reliability genome refers to a genome meeting preset conditions. The operational group of pathogens contained in the ensemble is determined.
The source of the high-confidence genome data may be the Refseq dataset or other public or private high-confidence genomes. The process of collecting high-reliability genome data includes a process of confirming the reliability of a certain genome and performing screening. Namely, a genome satisfying the following conditions is selected as a high-reliability genome: (1) screening is carried out according to the proportion of nondeterministic characters contained in a genome data: for example, for a DNA genome, the proportion of non-deterministic characters refers to the proportion of non-ACGT characters contained therein. If the proportion of non-ACGT characters in a piece of DNA genome data is too high, the piece of data is a suspected genome with low credibility; (2) the screening is based on the number of genomic data segments comprised by a complete chromosome: if too many fragments belong to one chromosome, the genome is suspected to be a low-credibility genome; (3) determining the average whole genome coverage percentage of the genome in the similar genome by performing whole genome sequence alignment on a plurality of genomes with similar genetic relationship (for example, genetic distance smaller than a certain threshold value) with the genome, and then screening according to the average whole genome coverage percentage: a genome with too low average coverage percentage is a genome suspected of low completeness, i.e., low confidence. And after the suspected low-credibility genome or the low-credibility genome is removed, the residual genome is the high-credibility genome.
And determining pathogen operation groups corresponding to the genomes with high credibility in the complete set according to the genomes contained by the species, the subspecies, the subtypes, the strains or the virus strains, or genetic units or taxonomic units of species and other different classification levels. In the embodiment, high-reliability genome data can be collected in advance, a complete set is obtained, pathogen operation groups contained in the complete set are determined, follow-up direct use is facilitated, and efficiency is improved.
In one embodiment, after determining the pathogen operation group contained in the complete set, the method further comprises:
when determining the pathogen operation groups contained in the ensemble, the server may compute the k-mers present within the pathogen operation groups contained in the ensemble, and build a list of pathogen k-mers present within each pathogen operation group in the ensemble. And establishing a k-mer total table of k-mers in all pathogen operation groups in the complete set according to the k-mer lists of the pathogens appearing in the pathogen operation groups, namelyA population of k-mers of pathogens, the individual pathogens being identifiedk-merLists And pathogensk-merThe summary table is stored in a database.
And calculating a record table of the occurrence times of k-mers contained in the pathogen operation group contained in the complete set, wherein the k-mers not only comprise the specific k-mers of the pathogen operation group, but also all the k-mers occurring in the genome in the pathogen operation group. Namely, the k-mer occurrence number recording table records the k-mers contained in the pathogen operation group and the number of times the k-mers have occurred in the pathogen operation group. If a k-mer occurs x times in a genome, then the number of occurrences of the k-mer is recorded as the number of counts to which x counts should be added to the corresponding count unit. If there are M pathogen operation groups in the pathogen operation groups contained in the ensemble, a record of the number of occurrences of k-mers contained in the M corresponding pathogen operation groups is established. And storing the obtained k-mer occurrence frequency record table corresponding to each pathogen operation group into a database.
Establishing a k-mer occurrence frequency recording table contained in the ensemble according to a pathogen k-mer summary table in a database, calculating the k-mer occurrence frequency recording table in the pathogen k-mer summary table by using the k-mer occurrence frequency recording table corresponding to each pathogen operation group, and storing the k-mer occurrence frequency recording table corresponding to the ensemble into the database.
Establishing a recording table of the proportion of the occurrence times of k-mers in a pathogen operation group contained in the complete set to the occurrence times of the k-mers in the complete set, specifically: and obtaining the occurrence times CG of the k-mers contained in each pathogen operation group through a record table of the occurrence times of the k-mers corresponding to the pathogen operation groups. And obtaining the occurrence frequency CB of the k-mers in the complete set contained in each pathogen operation group through the k-mer occurrence frequency record table corresponding to the complete set. And calculating the ratio of CG to CB to obtain the ratio F of the occurrence times of the k-mers in each pathogen operation group to the occurrence times of the k-mers in the complete set. And establishing a ratio record table of the occurrence times of the k-mers in the pathogen operation group in the complete set according to the CG, the CB and the F, and storing the ratio record table into a database.
In the embodiment, each table is established in advance and stored in the database, so that the table can be directly used in the subsequent calculation of the relative concentration, and the efficiency of calculating the relative concentration is improved.
In one embodiment, as shown in FIG. 25, step 2404 includes the step of
Step 2502, obtaining each first marker data in the sequencing data, calculating the number of k-mers contained in each first marker data, and calculating the total number of confirmed actual occurrences CCT according to the number of k-mers contained in each first marker data.
The first marking data refers to the data to be tested, which only contains the specific k-mer of one pathogen operation group, in the data to be tested, and the data to be tested is marked to be from the pathogen operation group. And marking the data to be detected, which only contains the specific k-mer of one pathogen operation group, in the data to be detected respectively into the data from the pathogen operation group only contained to obtain first marking data.
The server acquires each first marker data in the sequencing data and calculates the number of k-mers of each data to be tested marked as from the pathogen operation group only contained. And respectively calculating the sum of the number of k-mers of the data to be detected marked from the same pathogen operation group to obtain the actual occurrence total number CCT of the confirmed k-mers of each pathogen operation group.
Step 2504, acquiring a ratio of the total set ratio of each k-mer contained in the pathogen operation group contained in the total set, wherein the ratio of the total set ratio is a ratio of the occurrence frequency CG of the k-mer in the corresponding pathogen operation group to the occurrence frequency CB of the k-mer in the total set.
The database also comprises a k-mer occurrence number-to-corpus ratio table corresponding to the pathogen operation group, wherein each k-mer occurrence number-to-corpus ratio table comprises the occurrence number CG of each k-mer in the corresponding pathogen operation group, the corpus occurrence number CB of each k-mer in the pathogen operation group contained in the target database and the corpus ratio F of each k-mer. Therefore, the corpus proportion of each k-mer can be obtained from the occurrence number of each k-mer and the corpus proportion table.
Step 2506, obtaining each second marker data in the sequencing data, obtaining an actual number of occurrences of each k-mer included in the pathogen manipulation group included in the ensemble of each second marker data.
And the second marking data refers to the data to be detected which does not contain the specificity k-mer of any pathogen operation group in the sequencing data or has the data to be detected of the specificity k-mers belonging to a plurality of pathogen operation groups, and the data to be detected is marked as not belonging to any pathogen operation group.
The server acquires each second marking data in the sequencing data, confirms all k-mers contained in the second marking data for each second marking data, finds out each pathogen operation group corresponding to all k-mers contained in the second marking data according to a k-mer occurrence frequency recording table corresponding to each pathogen operation group pre-stored in a database, and acquires the actual occurrence frequency of each k-mer contained in the pathogen operation group corresponding to all k-mers in the second marking data.
Step 2508, calculating to obtain the estimated actual occurrence times of each k-mer according to the proportion of the total set to the ratio and the actual occurrence times.
And when the server acquires the full set proportion and the actual occurrence times from the database, calculating the product of the full set proportion of each k-mer contained in the pathogen operation group contained in the full set and the actual occurrence times of each k-mer contained in the pathogen operation group corresponding to the k-mer contained in the second marking data, and acquiring the estimated actual occurrence times of each k-mer contained in each pathogen operation group in the second marking data.
Step 2510, calculating the total estimated actual occurrence times ECT according to the estimated actual occurrence times of each k-mer.
After calculating the estimated actual occurrence times of each k-mer in the second marker data included in each pathogen operation group, i.e., after calculating the estimated actual occurrence times of each k-mer in the second marker data of each pathogen operation group, the estimated total actual occurrence times ECT of all k-mers in each pathogen operation group in the second marker data is obtained.
In step 2512, the number of occurrences GCT of k-mers in the sequencing data included in the pathogen manipulation group included in the repertoire is calculated based on the estimated total number of actual occurrences ECT and the confirmed total number of actual occurrences CCT.
And when the estimated total number of actual occurrences ECT of the k-mer of each pathogen operation group in the second marking data and the confirmed total number of actual occurrences CCT of the corresponding pathogen operation group in the first marking data are obtained through calculation, calculating the sum of the ECT and the CCT to obtain the number of occurrences GCT of the k-mer in the sequencing data of the pathogen operation groups contained in the complete set.
The occurrence times GCT of the k-mers in the sequencing data in the pathogen operation groups contained in the complete set can be obtained by calculating the total occurrence times CCT of the pathogen operation groups corresponding to the first marker data in the data to be detected and the estimated total actual occurrence times ECT of the k-mers in each pathogen operation group in the second marker data, and the occurrence times GCT of the k-mers in the sequencing data can be estimated, so that the relative concentration can be calculated.
In one embodiment, step 2404 includes the steps of:
and acquiring each first marker data in the sequencing data, calculating the number of k-mers contained in each first marker data, and calculating the total number of confirmed actual occurrences CCT according to the number of k-mers contained in each first marker data. When the length of k-mers included in the pathogen manipulation groups included in the repertoire is greater than a target value, the actual total number of occurrences CCT will be confirmed as the number of occurrences GCT of k-mers included in the pathogen manipulation groups included in the repertoire in the sequencing data.
The first marking data refers to the data to be tested, which only contains the specific k-mer of one pathogen operation group, in the data to be tested, and the data to be tested is marked to be from the pathogen operation group. And marking the data to be detected, which only contains the specific k-mer of one pathogen operation group, in the data to be detected respectively into the data from the pathogen operation group only contained to obtain first marking data.
The server acquires each first marker data in the sequencing data and calculates the number of k-mers of each data to be tested marked as from the pathogen operation group only contained. And respectively calculating the sum of the number of k-mers of the data to be detected marked from the same pathogen operation group to obtain the actual occurrence total number CCT of the confirmed k-mers of each pathogen operation group. At this time, when k of k-mers included in the pathogen operation group included in the ensemble is greater than a target value, wherein the target value is determined through various experiments. The k may be greater than 23 or 27. The calculated total number of confirmed actual occurrences CCT can be directly used as the number of occurrences GCT of k-mers in the sequencing data included in the pathogen manipulation group included in the ensemble, i.e., the number of occurrences GCT is equal to the number of confirmed actual occurrences CCT of k-mers. Therefore, the actual occurrence total times ECT do not need to be calculated and estimated, and the efficiency of calculating the occurrence times GCT is improved.
In one embodiment, as shown in fig. 26, after step 2406, further comprising the steps of:
step 2602, obtaining pathogen manipulation groups contained in the sequencing data and obtaining relative concentrations of the pathogen manipulation groups contained in the sequencing data.
When the server calculates the relative concentration of the pathogen operation group contained in the complete set in the sequencing data, the server can acquire the pathogen operation contained in the sequencing data, and the pathogen operation contained in the sequencing data can be detected in advance in the sequencing data. And searching a corresponding pathogen operation group in the complete set according to the pathogen operation group contained in the obtained sequencing data, and obtaining the relative concentration of the pathogen operation group in the sequencing data when the pathogen operation group is searched, so as to obtain the relative concentration of the pathogen operation group contained in the sequencing data.
At step 2604, the pathogen operation group with the relative concentration of the pathogen operation group contained in the sequencing data higher than the preset threshold is selected as the pathogen operation group confirmed to be contained in the sequencing data.
The server selects the pathogen operation group with the relative concentration of the pathogen operation group contained in the selected sequencing data higher than the preset threshold value as the pathogen operation group contained in the sequencing data, and sends the pathogen operation group contained in the sequencing data to the terminal for displaying.
In this embodiment, the accuracy of obtaining the pathogen operation panel included in the sequencing data is further improved by further determining the pathogen operation panel included in the sequencing data using the relative concentrations of the pathogen operation panel included in the sequencing data.
It should be understood that although the various steps in the various flow diagrams of fig. 1-26 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in these figures may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.
In one embodiment, there is provided a detection apparatus for a pathogen operating group, as shown in fig. 27, the apparatus comprising:
a sequencing data acquisition module 2702 configured to acquire sequencing data of the sample.
A target sequence set obtaining module 2704, configured to obtain a feature target sequence set corresponding to each pathogen operation group stored in the target database, where the feature target sequence set includes a specific k-mer that meets a preset specific condition in the pathogen operation group, and the k-mer is a genome sequence with a length of k.
The occurrence number acquiring module 2706 is configured to acquire the occurrence number of the specific k-mer in the sequencing data, where the specific k-mer is included in the feature target sequence set corresponding to each pathogen operation group.
And the pathogen operation group selection module 2708 is used for selecting the pathogen operation group with the occurrence frequency exceeding a preset frequency threshold value as the pathogen operation group contained in the sequencing data.
In one embodiment, the detection apparatus of the pathogen operating group further comprises: the specificity k-mer selecting module is used for selecting k-mers meeting preset specificity conditions from the k-mers corresponding to each pathogen operation group; and the specificity k-mer storage module is used for storing the k-mers meeting the preset specificity condition into the characteristic target point sequence set corresponding to each pathogen operation group.
In one embodiment, the detection apparatus of the pathogen operating group further comprises: the complete set acquisition module is used for acquiring high-reliability genome data to obtain a complete set, wherein the high-reliability genome refers to a genome meeting preset conditions; and the pathogen operating group determining module is used for determining the pathogen operating group contained in the complete set.
In one embodiment, the detection apparatus of the pathogen operating group further comprises: the CT calculation module is used for acquiring sequencing data of the sample and calculating the total CT of the occurrence times of k-mers appearing in the sequencing data; the GCT acquisition module is used for acquiring the occurrence times GCT of the k-mers in the sequencing data in the pathogen operation group contained in the complete set; the concentration calculation module is used for calculating the relative concentration of the pathogen operation group contained in the complete set in the sequencing data according to the occurrence times GCT and the occurrence times total CT;
the detection apparatus of the pathogen operating group further comprising: the concentration obtaining module is used for obtaining the relative concentration of the pathogen operation group contained in the sequencing data according to the relative concentration of the pathogen operation group contained in the complete set in the sequencing data; and the pathogen operation group selection module is used for selecting the pathogen operation group with the relative concentration of the pathogen operation group contained in the sequencing data higher than a preset threshold value as the sequencing data to confirm the contained pathogen operation group.
In one embodiment, the GCT acquisition module includes:
the CCT calculation module is used for acquiring each first marker data in the sequencing data, calculating the number of k-mers contained in each first marker data, and calculating the CCT according to the number of k-mers contained in each first marker data to obtain the actual occurrence number of times; the proportion acquisition module is used for acquiring the proportion of the total set proportion of each k-mer in the pathogen operation group contained in the total set, wherein the proportion of the total set proportion is the ratio of the occurrence frequency CG of the k-mer in the corresponding pathogen operation group to the occurrence frequency CB of the k-mer in the total set; the actual occurrence number acquisition module is used for acquiring each second marking data in the sequencing data and acquiring the actual occurrence number of each k-mer in the pathogen operation group contained in the complete set in each second marking data; the number estimation module is used for calculating the estimated actual occurrence number of each k-mer according to the proportion of the full set to the total set and the actual occurrence number; the total number calculating module is used for calculating the estimated actual total occurrence times ECT according to the estimated actual occurrence times of each k-mer; and a GCT obtaining module for calculating the occurrence times GCT of the k-mers in the sequencing data in the pathogen operation group contained in the complete set according to the estimated actual occurrence total times ECT and the confirmed actual occurrence total times CCT.
In one embodiment, the GCT acquisition module includes: the CCT calculation module is used for acquiring each first marker data in the sequencing data, calculating the number of k-mers contained in each first marker data, and calculating the CCT according to the number of k-mers contained in each first marker data to obtain the actual occurrence number of times; and the length judging module is used for determining the actual occurrence total number CCT as the occurrence number GCT of the k-mers in the sequencing data in the pathogen operation group contained in the complete set when the length of the k-mers contained in the pathogen operation group contained in the complete set is greater than a target value.
In one embodiment, a specific k-mer refers to a k-mer in a pathogen manipulation group whose number of occurrences in the index of numbers of occurrences in genomes of the pathogen manipulation group satisfies a preset error condition. In one embodiment, the k-mers in the specific k-mers satisfy the following two conditions: the occurrence times in the genome occurrence time index table of the pathogen operation group meet a first preset error condition; the occurrence times in the genome occurrence time index table of the pathogen operation group and the occurrence times in the genome occurrence time index table of the corpus satisfy a second preset error condition; the number of the k-mer contained in the genome contained in the pathogen operation group corresponding to each k-mer is recorded in a genome number index table; the index of occurrence number of genomes of the corpus records the number of genomes containing the k-mer in the genome contained in the corpus.
In one embodiment, the first predetermined error condition is: the sum of the ratio of the number of occurrences in the genome occurrence number index table of the pathogen manipulation group to the number of genomes contained in the pathogen manipulation group and the first threshold value is 1 or more.
In one embodiment, the first threshold is less than 5%.
In one embodiment, the second predetermined error condition is: the sum of the ratio of the number of occurrences in the genome number of occurrences index table of the pathogen manipulation group to the number of occurrences in the genome number of occurrences index table of the corpus and the second threshold value is 1 or more.
In one embodiment, the second threshold is less than 5%.
As shown in fig. 28, in an embodiment, the detection apparatus for pathogen manipulation groups further includes a generation module 2710 for generating a genome occurrence index table corresponding to each pathogen manipulation group, where the genome occurrence index table records the number of genomes containing the k-mers in the genomes included in the pathogen manipulation group corresponding to each k-mer; and storing the genome occurrence index table to a characteristic target point sequence set corresponding to the pathogen operation group.
In one embodiment, the generating module 2710 of the feature target sequence set is further configured to select k-mers satisfying a predetermined specificity condition from the k-mers corresponding to each pathogen operation group; and storing the k-mers meeting the preset specificity condition into the characteristic target sequence set corresponding to each pathogen operation group.
In one embodiment, the generating module 2710 of the set of feature target sequences is further configured to generate a genome occurrence index table of the corpus, the genome occurrence index table of the corpus recording the number of genomes comprising the k-mer in the genome comprised by the corpus; and storing the genome occurrence index table of the complete set into a target database.
In one embodiment, the pathogen operation group selection module 2708 is further configured to obtain the actual number of occurrences of each specific k-mer included in each pathogen operation group in the sequencing data; selecting a pathogen operation group corresponding to the specific k-mer with the actual occurrence times in the sequencing data exceeding a first preset time threshold value to obtain a suspected pathogen operation group; and selecting the pathogen operation group meeting the preset conditions from the suspected pathogen operation groups as the pathogen operation group contained in the sequencing data.
In one embodiment, the pathogen operation panel selection module 2708 is further configured to obtain the actual number of occurrences of specific k-mers contained in the pathogen operation panel in the sequencing data; and generating a specific k-mer actual occurrence number recording table corresponding to the pathogen operation group according to the actual occurrence number.
In one embodiment, the pathogen operation group selecting module 2708 is further configured to obtain a specific k-mer with an actual occurrence number greater than a preset actual occurrence number threshold as a specific k-mer for confirming occurrence; calculating the false positive probability of each pathogen operation group according to the number of the corresponding confirmed specific k-mers of each pathogen operation group; and selecting the pathogen operation group with the false positive probability lower than the preset standard probability as the pathogen operation group contained in the sequencing data.
In one embodiment, the pathogen operation panel selection module 2708 is further configured to obtain the actual number of occurrences of confirmed occurrences of specific k-mers; calculating the ratio of the actual occurrence times of the confirmed specific k-mers to the sum of the actual occurrence times of the confirmed specific k-mers and the actual occurrence times of all the confirmed specific k-mers; and selecting the pathogen operation group corresponding to the confirmed specific k-mer with the actual occurrence frequency proportion conforming to the expected occurrence frequency proportion from the pathogen operation groups with the false positive probability lower than the preset standard probability, and taking the pathogen operation group as the pathogen operation group contained in the sequencing data.
In one embodiment, the pathogen action group selecting module 2708 is further configured to, when it is determined that specific k-mers belonging to the same non-overlapping specific region exist in the presented specific k-mers, regard the specific k-mers belonging to the same non-overlapping specific region as the same specific k-mer, where the non-overlapping specific region includes any two specific k-mers whose number of overlapping characters meets a preset overlapping threshold; and when the number of the confirmed specific k-mers is determined according to the non-coincident specific region, calculating the false positive probability of each pathogen operation group according to the number of the confirmed specific k-mers.
In one embodiment, the generating module 2710 of the set of characteristic target sequences is further configured to obtain the number of occurrences of each specific k-mer comprised in the pathogen manipulation group in the genome comprised in the pathogen manipulation group; calculating to obtain the ratio of the occurrence frequency corresponding to each specificity k-mer to the sum of the occurrence frequency of the specificity k-mers contained in the pathogen operation group; generating a specific k-mer appearance ratio table corresponding to the pathogen operation group according to the appearance times and the appearance ratio corresponding to each specific k-mer in the pathogen operation group; and storing the specific k-mer appearance proportion table to a characteristic target sequence set corresponding to the pathogen operation group.
In one embodiment, the generating module 2710 of the set of characteristic target sequences is further configured to obtain the number CG of occurrences of k-mers in the pathogen manipulation group; acquiring the occurrence frequency CB of k-mers contained in a pathogen operation group in a complete set; calculating to obtain a ratio F of the total set proportion of each k-mer as a ratio of the occurrence times CG to the occurrence times CB of the total set; generating a k-mer occurrence number ratio table corresponding to each pathogen operation group according to the corpus ratio F; and storing the k-mer occurrence number accounting corpus proportion table to a characteristic target sequence set corresponding to the pathogen operation group.
In an embodiment, the pathogen operation group selecting module 2708 is further configured to obtain a specific k-mer with an actual occurrence number greater than a preset actual occurrence number threshold as a specific k-mer for confirming occurrence; acquiring the false positive distribution of the specific k-mers contained in each pathogen operation group in the corresponding simulated test data; comparing the number of the specificity k-mers which are confirmed to appear and correspond to each pathogen operation group with the false positive distribution of the specificity k-mers contained in each pathogen operation group in the simulation test data to obtain the false positive detection probability of each pathogen operation group; and selecting the pathogen operation group with the false positive detection probability lower than a preset threshold value as the pathogen operation group contained in the sequencing data.
In one embodiment, the pathogen operation set selection module 2708 is further configured to: acquiring simulation test data corresponding to each pathogen operation group; the simulation test data is data obtained by randomly sampling genomes in the complete set, which do not belong to the pathogen operation group corresponding to the simulation test data; calculating the number of specific k-mers contained in each pathogen manipulation group in the corresponding simulated test data; and obtaining the false positive distribution of the specific k-mers contained in each pathogen operation group in the corresponding simulated test data according to the number of the specific k-mers contained in each pathogen operation group in the corresponding simulated test data.
In one embodiment, the pathogen operation set selection module 2708 is further configured to: acquiring specific k-mers with actual occurrence times larger than a preset actual occurrence time threshold value as the confirmed specific k-mers, and acquiring the number of the confirmed specific k-mers; acquiring the number of specific k-mers contained in each pathogen operation group; calculating the concentration of the specific k-mers confirmed to appear in each pathogen operation group according to the number of the specific k-mers confirmed to appear and the number of the specific k-mers contained in each pathogen operation group; and selecting the pathogen operation group with the confirmed occurrence of the specific k-mer concentration higher than a preset threshold value as the pathogen operation group contained in the sequencing data.
In one embodiment, the apparatus for detecting pathogen operation group further comprises a pathogen operation group list generating module (not shown in the figure) for obtaining medical information of each pathogen operation group included in the sequencing data from the target database; generating a final pathogen operation group list according to the medical information of each pathogen operation group contained in the sequencing data; and outputting the final pathogen operation group list to the detection result of the sequencing data.
As shown in fig. 29, in one embodiment, the detection apparatus of the pathogen operating group further includes a relative concentration calculation module 2712 for obtaining a sum CT of occurrences of k-mers occurring in the sequencing data; obtaining an estimated actual occurrence sum CF of k-mers contained in a pathogen operation group contained in sequencing data, wherein the estimated actual occurrence sum CF is the sum of estimated occurrences of each k-mer contained in the pathogen operation group in the sequencing data; the relative concentration of each pathogen manipulation group was calculated from the sum of occurrences CT and the estimated sum of actual occurrences CF.
In one embodiment, the relative concentration calculation module 2712 is further configured to obtain a pathogen operation set included in the sequencing data as a final pathogen operation set; acquiring a ratio of a total set ratio of k-mers contained in each final pathogen operation group, wherein the ratio of the total set ratio is a ratio of the occurrence frequency CG of the k-mers in the corresponding pathogen operation group to the occurrence frequency CB of the k-mers in the total set; calculating the estimated actual occurrence times of each k-mer according to the proportion of the total set to the total set corresponding to each k-mer and the actual occurrence times; the estimated actual occurrence counts CF of k-mers included in the pathogen manipulation group included in the sequencing data were calculated from the estimated actual occurrence counts of each k-mer.
In one embodiment, the relative concentration calculation module 2712 is further configured to obtain the actual number of occurrences of k-mers included in the pathogen manipulation group in the sequencing data of the sample; and generating a k-mer actual occurrence number table corresponding to each pathogen operation group according to the actual occurrence number of the k-mers contained in the pathogen operation group in the sequencing data of the sample.
In one embodiment, the relative concentration calculation module 2712 is further configured to obtain a log of k-mer occurrences for each final pathogen operation group; acquiring k-mers contained in a k-mer occurrence number recording table; acquiring a k-mer occurrence number-to-corpus proportion table corresponding to each final pathogen operation group, wherein the k-mer occurrence number-to-corpus proportion table comprises the occurrence number of each k-mer sequence in the pathogen operation group and the occurrence number of each k-mer sequence in a corpus; and acquiring the proportion of the occurrence times of each k-mer in the k-mer occurrence time recording table from the k-mer occurrence time proportion table.
In one embodiment, the generating module 2710 of the set of characteristic target sequences is further configured to obtain the number of occurrences of k-mers in the pathogen operation group included in each pathogen operation group; generating a k-mer occurrence number recording table corresponding to each pathogen operation group according to the occurrence number of each k-mer in the pathogen operation group; and storing the k-mer occurrence frequency record table to a characteristic target point sequence set corresponding to the pathogen operation group. The relative concentration calculating module 2412 is further configured to obtain the occurrence CG of the k-mers in the pathogen operation group from the k-mer occurrence record table corresponding to the pathogen operation group.
In one embodiment, the generating module 2710 of the set of characteristic target sequences is further configured to obtain the number of occurrences of k-mers in the ensemble for each pathogen operation group; generating a k-mer occurrence number recording table of the corpus according to the occurrence number of the corpus of each k-mer; and storing the occurrence frequency record table of the k-mers of the corpus into a target point database. The relative concentration calculation module 2712 is further configured to obtain the occurrence CB of the k-mers in the pathogen operation group contained in the target database from the k-mer occurrence record table of the corpus.
In one embodiment, the relative concentration calculation module 2712 is further configured to obtain the k-mers included in each final pathogen operation group according to the k-mer occurrence number record table corresponding to each final pathogen operation group; generating a k-mer occurrence union list according to the k-mer occurrence frequency record table corresponding to each pathogen operation group; acquiring the actual occurrence times of each k-mer in sequencing data, wherein the actual occurrence times of each k-mer are contained in a k-mer occurrence union list; acquiring actual occurrence times corresponding to k-mers contained in each final pathogen operation group; and generating a k-mer actual occurrence number table corresponding to the final pathogen operation group according to the actual occurrence number corresponding to each k-mer.
In one embodiment, a device for detecting the relative concentration of a working group of pathogens is provided, as shown in fig. 30, the device comprising:
the sequencing data acquisition module is used for acquiring sequencing data of a sample and calculating the sum CT of the occurrence times of k-mers appearing in the sequencing data;
the occurrence number GCT acquisition module is used for acquiring the occurrence number GCT of the k-mers in the sequencing data in the pathogen operation group contained in the complete set;
and the relative concentration calculating module is used for calculating the relative concentration of the pathogen operation group contained in the complete set in the sequencing data according to the occurrence times GCT and the total occurrence times CT.
In one embodiment, the above apparatus for detecting the relative concentration of the pathogen operating group is further configured to: acquiring high-reliability genome data to obtain a complete set, wherein the high-reliability genome refers to a genome meeting preset conditions; the operational group of pathogens contained in the ensemble is determined.
In an embodiment, the number of occurrences GCT obtaining module is further configured to: acquiring each first marker data in sequencing data, calculating the number of k-mers contained in each first marker data, and calculating the total number of confirmed actual occurrences CCT according to the number of k-mers contained in each first marker data; acquiring a ratio of a total set ratio of each k-mer contained in a pathogen operation group contained in the total set, wherein the ratio of the total set ratio is the ratio of the occurrence frequency CG of the k-mer in the corresponding pathogen operation group to the occurrence frequency CB of the k-mer in the total set; acquiring each second marking data in the sequencing data, and acquiring the actual occurrence times of each k-mer in a pathogen operation group contained in a complete set in each second marking data; calculating to obtain the estimated actual occurrence times of each k-mer according to the ratio of the whole set to the total set and the actual occurrence times; calculating to obtain the total estimated actual occurrence times ECT according to the estimated actual occurrence times of each k-mer; the number of occurrences GCT of k-mers in the sequencing data contained in the pathogen manipulation group contained in the repertoire is calculated from the estimated total number of actual occurrences ECT and the confirmed total number of actual occurrences CCT.
In one embodiment, the GCT acquisition module includes: the CCT calculation module is used for acquiring each first marker data in the sequencing data, calculating the number of k-mers contained in each first marker data, and calculating the CCT according to the number of k-mers contained in each first marker data to obtain the actual occurrence number of times; and the length judging module is used for determining the actual occurrence total number CCT as the occurrence number GCT of the k-mers in the sequencing data in the pathogen operation group contained in the complete set when the length of the k-mers contained in the pathogen operation group contained in the complete set is greater than a target value.
In one embodiment, the above apparatus for detecting the relative concentration of the pathogen operating group is further configured to: obtaining a pathogen operation group contained in the sequencing data, and obtaining the relative concentration of the pathogen operation group contained in the sequencing data; and selecting the pathogen operation group with the relative concentration of the pathogen operation group contained in the sequencing data higher than a preset threshold value as the pathogen operation group confirmed to be contained in the sequencing data.
For specific limitations of the detection device of the pathogen manipulation group and the detection device of the relative concentration of the pathogen manipulation group, reference may be made to the above limitations of the detection method of the pathogen manipulation group and the detection method of the relative concentration of the pathogen manipulation group, which are not described herein again. The detection device of the pathogen operation group and the detection device of the relative concentration of the pathogen operation group can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 31. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used to store data in the target data, such as a set of characteristic target sequences corresponding to each pathogen operating group. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of detecting a pathogen operating panel and a method of detecting the relative concentration of a pathogen operating panel.
Those skilled in the art will appreciate that the architecture shown in fig. 31 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having stored therein a computer program, the processor when executing the computer program implementing the steps of the method for detection of a pathogen manipulation group and the method for detection of the relative concentration of a pathogen manipulation group as provided in any of the embodiments of the present application.
In one embodiment, a computer readable storage medium is provided, having stored thereon a computer program which, when executed by a processor, performs the steps of the method for detection of a pathogen manipulation group and the detection of the relative concentration of a pathogen manipulation group provided in any one of the embodiments of the present application.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (26)

1. A method of detecting a manipulation group of pathogens, the method comprising:
obtaining sequencing data of a sample;
acquiring a characteristic target sequence set corresponding to each pathogen operation group stored in a target database, wherein the characteristic target sequence set comprises specific k-mers meeting preset specific conditions in the pathogen operation groups, and the k-mers refer to genome sequences with the length of k;
acquiring the occurrence times of specific k-mers in sequencing data, wherein the specific k-mers are contained in a feature target sequence set corresponding to each pathogen operation group;
selecting a pathogen operation group with the occurrence frequency exceeding a preset frequency threshold as a pathogen operation group contained in the sequencing data, wherein a k-mer in the specific k-mers meets the following two conditions:
the occurrence times in the genome occurrence time index table of the pathogen operation group meet a first preset error condition; the occurrence times in the genome occurrence time index table of the pathogen operation group and the occurrence times in the genome occurrence time index table of the corpus satisfy a second preset error condition;
the genome number index table records the number of genomes containing corresponding k-mers in the genomes contained in the pathogen operation group corresponding to each k-mer; the index table of the number of occurrences of genome of the corpus records the number of genomes containing the k-mer in the genome contained in the corpus.
2. The method of claim 1, further comprising, prior to obtaining sequencing data for the sample:
selecting k-mers meeting preset specificity conditions from the k-mers corresponding to each pathogen operation group;
and storing the k-mers meeting the preset specificity condition into the characteristic target sequence set corresponding to each pathogen operation group.
3. The method of claim 2, further comprising, before selecting k-mers satisfying a predetermined specificity condition from among the k-mers corresponding to each pathogen operation group:
acquiring high-reliability genome data to obtain a complete set, wherein the high-reliability genome refers to a genome meeting preset conditions;
determining a panel of pathogen operations contained in the corpus.
4. The method of claim 3, further comprising, after determining the operational group of pathogens contained in the complete set:
obtaining sequencing data of the sample, and calculating the sum CT of the occurrence times of k-mers appearing in the sequencing data;
obtaining the number of occurrences GCT of k-mers in the sequencing data contained in a pathogen manipulation group contained in the repertoire;
calculating the relative concentration of the pathogen operation group contained in the complete set in the sequencing data according to the occurrence number GCT and the occurrence number sum CT;
after the pathogen operation group with the occurrence frequency exceeding the preset frequency threshold is selected as the pathogen operation group contained in the sequencing data, the method further comprises the following steps:
obtaining relative concentrations of pathogen operation groups contained in the sequencing data according to relative concentrations of pathogen operation groups contained in the complete set in the sequencing data;
and selecting the pathogen operation group with the relative concentration of the pathogen operation group contained in the sequencing data higher than a preset threshold value as the pathogen operation group confirmed to be contained in the sequencing data.
5. The method of claim 4, wherein said obtaining the number of occurrences GCT of k-mers in sequencing data contained in the pathogen manipulation group contained in said repertoire comprises:
acquiring each first marker data in the sequencing data, calculating the number of k-mers contained in each first marker data, and calculating the number of confirmed actual occurrence times CCT according to the number of k-mers contained in each first marker data;
acquiring a corpus proportion ratio of each k-mer in a pathogen operation group contained in the corpus, wherein the corpus proportion ratio is the ratio of the occurrence frequency CG of the k-mer in the corresponding pathogen operation group to the occurrence frequency CB of the k-mer in the corpus;
obtaining each second marker data in the sequencing data, obtaining an actual number of occurrences of each k-mer included in the pathogen operation group included in the ensemble in each second marker data;
calculating to obtain the estimated actual occurrence times of each k-mer according to the ratio of the full set to the total set and the actual occurrence times;
calculating to obtain the total estimated actual occurrence times ECT according to the estimated actual occurrence times of each k-mer;
and calculating the occurrence times GCT of the k-mers in the sequencing data in the pathogen operation group contained in the complete set according to the estimated actual total occurrence times ECT and the confirmed actual total occurrence times CCT.
6. The method of claim 4, wherein said obtaining the number of occurrences GCT of k-mers in sequencing data contained in the pathogen manipulation group contained in said repertoire comprises:
acquiring each first marker data in the sequencing data, calculating the number of k-mers contained in each first marker data, and calculating the number of confirmed actual occurrence times CCT according to the number of k-mers contained in each first marker data;
and when the lengths of the k-mers contained in the pathogen operation groups contained in the ensemble are larger than a target value, taking the total number of confirmed actual occurrences CCT as the number of occurrences GCT of the k-mers contained in the pathogen operation groups contained in the ensemble in sequencing data.
7. The method of claim 1, further comprising, prior to said obtaining sequencing data for the sample:
generating a genome occurrence index table corresponding to each pathogen operation group;
storing the genome occurrence index table to a characteristic target sequence set corresponding to the pathogen operation group;
generating a genome occurrence index table of the corpus;
and storing the genome occurrence index table of the complete set into the target database.
8. The method of claim 1, wherein the first predetermined error condition is: the sum of the ratio of the number of occurrences in the genome occurrence number index table of the pathogen manipulation group to the number of genomes contained in the pathogen manipulation group and the first threshold value is 1 or more.
9. The method of claim 1, wherein the second predetermined error condition is: the sum of the ratio of the number of occurrences in the genome number of occurrences index table of the pathogen manipulation group to the number of occurrences in the genome number of occurrences index table of the corpus and the second threshold value is 1 or more.
10. The method of claim 9, wherein the second threshold is less than 5%.
11. The method of claim 1, wherein selecting a pathogen operation group with a frequency exceeding a predetermined threshold as the pathogen operation group included in the sequencing data comprises:
obtaining an actual number of occurrences of each specific k-mer contained in each pathogen manipulation group in the sequencing data;
selecting a pathogen operation group corresponding to the specific k-mer with the actual occurrence times in the sequencing data exceeding a first preset time threshold value to obtain a suspected pathogen operation group;
and selecting a pathogen operation group meeting preset conditions from the suspected pathogen operation groups as the pathogen operation group contained in the sequencing data.
12. The method of claim 11, wherein selecting the pathogen operation group meeting the predetermined condition from the suspected pathogen operation groups as the pathogen operation group contained in the sequencing data comprises:
acquiring the specific k-mer with the actual occurrence number larger than a preset actual occurrence number threshold as the specific k-mer with the occurrence confirmation;
acquiring the false positive distribution of the specific k-mers contained in each pathogen operation group in the corresponding simulated test data;
comparing the number of the specificity k-mers which are confirmed to appear and correspond to each pathogen operation group with the false positive distribution of the specificity k-mers contained in each pathogen operation group in the simulation test data to obtain the false positive detection probability of each pathogen operation group;
and selecting a pathogen operation group with the false positive detection probability lower than a preset threshold value as the pathogen operation group contained in the sequencing data.
13. The method of claim 12, wherein obtaining a false positive distribution in the simulated test data for each specific k-mer contained in each pathogen manipulation group comprises:
acquiring simulation test data corresponding to each pathogen operation group; the simulated test data is randomly sampled from the genome of the complete set which does not belong to the pathogen operation group corresponding to the simulated test data;
calculating the number of specific k-mers contained in each pathogen manipulation group in the corresponding simulated test data;
and obtaining the false positive distribution of the specific k-mers contained in each pathogen operation group in the corresponding simulated test data according to the number of the specific k-mers contained in each pathogen operation group in the corresponding simulated test data.
14. The method of claim 11, wherein selecting the pathogen operation group meeting the predetermined condition from the suspected pathogen operation groups as the pathogen operation group contained in the sequencing data comprises:
acquiring the specific k-mers with the actual occurrence times larger than a preset actual occurrence time threshold value as the specific k-mers with the confirmed occurrence times, and acquiring the number of the specific k-mers with the confirmed occurrence times;
acquiring the number of specific k-mers contained in each pathogen operation group;
calculating the concentration of the confirmed specific k-mers of each pathogen operation group according to the number of the confirmed specific k-mers and the number of the specific k-mers contained in each pathogen operation group;
and selecting a pathogen operation group with the confirmed occurrence of specific k-mer concentration higher than a preset threshold value as the pathogen operation group contained in the sequencing data.
15. The method of claim 1, further comprising, after selecting the pathogen operation group with the occurrence number exceeding a predetermined threshold number as the pathogen operation group included in the sequencing data:
acquiring medical information of each pathogen operation group contained in the sequencing data from a target point database;
generating a final pathogen operation group list according to the medical information of each pathogen operation group contained in the sequencing data;
and outputting the final pathogen operation group list to the detection result of the sequencing data.
16. A method of detecting the relative concentration of a pathogen operating group, the method comprising:
obtaining sequencing data of a sample, and calculating the sum CT of the occurrence times of k-mers appearing in the sequencing data;
acquiring the occurrence times GCT of k-mers contained in a pathogen operation group contained in a complete set in the sequencing data;
and calculating the relative concentration of the pathogen operation group contained in the complete set in the sequencing data according to the occurrence times GCT and the occurrence times total CT.
17. The method of claim 16, further comprising, prior to obtaining sequencing data for the sample and calculating the sum of occurrences CT of k-mers in the sequencing data, the method comprising:
acquiring high-reliability genome data to obtain a complete set, wherein the high-reliability genome refers to a genome meeting preset conditions;
determining a panel of pathogen operations contained in the corpus.
18. The method of claim 16, wherein obtaining the number of occurrences GCT of k-mers in sequencing data for inclusion in the pathogen manipulation group contained in the repertoire comprises:
acquiring each first marker data in the sequencing data, calculating the number of k-mers contained in each first marker data, and calculating the number of confirmed actual occurrence times CCT according to the number of k-mers contained in each first marker data;
acquiring a corpus proportion ratio of each k-mer in a pathogen operation group contained in the corpus, wherein the corpus proportion ratio is the ratio of the occurrence frequency CG of the k-mer in the corresponding pathogen operation group to the occurrence frequency CB of the k-mer in the corpus;
obtaining each second marker data in the sequencing data, obtaining an actual number of occurrences of each k-mer included in the pathogen operation group included in the ensemble in each second marker data;
calculating to obtain the estimated actual occurrence times of each k-mer according to the ratio of the full set to the total set and the actual occurrence times;
calculating to obtain the total estimated actual occurrence times ECT according to the estimated actual occurrence times of each k-mer;
and calculating the occurrence times GCT of the k-mers in the sequencing data in the pathogen operation group contained in the complete set according to the estimated actual total occurrence times ECT and the confirmed actual total occurrence times CCT.
19. The method of claim 16, wherein the number of occurrences GCT of k-mers in sequencing data contained in the pathogen manipulation group contained in the repertoire of acquisitions comprises
Acquiring each first marker data in the sequencing data, calculating the number of k-mers contained in each first marker data, and calculating the number of confirmed actual occurrence times CCT according to the number of k-mers contained in each first marker data;
and when the lengths of the k-mers contained in the pathogen operation groups contained in the ensemble are larger than a target value, taking the total number of confirmed actual occurrences CCT as the number of occurrences GCT of the k-mers contained in the pathogen operation groups contained in the ensemble in sequencing data.
20. The method of claim 16, wherein after said calculating the relative concentrations of the pathogen manipulation groups contained in said repertoire based on said number of occurrences GCT and said sum of occurrences CT in said sequencing data, further comprises:
obtaining a pathogen operation group contained in sequencing data, and obtaining the relative concentration of the pathogen operation group contained in the sequencing data;
and selecting the pathogen operation group with the relative concentration of the pathogen operation group contained in the sequencing data higher than a preset threshold value as the pathogen operation group confirmed to be contained in the sequencing data.
21. A device for detecting a pathogen operating group, the device comprising:
the sequencing data acquisition module is used for acquiring sequencing data of the sample;
the target sequence set acquisition module is used for acquiring a characteristic target sequence set corresponding to each pathogen operation group stored in a target database, wherein the characteristic target sequence set comprises specific k-mers meeting preset specific conditions in the pathogen operation groups, and the k-mers refer to genome sequences with the length of k;
the specific k-mer occurrence frequency acquisition module is used for acquiring the occurrence frequency of the specific k-mer in sequencing data, wherein the specific k-mer occurrence frequency is contained in a feature target sequence set corresponding to each pathogen operation group;
the pathogen operation group selection module is used for selecting a pathogen operation group with the occurrence frequency exceeding a preset frequency threshold value as a pathogen operation group contained in sequencing data, wherein k-mers in the specific k-mers meet the following two conditions: the occurrence times in the genome occurrence time index table of the pathogen operation group meet a first preset error condition; the occurrence times in the genome occurrence time index table of the pathogen operation group and the occurrence times in the genome occurrence time index table of the corpus satisfy a second preset error condition; the genome number index table records the number of genomes containing corresponding k-mers in the genomes contained in the pathogen operation group corresponding to each k-mer; the index table of the number of occurrences of genome of the corpus records the number of genomes containing the k-mer in the genome contained in the corpus.
22. A device for detecting the relative concentration of an operational group of pathogens, the device comprising:
the sequencing data acquisition module is used for acquiring sequencing data of a sample and calculating the sum CT of the occurrence times of k-mers appearing in the sequencing data;
the occurrence number GCT acquisition module is used for acquiring the occurrence number GCT of k-mers contained in a pathogen operation group contained in the complete set in the sequencing data;
and the relative concentration calculating module is used for calculating the relative concentration of the pathogen operation group contained in the complete set in the sequencing data according to the occurrence times GCT and the occurrence times total CT.
23. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any one of claims 1 to 15 when executing the computer program.
24. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 15.
25. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any one of claims 16 to 20 when executing the computer program.
26. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 16 to 20.
CN201910316653.3A 2018-06-22 2019-04-19 Method and device for detecting pathogen operation group, computer equipment and storage medium Expired - Fee Related CN109949866B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/087580 WO2019242445A1 (en) 2018-06-22 2019-05-20 Detection method, device, computer equipment and storage medium of pathogen operation group

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810649271 2018-06-22
CN2018106492718 2018-06-22

Publications (2)

Publication Number Publication Date
CN109949866A CN109949866A (en) 2019-06-28
CN109949866B true CN109949866B (en) 2021-02-02

Family

ID=67015792

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910316653.3A Expired - Fee Related CN109949866B (en) 2018-06-22 2019-04-19 Method and device for detecting pathogen operation group, computer equipment and storage medium

Country Status (2)

Country Link
CN (1) CN109949866B (en)
WO (1) WO2019242445A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110473594B (en) * 2019-08-22 2020-05-05 广州微远基因科技有限公司 Pathogenic microorganism genome database and establishment method thereof
MX2022014017A (en) * 2020-05-08 2022-11-30 Illumina Inc Genome sequencing and detection techniques.
CN115578554B (en) * 2021-06-21 2024-02-02 数坤(上海)医疗科技有限公司 Vascular focus identification method, device, electronic equipment and readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105441432A (en) * 2014-09-05 2016-03-30 天津华大基因科技有限公司 Composition and application thereof to sequencing and variation detection
CN106484865A (en) * 2016-10-10 2017-03-08 哈尔滨工程大学 One kind is based on four word chained list dictionary tree searching algorithm of DNA k mer index problem
CN107133493A (en) * 2016-02-26 2017-09-05 中国科学院数学与系统科学研究院 Assemble method, structure variation detection method and the corresponding system of genome sequence
CN107750279A (en) * 2015-03-16 2018-03-02 个人基因组诊断公司 Foranalysis of nucleic acids system and method
CN108137642A (en) * 2015-09-25 2018-06-08 语境基因组学有限公司 Application of the molecular mass ensuring method in sequencing

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2561381C (en) * 2004-03-26 2015-05-12 Sequenom, Inc. Base specific cleavage of methylation-specific amplification products in combination with mass analysis
US20060095241A1 (en) * 2004-10-29 2006-05-04 Microsoft Corporation Systems and methods that utilize machine learning algorithms to facilitate assembly of aids vaccine cocktails
US20100217532A1 (en) * 2009-02-25 2010-08-26 University Of Delaware Systems and methods for identifying structurally or functionally significant amino acid sequences
WO2012155296A1 (en) * 2011-05-17 2012-11-22 深圳华大基因科技有限公司 Methods of acquiring genome size and error
CN102332064B (en) * 2011-10-07 2013-11-06 吉林大学 Biological species identification method based on genetic barcode
US20140106974A1 (en) * 2012-10-15 2014-04-17 Synblex, Llc Pathogen identification process and transport container
WO2016061396A1 (en) * 2014-10-16 2016-04-21 Counsyl, Inc. Variant caller
CN115206436A (en) * 2015-04-24 2022-10-18 犹他大学研究基金会 Method and system for multiple taxonomic classification
CN106971088A (en) * 2017-03-28 2017-07-21 泽塔生物科技(上海)有限公司 The method for identifying molecules and system of a kind of eukaryot-ic origin composition

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105441432A (en) * 2014-09-05 2016-03-30 天津华大基因科技有限公司 Composition and application thereof to sequencing and variation detection
CN107750279A (en) * 2015-03-16 2018-03-02 个人基因组诊断公司 Foranalysis of nucleic acids system and method
CN108137642A (en) * 2015-09-25 2018-06-08 语境基因组学有限公司 Application of the molecular mass ensuring method in sequencing
CN107133493A (en) * 2016-02-26 2017-09-05 中国科学院数学与系统科学研究院 Assemble method, structure variation detection method and the corresponding system of genome sequence
CN106484865A (en) * 2016-10-10 2017-03-08 哈尔滨工程大学 One kind is based on four word chained list dictionary tree searching algorithm of DNA k mer index problem

Also Published As

Publication number Publication date
WO2019242445A1 (en) 2019-12-26
CN109949866A (en) 2019-06-28

Similar Documents

Publication Publication Date Title
Simon et al. Benchmarking metagenomics tools for taxonomic classification
Kuchenbecker et al. IMSEQ—a fast and error aware approach to immunogenetic sequence analysis
Sharma et al. Gene loss rather than gene gain is associated with a host jump from monocots to dicots in the smut fungus Melanopsichium pennsylvanicum
DK3144672T3 (en) GENOME IDENTIFICATION SYSTEM
CN109949866B (en) Method and device for detecting pathogen operation group, computer equipment and storage medium
CN111341383B (en) Method, device and storage medium for detecting copy number variation
KR101828052B1 (en) Method and apparatus for analyzing copy-number variation (cnv) of gene
CN112259167B (en) Pathogen analysis method and device based on high-throughput sequencing and computer equipment
JP6644672B2 (en) Characterization of biological materials using unassembled sequence information, stochastic methods, and trait-specific database catalogs
CN115083521B (en) Method and system for identifying tumor cell group in single cell transcriptome sequencing data
Gleason et al. Machine learning predicts translation initiation sites in neurologic diseases with nucleotide repeat expansions
Orton et al. Bioinformatics tools for analysing viral genomic data
WO2019133937A1 (en) Microsatellite instabilty detection
CN113096728A (en) Method, device, storage medium and equipment for detecting tiny residual focus
CN114424287A (en) Single cell RNA-SEQ data processing
Arisdakessian et al. CoCoNet: an efficient deep learning tool for viral metagenome binning
Rajaby et al. SurVIndel: improving CNV calling from high-throughput sequencing data through statistical testing
CN110610741B (en) Human pathogen identification method and device and electronic equipment
CN109192246B (en) Method, apparatus and storage medium for detecting chromosomal copy number abnormalities
CN116802313A (en) Methods and systems for macrogenomic analysis
CN110021365B (en) Method, device, computer equipment and storage medium for determining detection target point
Schon et al. Bookend: precise transcript reconstruction with end-guided assembly
Hickl et al. binny: an automated binning algorithm to recover high-quality genomes from complex metagenomic datasets
KR20200133067A (en) Method and system for predicting disease from gut microbial data
CN116825182B (en) Method for screening bacterial drug resistance characteristics based on genome ORFs and application

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210202