CN113096737A - Method and system for automatically analyzing pathogen types - Google Patents
Method and system for automatically analyzing pathogen types Download PDFInfo
- Publication number
- CN113096737A CN113096737A CN202110331835.5A CN202110331835A CN113096737A CN 113096737 A CN113096737 A CN 113096737A CN 202110331835 A CN202110331835 A CN 202110331835A CN 113096737 A CN113096737 A CN 113096737A
- Authority
- CN
- China
- Prior art keywords
- read data
- sequencing read
- pathogen
- data
- sequencing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 244000052769 pathogen Species 0.000 title claims abstract description 316
- 230000001717 pathogenic effect Effects 0.000 title claims abstract description 310
- 238000000034 method Methods 0.000 title claims abstract description 66
- 238000012163 sequencing technique Methods 0.000 claims abstract description 474
- 238000004458 analytical method Methods 0.000 claims abstract description 32
- 238000001514 detection method Methods 0.000 claims abstract description 24
- 238000007671 third-generation sequencing Methods 0.000 claims abstract description 19
- 238000007619 statistical method Methods 0.000 claims description 29
- 241000894007 species Species 0.000 claims description 22
- 238000005406 washing Methods 0.000 claims description 22
- 238000012216 screening Methods 0.000 claims description 15
- 238000004140 cleaning Methods 0.000 claims description 12
- 150000007523 nucleic acids Chemical group 0.000 claims description 11
- 108091028043 Nucleic acid sequence Proteins 0.000 claims description 8
- 238000012360 testing method Methods 0.000 claims description 7
- 238000004422 calculation algorithm Methods 0.000 abstract description 17
- 238000005516 engineering process Methods 0.000 abstract description 9
- 238000010586 diagram Methods 0.000 description 15
- 238000002869 basic local alignment search tool Methods 0.000 description 8
- 238000004590 computer program Methods 0.000 description 7
- 238000003908 quality control method Methods 0.000 description 7
- 238000001914 filtration Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 238000003860 storage Methods 0.000 description 4
- 241000700605 Viruses Species 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 241000894006 Bacteria Species 0.000 description 2
- 241000233866 Fungi Species 0.000 description 2
- 239000012634 fragment Substances 0.000 description 2
- 108020004707 nucleic acids Proteins 0.000 description 2
- 102000039446 nucleic acids Human genes 0.000 description 2
- 244000045947 parasite Species 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 108020004414 DNA Proteins 0.000 description 1
- 230000003115 biocidal effect Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 235000019506 cigar Nutrition 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000007672 fourth generation sequencing Methods 0.000 description 1
- 238000012165 high-throughput sequencing Methods 0.000 description 1
- 238000002865 local sequence alignment Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000007918 pathogenicity Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000002864 sequence alignment Methods 0.000 description 1
- 239000000344 soap Substances 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/30—Detection of binding sites or motifs
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Abstract
The invention discloses a method and a system for automatically analyzing pathogen types, which firstly carry out preliminary judgment on the pathogen types of first sequencing read data of different data types of a sample to be detected through two different algorithms, then select second sequencing read data, and then compare, verify and screen the identified second sequencing read data of different pathogen types with corresponding pathogen reference sequences, thereby determining the pathogen types existing in the sample to be detected; the method can effectively process long reading data generated based on a third-generation sequencing technology, reduces the calculated amount of accurate comparison, can realize accurate analysis and report generation of typical third-generation metagenome or macrotranscriptome data within 30 minutes, and meets the requirement of rapid pathogen detection and analysis; meanwhile, the method can analyze short read length data obtained by second-generation sequencing, and has better data compatibility.
Description
Technical Field
The present invention relates to the field of information technology, and more particularly, to a method and system for automated analysis of pathogen types.
Background
A pathogen identification method based on high-throughput sequencing data comprises the following steps of carrying out pretreatment, nucleic acid extraction, sequencing library construction and other experimental steps on a sample, using a high-throughput sequencer to determine nucleic acid sequences in the sample, using sequence comparison software to compare and analyze the sequences with a pathogen sequence database after nucleic acid sequence data generated by the sequencer are obtained, screening by adopting a certain comparison result filtering condition to obtain a credible result, further calculating the number of sequences belonging to pathogens in the nucleic acid sequence data of the sample, the proportion of the sequences in all the sequences and the like, and finally judging the existence condition of the pathogens in the detected sample.
Because the sequencing data volume of the metagenome or the macrotranscriptome is large, the pathogen clinical detection based on the second generation sequencing technology can generally obtain the sequencing data of more than 1.5Gb, and the pathogen clinical detection based on the third generation long read sequencing can obtain more data. Therefore, in order to improve the timeliness and accuracy of clinical pathogen detection, a method and a system for rapidly comparing and analyzing sequencing data of a metagenome or a macrotranscriptome must be established.
Currently, commonly used sequence alignment software including bwa, bowtie 2, SOAP, BLAST, etc. has been widely used in the analysis of second-generation sequencing data. The technical scheme of the Chinese patent CN 108334750B metagenome data analysis method and system uses a k-mer algorithm to obtain a primary identification result, uses a BLAST algorithm to carry out secondary species identification on sequences in the primary identification set, and when the identification result of more than 50% of the sequences in the verification sequence set is consistent with the primary species identification result, the primary species identification result is considered to pass verification and serves as a report to detect species. The Chinese patent 'identification of pathogens and antibiotic characterization in CN 109923217A metagenome sample' technical scheme uses bwa algorithm to compare sequencing data with pathogen sequence database. The short sequence rapid comparison algorithms represented by bwa, bowtie 2 and the like are constructed based on a BWT conversion algorithm, are mainly optimized for second-generation sequencing data, and improve the comparison speed. The BLAST algorithm is a sequence alignment algorithm based on local sequence alignment, and uses a short segment matching algorithm and an effective statistical model to find out the optimal local alignment effect between a target sequence and a database.
However, how to realize automatic pathogen type analysis based on the third generation sequencing data is still a problem to be solved.
Disclosure of Invention
The invention provides a method and a system for automatically analyzing pathogen types, which are used for solving the problem of how to quickly and accurately determine the pathogen types in a sample to be detected on the basis of third-generation sequencing data.
In order to solve the above-mentioned problems, according to an aspect of the present invention, there is provided a method for automatically analyzing pathogen types, the method comprising:
obtaining at least one sequencing read data which is obtained by detecting a nucleic acid sequence and corresponds to a sample to be detected;
determining the data type of each sequencing read data in the at least one sequencing read data, and performing data cleaning on each sequencing read data in the at least one sequencing read data according to the data type and a preset data cleaning strategy to obtain at least one first sequencing read data with qualified quality;
performing primary identification on the pathogen type of each first sequencing read data in the at least one first sequencing read data with qualified quality by using a first identification method and a second identification method respectively to obtain a first pathogen type identification result and a second pathogen type identification result corresponding to each first sequencing read data respectively;
selecting second sequencing read data according to a first pathogen type identification result and a second pathogen type identification result corresponding to each first sequencing read data to obtain at least one second sequencing read data, and classifying the at least one second sequencing read data according to the identified pathogen type to obtain second sequencing read data sets corresponding to different pathogen types;
for any one second sequencing read data set, comparing each second sequencing read data in the any one second sequencing read data set with a pathogen reference sequence of a pathogen type corresponding to the any one second sequencing read data set to obtain a comparison result corresponding to each second sequencing read data in the any one second sequencing read data set;
and determining the final pathogen type identification result of the sample to be detected according to the comparison result corresponding to each second sequencing read data in each second sequencing read data set.
Preferably, the performing data washing on each sequencing read data in the at least one sequencing read data according to a preset data washing strategy according to a data type to obtain at least one first sequencing read data with qualified quality includes:
determining the sequencing read data quality of each sequencing read data with the data type of the second-generation sequencing data by using the fastQC, and screening out each sequencing read data with the sequencing read data quality meeting a preset first sequencing read data quality standard as first sequencing read data;
and determining the sequencing read data quality of each sequencing read data with the data type of the third-generation sequencing data by utilizing the Nanofilt, and screening each sequencing read data with the sequencing read data quality meeting a preset second sequencing read data quality standard as first sequencing read data.
Preferably, the selecting the second sequencing read data according to the first pathogen type identification result and the second pathogen type identification result corresponding to each first sequencing read data to obtain at least one second sequencing read data includes:
for any one of the first sequencing read data, if the first pathogen type identification result corresponding to the any one of the first sequencing read data indicates that the identified pathogen type belongs to a preset pathogen type set, taking the any one of the first sequencing read data as a second sequencing read data;
for any one first sequencing read data, if the second pathogen type identification result corresponding to the any one first sequencing read data indicates that the identified pathogen type belongs to a preset pathogen type set, the first comparison score of the any one first sequencing read data and the pathogen reference sequence corresponding to the identified pathogen type is greater than or equal to a preset score threshold, and the first length matching ratio of the length of the pathogen reference sequence corresponding to the identified pathogen type in the any one first sequencing read data and the total length of the any one first sequencing read data is greater than or equal to a preset first ratio threshold, the any one first sequencing read data is used as one second sequencing read data.
Preferably, the alignment result comprises: each second sequencing read data comprises a second alignment score, a second match length ratio, a sequence similarity, and an expected alignment value of the pathogen reference sequence corresponding to the identified pathogen type.
Preferably, the determining the final pathogen type identification result of the sample to be tested according to the comparison result corresponding to each second sequencing read data in each second sequencing read data set includes:
for any sequencing read data set, selecting second sequencing read data with the highest second alignment score in any sequencing read data set as target sequencing read data, and judging whether the target sequencing read data simultaneously meet the condition that a second matching length ratio is greater than or equal to a preset second ratio threshold, the sequence similarity is greater than or equal to a preset sequence similarity threshold, and the alignment expected value is less than or equal to a preset alignment expected threshold; if so, determining that the pathogen type corresponding to the target sequencing read data exists in the sample to be tested;
and determining the final pathogen type identification result according to the determined pathogen types existing in all the samples to be detected.
Preferably, wherein the method further comprises:
performing statistical analysis on the sequencing read data with the pathogen type identified according to the final pathogen type identification result of the sample to be detected to obtain a statistical analysis result; wherein the statistical analysis result comprises: the number of sequencing read data, composition ratio, RPM value and reference genome coverage for the pathogen types present in the test sample and the genus level and species level of each pathogen type.
Preferably, wherein the method further comprises:
and obtaining annotation information of each pathogen type, information of a sample to be detected and the target object information of the sample to be detected in the final pathogen type identification result, and generating a detection report according to the statistical analysis result, the annotation information, the information of the sample to be detected and the target object information.
According to another aspect of the present invention, there is provided a system for automated analysis of pathogen type, the system comprising:
a sequencing read data acquisition unit for acquiring at least one sequencing read data corresponding to a sample to be tested and obtained by nucleic acid sequence detection;
the data cleaning unit is used for determining the data type of each sequencing read data in the at least one sequencing read data, and performing data cleaning on each sequencing read data in the at least one sequencing read data according to the data type and a preset data cleaning strategy to obtain at least one first sequencing read data with qualified quality;
the pathogen type identification unit is used for carrying out primary identification on the pathogen type of each first sequencing read data in the at least one first sequencing read data with qualified quality by using a first identification method and a second identification method respectively so as to obtain a first pathogen type identification result and a second pathogen type identification result corresponding to each first sequencing read data respectively;
a second sequencing read data set acquisition unit, configured to select second sequencing read data according to the first pathogen type identification result and the second pathogen type identification result corresponding to each first sequencing read data, acquire at least one second sequencing read data, and classify the at least one second sequencing read data according to the identified pathogen type, so as to acquire second sequencing read data sets corresponding to different pathogen types;
a comparison result obtaining unit, configured to compare, for any one of the second sequencing read data sets, each of the second sequencing read data in the any one of the second sequencing read data sets with a pathogen reference sequence of a pathogen type corresponding to the any one of the second sequencing read data sets, and obtain a comparison result corresponding to each of the second sequencing read data in the any one of the second sequencing read data sets;
and the final pathogen type identification result determining unit is used for determining the final pathogen type identification result of the sample to be detected according to the comparison result corresponding to each second sequencing read data in each second sequencing read data set.
Preferably, the data washing unit performs data washing on each sequencing read data in the at least one sequencing read data according to a data type and a preset data washing strategy to obtain at least one first sequencing read data with qualified quality, and the data washing unit includes:
determining the sequencing read data quality of each sequencing read data with the data type of the second-generation sequencing data by using the fastQC, and screening out each sequencing read data with the sequencing read data quality meeting a preset first sequencing read data quality standard as first sequencing read data;
and determining the sequencing read data quality of each sequencing read data with the data type of the third-generation sequencing data by utilizing the Nanofilt, and screening each sequencing read data with the sequencing read data quality meeting a preset second sequencing read data quality standard as first sequencing read data.
Preferably, the second sequencing read data set obtaining unit selects the second sequencing read data according to the first pathogen type identification result and the second pathogen type identification result corresponding to each first sequencing read data, and obtains at least one second sequencing read data, including:
for any one of the first sequencing read data, if the first pathogen type identification result corresponding to the any one of the first sequencing read data indicates that the identified pathogen type belongs to a preset pathogen type set, taking the any one of the first sequencing read data as a second sequencing read data;
for any one first sequencing read data, if the second pathogen type identification result corresponding to the any one first sequencing read data indicates that the identified pathogen type belongs to a preset pathogen type set, the first comparison score of the any one first sequencing read data and the pathogen reference sequence corresponding to the identified pathogen type is greater than or equal to a preset score threshold, and the first length matching ratio of the length of the pathogen reference sequence corresponding to the identified pathogen type in the any one first sequencing read data and the total length of the any one first sequencing read data is greater than or equal to a preset first ratio threshold, the any one first sequencing read data is used as one second sequencing read data.
Preferably, the alignment result comprises: each second sequencing read data comprises a second alignment score, a second match length ratio, a sequence similarity, and an expected alignment value of the pathogen reference sequence corresponding to the identified pathogen type.
Preferably, the determining unit of the final pathogen type identification result determines the final pathogen type identification result of the sample to be tested according to the comparison result corresponding to each second sequencing read data in each second sequencing read data set, and includes:
for any sequencing read data set, selecting second sequencing read data with the highest second alignment score in any sequencing read data set as target sequencing read data, and judging whether the target sequencing read data simultaneously meet the condition that a second matching length ratio is greater than or equal to a preset second ratio threshold, the sequence similarity is greater than or equal to a preset sequence similarity threshold, and the alignment expected value is less than or equal to a preset alignment expected threshold; if so, determining that the pathogen type corresponding to the target sequencing read data exists in the sample to be tested;
and determining the final pathogen type identification result according to the determined pathogen types existing in all the samples to be detected.
Preferably, wherein the system further comprises:
the statistical analysis unit is used for performing statistical analysis on the sequencing read data with the pathogen type identified according to the final pathogen type identification result of the sample to be detected to obtain a statistical analysis result; wherein the statistical analysis result comprises: the number of sequencing read data, composition ratio, RPM value and reference genome coverage for the pathogen types present in the test sample and the genus level and species level of each pathogen type.
Preferably, wherein the system further comprises:
and the detection report generating unit is used for acquiring annotation information of each pathogen type, information of the sample to be detected and the target object information of the sample to be detected in the final pathogen type identification result, and generating a detection report according to the statistical analysis result, the annotation information, the information of the sample to be detected and the target object information.
The invention provides a method and a system for automatically analyzing pathogen types, which firstly carry out preliminary judgment on the pathogen types of first sequencing read data of different data types of a sample to be detected through two different algorithms, then select second sequencing read data, and then compare, verify and screen the identified second sequencing read data of different pathogen types with corresponding pathogen reference sequences, thereby determining the pathogen types existing in the sample to be detected; the method can effectively process long reading data generated based on a third-generation sequencing technology, reduces the calculated amount of accurate comparison, solves the problems that the long reading data and sequencing read data with higher error rate are difficult to be considered both in accuracy and analysis speed in comparison analysis, can realize accurate analysis and report generation of typical third-generation metagenome or macrotranscription data within 30 minutes, and meets the requirement of performing rapid pathogen detection analysis on the metagenome or the macrotranscription data obtained based on the third-generation sequencing technology in clinic; meanwhile, the method can analyze short read length data obtained by second-generation sequencing, and has better data compatibility.
Drawings
A more complete understanding of exemplary embodiments of the present invention may be had by reference to the following drawings in which:
FIG. 1 is a flow diagram of a method 100 for automated analysis of pathogen types according to an embodiment of the present invention;
FIG. 2 is a schematic diagram for automated pathogen type analysis according to an embodiment of the present invention;
FIG. 3 is an exemplary diagram of a detection report according to an embodiment of the present invention;
FIG. 4 is a flow diagram of automated pathogen type analysis according to an embodiment of the present invention;
fig. 5 is a schematic diagram of a system 500 for automated pathogen type analysis, according to an embodiment of the present invention.
Detailed Description
The exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, however, the present invention may be embodied in many different forms and is not limited to the embodiments described herein, which are provided for complete and complete disclosure of the present invention and to fully convey the scope of the present invention to those skilled in the art. The terminology used in the exemplary embodiments illustrated in the accompanying drawings is not intended to be limiting of the invention. In the drawings, the same units/elements are denoted by the same reference numerals.
Unless otherwise defined, terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Further, it will be understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense.
Fig. 1 is a flow diagram of a method 100 for automated analysis of pathogen types according to an embodiment of the invention. As shown in fig. 1, the method for automatically analyzing pathogen types provided by the embodiment of the present invention can effectively process long read data generated based on a third-generation sequencing technology, reduce the amount of calculation for accurate comparison, solve the problem that the accuracy and the analysis speed are difficult to be considered in comparison analysis of the long read data and sequencing read data with a high error rate, can realize accurate analysis and report generation of typical third-generation metagenome or macrotranscript data within 30 minutes, and meet the clinical requirement for rapid pathogen detection analysis of metagenome or macrotranscript data obtained based on the third-generation sequencing technology; meanwhile, the method can analyze short read length data obtained by second-generation sequencing, and has better data compatibility. The method 100 for automatically analyzing pathogen types provided by the embodiment of the present invention starts with step 101, and at step 101, at least one sequencing read data obtained by detecting a nucleic acid sequence corresponding to a sample to be tested is obtained.
The method can analyze the third generation long-reading metagenome or macrotranscriptome sequencing data corresponding to the sample to be detected, thereby determining the pathogen type. The principle of achieving pathogen type identification is shown in fig. 2. Referring to fig. 2, in the present invention, a sequencing read sequence obtained by performing nucleic acid sequencing on a sample with a sequencer is read and stored in the format of FASTQ, which is an international universal sequencing data standard, to obtain a FASTQ data file; of course, the compression of sequencing read data can also be performed based on the gzip method to reduce the storage occupation, so as to obtain the fastq. When a plurality of samples to be detected are available, the method can perform barcode identification on the sample barcode information corresponding to each sample to be detected, realize the splitting and identification of sequencing read sequence data, and merge a plurality of FASTQ data files or compressed FASTQ. The sequencing read sequence can be second-generation data acquired based on a second-generation sequencer and/or third-generation data acquired based on a third-generation sequencer.
In step 102, a data type of each sequencing read data in the at least one sequencing read data is determined, and data washing is performed on each sequencing read data in the at least one sequencing read data according to the data type and a preset data washing strategy, so as to obtain at least one first sequencing read data with qualified quality.
Preferably, the performing data washing on each sequencing read data in the at least one sequencing read data according to a preset data washing strategy according to a data type to obtain at least one first sequencing read data with qualified quality includes:
determining the sequencing read data quality of each sequencing read data with the data type of the second-generation sequencing data by using the fastQC, and screening out each sequencing read data with the sequencing read data quality meeting a preset first sequencing read data quality standard as first sequencing read data;
and determining the sequencing read data quality of each sequencing read data with the data type of the third-generation sequencing data by utilizing the Nanofilt, and screening each sequencing read data with the sequencing read data quality meeting a preset second sequencing read data quality standard as first sequencing read data.
In the present invention, as shown in fig. 2, the data types of the sequencing read data include: second generation short read length data (i.e., second generation sequencing data) and third generation long read length data (i.e., third generation sequencing data). And for sequencing read data of different data types, cleaning according to a preset data cleaning strategy, and filtering low-quality data by using different data quality control software and parameters.
In the invention, the Nanofilt software is used for carrying out quality control detection and filtration on the third-generation sequencing read data file, the input file is an original FASTQ format file or a compressed FASTQ. Wherein the second sequencing read data quality standard is set according to actual conditions, such as: setting the data quality standard of the second sequencing reading to be Q more than or equal to 10, namely, the error rate of all basic groups in the sequencing reading is less than or equal to 10%. Wherein the Q value is determined by the base error rate according to the formula Q-10 × log10P is obtained through calculation; where Q is the quality value and P is the error rate of a certain base.
And performing quality control detection and filtration on the second-generation sequencing read data file by using fastQC software, wherein the input file is an original FASTQ format file or FASTQ. The first sequencing read data quality standard can be set according to actual conditions, such as: the error rate of all bases in the first sequencing read is set to be less than or equal to 0.1 percent (namely Q is more than or equal to 30).
In addition, the extracted high quality at least one first sequencing read data that passes data quality control can be stored in fastq.
In step 103, a first identification method and a second identification method are respectively used for performing primary identification on the pathogen type of each first sequencing read data in the at least one first sequencing read data with qualified quality, so as to respectively obtain a first pathogen type identification result and a second pathogen type identification result corresponding to each first sequencing read data.
Referring to fig. 2, in the present invention, the first identification method is Kraken2, and the second identification method is Minimap2, and two different algorithms are respectively used for identifying the pathogen type. Wherein, the Kraken2 algorithm is used to realize the identification of bacteria, fungi, viruses and parasites based on species-specific k-mer sequences, the input is the high-quality sequence FASTQ. gz format file obtained in step 102, and the output is the species discrimination result corresponding to each sequencing read data, namely the first pathogen type identification result.
Complementary comparison of pathogens with large genomic variation such as viruses is carried out by utilizing a Minimap2 algorithm, pathogen type identification is carried out based on minizer (a section of seed with the minimum hash value in a sequence) hash table search, a training algorithm and a dynamic programming algorithm, input is a high-quality sequence FASTQ. Wherein the second pathogen type identification result comprises: pathogen type and alignment score of each sequencing read data to pathogen reference sequence.
In step 104, selecting second sequencing read data according to the first pathogen type identification result and the second pathogen type identification result corresponding to each first sequencing read data to obtain at least one second sequencing read data, and classifying the at least one second sequencing read data according to the identified pathogen type to obtain second sequencing read data sets corresponding to different pathogen types.
Preferably, the selecting the second sequencing read data according to the first pathogen type identification result and the second pathogen type identification result corresponding to each first sequencing read data to obtain at least one second sequencing read data includes:
for any one of the first sequencing read data, if the first pathogen type identification result corresponding to the any one of the first sequencing read data indicates that the identified pathogen type belongs to a preset pathogen type set, taking the any one of the first sequencing read data as a second sequencing read data;
for any one first sequencing read data, if the second pathogen type identification result corresponding to the any one first sequencing read data indicates that the identified pathogen type belongs to a preset pathogen type set, the first comparison score of the any one first sequencing read data and the pathogen reference sequence corresponding to the identified pathogen type is greater than or equal to a preset score threshold, and the first length matching ratio of the length of the pathogen reference sequence corresponding to the identified pathogen type in the any one first sequencing read data and the total length of the any one first sequencing read data is greater than or equal to a preset first ratio threshold, the any one first sequencing read data is used as one second sequencing read data.
In the present invention, the preset score threshold may be set to 50, and the first ratio threshold may be set to 50%.
Referring to fig. 2, in the present invention, for the first pathogen type identification result obtained based on the Kraken2 algorithm, for any one of the first sequencing read data, if the first pathogen type identification result corresponding to the any one of the first sequencing read data indicates that the identified pathogen type belongs to the preset pathogen type set, that is, the first pathogen type identification result is not empty or unknown, the any one of the first sequencing read data is used as a second sequencing read data.
For the second pathogen type identification result obtained based on the Minimap2 algorithm, for any first sequencing read data, if the second pathogen type identification result corresponding to any first sequencing read data indicates that the identified pathogen type belongs to the preset pathogen type set and the first comparison score MAPQ value of the pathogen reference sequence corresponding to any first sequencing read data and the identified pathogen type is greater than or equal to the preset score threshold 50, and the first length matching ratio Mi between the length of the pathogen reference sequence corresponding to the identified pathogen type in any first sequencing read data and the total length of any first sequencing read data is determined to be greater than or equal to the preset first ratio threshold 50% based on the CIGAR, taking any first sequencing read data as a second sequencing read data.
In the present invention, the second sequencing read data is a read suspected of being a pathogen.
In the present invention, the first length match ratio Mi of a certain sequencing read is calculated by the following method, including:
wherein n is the total number of fragments in a sequencing read data that match to the corresponding pathogen reference sequence, miIs the length of the ith matching fragment, and L is the length of the data of the certain sequencing read.
In the present invention, after determining at least one second sequencing read data suspected to be a pathogen, the at least one second sequencing read data is further classified according to the identified pathogen type to obtain second sequencing read data sets corresponding to different pathogen types. Wherein, data splitting is carried out according to the identified or compared pathogen species, the second sequencing read data are extracted from the input FASTQ. The identification is based on the alignment with the reference sequence of the known species, so that the preliminary identification is all the known species, and the resolution process is to store the reading of the same species as a file in a FASTA format respectively as the input of the next basic local alignment BLAST.
In step 105, for any one second sequencing read data set, each second sequencing read data in the any one second sequencing read data set is compared with a pathogen reference sequence of a pathogen type corresponding to the any one second sequencing read data set, and a comparison result corresponding to each second sequencing read data in the any one second sequencing read data set is obtained.
Preferably, the alignment result comprises: each second sequencing read data comprises a second alignment score, a second match length ratio, a sequence similarity, and an expected alignment value of the pathogen reference sequence corresponding to the identified pathogen type.
Referring to fig. 2, in the present invention, each second sequencing read data in the FASTA format file of suspected pathogen reads separated by pathogen type in step 104 is subjected to accurate BLAST Alignment with the reference sequence of each species using Basic Local Alignment Search Tool (BLAST), so as to obtain the Alignment result corresponding to each second sequencing read data in each second sequencing read data set. And the comparison result corresponding to each second sequencing read data comprises a second comparison score, a second matching length ratio, a sequence similarity and a comparison expected value of the second sequencing read data and the pathogen reference sequence corresponding to the identified pathogen type.
In step 106, the final pathogen type identification result of the sample to be detected is determined according to the comparison result corresponding to each second sequencing read data in each second sequencing read data set.
Preferably, the determining the final pathogen type identification result of the sample to be tested according to the comparison result corresponding to each second sequencing read data in each second sequencing read data set includes:
for any sequencing read data set, selecting second sequencing read data with the highest second alignment score in any sequencing read data set as target sequencing read data, and judging whether the target sequencing read data simultaneously meet the condition that a second matching length ratio is greater than or equal to a preset second ratio threshold, the sequence similarity is greater than or equal to a preset sequence similarity threshold, and the alignment expected value is less than or equal to a preset alignment expected threshold; if so, determining that the pathogen type corresponding to the target sequencing read data exists in the sample to be tested;
and determining the final pathogen type identification result according to the determined pathogen types existing in all the samples to be detected.
In the present invention, the threshold of the second ratio is 80%, the threshold of the sequence similarity is 90%, and the threshold of the expected alignment is 10-5. And for any sequencing read data set, selecting the second sequencing read data with the highest second alignment score in any sequencing read data set as the target sequencing read data. Then, whether the target sequencing read data simultaneously meet the conditions that the second matching length ratio is greater than or equal to the preset second ratio threshold value 80%, the sequence similarity is greater than or equal to the preset sequence similarity threshold value 90%, and the comparison expected value is less than or equal to the preset comparison expected threshold value 10 is judged-5(ii) a And if so, determining that the pathogen type corresponding to the target sequencing read data exists in the sample to be tested. And finally, summarizing, and determining the final pathogen type identification result according to the pathogen types existing in all the samples to be detected. The final pathogen type identification results include: a pathogen type and sequencing read data corresponding to the pathogen type.
Preferably, wherein the method further comprises:
performing statistical analysis on the sequencing read data with the pathogen type identified according to the final pathogen type identification result of the sample to be detected to obtain a statistical analysis result; wherein the statistical analysis result comprises: the number of sequencing read data, composition ratio, RPM value and reference genome coverage for the pathogen types present in the test sample and the genus level and species level of each pathogen type.
Preferably, wherein the method further comprises:
and obtaining annotation information of each pathogen type, information of a sample to be detected and the target object information of the sample to be detected in the final pathogen type identification result, and generating a detection report according to the statistical analysis result, the annotation information, the information of the sample to be detected and the target object information.
The method of the present invention further comprises: and (4) performing statistical analysis on the sequencing read data for identifying the pathogen type obtained in the step (106) according to species, and determining the pathogen type, the reading number Ni of the genus level and the species level, the composition ratio Pi, the RPMi value, the reference genome coverage and the like in the sample to be detected. Pathogen types include: bacteria, fungi, viruses and parasites.
In the present invention, the calculation of the composition ratio Pi for each pathogen type in the test sample from the number of sequencing read data for each pathogen type is performed in the following manner, including:
wherein Pi is the constitutive ratio of the ith pathogen type; n is the total number of pathogen types detected in the sample and Ni is the number of reads for the ith pathogen type.
Calculating the RPMi value for each pathogen type from the number of sequencing reads for each pathogen type and the total number of sequencing reads detected by quality control using the following means, including:
RPMi=Ni/Nt×106,
wherein RPMi is the RPM value for the ith pathogen type; ni is the number of reads for the ith pathogen type, and Nt is the total number of sequencing reads detected by quality control in the sample.
The method can also supplement and add pathogen annotation information, sample information, target object information and the like according to the statistical information of the pathogens in the to-be-detected sample obtained through statistics so as to generate a detection report. Wherein the annotation information includes: chinese name of pathogen type, Latin name, DNA or RNA as genome, description words of pathogenicity characteristics and the like, and sample information comprises: sample type, volume, sampling time, etc.; target object information, i.e., patient information, including: name, age, sex, disease diagnosis, etc. The detection report is shown in fig. 3.
Fig. 4 is a flowchart of an automatic pathogen type analysis according to an embodiment of the present invention, including: defining a data storage directory and a data type; reading a fastq file or a fastq.gz file under a directory; judging the data type; for the third-generation sequencing read data, performing data quality control analysis by using Nanofilt, screening out the sequencing read data with the quality satisfying Q10, and reserving the sequencing read data, for the second-generation sequencing read data, performing data quality control analysis by using fastQC, screening out the sequencing read data with the quality satisfying Q30, and reserving the sequencing read data; storing the reserved sequencing read data into a past.fastq.gz file; performing species analysis based on Kraken2, reserving sequencing read data of known pathogen, performing species analysis based on Minimap2, reserving sequencing read data with first comparative score MAPQ value more than or equal to 50; storing the reserved sequencing read data into a taxi.fasta file according to the species taxi; using BLAST to compare with each species genome to obtain a second comparison score, and if the sequencing read data corresponding to the highest second comparison score simultaneously meet that the second matching length ratio alignment is greater than or equal to 80% of a preset second ratio threshold, the sequence similarity is greater than or equal to 90% of a preset sequence similarity threshold, and the comparison expected value is less than or equal to 10% of a preset comparison expected threshold-5Then reserving; statistical analysis of pathogen types was performed to determine detection reports.
The invention establishes a pathogen comparison and identification method aiming at third-generation long-reading sequencing data, and can effectively process the long-reading long data generated by the third-generation sequencing technologies such as nanopore and pacBio. Compared with the method based on the bwa, bowtie 2 and other comparison software which is widely applied at present, the method solves the problem that the accuracy and the analysis speed are difficult to be considered in the comparison analysis of sequencing read data with long read length and high error rate, reduces the analysis time consumption of typical data (1Gb nanopore sequencing data) to within 30 minutes, and meets the requirement of performing rapid pathogen detection analysis on the metagenome or macrotranscription group data obtained based on the third generation sequencing technology in clinic. Meanwhile, the method can analyze short read length data obtained by second-generation sequencing, and has better data compatibility.
Fig. 5 is a schematic diagram of a system 500 for automated pathogen type analysis, according to an embodiment of the present invention. As shown in fig. 5, the present invention provides a system 500 for automated pathogen type analysis, comprising: a sequencing read data acquisition unit 501, a data washing unit 502, a pathogen type identification unit 503, a second sequencing read data set acquisition unit 504, an alignment result acquisition unit 505, and a final pathogen type identification result determination unit 506.
Preferably, the sequencing read data obtaining unit 501 is configured to obtain at least one sequencing read data obtained by detecting a nucleic acid sequence corresponding to a sample to be detected.
Preferably, the data washing unit 502 is configured to determine a data type of each sequencing read data in the at least one sequencing read data, and perform data washing on each sequencing read data in the at least one sequencing read data according to a preset data washing strategy according to the data type to obtain at least one first sequencing read data with qualified quality.
Preferably, the data washing unit 502 performs data washing on each sequencing read data in the at least one sequencing read data according to a preset data washing strategy according to a data type to obtain at least one first sequencing read data with qualified quality, including:
determining the sequencing read data quality of each sequencing read data with the data type of the second-generation sequencing data by using the fastQC, and screening out each sequencing read data with the sequencing read data quality meeting a preset first sequencing read data quality standard as first sequencing read data;
and determining the sequencing read data quality of each sequencing read data with the data type of the third-generation sequencing data by utilizing the Nanofilt, and screening each sequencing read data with the sequencing read data quality meeting a preset second sequencing read data quality standard as first sequencing read data.
Preferably, the pathogen type identification unit 503 is configured to perform a preliminary pathogen type identification on each of the at least one qualified first sequencing read data by using a first identification method and a second identification method, respectively, so as to obtain a first pathogen type identification result and a second pathogen type identification result corresponding to each of the first sequencing read data, respectively.
Preferably, the second sequencing read data set obtaining unit 504 is configured to select second sequencing read data according to the first pathogen type identification result and the second pathogen type identification result corresponding to each first sequencing read data, obtain at least one second sequencing read data, and classify the at least one second sequencing read data according to the identified pathogen type to obtain second sequencing read data sets corresponding to different pathogen types.
Preferably, the second sequencing read data set obtaining unit 504 selects the second sequencing read data according to the first pathogen type identification result and the second pathogen type identification result corresponding to each first sequencing read data, and obtains at least one second sequencing read data, including:
for any one of the first sequencing read data, if the first pathogen type identification result corresponding to the any one of the first sequencing read data indicates that the identified pathogen type belongs to a preset pathogen type set, taking the any one of the first sequencing read data as a second sequencing read data;
for any one first sequencing read data, if the second pathogen type identification result corresponding to the any one first sequencing read data indicates that the identified pathogen type belongs to a preset pathogen type set, the first comparison score of the any one first sequencing read data and the pathogen reference sequence corresponding to the identified pathogen type is greater than or equal to a preset score threshold, and the first length matching ratio of the length of the pathogen reference sequence corresponding to the identified pathogen type in the any one first sequencing read data and the total length of the any one first sequencing read data is greater than or equal to a preset first ratio threshold, the any one first sequencing read data is used as one second sequencing read data.
Preferably, the alignment result obtaining unit 505 is configured to, for any one of the second sequencing read data sets, align each of the second sequencing read data in the any one of the second sequencing read data sets with a pathogen reference sequence of a pathogen type corresponding to the any one of the second sequencing read data sets, and obtain an alignment result corresponding to each of the second sequencing read data in the any one of the second sequencing read data sets.
Preferably, the alignment result comprises: each second sequencing read data comprises a second alignment score, a second match length ratio, a sequence similarity, and an expected alignment value of the pathogen reference sequence corresponding to the identified pathogen type.
Preferably, the final pathogen type identification result determining unit 506 is configured to determine the final pathogen type identification result of the sample to be tested according to the comparison result corresponding to each second sequencing read data in each second sequencing read data set.
Preferably, the determining unit 506 for final pathogen type identification result determines the final pathogen type identification result of the sample to be tested according to the comparison result corresponding to each second sequencing read data in each second sequencing read data set, and includes:
for any sequencing read data set, selecting second sequencing read data with the highest second alignment score in any sequencing read data set as target sequencing read data, and judging whether the target sequencing read data simultaneously meet the condition that a second matching length ratio is greater than or equal to a preset second ratio threshold, the sequence similarity is greater than or equal to a preset sequence similarity threshold, and the alignment expected value is less than or equal to a preset alignment expected threshold; if so, determining that the pathogen type corresponding to the target sequencing read data exists in the sample to be tested;
and determining the final pathogen type identification result according to the determined pathogen types existing in all the samples to be detected.
Preferably, wherein the system further comprises:
the statistical analysis unit is used for performing statistical analysis on the sequencing read data with the pathogen type identified according to the final pathogen type identification result of the sample to be detected to obtain a statistical analysis result; wherein the statistical analysis result comprises: the number of sequencing read data, composition ratio, RPM value and reference genome coverage for the pathogen types present in the test sample and the genus level and species level of each pathogen type.
Preferably, wherein the system further comprises:
and the detection report generating unit is used for acquiring annotation information of each pathogen type, information of the sample to be detected and the target object information of the sample to be detected in the final pathogen type identification result, and generating a detection report according to the statistical analysis result, the annotation information, the information of the sample to be detected and the target object information.
The system 500 for automatically analyzing pathogen types according to the embodiment of the present invention corresponds to the method 100 for automatically analyzing pathogen types according to another embodiment of the present invention, and will not be described herein again.
The invention has been described with reference to a few embodiments. However, other embodiments of the invention than the one disclosed above are equally possible within the scope of the invention, as would be apparent to a person skilled in the art from the appended patent claims.
Generally, all terms used in the claims are to be interpreted according to their ordinary meaning in the technical field, unless explicitly defined otherwise herein. All references to "a/an/the [ device, component, etc ]" are to be interpreted openly as referring to at least one instance of said device, component, etc., unless explicitly stated otherwise. The steps of any method disclosed herein do not have to be performed in the exact order disclosed, unless explicitly stated.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.
Claims (14)
1. A method for automated analysis of pathogen type, the method comprising:
obtaining at least one sequencing read data which is obtained by detecting a nucleic acid sequence and corresponds to a sample to be detected;
determining the data type of each sequencing read data in the at least one sequencing read data, and performing data cleaning on each sequencing read data in the at least one sequencing read data according to the data type and a preset data cleaning strategy to obtain at least one first sequencing read data with qualified quality;
performing primary identification on the pathogen type of each first sequencing read data in the at least one first sequencing read data with qualified quality by using a first identification method and a second identification method respectively to obtain a first pathogen type identification result and a second pathogen type identification result corresponding to each first sequencing read data respectively;
selecting second sequencing read data according to a first pathogen type identification result and a second pathogen type identification result corresponding to each first sequencing read data to obtain at least one second sequencing read data, and classifying the at least one second sequencing read data according to the identified pathogen type to obtain second sequencing read data sets corresponding to different pathogen types;
for any one second sequencing read data set, comparing each second sequencing read data in the any one second sequencing read data set with a pathogen reference sequence of a pathogen type corresponding to the any one second sequencing read data set to obtain a comparison result corresponding to each second sequencing read data in the any one second sequencing read data set;
and determining the final pathogen type identification result of the sample to be detected according to the comparison result corresponding to each second sequencing read data in each second sequencing read data set.
2. The method of claim 1, wherein the data washing each of the at least one sequencing read data according to a data type according to a preset data washing strategy to obtain at least one qualified first sequencing read data comprises:
determining the sequencing read data quality of each sequencing read data with the data type of the second-generation sequencing data by using the fastQC, and screening out each sequencing read data with the sequencing read data quality meeting a preset first sequencing read data quality standard as first sequencing read data;
and determining the sequencing read data quality of each sequencing read data with the data type of the third-generation sequencing data by utilizing the Nanofilt, and screening each sequencing read data with the sequencing read data quality meeting a preset second sequencing read data quality standard as first sequencing read data.
3. The method of claim 1, wherein the selecting the second sequencing read data according to the first pathogen type identification result and the second pathogen type identification result corresponding to each first sequencing read data to obtain at least one second sequencing read data comprises:
for any one of the first sequencing read data, if the first pathogen type identification result corresponding to the any one of the first sequencing read data indicates that the identified pathogen type belongs to a preset pathogen type set, taking the any one of the first sequencing read data as a second sequencing read data;
for any one first sequencing read data, if the second pathogen type identification result corresponding to the any one first sequencing read data indicates that the identified pathogen type belongs to a preset pathogen type set, the first comparison score of the any one first sequencing read data and the pathogen reference sequence corresponding to the identified pathogen type is greater than or equal to a preset score threshold, and the first length matching ratio of the length of the pathogen reference sequence corresponding to the identified pathogen type in the any one first sequencing read data and the total length of the any one first sequencing read data is greater than or equal to a preset first ratio threshold, the any one first sequencing read data is used as one second sequencing read data.
4. The method of claim 1, wherein the comparing comprises: each second sequencing read data comprises a second alignment score, a second match length ratio, a sequence similarity, and an expected alignment value of the pathogen reference sequence corresponding to the identified pathogen type.
5. The method of claim 4, wherein determining the final pathogen type identification result of the sample to be tested according to the comparison result corresponding to each second sequencing read data in each second sequencing read data set comprises:
for any sequencing read data set, selecting second sequencing read data with the highest second alignment score in any sequencing read data set as target sequencing read data, and judging whether the target sequencing read data simultaneously meet the condition that a second matching length ratio is greater than or equal to a preset second ratio threshold, the sequence similarity is greater than or equal to a preset sequence similarity threshold, and the alignment expected value is less than or equal to a preset alignment expected threshold; if so, determining that the pathogen type corresponding to the target sequencing read data exists in the sample to be tested;
and determining the final pathogen type identification result according to the determined pathogen types existing in all the samples to be detected.
6. The method of claim 1, further comprising:
performing statistical analysis on the sequencing read data with the pathogen type identified according to the final pathogen type identification result of the sample to be detected to obtain a statistical analysis result; wherein the statistical analysis result comprises: the number of sequencing read data, composition ratio, RPM value and reference genome coverage for the pathogen types present in the test sample and the genus level and species level of each pathogen type.
7. The method of claim 6, further comprising:
and obtaining annotation information of each pathogen type, information of a sample to be detected and the target object information of the sample to be detected in the final pathogen type identification result, and generating a detection report according to the statistical analysis result, the annotation information, the information of the sample to be detected and the target object information.
8. A system for automated analysis of pathogen type, the system comprising:
a sequencing read data acquisition unit for acquiring at least one sequencing read data corresponding to a sample to be tested and obtained by nucleic acid sequence detection;
the data cleaning unit is used for determining the data type of each sequencing read data in the at least one sequencing read data, and performing data cleaning on each sequencing read data in the at least one sequencing read data according to the data type and a preset data cleaning strategy to obtain at least one first sequencing read data with qualified quality;
the pathogen type identification unit is used for carrying out primary identification on the pathogen type of each first sequencing read data in the at least one first sequencing read data with qualified quality by using a first identification method and a second identification method respectively so as to obtain a first pathogen type identification result and a second pathogen type identification result corresponding to each first sequencing read data respectively;
a second sequencing read data set acquisition unit, configured to select second sequencing read data according to the first pathogen type identification result and the second pathogen type identification result corresponding to each first sequencing read data, acquire at least one second sequencing read data, and classify the at least one second sequencing read data according to the identified pathogen type, so as to acquire second sequencing read data sets corresponding to different pathogen types;
a comparison result obtaining unit, configured to compare, for any one of the second sequencing read data sets, each of the second sequencing read data in the any one of the second sequencing read data sets with a pathogen reference sequence of a pathogen type corresponding to the any one of the second sequencing read data sets, and obtain a comparison result corresponding to each of the second sequencing read data in the any one of the second sequencing read data sets;
and the final pathogen type identification result determining unit is used for determining the final pathogen type identification result of the sample to be detected according to the comparison result corresponding to each second sequencing read data in each second sequencing read data set.
9. The system of claim 8, wherein the data washing unit performs data washing on each sequencing read data in the at least one sequencing read data according to a data type and according to a preset data washing strategy to obtain at least one first sequencing read data with qualified quality, and comprises:
determining the sequencing read data quality of each sequencing read data with the data type of the second-generation sequencing data by using the fastQC, and screening out each sequencing read data with the sequencing read data quality meeting a preset first sequencing read data quality standard as first sequencing read data;
and determining the sequencing read data quality of each sequencing read data with the data type of the third-generation sequencing data by utilizing the Nanofilt, and screening each sequencing read data with the sequencing read data quality meeting a preset second sequencing read data quality standard as first sequencing read data.
10. The system of claim 8, wherein the second sequencing read data set obtaining unit selects the second sequencing read data according to the first pathogen type identification result and the second pathogen type identification result corresponding to each first sequencing read data to obtain at least one second sequencing read data, and comprises:
for any one of the first sequencing read data, if the first pathogen type identification result corresponding to the any one of the first sequencing read data indicates that the identified pathogen type belongs to a preset pathogen type set, taking the any one of the first sequencing read data as a second sequencing read data;
for any one first sequencing read data, if the second pathogen type identification result corresponding to the any one first sequencing read data indicates that the identified pathogen type belongs to a preset pathogen type set, the first comparison score of the any one first sequencing read data and the pathogen reference sequence corresponding to the identified pathogen type is greater than or equal to a preset score threshold, and the first length matching ratio of the length of the pathogen reference sequence corresponding to the identified pathogen type in the any one first sequencing read data and the total length of the any one first sequencing read data is greater than or equal to a preset first ratio threshold, the any one first sequencing read data is used as one second sequencing read data.
11. The system of claim 8, wherein the alignment result comprises: each second sequencing read data comprises a second alignment score, a second match length ratio, a sequence similarity, and an expected alignment value of the pathogen reference sequence corresponding to the identified pathogen type.
12. The system according to claim 11, wherein the final pathogen type identification result determining unit determines the final pathogen type identification result of the sample to be tested according to the comparison result corresponding to each second sequencing read data in each second sequencing read data set, and includes:
for any sequencing read data set, selecting second sequencing read data with the highest second alignment score in any sequencing read data set as target sequencing read data, and judging whether the target sequencing read data simultaneously meet the condition that a second matching length ratio is greater than or equal to a preset second ratio threshold, the sequence similarity is greater than or equal to a preset sequence similarity threshold, and the alignment expected value is less than or equal to a preset alignment expected threshold; if so, determining that the pathogen type corresponding to the target sequencing read data exists in the sample to be tested;
and determining the final pathogen type identification result according to the determined pathogen types existing in all the samples to be detected.
13. The system of claim 8, further comprising:
the statistical analysis unit is used for performing statistical analysis on the sequencing read data with the pathogen type identified according to the final pathogen type identification result of the sample to be detected to obtain a statistical analysis result; wherein the statistical analysis result comprises: the number of sequencing read data, composition ratio, RPM value and reference genome coverage for the pathogen types present in the test sample and the genus level and species level of each pathogen type.
14. The system of claim 8, further comprising:
and the detection report generating unit is used for acquiring annotation information of each pathogen type, information of the sample to be detected and the target object information of the sample to be detected in the final pathogen type identification result, and generating a detection report according to the statistical analysis result, the annotation information, the information of the sample to be detected and the target object information.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110331835.5A CN113096737B (en) | 2021-03-26 | 2021-03-26 | Method and system for automatically analyzing pathogen type |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110331835.5A CN113096737B (en) | 2021-03-26 | 2021-03-26 | Method and system for automatically analyzing pathogen type |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113096737A true CN113096737A (en) | 2021-07-09 |
CN113096737B CN113096737B (en) | 2023-10-31 |
Family
ID=76670200
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110331835.5A Active CN113096737B (en) | 2021-03-26 | 2021-03-26 | Method and system for automatically analyzing pathogen type |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113096737B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113539369A (en) * | 2021-07-14 | 2021-10-22 | 江苏先声医学诊断有限公司 | Optimized kraken2 algorithm and application thereof in second-generation sequencing |
CN114464253A (en) * | 2022-03-03 | 2022-05-10 | 予果生物科技(北京)有限公司 | Method, system and application for real-time pathogen detection based on long read-length sequencing |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017053446A2 (en) * | 2015-09-21 | 2017-03-30 | The Regents Of The University Of California | Pathogen detection using next generation sequencing |
WO2018126033A1 (en) * | 2016-12-28 | 2018-07-05 | Ascus Biosciences, Inc. | Methods, apparatuses, and systems for analyzing microorganism strains in complex heterogeneous communities, determining functional relationships and interactions thereof, and diagnostics and biostate management based thereon |
CN108334750A (en) * | 2018-04-19 | 2018-07-27 | 江苏先声医学诊断有限公司 | A kind of macro genomic data analysis method and system |
-
2021
- 2021-03-26 CN CN202110331835.5A patent/CN113096737B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017053446A2 (en) * | 2015-09-21 | 2017-03-30 | The Regents Of The University Of California | Pathogen detection using next generation sequencing |
WO2018126033A1 (en) * | 2016-12-28 | 2018-07-05 | Ascus Biosciences, Inc. | Methods, apparatuses, and systems for analyzing microorganism strains in complex heterogeneous communities, determining functional relationships and interactions thereof, and diagnostics and biostate management based thereon |
CN110392738A (en) * | 2016-12-28 | 2019-10-29 | 埃斯库斯生物科技股份公司 | For being analyzed the microorganism strain in complex heterogeneous group, being determined its functional relationship and interaction and determine the method, apparatus and system of diagnosis and biological aspect management based on this |
CN108334750A (en) * | 2018-04-19 | 2018-07-27 | 江苏先声医学诊断有限公司 | A kind of macro genomic data analysis method and system |
Non-Patent Citations (1)
Title |
---|
李林海;陈丽丹;肖斌;孙朝晖;: "宏基因组测序在感染性疾病病原体检测中的应用", 传染病信息, no. 01 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113539369A (en) * | 2021-07-14 | 2021-10-22 | 江苏先声医学诊断有限公司 | Optimized kraken2 algorithm and application thereof in second-generation sequencing |
CN113539369B (en) * | 2021-07-14 | 2022-03-25 | 江苏先声医学诊断有限公司 | Optimized kraken2 algorithm and application thereof in second-generation sequencing |
WO2023283967A1 (en) * | 2021-07-14 | 2023-01-19 | 江苏先声医学诊断有限公司 | Optimized kraken2 algorithm and application thereof in second-generation sequencing |
CN114464253A (en) * | 2022-03-03 | 2022-05-10 | 予果生物科技(北京)有限公司 | Method, system and application for real-time pathogen detection based on long read-length sequencing |
CN114464253B (en) * | 2022-03-03 | 2023-03-10 | 予果生物科技(北京)有限公司 | Method, system and application for real-time pathogen detection based on long-read-length sequencing |
Also Published As
Publication number | Publication date |
---|---|
CN113096737B (en) | 2023-10-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109767810B (en) | High-throughput sequencing data analysis method and device | |
CN113096736B (en) | Virus real-time automatic analysis method and system based on nanopore sequencing | |
CN104302781B (en) | A kind of method and device detecting chromosomal structural abnormality | |
CN106033502B (en) | The method and apparatus for identifying virus | |
CN113096737B (en) | Method and system for automatically analyzing pathogen type | |
CN103617256A (en) | Method and device for processing file needing mutation detection | |
CN104700033A (en) | Virus detection method and virus detection device | |
CN107944228B (en) | Visualization method for gene sequencing variation site | |
CN112289376B (en) | Method and device for detecting somatic cell mutation | |
CN112397151A (en) | Methylation marker screening and evaluating method and device based on target capture sequencing | |
CN111326212A (en) | Detection method of structural variation | |
CN112669903A (en) | HLA typing method and device based on Sanger sequencing | |
CN110556163A (en) | Analysis method of long-chain non-coding RNA translation small peptide based on translation group | |
CN114038507A (en) | Prediction method, training method of prediction model and related device | |
CN110970093B (en) | Method and device for screening primer design template and application | |
WO2014083018A1 (en) | Method and system for processing data for evaluating a quality level of a dataset | |
JP6356015B2 (en) | Gene expression information analyzing apparatus, gene expression information analyzing method, and program | |
CN114530200B (en) | Mixed sample identification method based on calculation of SNP entropy | |
CN106156539A (en) | The method and apparatus analyzing the immunity difference of individual two class states | |
CN108595914A (en) | One grows tobacco mitochondrial RNA (mt RNA) editing sites high-precision forecasting method | |
JP2013505012A5 (en) | ||
CN116646010B (en) | Human virus detection method and device, equipment and storage medium | |
RU2011117576A (en) | METHOD FOR DETERMINING CERTAIN INDICATOR FOR DISTINCTIVE FEATURES OBTAINED FROM CLINICAL DATA, AND APPLYING THE CERTIFICATE INDICATOR FOR PREFERENCE OF ONE DISTINCTIVE FEATURE OVER OTHER | |
CN117393171B (en) | Method and system for constructing prediction model of LARS development track after rectal cancer operation | |
CN116168761B (en) | Method and device for determining characteristic region of nucleic acid sequence, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |