CN113096737A

CN113096737A - Method and system for automatically analyzing pathogen types

Info

Publication number: CN113096737A
Application number: CN202110331835.5A
Authority: CN
Inventors: 杜鹏程; 余乐; 刘树青
Original assignee: Beijing Yuansheng Kangtai Gene Technology Co ltd
Current assignee: Beijing Yuansheng Kangtai Gene Technology Co ltd
Priority date: 2021-03-26
Filing date: 2021-03-26
Publication date: 2021-07-09
Anticipated expiration: 2041-03-26
Also published as: CN113096737B

Abstract

The invention discloses a method and a system for automatically analyzing pathogen types, which firstly carry out preliminary judgment on the pathogen types of first sequencing read data of different data types of a sample to be detected through two different algorithms, then select second sequencing read data, and then compare, verify and screen the identified second sequencing read data of different pathogen types with corresponding pathogen reference sequences, thereby determining the pathogen types existing in the sample to be detected; the method can effectively process long reading data generated based on a third-generation sequencing technology, reduces the calculated amount of accurate comparison, can realize accurate analysis and report generation of typical third-generation metagenome or macrotranscriptome data within 30 minutes, and meets the requirement of rapid pathogen detection and analysis; meanwhile, the method can analyze short read length data obtained by second-generation sequencing, and has better data compatibility.

Description

Method and system for automatically analyzing pathogen types

Technical Field

The present invention relates to the field of information technology, and more particularly, to a method and system for automated analysis of pathogen types.

Background

A pathogen identification method based on high-throughput sequencing data comprises the following steps of carrying out pretreatment, nucleic acid extraction, sequencing library construction and other experimental steps on a sample, using a high-throughput sequencer to determine nucleic acid sequences in the sample, using sequence comparison software to compare and analyze the sequences with a pathogen sequence database after nucleic acid sequence data generated by the sequencer are obtained, screening by adopting a certain comparison result filtering condition to obtain a credible result, further calculating the number of sequences belonging to pathogens in the nucleic acid sequence data of the sample, the proportion of the sequences in all the sequences and the like, and finally judging the existence condition of the pathogens in the detected sample.

Because the sequencing data volume of the metagenome or the macrotranscriptome is large, the pathogen clinical detection based on the second generation sequencing technology can generally obtain the sequencing data of more than 1.5Gb, and the pathogen clinical detection based on the third generation long read sequencing can obtain more data. Therefore, in order to improve the timeliness and accuracy of clinical pathogen detection, a method and a system for rapidly comparing and analyzing sequencing data of a metagenome or a macrotranscriptome must be established.

Currently, commonly used sequence alignment software including bwa, bowtie 2, SOAP, BLAST, etc. has been widely used in the analysis of second-generation sequencing data. The technical scheme of the Chinese patent CN 108334750B metagenome data analysis method and system uses a k-mer algorithm to obtain a primary identification result, uses a BLAST algorithm to carry out secondary species identification on sequences in the primary identification set, and when the identification result of more than 50% of the sequences in the verification sequence set is consistent with the primary species identification result, the primary species identification result is considered to pass verification and serves as a report to detect species. The Chinese patent 'identification of pathogens and antibiotic characterization in CN 109923217A metagenome sample' technical scheme uses bwa algorithm to compare sequencing data with pathogen sequence database. The short sequence rapid comparison algorithms represented by bwa, bowtie 2 and the like are constructed based on a BWT conversion algorithm, are mainly optimized for second-generation sequencing data, and improve the comparison speed. The BLAST algorithm is a sequence alignment algorithm based on local sequence alignment, and uses a short segment matching algorithm and an effective statistical model to find out the optimal local alignment effect between a target sequence and a database.

However, how to realize automatic pathogen type analysis based on the third generation sequencing data is still a problem to be solved.

Disclosure of Invention

The invention provides a method and a system for automatically analyzing pathogen types, which are used for solving the problem of how to quickly and accurately determine the pathogen types in a sample to be detected on the basis of third-generation sequencing data.

In order to solve the above-mentioned problems, according to an aspect of the present invention, there is provided a method for automatically analyzing pathogen types, the method comprising:

obtaining at least one sequencing read data which is obtained by detecting a nucleic acid sequence and corresponds to a sample to be detected;

determining the data type of each sequencing read data in the at least one sequencing read data, and performing data cleaning on each sequencing read data in the at least one sequencing read data according to the data type and a preset data cleaning strategy to obtain at least one first sequencing read data with qualified quality;

performing primary identification on the pathogen type of each first sequencing read data in the at least one first sequencing read data with qualified quality by using a first identification method and a second identification method respectively to obtain a first pathogen type identification result and a second pathogen type identification result corresponding to each first sequencing read data respectively;

selecting second sequencing read data according to a first pathogen type identification result and a second pathogen type identification result corresponding to each first sequencing read data to obtain at least one second sequencing read data, and classifying the at least one second sequencing read data according to the identified pathogen type to obtain second sequencing read data sets corresponding to different pathogen types;

for any one second sequencing read data set, comparing each second sequencing read data in the any one second sequencing read data set with a pathogen reference sequence of a pathogen type corresponding to the any one second sequencing read data set to obtain a comparison result corresponding to each second sequencing read data in the any one second sequencing read data set;

and determining the final pathogen type identification result of the sample to be detected according to the comparison result corresponding to each second sequencing read data in each second sequencing read data set.

Preferably, the performing data washing on each sequencing read data in the at least one sequencing read data according to a preset data washing strategy according to a data type to obtain at least one first sequencing read data with qualified quality includes:

determining the sequencing read data quality of each sequencing read data with the data type of the second-generation sequencing data by using the fastQC, and screening out each sequencing read data with the sequencing read data quality meeting a preset first sequencing read data quality standard as first sequencing read data;

and determining the sequencing read data quality of each sequencing read data with the data type of the third-generation sequencing data by utilizing the Nanofilt, and screening each sequencing read data with the sequencing read data quality meeting a preset second sequencing read data quality standard as first sequencing read data.

Preferably, the selecting the second sequencing read data according to the first pathogen type identification result and the second pathogen type identification result corresponding to each first sequencing read data to obtain at least one second sequencing read data includes:

for any one of the first sequencing read data, if the first pathogen type identification result corresponding to the any one of the first sequencing read data indicates that the identified pathogen type belongs to a preset pathogen type set, taking the any one of the first sequencing read data as a second sequencing read data;

for any one first sequencing read data, if the second pathogen type identification result corresponding to the any one first sequencing read data indicates that the identified pathogen type belongs to a preset pathogen type set, the first comparison score of the any one first sequencing read data and the pathogen reference sequence corresponding to the identified pathogen type is greater than or equal to a preset score threshold, and the first length matching ratio of the length of the pathogen reference sequence corresponding to the identified pathogen type in the any one first sequencing read data and the total length of the any one first sequencing read data is greater than or equal to a preset first ratio threshold, the any one first sequencing read data is used as one second sequencing read data.

Preferably, the alignment result comprises: each second sequencing read data comprises a second alignment score, a second match length ratio, a sequence similarity, and an expected alignment value of the pathogen reference sequence corresponding to the identified pathogen type.

Preferably, the determining the final pathogen type identification result of the sample to be tested according to the comparison result corresponding to each second sequencing read data in each second sequencing read data set includes:

for any sequencing read data set, selecting second sequencing read data with the highest second alignment score in any sequencing read data set as target sequencing read data, and judging whether the target sequencing read data simultaneously meet the condition that a second matching length ratio is greater than or equal to a preset second ratio threshold, the sequence similarity is greater than or equal to a preset sequence similarity threshold, and the alignment expected value is less than or equal to a preset alignment expected threshold; if so, determining that the pathogen type corresponding to the target sequencing read data exists in the sample to be tested;

and determining the final pathogen type identification result according to the determined pathogen types existing in all the samples to be detected.

Preferably, wherein the method further comprises:

performing statistical analysis on the sequencing read data with the pathogen type identified according to the final pathogen type identification result of the sample to be detected to obtain a statistical analysis result; wherein the statistical analysis result comprises: the number of sequencing read data, composition ratio, RPM value and reference genome coverage for the pathogen types present in the test sample and the genus level and species level of each pathogen type.

Preferably, wherein the method further comprises:

and obtaining annotation information of each pathogen type, information of a sample to be detected and the target object information of the sample to be detected in the final pathogen type identification result, and generating a detection report according to the statistical analysis result, the annotation information, the information of the sample to be detected and the target object information.

According to another aspect of the present invention, there is provided a system for automated analysis of pathogen type, the system comprising:

a sequencing read data acquisition unit for acquiring at least one sequencing read data corresponding to a sample to be tested and obtained by nucleic acid sequence detection;

the data cleaning unit is used for determining the data type of each sequencing read data in the at least one sequencing read data, and performing data cleaning on each sequencing read data in the at least one sequencing read data according to the data type and a preset data cleaning strategy to obtain at least one first sequencing read data with qualified quality;

the pathogen type identification unit is used for carrying out primary identification on the pathogen type of each first sequencing read data in the at least one first sequencing read data with qualified quality by using a first identification method and a second identification method respectively so as to obtain a first pathogen type identification result and a second pathogen type identification result corresponding to each first sequencing read data respectively;

a second sequencing read data set acquisition unit, configured to select second sequencing read data according to the first pathogen type identification result and the second pathogen type identification result corresponding to each first sequencing read data, acquire at least one second sequencing read data, and classify the at least one second sequencing read data according to the identified pathogen type, so as to acquire second sequencing read data sets corresponding to different pathogen types;

a comparison result obtaining unit, configured to compare, for any one of the second sequencing read data sets, each of the second sequencing read data in the any one of the second sequencing read data sets with a pathogen reference sequence of a pathogen type corresponding to the any one of the second sequencing read data sets, and obtain a comparison result corresponding to each of the second sequencing read data in the any one of the second sequencing read data sets;

and the final pathogen type identification result determining unit is used for determining the final pathogen type identification result of the sample to be detected according to the comparison result corresponding to each second sequencing read data in each second sequencing read data set.

Preferably, the data washing unit performs data washing on each sequencing read data in the at least one sequencing read data according to a data type and a preset data washing strategy to obtain at least one first sequencing read data with qualified quality, and the data washing unit includes:

Preferably, the second sequencing read data set obtaining unit selects the second sequencing read data according to the first pathogen type identification result and the second pathogen type identification result corresponding to each first sequencing read data, and obtains at least one second sequencing read data, including:

Preferably, the determining unit of the final pathogen type identification result determines the final pathogen type identification result of the sample to be tested according to the comparison result corresponding to each second sequencing read data in each second sequencing read data set, and includes:

Preferably, wherein the system further comprises:

the statistical analysis unit is used for performing statistical analysis on the sequencing read data with the pathogen type identified according to the final pathogen type identification result of the sample to be detected to obtain a statistical analysis result; wherein the statistical analysis result comprises: the number of sequencing read data, composition ratio, RPM value and reference genome coverage for the pathogen types present in the test sample and the genus level and species level of each pathogen type.

Preferably, wherein the system further comprises:

and the detection report generating unit is used for acquiring annotation information of each pathogen type, information of the sample to be detected and the target object information of the sample to be detected in the final pathogen type identification result, and generating a detection report according to the statistical analysis result, the annotation information, the information of the sample to be detected and the target object information.

The invention provides a method and a system for automatically analyzing pathogen types, which firstly carry out preliminary judgment on the pathogen types of first sequencing read data of different data types of a sample to be detected through two different algorithms, then select second sequencing read data, and then compare, verify and screen the identified second sequencing read data of different pathogen types with corresponding pathogen reference sequences, thereby determining the pathogen types existing in the sample to be detected; the method can effectively process long reading data generated based on a third-generation sequencing technology, reduces the calculated amount of accurate comparison, solves the problems that the long reading data and sequencing read data with higher error rate are difficult to be considered both in accuracy and analysis speed in comparison analysis, can realize accurate analysis and report generation of typical third-generation metagenome or macrotranscription data within 30 minutes, and meets the requirement of performing rapid pathogen detection analysis on the metagenome or the macrotranscription data obtained based on the third-generation sequencing technology in clinic; meanwhile, the method can analyze short read length data obtained by second-generation sequencing, and has better data compatibility.

Drawings

A more complete understanding of exemplary embodiments of the present invention may be had by reference to the following drawings in which:

FIG. 1 is a flow diagram of a method 100 for automated analysis of pathogen types according to an embodiment of the present invention;

FIG. 2 is a schematic diagram for automated pathogen type analysis according to an embodiment of the present invention;

FIG. 3 is an exemplary diagram of a detection report according to an embodiment of the present invention;

FIG. 4 is a flow diagram of automated pathogen type analysis according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a system 500 for automated pathogen type analysis, according to an embodiment of the present invention.

Detailed Description

The exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, however, the present invention may be embodied in many different forms and is not limited to the embodiments described herein, which are provided for complete and complete disclosure of the present invention and to fully convey the scope of the present invention to those skilled in the art. The terminology used in the exemplary embodiments illustrated in the accompanying drawings is not intended to be limiting of the invention. In the drawings, the same units/elements are denoted by the same reference numerals.

Unless otherwise defined, terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Further, it will be understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense.

Fig. 1 is a flow diagram of a method 100 for automated analysis of pathogen types according to an embodiment of the invention. As shown in fig. 1, the method for automatically analyzing pathogen types provided by the embodiment of the present invention can effectively process long read data generated based on a third-generation sequencing technology, reduce the amount of calculation for accurate comparison, solve the problem that the accuracy and the analysis speed are difficult to be considered in comparison analysis of the long read data and sequencing read data with a high error rate, can realize accurate analysis and report generation of typical third-generation metagenome or macrotranscript data within 30 minutes, and meet the clinical requirement for rapid pathogen detection analysis of metagenome or macrotranscript data obtained based on the third-generation sequencing technology; meanwhile, the method can analyze short read length data obtained by second-generation sequencing, and has better data compatibility. The method 100 for automatically analyzing pathogen types provided by the embodiment of the present invention starts with step 101, and at step 101, at least one sequencing read data obtained by detecting a nucleic acid sequence corresponding to a sample to be tested is obtained.

The method can analyze the third generation long-reading metagenome or macrotranscriptome sequencing data corresponding to the sample to be detected, thereby determining the pathogen type. The principle of achieving pathogen type identification is shown in fig. 2. Referring to fig. 2, in the present invention, a sequencing read sequence obtained by performing nucleic acid sequencing on a sample with a sequencer is read and stored in the format of FASTQ, which is an international universal sequencing data standard, to obtain a FASTQ data file; of course, the compression of sequencing read data can also be performed based on the gzip method to reduce the storage occupation, so as to obtain the fastq. When a plurality of samples to be detected are available, the method can perform barcode identification on the sample barcode information corresponding to each sample to be detected, realize the splitting and identification of sequencing read sequence data, and merge a plurality of FASTQ data files or compressed FASTQ. The sequencing read sequence can be second-generation data acquired based on a second-generation sequencer and/or third-generation data acquired based on a third-generation sequencer.

In step 102, a data type of each sequencing read data in the at least one sequencing read data is determined, and data washing is performed on each sequencing read data in the at least one sequencing read data according to the data type and a preset data washing strategy, so as to obtain at least one first sequencing read data with qualified quality.

In the present invention, as shown in fig. 2, the data types of the sequencing read data include: second generation short read length data (i.e., second generation sequencing data) and third generation long read length data (i.e., third generation sequencing data). And for sequencing read data of different data types, cleaning according to a preset data cleaning strategy, and filtering low-quality data by using different data quality control software and parameters.

In the invention, the Nanofilt software is used for carrying out quality control detection and filtration on the third-generation sequencing read data file, the input file is an original FASTQ format file or a compressed FASTQ. Wherein the second sequencing read data quality standard is set according to actual conditions, such as: setting the data quality standard of the second sequencing reading to be Q more than or equal to 10, namely, the error rate of all basic groups in the sequencing reading is less than or equal to 10%. Wherein the Q value is determined by the base error rate according to the formula Q-10 × log₁₀P is obtained through calculation; where Q is the quality value and P is the error rate of a certain base.

And performing quality control detection and filtration on the second-generation sequencing read data file by using fastQC software, wherein the input file is an original FASTQ format file or FASTQ. The first sequencing read data quality standard can be set according to actual conditions, such as: the error rate of all bases in the first sequencing read is set to be less than or equal to 0.1 percent (namely Q is more than or equal to 30).

In addition, the extracted high quality at least one first sequencing read data that passes data quality control can be stored in fastq.

In step 103, a first identification method and a second identification method are respectively used for performing primary identification on the pathogen type of each first sequencing read data in the at least one first sequencing read data with qualified quality, so as to respectively obtain a first pathogen type identification result and a second pathogen type identification result corresponding to each first sequencing read data.

Referring to fig. 2, in the present invention, the first identification method is Kraken2, and the second identification method is Minimap2, and two different algorithms are respectively used for identifying the pathogen type. Wherein, the Kraken2 algorithm is used to realize the identification of bacteria, fungi, viruses and parasites based on species-specific k-mer sequences, the input is the high-quality sequence FASTQ. gz format file obtained in step 102, and the output is the species discrimination result corresponding to each sequencing read data, namely the first pathogen type identification result.

Complementary comparison of pathogens with large genomic variation such as viruses is carried out by utilizing a Minimap2 algorithm, pathogen type identification is carried out based on minizer (a section of seed with the minimum hash value in a sequence) hash table search, a training algorithm and a dynamic programming algorithm, input is a high-quality sequence FASTQ. Wherein the second pathogen type identification result comprises: pathogen type and alignment score of each sequencing read data to pathogen reference sequence.

In step 104, selecting second sequencing read data according to the first pathogen type identification result and the second pathogen type identification result corresponding to each first sequencing read data to obtain at least one second sequencing read data, and classifying the at least one second sequencing read data according to the identified pathogen type to obtain second sequencing read data sets corresponding to different pathogen types.

In the present invention, the preset score threshold may be set to 50, and the first ratio threshold may be set to 50%.

Referring to fig. 2, in the present invention, for the first pathogen type identification result obtained based on the Kraken2 algorithm, for any one of the first sequencing read data, if the first pathogen type identification result corresponding to the any one of the first sequencing read data indicates that the identified pathogen type belongs to the preset pathogen type set, that is, the first pathogen type identification result is not empty or unknown, the any one of the first sequencing read data is used as a second sequencing read data.

For the second pathogen type identification result obtained based on the Minimap2 algorithm, for any first sequencing read data, if the second pathogen type identification result corresponding to any first sequencing read data indicates that the identified pathogen type belongs to the preset pathogen type set and the first comparison score MAPQ value of the pathogen reference sequence corresponding to any first sequencing read data and the identified pathogen type is greater than or equal to the preset score threshold 50, and the first length matching ratio Mi between the length of the pathogen reference sequence corresponding to the identified pathogen type in any first sequencing read data and the total length of any first sequencing read data is determined to be greater than or equal to the preset first ratio threshold 50% based on the CIGAR, taking any first sequencing read data as a second sequencing read data.

In the present invention, the second sequencing read data is a read suspected of being a pathogen.

In the present invention, the first length match ratio Mi of a certain sequencing read is calculated by the following method, including:

wherein n is the total number of fragments in a sequencing read data that match to the corresponding pathogen reference sequence, m_iIs the length of the ith matching fragment, and L is the length of the data of the certain sequencing read.

In the present invention, after determining at least one second sequencing read data suspected to be a pathogen, the at least one second sequencing read data is further classified according to the identified pathogen type to obtain second sequencing read data sets corresponding to different pathogen types. Wherein, data splitting is carried out according to the identified or compared pathogen species, the second sequencing read data are extracted from the input FASTQ. The identification is based on the alignment with the reference sequence of the known species, so that the preliminary identification is all the known species, and the resolution process is to store the reading of the same species as a file in a FASTA format respectively as the input of the next basic local alignment BLAST.

In step 105, for any one second sequencing read data set, each second sequencing read data in the any one second sequencing read data set is compared with a pathogen reference sequence of a pathogen type corresponding to the any one second sequencing read data set, and a comparison result corresponding to each second sequencing read data in the any one second sequencing read data set is obtained.

Referring to fig. 2, in the present invention, each second sequencing read data in the FASTA format file of suspected pathogen reads separated by pathogen type in step 104 is subjected to accurate BLAST Alignment with the reference sequence of each species using Basic Local Alignment Search Tool (BLAST), so as to obtain the Alignment result corresponding to each second sequencing read data in each second sequencing read data set. And the comparison result corresponding to each second sequencing read data comprises a second comparison score, a second matching length ratio, a sequence similarity and a comparison expected value of the second sequencing read data and the pathogen reference sequence corresponding to the identified pathogen type.

In step 106, the final pathogen type identification result of the sample to be detected is determined according to the comparison result corresponding to each second sequencing read data in each second sequencing read data set.

In the present invention, the threshold of the second ratio is 80%, the threshold of the sequence similarity is 90%, and the threshold of the expected alignment is 10^-5. And for any sequencing read data set, selecting the second sequencing read data with the highest second alignment score in any sequencing read data set as the target sequencing read data. Then, whether the target sequencing read data simultaneously meet the conditions that the second matching length ratio is greater than or equal to the preset second ratio threshold value 80%, the sequence similarity is greater than or equal to the preset sequence similarity threshold value 90%, and the comparison expected value is less than or equal to the preset comparison expected threshold value 10 is judged^-5(ii) a And if so, determining that the pathogen type corresponding to the target sequencing read data exists in the sample to be tested. And finally, summarizing, and determining the final pathogen type identification result according to the pathogen types existing in all the samples to be detected. The final pathogen type identification results include: a pathogen type and sequencing read data corresponding to the pathogen type.

Preferably, wherein the method further comprises:

The method of the present invention further comprises: and (4) performing statistical analysis on the sequencing read data for identifying the pathogen type obtained in the step (106) according to species, and determining the pathogen type, the reading number Ni of the genus level and the species level, the composition ratio Pi, the RPMi value, the reference genome coverage and the like in the sample to be detected. Pathogen types include: bacteria, fungi, viruses and parasites.

In the present invention, the calculation of the composition ratio Pi for each pathogen type in the test sample from the number of sequencing read data for each pathogen type is performed in the following manner, including:

wherein Pi is the constitutive ratio of the ith pathogen type; n is the total number of pathogen types detected in the sample and Ni is the number of reads for the ith pathogen type.

Calculating the RPMi value for each pathogen type from the number of sequencing reads for each pathogen type and the total number of sequencing reads detected by quality control using the following means, including:

RPMi＝Ni/Nt×10⁶，

wherein RPMi is the RPM value for the ith pathogen type; ni is the number of reads for the ith pathogen type, and Nt is the total number of sequencing reads detected by quality control in the sample.

The method can also supplement and add pathogen annotation information, sample information, target object information and the like according to the statistical information of the pathogens in the to-be-detected sample obtained through statistics so as to generate a detection report. Wherein the annotation information includes: chinese name of pathogen type, Latin name, DNA or RNA as genome, description words of pathogenicity characteristics and the like, and sample information comprises: sample type, volume, sampling time, etc.; target object information, i.e., patient information, including: name, age, sex, disease diagnosis, etc. The detection report is shown in fig. 3.

Fig. 4 is a flowchart of an automatic pathogen type analysis according to an embodiment of the present invention, including: defining a data storage directory and a data type; reading a fastq file or a fastq.gz file under a directory; judging the data type; for the third-generation sequencing read data, performing data quality control analysis by using Nanofilt, screening out the sequencing read data with the quality satisfying Q10, and reserving the sequencing read data, for the second-generation sequencing read data, performing data quality control analysis by using fastQC, screening out the sequencing read data with the quality satisfying Q30, and reserving the sequencing read data; storing the reserved sequencing read data into a past.fastq.gz file; performing species analysis based on Kraken2, reserving sequencing read data of known pathogen, performing species analysis based on Minimap2, reserving sequencing read data with first comparative score MAPQ value more than or equal to 50; storing the reserved sequencing read data into a taxi.fasta file according to the species taxi; using BLAST to compare with each species genome to obtain a second comparison score, and if the sequencing read data corresponding to the highest second comparison score simultaneously meet that the second matching length ratio alignment is greater than or equal to 80% of a preset second ratio threshold, the sequence similarity is greater than or equal to 90% of a preset sequence similarity threshold, and the comparison expected value is less than or equal to 10% of a preset comparison expected threshold^-5Then reserving; statistical analysis of pathogen types was performed to determine detection reports.

The invention establishes a pathogen comparison and identification method aiming at third-generation long-reading sequencing data, and can effectively process the long-reading long data generated by the third-generation sequencing technologies such as nanopore and pacBio. Compared with the method based on the bwa, bowtie 2 and other comparison software which is widely applied at present, the method solves the problem that the accuracy and the analysis speed are difficult to be considered in the comparison analysis of sequencing read data with long read length and high error rate, reduces the analysis time consumption of typical data (1Gb nanopore sequencing data) to within 30 minutes, and meets the requirement of performing rapid pathogen detection analysis on the metagenome or macrotranscription group data obtained based on the third generation sequencing technology in clinic. Meanwhile, the method can analyze short read length data obtained by second-generation sequencing, and has better data compatibility.

Fig. 5 is a schematic diagram of a system 500 for automated pathogen type analysis, according to an embodiment of the present invention. As shown in fig. 5, the present invention provides a system 500 for automated pathogen type analysis, comprising: a sequencing read data acquisition unit 501, a data washing unit 502, a pathogen type identification unit 503, a second sequencing read data set acquisition unit 504, an alignment result acquisition unit 505, and a final pathogen type identification result determination unit 506.

Preferably, the sequencing read data obtaining unit 501 is configured to obtain at least one sequencing read data obtained by detecting a nucleic acid sequence corresponding to a sample to be detected.

Preferably, the data washing unit 502 is configured to determine a data type of each sequencing read data in the at least one sequencing read data, and perform data washing on each sequencing read data in the at least one sequencing read data according to a preset data washing strategy according to the data type to obtain at least one first sequencing read data with qualified quality.

Preferably, the data washing unit 502 performs data washing on each sequencing read data in the at least one sequencing read data according to a preset data washing strategy according to a data type to obtain at least one first sequencing read data with qualified quality, including:

Preferably, the pathogen type identification unit 503 is configured to perform a preliminary pathogen type identification on each of the at least one qualified first sequencing read data by using a first identification method and a second identification method, respectively, so as to obtain a first pathogen type identification result and a second pathogen type identification result corresponding to each of the first sequencing read data, respectively.

Preferably, the second sequencing read data set obtaining unit 504 is configured to select second sequencing read data according to the first pathogen type identification result and the second pathogen type identification result corresponding to each first sequencing read data, obtain at least one second sequencing read data, and classify the at least one second sequencing read data according to the identified pathogen type to obtain second sequencing read data sets corresponding to different pathogen types.

Preferably, the second sequencing read data set obtaining unit 504 selects the second sequencing read data according to the first pathogen type identification result and the second pathogen type identification result corresponding to each first sequencing read data, and obtains at least one second sequencing read data, including:

Preferably, the alignment result obtaining unit 505 is configured to, for any one of the second sequencing read data sets, align each of the second sequencing read data in the any one of the second sequencing read data sets with a pathogen reference sequence of a pathogen type corresponding to the any one of the second sequencing read data sets, and obtain an alignment result corresponding to each of the second sequencing read data in the any one of the second sequencing read data sets.

Preferably, the final pathogen type identification result determining unit 506 is configured to determine the final pathogen type identification result of the sample to be tested according to the comparison result corresponding to each second sequencing read data in each second sequencing read data set.

Preferably, the determining unit 506 for final pathogen type identification result determines the final pathogen type identification result of the sample to be tested according to the comparison result corresponding to each second sequencing read data in each second sequencing read data set, and includes:

Preferably, wherein the system further comprises:

The system 500 for automatically analyzing pathogen types according to the embodiment of the present invention corresponds to the method 100 for automatically analyzing pathogen types according to another embodiment of the present invention, and will not be described herein again.

The invention has been described with reference to a few embodiments. However, other embodiments of the invention than the one disclosed above are equally possible within the scope of the invention, as would be apparent to a person skilled in the art from the appended patent claims.

Generally, all terms used in the claims are to be interpreted according to their ordinary meaning in the technical field, unless explicitly defined otherwise herein. All references to "a/an/the [ device, component, etc ]" are to be interpreted openly as referring to at least one instance of said device, component, etc., unless explicitly stated otherwise. The steps of any method disclosed herein do not have to be performed in the exact order disclosed, unless explicitly stated.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims

1. A method for automated analysis of pathogen type, the method comprising:

2. The method of claim 1, wherein the data washing each of the at least one sequencing read data according to a data type according to a preset data washing strategy to obtain at least one qualified first sequencing read data comprises:

3. The method of claim 1, wherein the selecting the second sequencing read data according to the first pathogen type identification result and the second pathogen type identification result corresponding to each first sequencing read data to obtain at least one second sequencing read data comprises:

4. The method of claim 1, wherein the comparing comprises: each second sequencing read data comprises a second alignment score, a second match length ratio, a sequence similarity, and an expected alignment value of the pathogen reference sequence corresponding to the identified pathogen type.

5. The method of claim 4, wherein determining the final pathogen type identification result of the sample to be tested according to the comparison result corresponding to each second sequencing read data in each second sequencing read data set comprises:

6. The method of claim 1, further comprising:

7. The method of claim 6, further comprising:

8. A system for automated analysis of pathogen type, the system comprising:

9. The system of claim 8, wherein the data washing unit performs data washing on each sequencing read data in the at least one sequencing read data according to a data type and according to a preset data washing strategy to obtain at least one first sequencing read data with qualified quality, and comprises:

10. The system of claim 8, wherein the second sequencing read data set obtaining unit selects the second sequencing read data according to the first pathogen type identification result and the second pathogen type identification result corresponding to each first sequencing read data to obtain at least one second sequencing read data, and comprises:

11. The system of claim 8, wherein the alignment result comprises: each second sequencing read data comprises a second alignment score, a second match length ratio, a sequence similarity, and an expected alignment value of the pathogen reference sequence corresponding to the identified pathogen type.

12. The system according to claim 11, wherein the final pathogen type identification result determining unit determines the final pathogen type identification result of the sample to be tested according to the comparison result corresponding to each second sequencing read data in each second sequencing read data set, and includes:

13. The system of claim 8, further comprising:

14. The system of claim 8, further comprising: