CN113096737B

CN113096737B - Method and system for automatically analyzing pathogen type

Info

Publication number: CN113096737B
Application number: CN202110331835.5A
Authority: CN
Inventors: 杜鹏程; 余乐; 刘树青
Original assignee: Beijing Yuansheng Kangtai Gene Technology Co ltd
Current assignee: Beijing Yuansheng Kangtai Gene Technology Co ltd
Priority date: 2021-03-26
Filing date: 2021-03-26
Publication date: 2023-10-31
Anticipated expiration: 2041-03-26
Also published as: CN113096737A

Abstract

The invention discloses a method and a system for automatically analyzing pathogen types, which are characterized in that firstly, pathogen types of first sequencing read data of different data types of a sample to be tested are primarily judged through two different algorithms, then second sequencing read data are selected, and then the second sequencing read data of the identified different pathogen types are compared with corresponding pathogen reference sequences for verification and screening, so that the pathogen types in the sample to be tested are determined; the method can effectively process long-reading length data generated based on the third-generation sequencing technology, reduces the calculation amount of accurate comparison, can realize accurate analysis and report generation of typical third-generation macro genome or macro transcriptome data within 30 minutes, and meets the requirement of rapid pathogen detection analysis; meanwhile, the method can analyze short-reading long data obtained by second generation sequencing, and has better data compatibility.

Description

Method and system for automatically analyzing pathogen type

Technical Field

The present invention relates to the field of information technology, and more particularly, to a method and system for automated analysis of pathogen types.

Background

The method comprises the main technical steps of pretreatment, nucleic acid extraction, sequencing library construction and other experimental steps on a sample, determining the nucleic acid sequence in the sample by using a high-throughput sequencer, comparing and analyzing the sequences with a pathogen sequence database by using sequence comparison software after obtaining the nucleic acid sequence data output by the sequencer, screening by adopting certain comparison result filtering conditions to obtain a credible result, further calculating the sequence number of the pathogen in the nucleic acid sequence data of the sample, the proportion of the pathogen in the whole sequence and the like, and finally judging the existence condition of the pathogen in the detected sample.

Because the macro genome or macro transcriptome sequencing data volume is large, the pathogen clinical detection based on the second generation sequencing technology can generally obtain the sequencing data with the length of more than 1.5Gb, and the rapid detection and analysis of clinical samples can obtain the more data based on the third generation long-reading long sequencing, the main technical difficulty of the current rapid detection and analysis is to realize the rapid comparison analysis of input data, ensure the accuracy and minimize the dependence on hardware performance. Therefore, to improve the timeliness and accuracy of clinical pathogen detection, a rapid comparative analysis method and system of metagenomic or metatranscriptome sequencing data must be established.

Currently, commonly used sequence alignment software includes bwa, bowtie 2, SOAP, BLAST, etc., and has been widely used in analysis of second generation sequencing data. According to the technical scheme of the metagenomic data analysis method and system of the Chinese patent CN 108334750B, a k-mer algorithm is used for obtaining a primary identification result, a BLAST algorithm is used for carrying out secondary species identification on sequences in the primary identification set, and when the identification result of more than 50% of sequences in the verification sequence set is consistent with the primary species identification result, the primary species identification result is considered to pass verification and is used as a report for detecting species. The technical scheme of the Chinese patent CN 109923217A metagenomic sample for identifying pathogens and characterizing antibiotics uses bwa algorithm to compare sequencing data with pathogen sequence database. Short sequence rapid comparison algorithms represented by bwa, bowtie 2 and the like are established based on a BWT conversion algorithm, and are mainly optimized for second-generation sequencing data, so that the comparison speed is improved. The BLAST algorithm is a sequence comparison algorithm based on local sequence comparison, and adopts a short segment matching algorithm and an effective statistical model to find out the optimal local comparison effect between a target sequence and a database.

However, how to achieve automated analysis of pathogen types based on third generation sequencing data remains a problem to be solved.

Disclosure of Invention

The invention provides a method and a system for automatically analyzing pathogen types, which are used for solving the problem of how to quickly and accurately determine pathogen types in a sample to be tested based on three-generation sequencing data.

In order to solve the above problems, according to an aspect of the present invention, there is provided a method for automatically analyzing a pathogen type, the method comprising:

acquiring at least one sequencing read data obtained by detecting a nucleic acid sequence corresponding to a sample to be detected;

determining the data type of each sequencing read data in the at least one sequencing read data, and carrying out data cleaning on each sequencing read data in the at least one sequencing read data according to a preset data cleaning strategy according to the data type so as to obtain at least one first sequencing read data with qualified quality;

performing preliminary identification of pathogen type on each first sequencing read data in the at least one qualified first sequencing read data by using a first identification method and a second identification method respectively to obtain a first pathogen type identification result and a second pathogen type identification result corresponding to each first sequencing read data respectively;

Selecting second sequencing read data according to a first pathogen type identification result and a second pathogen type identification result corresponding to each first sequencing read data, obtaining at least one second sequencing read data, and classifying the at least one second sequencing read data according to the identified pathogen type to obtain second sequencing read data sets corresponding to different pathogen types;

for any one of the second sequencing read data sets, comparing each second sequencing read data in the any one of the second sequencing read data sets with a pathogen reference sequence of a pathogen type corresponding to the any one of the second sequencing read data sets, and obtaining a comparison result corresponding to each second sequencing read data in the any one of the second sequencing read data sets;

and determining a final pathogen type identification result of the sample to be tested according to the comparison result corresponding to each second sequencing read data in each second sequencing read data set.

Preferably, the data cleaning of each sequencing read data in the at least one sequencing read data according to a preset data cleaning policy according to a data type to obtain at least one qualified first sequencing read data includes:

Determining the sequencing read data quality of each sequencing read data with the data type of second generation sequencing data by using fastQC, and screening out each sequencing read data with the sequencing read data quality meeting the preset first sequencing read data quality standard as first sequencing read data;

determining the sequencing read data quality of each sequencing read data with the data type of three-generation sequencing data by utilizing Nanofilt, and screening out each sequencing read data with the sequencing read data quality meeting the preset second sequencing read data quality standard as the first sequencing read data.

Preferably, the selecting the second sequencing read data according to the first pathogen type identification result and the second pathogen type identification result corresponding to each first sequencing read data, and obtaining at least one second sequencing read data includes:

for any one of the first sequencing read data, if the first pathogen type identification result corresponding to the any one of the first sequencing read data indicates that the identified pathogen type belongs to a preset pathogen type set, taking the any one of the first sequencing read data as second sequencing read data;

And for any one of the first sequencing read data, if the second pathogen type identification result corresponding to the any one of the first sequencing read data indicates that the identified pathogen type belongs to a preset pathogen type set, the first comparison score of the any one of the first sequencing read data and the pathogen reference sequence corresponding to the identified pathogen type is greater than or equal to a preset scoring threshold value, and the first length matching ratio of the length of the pathogen reference sequence corresponding to the identified pathogen type in the any one of the first sequencing read data and the total length of the any one of the first sequencing read data is greater than or equal to a preset first ratio threshold value, taking the any one of the first sequencing read data as one of the second sequencing read data.

Preferably, wherein the comparison result comprises: a second alignment score, a second match length ratio, a sequence similarity, and an alignment expectation for each second sequencing read data to a pathogen reference sequence corresponding to the identified pathogen type.

Preferably, the determining the final pathogen type identification result of the sample to be tested according to the comparison result corresponding to each second sequencing read data in each second sequencing read data set includes:

For any sequencing read data set, selecting second sequencing read data with highest second comparison score in any sequencing read data set as target sequencing read data, and judging whether the target sequencing read data simultaneously meets a second matching length ratio which is larger than or equal to a preset second ratio threshold, wherein the sequence similarity is larger than or equal to a preset sequence similarity threshold, and the comparison expected value is smaller than or equal to a preset comparison expected threshold; if yes, determining that pathogen types corresponding to the target sequencing read data exist in the sample to be tested;

and determining the final pathogen type recognition result according to the determined pathogen types in all the samples to be detected.

Preferably, wherein the method further comprises:

carrying out statistical analysis on sequencing read data of the identified pathogen type according to the final pathogen type identification result of the sample to be tested to obtain a statistical analysis result; wherein the statistical analysis result includes: the pathogen types present in the test sample, the genus level and species level of each pathogen type, the number, composition ratio, RPM value and reference genome coverage of sequencing read data.

Preferably, wherein the method further comprises:

and acquiring annotation information, sample information to be detected and target object information of the sample to be detected of each pathogen type in the final pathogen type identification result, and generating a detection report according to the statistical analysis result, the annotation information, the sample information to be detected and the target object information.

According to another aspect of the present invention, there is provided a system for automatic analysis of pathogen types, the system comprising:

the sequencing read data acquisition unit is used for acquiring at least one sequencing read data obtained through nucleic acid sequence detection corresponding to the sample to be detected;

the data cleaning unit is used for determining the data type of each sequencing read data in the at least one sequencing read data, and cleaning the data of each sequencing read data in the at least one sequencing read data according to a preset data cleaning strategy according to the data type so as to obtain at least one first sequencing read data with qualified quality;

the pathogen type identification unit is used for carrying out preliminary identification of pathogen type on each first sequencing read data in the at least one first sequencing read data with qualified quality by utilizing a first identification method and a second identification method respectively so as to obtain a first pathogen type identification result and a second pathogen type identification result corresponding to each first sequencing read data respectively;

The second sequencing read data set acquisition unit is used for selecting second sequencing read data according to a first pathogen type identification result and a second pathogen type identification result corresponding to each first sequencing read data, acquiring at least one second sequencing read data, and classifying the at least one second sequencing read data according to the identified pathogen type so as to acquire second sequencing read data sets corresponding to different pathogen types;

the comparison result obtaining unit is used for comparing each second sequencing read data in any second sequencing read data set with a pathogen reference sequence of a pathogen type corresponding to any second sequencing read data set to obtain a comparison result corresponding to each second sequencing read data in any second sequencing read data set;

and the final pathogen type identification result determining unit is used for determining a final pathogen type identification result of the sample to be tested according to the comparison result corresponding to each second sequencing read data in each second sequencing read data set.

Preferably, the data cleaning unit performs data cleaning on each sequencing read data in the at least one sequencing read data according to a preset data cleaning policy according to a data type to obtain at least one qualified first sequencing read data, and includes:

Preferably, the second sequencing read data set obtaining unit performs selection of second sequencing read data according to a first pathogen type identification result and a second pathogen type identification result corresponding to each first sequencing read data, and obtains at least one second sequencing read data, including:

Preferably, the final pathogen type recognition result determining unit determines a final pathogen type recognition result of the sample to be tested according to the comparison result corresponding to each second sequencing read data in each second sequencing read data set, including:

Preferably, wherein the system further comprises:

the statistical analysis unit is used for carrying out statistical analysis on the sequencing read data of the identified pathogen type according to the final pathogen type identification result of the sample to be tested, and obtaining a statistical analysis result; wherein the statistical analysis result includes: the pathogen types present in the test sample, the genus level and species level of each pathogen type, the number, composition ratio, RPM value and reference genome coverage of sequencing read data.

Preferably, wherein the system further comprises:

and the detection report generation unit is used for acquiring annotation information, sample information to be detected and target object information of the sample to be detected of each pathogen type in the final pathogen type identification result, and generating a detection report according to the statistical analysis result, the annotation information, the sample information to be detected and the target object information.

The invention provides a method and a system for automatically analyzing pathogen types, which are characterized in that firstly, pathogen types of first sequencing read data of different data types of a sample to be tested are primarily judged through two different algorithms, then second sequencing read data are selected, and then the second sequencing read data of the identified different pathogen types are compared with corresponding pathogen reference sequences for verification and screening, so that the pathogen types in the sample to be tested are determined; the method can effectively process long-reading long data generated based on the third-generation sequencing technology, reduces the calculation amount of accurate comparison, solves the problem that the accuracy and the analysis speed are difficult to be compatible in the comparison and analysis of the long-reading long data and the sequencing read data with higher error rate, can realize the accurate analysis and report generation of typical third-generation macro genome or macro transcriptome data within 30 minutes, and meets the requirement of rapid pathogen detection analysis of the macro genome or macro transcriptome data obtained based on the third-generation sequencing technology in clinic; meanwhile, the method can analyze short-reading long data obtained by second generation sequencing, and has better data compatibility.

Drawings

Exemplary embodiments of the present invention may be more completely understood in consideration of the following drawings:

FIG. 1 is a flow chart of a method 100 for automated analysis of pathogen types according to an embodiment of the invention;

FIG. 2 is a schematic diagram for automated analysis of pathogen types according to an embodiment of the invention;

FIG. 3 is an exemplary diagram of a detection report according to an embodiment of the present invention;

FIG. 4 is a flow chart of automated analysis of pathogen types according to an embodiment of the invention;

fig. 5 is a schematic diagram of a system 500 for automated analysis of pathogen types according to an embodiment of the invention.

Detailed Description

The exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, however, the present invention may be embodied in many different forms and is not limited to the examples described herein, which are provided to fully and completely disclose the present invention and fully convey the scope of the invention to those skilled in the art. The terminology used in the exemplary embodiments illustrated in the accompanying drawings is not intended to be limiting of the invention. In the drawings, like elements/components are referred to by like reference numerals.

Unless otherwise indicated, terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art. In addition, it will be understood that terms defined in commonly used dictionaries should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense.

FIG. 1 is a flow chart of a method 100 for automated analysis of pathogen types according to an embodiment of the invention. As shown in FIG. 1, the method for automatically analyzing pathogen types provided by the embodiment of the invention can effectively process long-reading long data generated based on the third-generation sequencing technology, reduce the calculation amount of accurate comparison, solve the problem that the accuracy and the analysis speed are difficult to be compatible in sequencing read data comparison analysis with higher error rate of the long-reading long data, can realize accurate analysis and report generation of typical third-generation macro genome or macro transcriptome data within 30 minutes, and meet the requirement of rapid pathogen detection analysis of macro genome or macro transcriptome data obtained based on the third-generation sequencing technology in clinic; meanwhile, the method can analyze short-reading long data obtained by second generation sequencing, and has better data compatibility. The method 100 for automatically analyzing pathogen types provided in the embodiments of the present invention starts at step 101, and at least one sequencing read data obtained through nucleic acid sequence detection corresponding to a sample to be tested is obtained at step 101.

The method can analyze the third-generation long-reading long-macro genome or macro transcriptome sequencing data corresponding to the sample to be tested, thereby determining the pathogen type. The principle of achieving pathogen type recognition is shown in figure 2. Referring to fig. 2, in the present invention, a sequencing read sequence obtained by measuring a nucleic acid sequence of a sample using a sequencer is read, and stored in an international standard FASTQ format for sequencing data, thereby obtaining a FASTQ data file; of course, the compression of sequencing read data can also be performed based on the gzip method to reduce memory occupation and obtain the fastq.gz file. When a plurality of samples to be detected are provided, the invention can carry out barcode identification on sample bar code information corresponding to each sample to be detected, realize splitting and identification on sequencing read sequence data, and combine a plurality of FASTQ data files or compressed FASTQ.gz files from the same sample by using a cat or zcat command. The sequencing read sequence can be second-generation data acquired based on a second-generation sequencer and/or third-generation data acquired based on a third-generation sequencer.

In step 102, a data type of each sequencing read data in the at least one sequencing read data is determined, and data cleaning is performed on each sequencing read data in the at least one sequencing read data according to a preset data cleaning strategy according to the data type, so as to obtain at least one first sequencing read data with qualified quality.

In the present invention, as shown in connection with FIG. 2, the data types of the sequencing read data include: second generation short-read long data (i.e., second generation sequencing data) and third generation long-read long data (i.e., third generation sequencing data). And cleaning sequencing read data of different data types according to a preset data cleaning strategy, and filtering low-quality data by using different data quality control software and parameters.

In the invention, quality control detection and filtration are carried out on the third-generation sequencing read data file by using Nanofilt software, the input file is an original FASTQ format file or a FASTQ.gz format file obtained by compression, sequencing read data with the quality meeting the quality standard of the second sequencing read data is screened, and the FASTQ.gz format file and a quality control report of high-quality data are output. Wherein the second sequencing read data quality criteria is set according to the actual situation, for example: and setting the data quality standard of the second sequencing read to be Q more than or equal to 10, namely, the requirement that the error rate of all bases in the sequencing read is less than or equal to 10 percent is met. Wherein the Q value is represented by the formula Q= -10×log according to the base error rate ₁₀ P is calculated; wherein Q is a mass value, and P is a base error rate.

And performing quality control detection and filtering on the second generation sequencing read data file by using fastQC software, wherein the input file is an original FASTQ format file or FASTQ.gz format file, screening sequencing read data of which the average sequencing read data quality meets the first sequencing read data quality standard, and outputting the FASTQ.gz format file and a quality control report of high-quality data. Wherein, the first sequencing read data quality standard can be set according to actual conditions, for example: setting the error rate of all bases in the first sequencing read to be less than or equal to 0.1 percent (namely Q is more than or equal to 30).

In addition, the extracted high quality at least one first sequencing read data by data quality control may be stored in fastq.gz format for subsequent analysis.

In step 103, the first identification method and the second identification method are used for carrying out preliminary identification of the pathogen type on each first sequencing read data in the at least one qualified first sequencing read data respectively, so as to obtain a first pathogen type identification result and a second pathogen type identification result corresponding to each first sequencing read data respectively.

Referring to fig. 2, in the present invention, the first recognition method is Kraken2 and the second recognition method is minimum 2, and two different algorithms are used to recognize pathogen types. The krake 2 algorithm is used for identifying bacteria, fungi, viruses and parasites based on the species-specific k-mer sequence, the high-quality sequence FASTQ.gz format file obtained in the step 102 is input, and a species discrimination result corresponding to each sequencing reading data, namely a first pathogen type identification result, is output.

And (3) carrying out supplementary comparison of pathogens with large genome variation such as viruses by utilizing a Minimap2 algorithm, carrying out pathogen type identification based on minimizer (seed with minimum hash value in a sequence) hash table search, a challenge algorithm and a dynamic programming algorithm, inputting a high-quality sequence FASTQ.gz format file obtained in the step (102), and outputting a sam format comparison result of each sequencing read data and a reference sequence, namely a second pathogen type identification result. Wherein the second pathogen-type recognition result includes: pathogen type and alignment score for each sequencing read data to pathogen reference sequence.

In step 104, selecting second sequencing read data according to the first pathogen type identification result and the second pathogen type identification result corresponding to each first sequencing read data, obtaining at least one second sequencing read data, and classifying the at least one second sequencing read data according to the identified pathogen type to obtain second sequencing read data sets corresponding to different pathogen types.

In the present invention, the preset scoring threshold may be set to 50, and the first ratio threshold may be set to 50%.

Referring to fig. 2, in the present invention, for the first pathogen type recognition result obtained based on the Kraken2 algorithm, for any one of the first sequencing read data, if the first pathogen type recognition result corresponding to any one of the first sequencing read data indicates that the recognized pathogen type belongs to a preset pathogen type set, that is, the first pathogen type recognition result is not null or unknown, the any one of the first sequencing read data is taken as one of the second sequencing read data.

For the second pathogen type identification result obtained based on the minimum 2 algorithm, for any one of the first sequencing read data, if the second pathogen type identification result corresponding to the any one of the first sequencing read data indicates that the identified pathogen type belongs to a preset pathogen type set and the first comparison score MAPQ value of the pathogen reference sequence corresponding to the identified pathogen type is greater than or equal to a preset scoring threshold value 50, and the first length matching ratio Mi of the length of the pathogen reference sequence corresponding to the identified pathogen type matched to the total length of the any one of the first sequencing read data is greater than or equal to a preset first comparison threshold value 50 based on CIGAR, taking the any one of the first sequencing read data as one of the second sequencing read data.

In the present invention, the second sequencing read is a read suspected of being a pathogen.

In the present invention, the first length matching ratio Mi of a sequencing read is calculated by the following method, including:

wherein n is the total number of fragments in a certain sequencing read data matched to the corresponding pathogen reference sequence, m _i For the length of the ith matching fragment, L is the length of the certain sequencing read data.

In the present invention, after determining at least one second sequencing read data suspected to be a pathogen, the at least one second sequencing read data also needs to be classified according to the identified pathogen type to obtain second sequencing read data sets corresponding to different pathogen types. The second sequencing read data are extracted from the input FASTQ.gz format file, and the reads belonging to the same species are saved into the same FASTA format file. The primary identification is the known species because of the alignment identification of the reference sequences of the known species, and the resolution process is to store the read of the same species as a file in FASTA format respectively as the input of the basic local alignment BLAST in the next step.

In step 105, for any one of the second sequencing read data sets, each of the second sequencing read data in the any one of the second sequencing read data sets is aligned with a pathogen reference sequence of a pathogen type corresponding to the any one of the second sequencing read data sets, and a corresponding alignment of each of the second sequencing read data in the any one of the second sequencing read data sets is obtained.

Referring to fig. 2, in the present invention, the basic local alignment tool (Basic Local Alignment Search Tool, BLAST) is used to precisely BLAST-align each second sequencing read data in the FASTA format file of the suspected pathogen read split by the pathogen type in step 104 with each species reference sequence, and the alignment result corresponding to each second sequencing read data in each second sequencing read data set is obtained. The comparison result corresponding to each second sequencing read data comprises a second comparison score, a second matching length ratio, a sequence similarity and a comparison expected value of the second sequencing read data and a pathogen reference sequence corresponding to the identified pathogen type.

In step 106, a final pathogen type identification result of the sample to be tested is determined according to the comparison result corresponding to each second sequencing read data in each second sequencing read data set.

In the invention, the second ratio threshold is 80%, the sequence similarity threshold is 90%, and the alignment expected threshold is 10 ^-5 . And selecting second sequencing read data with highest second comparison scores in any sequencing read data set as target sequencing read data. Then, judging whether the target sequencing read data simultaneously meets a second matching length ratio which is more than or equal to a preset second ratio threshold value of 80 percent, wherein the sequence similarity is more than or equal to a preset sequence similarity threshold value of 90 percent, and the comparison expected value is less than or equal to a preset comparison expected threshold value of 10 ^-5 The method comprises the steps of carrying out a first treatment on the surface of the And if so, determining that the pathogen type corresponding to the target sequencing read data exists in the sample to be tested. And finally, summarizing, and determining the final pathogen type recognition result according to pathogen types existing in all the samples to be detected. The final pathogen-type recognition results include: pathogen type and sequencing read data corresponding to the pathogen type.

Preferably, wherein the method further comprises:

The method of the invention further comprises: and (3) carrying out statistical analysis on the sequencing read data which is obtained in the step 106 and is used for identifying the pathogen type according to species, and determining the pathogen type, the genus level, the number of reads Ni at the species level, the composition ratio Pi, the RPMi value, the coverage of a reference genome and the like in the sample to be detected. Pathogen types include: bacteria, fungi, viruses and parasites.

In the present invention, the composition ratio Pi of each pathogen type in a sample to be tested is calculated from the number of sequencing read data of each pathogen type by the following means, including:

wherein Pi is the composition ratio of the ith pathogen type; n is the total number of pathogen types detected in the sample and Ni is the number of reads of the ith pathogen type.

Calculating the RPMi value for each pathogen type from the number of sequencing reads for each pathogen type and the number of total sequencing reads detected by quality control using the following method, comprising:

RPMi＝Ni/Nt×10 ⁶ ，

Wherein RPMi is the RPM value for the ith pathogen type; ni is the number of reads of the ith pathogen type, nt is the total number of sequencing reads in the sample detected by quality control.

According to the method, pathogen annotation information, sample information, target object information and the like can be added in a supplementary mode according to the statistical information of pathogens in the sample to be detected obtained through statistics, so that a detection report can be generated. Wherein the annotation information includes: chinese name of pathogen type, latin name, descriptive text of DNA or RNA genome, pathogenicity characteristics, etc., sample information includes: sample type, volume, sampling time, etc.; target object information, i.e., patient information, includes: name, age, sex, disease diagnosis, etc. The detection report is shown in fig. 3.

As shown in fig. 4, a flowchart of automatic analysis of pathogen type according to an embodiment of the invention includes: defining a data storage directory and data types; reading fastq files or fastq.gz files under the directory; judging the data type; for three-generation sequencing readsAccording to the data quality control analysis, using Nanofilt to screen out sequencing read data with quality meeting Q10 to be reserved, and for second generation sequencing read data, using fastQC to perform data quality control analysis to screen out sequencing read data with quality meeting Q30 to be reserved; storing the retained sequencing read data in a passed. Fastq. Gz file; species analysis is performed based on Krake 2, sequencing read data which are reserved for known pathogens, species analysis is performed based on Minimap2, and sequencing read data with a first comparison score MAPQ value of greater than or equal to 50 are reserved; storing the retained sequencing read data in a species taxid to a taxid.fasta file; using BLAST to compare with genome of each species to obtain a second comparison score, and if sequencing read data corresponding to the highest second comparison score simultaneously meets a second matching length ratio alignment of 80% or more of a preset second ratio threshold, and sequence similarity identity of 90% or more of a preset sequence similarity threshold, and a comparison expected value of 10% or less of a preset comparison expected threshold ^-5 Then the data is reserved; statistical analysis of pathogen types is performed to determine detection reports.

The pathogen comparison and identification method for the third-generation long-reading long-sequencing data is established, and long-reading long data generated by the third-generation sequencing technologies such as nanopores, pacbrio and the like can be effectively processed. Compared with the method widely applied at present based on comparison software such as bwa and bowtie 2, the method solves the problem that accuracy and analysis speed are difficult to be compatible in sequencing read data comparison analysis with long read length and high error rate, reduces the time consumption of typical data (1 Gb nanopore sequencing data) analysis to within 30 minutes, and meets the requirement of rapid pathogen detection analysis based on macro genome or macro transcriptome data obtained by a third generation sequencing technology in clinic. Meanwhile, the method can analyze short-reading long data obtained by second generation sequencing, and has better data compatibility.

Fig. 5 is a schematic diagram of a system 500 for automated analysis of pathogen types according to an embodiment of the invention. As shown in fig. 5, a system 500 for automatically analyzing pathogen types according to an embodiment of the invention includes: a sequencing read data acquisition unit 501, a data cleansing unit 502, a pathogen type recognition unit 503, a second sequencing read data set acquisition unit 504, a comparison result acquisition unit 505 and a final pathogen type recognition result determination unit 506.

Preferably, the sequencing read data obtaining unit 501 is configured to obtain at least one sequencing read data obtained by detecting a nucleic acid sequence corresponding to a sample to be tested.

Preferably, the data cleansing unit 502 is configured to determine a data type of each sequencing read data in the at least one sequencing read data, and conduct data cleansing on each sequencing read data in the at least one sequencing read data according to a preset data cleansing policy according to the data type, so as to obtain at least one qualified first sequencing read data.

Preferably, the data cleansing unit 502 performs data cleansing on each sequencing read data in the at least one sequencing read data according to a preset data cleansing policy according to a data type, so as to obtain at least one qualified first sequencing read data, including:

Preferably, the pathogen type recognition unit 503 is configured to perform preliminary pathogen type recognition on each of the at least one qualified first sequencing read data by using a first recognition method and a second recognition method, so as to obtain a first pathogen type recognition result and a second pathogen type recognition result corresponding to each of the first sequencing read data.

Preferably, the second sequencing read data set obtaining unit 504 is configured to perform selection of second sequencing read data according to the first pathogen type identification result and the second pathogen type identification result corresponding to each first sequencing read data, obtain at least one second sequencing read data, and classify the at least one second sequencing read data according to the identified pathogen type, so as to obtain second sequencing read data sets corresponding to different pathogen types.

Preferably, the second sequencing read data set obtaining unit 504 performs selection of second sequencing read data according to the first pathogen type identification result and the second pathogen type identification result corresponding to each piece of first sequencing read data, to obtain at least one piece of second sequencing read data, including:

Preferably, the comparison result obtaining unit 505 is configured to compare, for any one of the second sequencing read data sets, each second sequencing read data in the any one of the second sequencing read data sets with a pathogen reference sequence of a pathogen type corresponding to the any one of the second sequencing read data sets, and obtain a comparison result corresponding to each second sequencing read data in the any one of the second sequencing read data sets.

Preferably, the final pathogen-type recognition result determining unit 506 is configured to determine a final pathogen-type recognition result of the sample to be tested according to the comparison result corresponding to each second sequencing read data in each second sequencing read data set.

Preferably, the final pathogen type identification result determining unit 506 determines a final pathogen type identification result of the sample to be tested according to the comparison result corresponding to each second sequencing read data in each second sequencing read data set, including:

Preferably, wherein the system further comprises:

The system 500 for automatically analyzing a pathogen type according to an embodiment of the present invention corresponds to the method 100 for automatically analyzing a pathogen type according to another embodiment of the present invention, and is not described herein.

The application has been described with reference to a few embodiments. However, as is well known to those skilled in the art, other embodiments than the above disclosed application are equally possible within the scope of the application, as defined by the appended patent claims.

Generally, all terms used in the claims are to be interpreted according to their ordinary meaning in the technical field, unless explicitly defined otherwise therein. All references to "a/an/the [ means, component, etc. ]" are to be interpreted openly as referring to at least one instance of said means, component, etc., unless explicitly stated otherwise. The steps of any method disclosed herein do not have to be performed in the exact order disclosed, unless explicitly stated.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical aspects of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those of ordinary skill in the art that: modifications and equivalents may be made to the specific embodiments of the invention without departing from the spirit and scope of the invention, which is intended to be covered by the claims.

Claims

1. A method for automated analysis of pathogen types, the method comprising:

2. The method of claim 1, wherein the performing data cleansing on each sequencing read data in the at least one sequencing read data according to a preset data cleansing policy according to a data type to obtain at least one qualified first sequencing read data comprises:

3. The method of claim 1, wherein the selecting the second sequencing read data based on the first pathogen type identification result and the second pathogen type identification result corresponding to each of the first sequencing read data, and obtaining at least one second sequencing read data, comprises:

4. The method of claim 1, wherein the comparison comprises: a second alignment score, a second match length ratio, a sequence similarity, and an alignment expectation for each second sequencing read data to a pathogen reference sequence corresponding to the identified pathogen type.

5. The method of claim 4, wherein determining a final pathogen type identification result for the sample to be tested based on the comparison result for each second sequencing read data in each second sequencing read data set, comprises:

6. The method according to claim 1, wherein the method further comprises:

7. The method of claim 6, wherein the method further comprises:

8. A system for automated analysis of pathogen types, the system comprising:

9. The system of claim 8, wherein the data cleansing unit performs data cleansing on each sequencing read data in the at least one sequencing read data according to a preset data cleansing policy according to a data type to obtain at least one qualified first sequencing read data, comprising:

10. The system of claim 8, wherein the second sequencing read data set obtaining unit performs selection of second sequencing read data according to a first pathogen type recognition result and a second pathogen type recognition result corresponding to each of the first sequencing read data, and obtains at least one second sequencing read data, including:

11. The system of claim 8, wherein the comparison comprises: a second alignment score, a second match length ratio, a sequence similarity, and an alignment expectation for each second sequencing read data to a pathogen reference sequence corresponding to the identified pathogen type.

12. The system of claim 11, wherein the final pathogen type recognition result determining unit determines a final pathogen type recognition result of the sample to be tested based on the comparison result corresponding to each second sequencing read data in each second sequencing read data set, comprising:

13. The system of claim 8, wherein the system further comprises:

14. The system of claim 8, wherein the system further comprises: