CN113096737A - Method and system for automatically analyzing pathogen types - Google Patents

Method and system for automatically analyzing pathogen types Download PDF

Info

Publication number
CN113096737A
CN113096737A CN202110331835.5A CN202110331835A CN113096737A CN 113096737 A CN113096737 A CN 113096737A CN 202110331835 A CN202110331835 A CN 202110331835A CN 113096737 A CN113096737 A CN 113096737A
Authority
CN
China
Prior art keywords
read data
sequencing read
pathogen
data
sequencing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110331835.5A
Other languages
Chinese (zh)
Other versions
CN113096737B (en
Inventor
杜鹏程
余乐
刘树青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Yuansheng Kangtai Gene Technology Co ltd
Original Assignee
Beijing Yuansheng Kangtai Gene Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Yuansheng Kangtai Gene Technology Co ltd filed Critical Beijing Yuansheng Kangtai Gene Technology Co ltd
Priority to CN202110331835.5A priority Critical patent/CN113096737B/en
Publication of CN113096737A publication Critical patent/CN113096737A/en
Application granted granted Critical
Publication of CN113096737B publication Critical patent/CN113096737B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The invention discloses a method and a system for automatically analyzing pathogen types, which firstly carry out preliminary judgment on the pathogen types of first sequencing read data of different data types of a sample to be detected through two different algorithms, then select second sequencing read data, and then compare, verify and screen the identified second sequencing read data of different pathogen types with corresponding pathogen reference sequences, thereby determining the pathogen types existing in the sample to be detected; the method can effectively process long reading data generated based on a third-generation sequencing technology, reduces the calculated amount of accurate comparison, can realize accurate analysis and report generation of typical third-generation metagenome or macrotranscriptome data within 30 minutes, and meets the requirement of rapid pathogen detection and analysis; meanwhile, the method can analyze short read length data obtained by second-generation sequencing, and has better data compatibility.

Description

Method and system for automatically analyzing pathogen types
Technical Field
The present invention relates to the field of information technology, and more particularly, to a method and system for automated analysis of pathogen types.
Background
A pathogen identification method based on high-throughput sequencing data comprises the following steps of carrying out pretreatment, nucleic acid extraction, sequencing library construction and other experimental steps on a sample, using a high-throughput sequencer to determine nucleic acid sequences in the sample, using sequence comparison software to compare and analyze the sequences with a pathogen sequence database after nucleic acid sequence data generated by the sequencer are obtained, screening by adopting a certain comparison result filtering condition to obtain a credible result, further calculating the number of sequences belonging to pathogens in the nucleic acid sequence data of the sample, the proportion of the sequences in all the sequences and the like, and finally judging the existence condition of the pathogens in the detected sample.
Because the sequencing data volume of the metagenome or the macrotranscriptome is large, the pathogen clinical detection based on the second generation sequencing technology can generally obtain the sequencing data of more than 1.5Gb, and the pathogen clinical detection based on the third generation long read sequencing can obtain more data. Therefore, in order to improve the timeliness and accuracy of clinical pathogen detection, a method and a system for rapidly comparing and analyzing sequencing data of a metagenome or a macrotranscriptome must be established.
Currently, commonly used sequence alignment software including bwa, bowtie 2, SOAP, BLAST, etc. has been widely used in the analysis of second-generation sequencing data. The technical scheme of the Chinese patent CN 108334750B metagenome data analysis method and system uses a k-mer algorithm to obtain a primary identification result, uses a BLAST algorithm to carry out secondary species identification on sequences in the primary identification set, and when the identification result of more than 50% of the sequences in the verification sequence set is consistent with the primary species identification result, the primary species identification result is considered to pass verification and serves as a report to detect species. The Chinese patent 'identification of pathogens and antibiotic characterization in CN 109923217A metagenome sample' technical scheme uses bwa algorithm to compare sequencing data with pathogen sequence database. The short sequence rapid comparison algorithms represented by bwa, bowtie 2 and the like are constructed based on a BWT conversion algorithm, are mainly optimized for second-generation sequencing data, and improve the comparison speed. The BLAST algorithm is a sequence alignment algorithm based on local sequence alignment, and uses a short segment matching algorithm and an effective statistical model to find out the optimal local alignment effect between a target sequence and a database.
However, how to realize automatic pathogen type analysis based on the third generation sequencing data is still a problem to be solved.
Disclosure of Invention
The invention provides a method and a system for automatically analyzing pathogen types, which are used for solving the problem of how to quickly and accurately determine the pathogen types in a sample to be detected on the basis of third-generation sequencing data.
In order to solve the above-mentioned problems, according to an aspect of the present invention, there is provided a method for automatically analyzing pathogen types, the method comprising:
obtaining at least one sequencing read data which is obtained by detecting a nucleic acid sequence and corresponds to a sample to be detected;
determining the data type of each sequencing read data in the at least one sequencing read data, and performing data cleaning on each sequencing read data in the at least one sequencing read data according to the data type and a preset data cleaning strategy to obtain at least one first sequencing read data with qualified quality;
performing primary identification on the pathogen type of each first sequencing read data in the at least one first sequencing read data with qualified quality by using a first identification method and a second identification method respectively to obtain a first pathogen type identification result and a second pathogen type identification result corresponding to each first sequencing read data respectively;
selecting second sequencing read data according to a first pathogen type identification result and a second pathogen type identification result corresponding to each first sequencing read data to obtain at least one second sequencing read data, and classifying the at least one second sequencing read data according to the identified pathogen type to obtain second sequencing read data sets corresponding to different pathogen types;
for any one second sequencing read data set, comparing each second sequencing read data in the any one second sequencing read data set with a pathogen reference sequence of a pathogen type corresponding to the any one second sequencing read data set to obtain a comparison result corresponding to each second sequencing read data in the any one second sequencing read data set;
and determining the final pathogen type identification result of the sample to be detected according to the comparison result corresponding to each second sequencing read data in each second sequencing read data set.
Preferably, the performing data washing on each sequencing read data in the at least one sequencing read data according to a preset data washing strategy according to a data type to obtain at least one first sequencing read data with qualified quality includes:
determining the sequencing read data quality of each sequencing read data with the data type of the second-generation sequencing data by using the fastQC, and screening out each sequencing read data with the sequencing read data quality meeting a preset first sequencing read data quality standard as first sequencing read data;
and determining the sequencing read data quality of each sequencing read data with the data type of the third-generation sequencing data by utilizing the Nanofilt, and screening each sequencing read data with the sequencing read data quality meeting a preset second sequencing read data quality standard as first sequencing read data.
Preferably, the selecting the second sequencing read data according to the first pathogen type identification result and the second pathogen type identification result corresponding to each first sequencing read data to obtain at least one second sequencing read data includes:
for any one of the first sequencing read data, if the first pathogen type identification result corresponding to the any one of the first sequencing read data indicates that the identified pathogen type belongs to a preset pathogen type set, taking the any one of the first sequencing read data as a second sequencing read data;
for any one first sequencing read data, if the second pathogen type identification result corresponding to the any one first sequencing read data indicates that the identified pathogen type belongs to a preset pathogen type set, the first comparison score of the any one first sequencing read data and the pathogen reference sequence corresponding to the identified pathogen type is greater than or equal to a preset score threshold, and the first length matching ratio of the length of the pathogen reference sequence corresponding to the identified pathogen type in the any one first sequencing read data and the total length of the any one first sequencing read data is greater than or equal to a preset first ratio threshold, the any one first sequencing read data is used as one second sequencing read data.
Preferably, the alignment result comprises: each second sequencing read data comprises a second alignment score, a second match length ratio, a sequence similarity, and an expected alignment value of the pathogen reference sequence corresponding to the identified pathogen type.
Preferably, the determining the final pathogen type identification result of the sample to be tested according to the comparison result corresponding to each second sequencing read data in each second sequencing read data set includes:
for any sequencing read data set, selecting second sequencing read data with the highest second alignment score in any sequencing read data set as target sequencing read data, and judging whether the target sequencing read data simultaneously meet the condition that a second matching length ratio is greater than or equal to a preset second ratio threshold, the sequence similarity is greater than or equal to a preset sequence similarity threshold, and the alignment expected value is less than or equal to a preset alignment expected threshold; if so, determining that the pathogen type corresponding to the target sequencing read data exists in the sample to be tested;
and determining the final pathogen type identification result according to the determined pathogen types existing in all the samples to be detected.
Preferably, wherein the method further comprises:
performing statistical analysis on the sequencing read data with the pathogen type identified according to the final pathogen type identification result of the sample to be detected to obtain a statistical analysis result; wherein the statistical analysis result comprises: the number of sequencing read data, composition ratio, RPM value and reference genome coverage for the pathogen types present in the test sample and the genus level and species level of each pathogen type.
Preferably, wherein the method further comprises:
and obtaining annotation information of each pathogen type, information of a sample to be detected and the target object information of the sample to be detected in the final pathogen type identification result, and generating a detection report according to the statistical analysis result, the annotation information, the information of the sample to be detected and the target object information.
According to another aspect of the present invention, there is provided a system for automated analysis of pathogen type, the system comprising:
a sequencing read data acquisition unit for acquiring at least one sequencing read data corresponding to a sample to be tested and obtained by nucleic acid sequence detection;
the data cleaning unit is used for determining the data type of each sequencing read data in the at least one sequencing read data, and performing data cleaning on each sequencing read data in the at least one sequencing read data according to the data type and a preset data cleaning strategy to obtain at least one first sequencing read data with qualified quality;
the pathogen type identification unit is used for carrying out primary identification on the pathogen type of each first sequencing read data in the at least one first sequencing read data with qualified quality by using a first identification method and a second identification method respectively so as to obtain a first pathogen type identification result and a second pathogen type identification result corresponding to each first sequencing read data respectively;
a second sequencing read data set acquisition unit, configured to select second sequencing read data according to the first pathogen type identification result and the second pathogen type identification result corresponding to each first sequencing read data, acquire at least one second sequencing read data, and classify the at least one second sequencing read data according to the identified pathogen type, so as to acquire second sequencing read data sets corresponding to different pathogen types;
a comparison result obtaining unit, configured to compare, for any one of the second sequencing read data sets, each of the second sequencing read data in the any one of the second sequencing read data sets with a pathogen reference sequence of a pathogen type corresponding to the any one of the second sequencing read data sets, and obtain a comparison result corresponding to each of the second sequencing read data in the any one of the second sequencing read data sets;
and the final pathogen type identification result determining unit is used for determining the final pathogen type identification result of the sample to be detected according to the comparison result corresponding to each second sequencing read data in each second sequencing read data set.
Preferably, the data washing unit performs data washing on each sequencing read data in the at least one sequencing read data according to a data type and a preset data washing strategy to obtain at least one first sequencing read data with qualified quality, and the data washing unit includes:
determining the sequencing read data quality of each sequencing read data with the data type of the second-generation sequencing data by using the fastQC, and screening out each sequencing read data with the sequencing read data quality meeting a preset first sequencing read data quality standard as first sequencing read data;
and determining the sequencing read data quality of each sequencing read data with the data type of the third-generation sequencing data by utilizing the Nanofilt, and screening each sequencing read data with the sequencing read data quality meeting a preset second sequencing read data quality standard as first sequencing read data.
Preferably, the second sequencing read data set obtaining unit selects the second sequencing read data according to the first pathogen type identification result and the second pathogen type identification result corresponding to each first sequencing read data, and obtains at least one second sequencing read data, including:
for any one of the first sequencing read data, if the first pathogen type identification result corresponding to the any one of the first sequencing read data indicates that the identified pathogen type belongs to a preset pathogen type set, taking the any one of the first sequencing read data as a second sequencing read data;
for any one first sequencing read data, if the second pathogen type identification result corresponding to the any one first sequencing read data indicates that the identified pathogen type belongs to a preset pathogen type set, the first comparison score of the any one first sequencing read data and the pathogen reference sequence corresponding to the identified pathogen type is greater than or equal to a preset score threshold, and the first length matching ratio of the length of the pathogen reference sequence corresponding to the identified pathogen type in the any one first sequencing read data and the total length of the any one first sequencing read data is greater than or equal to a preset first ratio threshold, the any one first sequencing read data is used as one second sequencing read data.
Preferably, the alignment result comprises: each second sequencing read data comprises a second alignment score, a second match length ratio, a sequence similarity, and an expected alignment value of the pathogen reference sequence corresponding to the identified pathogen type.
Preferably, the determining unit of the final pathogen type identification result determines the final pathogen type identification result of the sample to be tested according to the comparison result corresponding to each second sequencing read data in each second sequencing read data set, and includes:
for any sequencing read data set, selecting second sequencing read data with the highest second alignment score in any sequencing read data set as target sequencing read data, and judging whether the target sequencing read data simultaneously meet the condition that a second matching length ratio is greater than or equal to a preset second ratio threshold, the sequence similarity is greater than or equal to a preset sequence similarity threshold, and the alignment expected value is less than or equal to a preset alignment expected threshold; if so, determining that the pathogen type corresponding to the target sequencing read data exists in the sample to be tested;
and determining the final pathogen type identification result according to the determined pathogen types existing in all the samples to be detected.
Preferably, wherein the system further comprises:
the statistical analysis unit is used for performing statistical analysis on the sequencing read data with the pathogen type identified according to the final pathogen type identification result of the sample to be detected to obtain a statistical analysis result; wherein the statistical analysis result comprises: the number of sequencing read data, composition ratio, RPM value and reference genome coverage for the pathogen types present in the test sample and the genus level and species level of each pathogen type.
Preferably, wherein the system further comprises:
and the detection report generating unit is used for acquiring annotation information of each pathogen type, information of the sample to be detected and the target object information of the sample to be detected in the final pathogen type identification result, and generating a detection report according to the statistical analysis result, the annotation information, the information of the sample to be detected and the target object information.
The invention provides a method and a system for automatically analyzing pathogen types, which firstly carry out preliminary judgment on the pathogen types of first sequencing read data of different data types of a sample to be detected through two different algorithms, then select second sequencing read data, and then compare, verify and screen the identified second sequencing read data of different pathogen types with corresponding pathogen reference sequences, thereby determining the pathogen types existing in the sample to be detected; the method can effectively process long reading data generated based on a third-generation sequencing technology, reduces the calculated amount of accurate comparison, solves the problems that the long reading data and sequencing read data with higher error rate are difficult to be considered both in accuracy and analysis speed in comparison analysis, can realize accurate analysis and report generation of typical third-generation metagenome or macrotranscription data within 30 minutes, and meets the requirement of performing rapid pathogen detection analysis on the metagenome or the macrotranscription data obtained based on the third-generation sequencing technology in clinic; meanwhile, the method can analyze short read length data obtained by second-generation sequencing, and has better data compatibility.
Drawings
A more complete understanding of exemplary embodiments of the present invention may be had by reference to the following drawings in which:
FIG. 1 is a flow diagram of a method 100 for automated analysis of pathogen types according to an embodiment of the present invention;
FIG. 2 is a schematic diagram for automated pathogen type analysis according to an embodiment of the present invention;
FIG. 3 is an exemplary diagram of a detection report according to an embodiment of the present invention;
FIG. 4 is a flow diagram of automated pathogen type analysis according to an embodiment of the present invention;
fig. 5 is a schematic diagram of a system 500 for automated pathogen type analysis, according to an embodiment of the present invention.
Detailed Description
The exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, however, the present invention may be embodied in many different forms and is not limited to the embodiments described herein, which are provided for complete and complete disclosure of the present invention and to fully convey the scope of the present invention to those skilled in the art. The terminology used in the exemplary embodiments illustrated in the accompanying drawings is not intended to be limiting of the invention. In the drawings, the same units/elements are denoted by the same reference numerals.
Unless otherwise defined, terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Further, it will be understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense.
Fig. 1 is a flow diagram of a method 100 for automated analysis of pathogen types according to an embodiment of the invention. As shown in fig. 1, the method for automatically analyzing pathogen types provided by the embodiment of the present invention can effectively process long read data generated based on a third-generation sequencing technology, reduce the amount of calculation for accurate comparison, solve the problem that the accuracy and the analysis speed are difficult to be considered in comparison analysis of the long read data and sequencing read data with a high error rate, can realize accurate analysis and report generation of typical third-generation metagenome or macrotranscript data within 30 minutes, and meet the clinical requirement for rapid pathogen detection analysis of metagenome or macrotranscript data obtained based on the third-generation sequencing technology; meanwhile, the method can analyze short read length data obtained by second-generation sequencing, and has better data compatibility. The method 100 for automatically analyzing pathogen types provided by the embodiment of the present invention starts with step 101, and at step 101, at least one sequencing read data obtained by detecting a nucleic acid sequence corresponding to a sample to be tested is obtained.
The method can analyze the third generation long-reading metagenome or macrotranscriptome sequencing data corresponding to the sample to be detected, thereby determining the pathogen type. The principle of achieving pathogen type identification is shown in fig. 2. Referring to fig. 2, in the present invention, a sequencing read sequence obtained by performing nucleic acid sequencing on a sample with a sequencer is read and stored in the format of FASTQ, which is an international universal sequencing data standard, to obtain a FASTQ data file; of course, the compression of sequencing read data can also be performed based on the gzip method to reduce the storage occupation, so as to obtain the fastq. When a plurality of samples to be detected are available, the method can perform barcode identification on the sample barcode information corresponding to each sample to be detected, realize the splitting and identification of sequencing read sequence data, and merge a plurality of FASTQ data files or compressed FASTQ. The sequencing read sequence can be second-generation data acquired based on a second-generation sequencer and/or third-generation data acquired based on a third-generation sequencer.
In step 102, a data type of each sequencing read data in the at least one sequencing read data is determined, and data washing is performed on each sequencing read data in the at least one sequencing read data according to the data type and a preset data washing strategy, so as to obtain at least one first sequencing read data with qualified quality.
Preferably, the performing data washing on each sequencing read data in the at least one sequencing read data according to a preset data washing strategy according to a data type to obtain at least one first sequencing read data with qualified quality includes:
determining the sequencing read data quality of each sequencing read data with the data type of the second-generation sequencing data by using the fastQC, and screening out each sequencing read data with the sequencing read data quality meeting a preset first sequencing read data quality standard as first sequencing read data;
and determining the sequencing read data quality of each sequencing read data with the data type of the third-generation sequencing data by utilizing the Nanofilt, and screening each sequencing read data with the sequencing read data quality meeting a preset second sequencing read data quality standard as first sequencing read data.
In the present invention, as shown in fig. 2, the data types of the sequencing read data include: second generation short read length data (i.e., second generation sequencing data) and third generation long read length data (i.e., third generation sequencing data). And for sequencing read data of different data types, cleaning according to a preset data cleaning strategy, and filtering low-quality data by using different data quality control software and parameters.
In the invention, the Nanofilt software is used for carrying out quality control detection and filtration on the third-generation sequencing read data file, the input file is an original FASTQ format file or a compressed FASTQ. Wherein the second sequencing read data quality standard is set according to actual conditions, such as: setting the data quality standard of the second sequencing reading to be Q more than or equal to 10, namely, the error rate of all basic groups in the sequencing reading is less than or equal to 10%. Wherein the Q value is determined by the base error rate according to the formula Q-10 × log10P is obtained through calculation; where Q is the quality value and P is the error rate of a certain base.
And performing quality control detection and filtration on the second-generation sequencing read data file by using fastQC software, wherein the input file is an original FASTQ format file or FASTQ. The first sequencing read data quality standard can be set according to actual conditions, such as: the error rate of all bases in the first sequencing read is set to be less than or equal to 0.1 percent (namely Q is more than or equal to 30).
In addition, the extracted high quality at least one first sequencing read data that passes data quality control can be stored in fastq.
In step 103, a first identification method and a second identification method are respectively used for performing primary identification on the pathogen type of each first sequencing read data in the at least one first sequencing read data with qualified quality, so as to respectively obtain a first pathogen type identification result and a second pathogen type identification result corresponding to each first sequencing read data.
Referring to fig. 2, in the present invention, the first identification method is Kraken2, and the second identification method is Minimap2, and two different algorithms are respectively used for identifying the pathogen type. Wherein, the Kraken2 algorithm is used to realize the identification of bacteria, fungi, viruses and parasites based on species-specific k-mer sequences, the input is the high-quality sequence FASTQ. gz format file obtained in step 102, and the output is the species discrimination result corresponding to each sequencing read data, namely the first pathogen type identification result.
Complementary comparison of pathogens with large genomic variation such as viruses is carried out by utilizing a Minimap2 algorithm, pathogen type identification is carried out based on minizer (a section of seed with the minimum hash value in a sequence) hash table search, a training algorithm and a dynamic programming algorithm, input is a high-quality sequence FASTQ. Wherein the second pathogen type identification result comprises: pathogen type and alignment score of each sequencing read data to pathogen reference sequence.
In step 104, selecting second sequencing read data according to the first pathogen type identification result and the second pathogen type identification result corresponding to each first sequencing read data to obtain at least one second sequencing read data, and classifying the at least one second sequencing read data according to the identified pathogen type to obtain second sequencing read data sets corresponding to different pathogen types.
Preferably, the selecting the second sequencing read data according to the first pathogen type identification result and the second pathogen type identification result corresponding to each first sequencing read data to obtain at least one second sequencing read data includes:
for any one of the first sequencing read data, if the first pathogen type identification result corresponding to the any one of the first sequencing read data indicates that the identified pathogen type belongs to a preset pathogen type set, taking the any one of the first sequencing read data as a second sequencing read data;
for any one first sequencing read data, if the second pathogen type identification result corresponding to the any one first sequencing read data indicates that the identified pathogen type belongs to a preset pathogen type set, the first comparison score of the any one first sequencing read data and the pathogen reference sequence corresponding to the identified pathogen type is greater than or equal to a preset score threshold, and the first length matching ratio of the length of the pathogen reference sequence corresponding to the identified pathogen type in the any one first sequencing read data and the total length of the any one first sequencing read data is greater than or equal to a preset first ratio threshold, the any one first sequencing read data is used as one second sequencing read data.
In the present invention, the preset score threshold may be set to 50, and the first ratio threshold may be set to 50%.
Referring to fig. 2, in the present invention, for the first pathogen type identification result obtained based on the Kraken2 algorithm, for any one of the first sequencing read data, if the first pathogen type identification result corresponding to the any one of the first sequencing read data indicates that the identified pathogen type belongs to the preset pathogen type set, that is, the first pathogen type identification result is not empty or unknown, the any one of the first sequencing read data is used as a second sequencing read data.
For the second pathogen type identification result obtained based on the Minimap2 algorithm, for any first sequencing read data, if the second pathogen type identification result corresponding to any first sequencing read data indicates that the identified pathogen type belongs to the preset pathogen type set and the first comparison score MAPQ value of the pathogen reference sequence corresponding to any first sequencing read data and the identified pathogen type is greater than or equal to the preset score threshold 50, and the first length matching ratio Mi between the length of the pathogen reference sequence corresponding to the identified pathogen type in any first sequencing read data and the total length of any first sequencing read data is determined to be greater than or equal to the preset first ratio threshold 50% based on the CIGAR, taking any first sequencing read data as a second sequencing read data.
In the present invention, the second sequencing read data is a read suspected of being a pathogen.
In the present invention, the first length match ratio Mi of a certain sequencing read is calculated by the following method, including:
Figure BDA0002994358190000121
wherein n is the total number of fragments in a sequencing read data that match to the corresponding pathogen reference sequence, miIs the length of the ith matching fragment, and L is the length of the data of the certain sequencing read.
In the present invention, after determining at least one second sequencing read data suspected to be a pathogen, the at least one second sequencing read data is further classified according to the identified pathogen type to obtain second sequencing read data sets corresponding to different pathogen types. Wherein, data splitting is carried out according to the identified or compared pathogen species, the second sequencing read data are extracted from the input FASTQ. The identification is based on the alignment with the reference sequence of the known species, so that the preliminary identification is all the known species, and the resolution process is to store the reading of the same species as a file in a FASTA format respectively as the input of the next basic local alignment BLAST.
In step 105, for any one second sequencing read data set, each second sequencing read data in the any one second sequencing read data set is compared with a pathogen reference sequence of a pathogen type corresponding to the any one second sequencing read data set, and a comparison result corresponding to each second sequencing read data in the any one second sequencing read data set is obtained.
Preferably, the alignment result comprises: each second sequencing read data comprises a second alignment score, a second match length ratio, a sequence similarity, and an expected alignment value of the pathogen reference sequence corresponding to the identified pathogen type.
Referring to fig. 2, in the present invention, each second sequencing read data in the FASTA format file of suspected pathogen reads separated by pathogen type in step 104 is subjected to accurate BLAST Alignment with the reference sequence of each species using Basic Local Alignment Search Tool (BLAST), so as to obtain the Alignment result corresponding to each second sequencing read data in each second sequencing read data set. And the comparison result corresponding to each second sequencing read data comprises a second comparison score, a second matching length ratio, a sequence similarity and a comparison expected value of the second sequencing read data and the pathogen reference sequence corresponding to the identified pathogen type.
In step 106, the final pathogen type identification result of the sample to be detected is determined according to the comparison result corresponding to each second sequencing read data in each second sequencing read data set.
Preferably, the determining the final pathogen type identification result of the sample to be tested according to the comparison result corresponding to each second sequencing read data in each second sequencing read data set includes:
for any sequencing read data set, selecting second sequencing read data with the highest second alignment score in any sequencing read data set as target sequencing read data, and judging whether the target sequencing read data simultaneously meet the condition that a second matching length ratio is greater than or equal to a preset second ratio threshold, the sequence similarity is greater than or equal to a preset sequence similarity threshold, and the alignment expected value is less than or equal to a preset alignment expected threshold; if so, determining that the pathogen type corresponding to the target sequencing read data exists in the sample to be tested;
and determining the final pathogen type identification result according to the determined pathogen types existing in all the samples to be detected.
In the present invention, the threshold of the second ratio is 80%, the threshold of the sequence similarity is 90%, and the threshold of the expected alignment is 10-5. And for any sequencing read data set, selecting the second sequencing read data with the highest second alignment score in any sequencing read data set as the target sequencing read data. Then, whether the target sequencing read data simultaneously meet the conditions that the second matching length ratio is greater than or equal to the preset second ratio threshold value 80%, the sequence similarity is greater than or equal to the preset sequence similarity threshold value 90%, and the comparison expected value is less than or equal to the preset comparison expected threshold value 10 is judged-5(ii) a And if so, determining that the pathogen type corresponding to the target sequencing read data exists in the sample to be tested. And finally, summarizing, and determining the final pathogen type identification result according to the pathogen types existing in all the samples to be detected. The final pathogen type identification results include: a pathogen type and sequencing read data corresponding to the pathogen type.
Preferably, wherein the method further comprises:
performing statistical analysis on the sequencing read data with the pathogen type identified according to the final pathogen type identification result of the sample to be detected to obtain a statistical analysis result; wherein the statistical analysis result comprises: the number of sequencing read data, composition ratio, RPM value and reference genome coverage for the pathogen types present in the test sample and the genus level and species level of each pathogen type.
Preferably, wherein the method further comprises:
and obtaining annotation information of each pathogen type, information of a sample to be detected and the target object information of the sample to be detected in the final pathogen type identification result, and generating a detection report according to the statistical analysis result, the annotation information, the information of the sample to be detected and the target object information.
The method of the present invention further comprises: and (4) performing statistical analysis on the sequencing read data for identifying the pathogen type obtained in the step (106) according to species, and determining the pathogen type, the reading number Ni of the genus level and the species level, the composition ratio Pi, the RPMi value, the reference genome coverage and the like in the sample to be detected. Pathogen types include: bacteria, fungi, viruses and parasites.
In the present invention, the calculation of the composition ratio Pi for each pathogen type in the test sample from the number of sequencing read data for each pathogen type is performed in the following manner, including:
Figure BDA0002994358190000141
wherein Pi is the constitutive ratio of the ith pathogen type; n is the total number of pathogen types detected in the sample and Ni is the number of reads for the ith pathogen type.
Calculating the RPMi value for each pathogen type from the number of sequencing reads for each pathogen type and the total number of sequencing reads detected by quality control using the following means, including:
RPMi=Ni/Nt×106
wherein RPMi is the RPM value for the ith pathogen type; ni is the number of reads for the ith pathogen type, and Nt is the total number of sequencing reads detected by quality control in the sample.
The method can also supplement and add pathogen annotation information, sample information, target object information and the like according to the statistical information of the pathogens in the to-be-detected sample obtained through statistics so as to generate a detection report. Wherein the annotation information includes: chinese name of pathogen type, Latin name, DNA or RNA as genome, description words of pathogenicity characteristics and the like, and sample information comprises: sample type, volume, sampling time, etc.; target object information, i.e., patient information, including: name, age, sex, disease diagnosis, etc. The detection report is shown in fig. 3.
Fig. 4 is a flowchart of an automatic pathogen type analysis according to an embodiment of the present invention, including: defining a data storage directory and a data type; reading a fastq file or a fastq.gz file under a directory; judging the data type; for the third-generation sequencing read data, performing data quality control analysis by using Nanofilt, screening out the sequencing read data with the quality satisfying Q10, and reserving the sequencing read data, for the second-generation sequencing read data, performing data quality control analysis by using fastQC, screening out the sequencing read data with the quality satisfying Q30, and reserving the sequencing read data; storing the reserved sequencing read data into a past.fastq.gz file; performing species analysis based on Kraken2, reserving sequencing read data of known pathogen, performing species analysis based on Minimap2, reserving sequencing read data with first comparative score MAPQ value more than or equal to 50; storing the reserved sequencing read data into a taxi.fasta file according to the species taxi; using BLAST to compare with each species genome to obtain a second comparison score, and if the sequencing read data corresponding to the highest second comparison score simultaneously meet that the second matching length ratio alignment is greater than or equal to 80% of a preset second ratio threshold, the sequence similarity is greater than or equal to 90% of a preset sequence similarity threshold, and the comparison expected value is less than or equal to 10% of a preset comparison expected threshold-5Then reserving; statistical analysis of pathogen types was performed to determine detection reports.
The invention establishes a pathogen comparison and identification method aiming at third-generation long-reading sequencing data, and can effectively process the long-reading long data generated by the third-generation sequencing technologies such as nanopore and pacBio. Compared with the method based on the bwa, bowtie 2 and other comparison software which is widely applied at present, the method solves the problem that the accuracy and the analysis speed are difficult to be considered in the comparison analysis of sequencing read data with long read length and high error rate, reduces the analysis time consumption of typical data (1Gb nanopore sequencing data) to within 30 minutes, and meets the requirement of performing rapid pathogen detection analysis on the metagenome or macrotranscription group data obtained based on the third generation sequencing technology in clinic. Meanwhile, the method can analyze short read length data obtained by second-generation sequencing, and has better data compatibility.
Fig. 5 is a schematic diagram of a system 500 for automated pathogen type analysis, according to an embodiment of the present invention. As shown in fig. 5, the present invention provides a system 500 for automated pathogen type analysis, comprising: a sequencing read data acquisition unit 501, a data washing unit 502, a pathogen type identification unit 503, a second sequencing read data set acquisition unit 504, an alignment result acquisition unit 505, and a final pathogen type identification result determination unit 506.
Preferably, the sequencing read data obtaining unit 501 is configured to obtain at least one sequencing read data obtained by detecting a nucleic acid sequence corresponding to a sample to be detected.
Preferably, the data washing unit 502 is configured to determine a data type of each sequencing read data in the at least one sequencing read data, and perform data washing on each sequencing read data in the at least one sequencing read data according to a preset data washing strategy according to the data type to obtain at least one first sequencing read data with qualified quality.
Preferably, the data washing unit 502 performs data washing on each sequencing read data in the at least one sequencing read data according to a preset data washing strategy according to a data type to obtain at least one first sequencing read data with qualified quality, including:
determining the sequencing read data quality of each sequencing read data with the data type of the second-generation sequencing data by using the fastQC, and screening out each sequencing read data with the sequencing read data quality meeting a preset first sequencing read data quality standard as first sequencing read data;
and determining the sequencing read data quality of each sequencing read data with the data type of the third-generation sequencing data by utilizing the Nanofilt, and screening each sequencing read data with the sequencing read data quality meeting a preset second sequencing read data quality standard as first sequencing read data.
Preferably, the pathogen type identification unit 503 is configured to perform a preliminary pathogen type identification on each of the at least one qualified first sequencing read data by using a first identification method and a second identification method, respectively, so as to obtain a first pathogen type identification result and a second pathogen type identification result corresponding to each of the first sequencing read data, respectively.
Preferably, the second sequencing read data set obtaining unit 504 is configured to select second sequencing read data according to the first pathogen type identification result and the second pathogen type identification result corresponding to each first sequencing read data, obtain at least one second sequencing read data, and classify the at least one second sequencing read data according to the identified pathogen type to obtain second sequencing read data sets corresponding to different pathogen types.
Preferably, the second sequencing read data set obtaining unit 504 selects the second sequencing read data according to the first pathogen type identification result and the second pathogen type identification result corresponding to each first sequencing read data, and obtains at least one second sequencing read data, including:
for any one of the first sequencing read data, if the first pathogen type identification result corresponding to the any one of the first sequencing read data indicates that the identified pathogen type belongs to a preset pathogen type set, taking the any one of the first sequencing read data as a second sequencing read data;
for any one first sequencing read data, if the second pathogen type identification result corresponding to the any one first sequencing read data indicates that the identified pathogen type belongs to a preset pathogen type set, the first comparison score of the any one first sequencing read data and the pathogen reference sequence corresponding to the identified pathogen type is greater than or equal to a preset score threshold, and the first length matching ratio of the length of the pathogen reference sequence corresponding to the identified pathogen type in the any one first sequencing read data and the total length of the any one first sequencing read data is greater than or equal to a preset first ratio threshold, the any one first sequencing read data is used as one second sequencing read data.
Preferably, the alignment result obtaining unit 505 is configured to, for any one of the second sequencing read data sets, align each of the second sequencing read data in the any one of the second sequencing read data sets with a pathogen reference sequence of a pathogen type corresponding to the any one of the second sequencing read data sets, and obtain an alignment result corresponding to each of the second sequencing read data in the any one of the second sequencing read data sets.
Preferably, the alignment result comprises: each second sequencing read data comprises a second alignment score, a second match length ratio, a sequence similarity, and an expected alignment value of the pathogen reference sequence corresponding to the identified pathogen type.
Preferably, the final pathogen type identification result determining unit 506 is configured to determine the final pathogen type identification result of the sample to be tested according to the comparison result corresponding to each second sequencing read data in each second sequencing read data set.
Preferably, the determining unit 506 for final pathogen type identification result determines the final pathogen type identification result of the sample to be tested according to the comparison result corresponding to each second sequencing read data in each second sequencing read data set, and includes:
for any sequencing read data set, selecting second sequencing read data with the highest second alignment score in any sequencing read data set as target sequencing read data, and judging whether the target sequencing read data simultaneously meet the condition that a second matching length ratio is greater than or equal to a preset second ratio threshold, the sequence similarity is greater than or equal to a preset sequence similarity threshold, and the alignment expected value is less than or equal to a preset alignment expected threshold; if so, determining that the pathogen type corresponding to the target sequencing read data exists in the sample to be tested;
and determining the final pathogen type identification result according to the determined pathogen types existing in all the samples to be detected.
Preferably, wherein the system further comprises:
the statistical analysis unit is used for performing statistical analysis on the sequencing read data with the pathogen type identified according to the final pathogen type identification result of the sample to be detected to obtain a statistical analysis result; wherein the statistical analysis result comprises: the number of sequencing read data, composition ratio, RPM value and reference genome coverage for the pathogen types present in the test sample and the genus level and species level of each pathogen type.
Preferably, wherein the system further comprises:
and the detection report generating unit is used for acquiring annotation information of each pathogen type, information of the sample to be detected and the target object information of the sample to be detected in the final pathogen type identification result, and generating a detection report according to the statistical analysis result, the annotation information, the information of the sample to be detected and the target object information.
The system 500 for automatically analyzing pathogen types according to the embodiment of the present invention corresponds to the method 100 for automatically analyzing pathogen types according to another embodiment of the present invention, and will not be described herein again.
The invention has been described with reference to a few embodiments. However, other embodiments of the invention than the one disclosed above are equally possible within the scope of the invention, as would be apparent to a person skilled in the art from the appended patent claims.
Generally, all terms used in the claims are to be interpreted according to their ordinary meaning in the technical field, unless explicitly defined otherwise herein. All references to "a/an/the [ device, component, etc ]" are to be interpreted openly as referring to at least one instance of said device, component, etc., unless explicitly stated otherwise. The steps of any method disclosed herein do not have to be performed in the exact order disclosed, unless explicitly stated.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims (14)

1. A method for automated analysis of pathogen type, the method comprising:
obtaining at least one sequencing read data which is obtained by detecting a nucleic acid sequence and corresponds to a sample to be detected;
determining the data type of each sequencing read data in the at least one sequencing read data, and performing data cleaning on each sequencing read data in the at least one sequencing read data according to the data type and a preset data cleaning strategy to obtain at least one first sequencing read data with qualified quality;
performing primary identification on the pathogen type of each first sequencing read data in the at least one first sequencing read data with qualified quality by using a first identification method and a second identification method respectively to obtain a first pathogen type identification result and a second pathogen type identification result corresponding to each first sequencing read data respectively;
selecting second sequencing read data according to a first pathogen type identification result and a second pathogen type identification result corresponding to each first sequencing read data to obtain at least one second sequencing read data, and classifying the at least one second sequencing read data according to the identified pathogen type to obtain second sequencing read data sets corresponding to different pathogen types;
for any one second sequencing read data set, comparing each second sequencing read data in the any one second sequencing read data set with a pathogen reference sequence of a pathogen type corresponding to the any one second sequencing read data set to obtain a comparison result corresponding to each second sequencing read data in the any one second sequencing read data set;
and determining the final pathogen type identification result of the sample to be detected according to the comparison result corresponding to each second sequencing read data in each second sequencing read data set.
2. The method of claim 1, wherein the data washing each of the at least one sequencing read data according to a data type according to a preset data washing strategy to obtain at least one qualified first sequencing read data comprises:
determining the sequencing read data quality of each sequencing read data with the data type of the second-generation sequencing data by using the fastQC, and screening out each sequencing read data with the sequencing read data quality meeting a preset first sequencing read data quality standard as first sequencing read data;
and determining the sequencing read data quality of each sequencing read data with the data type of the third-generation sequencing data by utilizing the Nanofilt, and screening each sequencing read data with the sequencing read data quality meeting a preset second sequencing read data quality standard as first sequencing read data.
3. The method of claim 1, wherein the selecting the second sequencing read data according to the first pathogen type identification result and the second pathogen type identification result corresponding to each first sequencing read data to obtain at least one second sequencing read data comprises:
for any one of the first sequencing read data, if the first pathogen type identification result corresponding to the any one of the first sequencing read data indicates that the identified pathogen type belongs to a preset pathogen type set, taking the any one of the first sequencing read data as a second sequencing read data;
for any one first sequencing read data, if the second pathogen type identification result corresponding to the any one first sequencing read data indicates that the identified pathogen type belongs to a preset pathogen type set, the first comparison score of the any one first sequencing read data and the pathogen reference sequence corresponding to the identified pathogen type is greater than or equal to a preset score threshold, and the first length matching ratio of the length of the pathogen reference sequence corresponding to the identified pathogen type in the any one first sequencing read data and the total length of the any one first sequencing read data is greater than or equal to a preset first ratio threshold, the any one first sequencing read data is used as one second sequencing read data.
4. The method of claim 1, wherein the comparing comprises: each second sequencing read data comprises a second alignment score, a second match length ratio, a sequence similarity, and an expected alignment value of the pathogen reference sequence corresponding to the identified pathogen type.
5. The method of claim 4, wherein determining the final pathogen type identification result of the sample to be tested according to the comparison result corresponding to each second sequencing read data in each second sequencing read data set comprises:
for any sequencing read data set, selecting second sequencing read data with the highest second alignment score in any sequencing read data set as target sequencing read data, and judging whether the target sequencing read data simultaneously meet the condition that a second matching length ratio is greater than or equal to a preset second ratio threshold, the sequence similarity is greater than or equal to a preset sequence similarity threshold, and the alignment expected value is less than or equal to a preset alignment expected threshold; if so, determining that the pathogen type corresponding to the target sequencing read data exists in the sample to be tested;
and determining the final pathogen type identification result according to the determined pathogen types existing in all the samples to be detected.
6. The method of claim 1, further comprising:
performing statistical analysis on the sequencing read data with the pathogen type identified according to the final pathogen type identification result of the sample to be detected to obtain a statistical analysis result; wherein the statistical analysis result comprises: the number of sequencing read data, composition ratio, RPM value and reference genome coverage for the pathogen types present in the test sample and the genus level and species level of each pathogen type.
7. The method of claim 6, further comprising:
and obtaining annotation information of each pathogen type, information of a sample to be detected and the target object information of the sample to be detected in the final pathogen type identification result, and generating a detection report according to the statistical analysis result, the annotation information, the information of the sample to be detected and the target object information.
8. A system for automated analysis of pathogen type, the system comprising:
a sequencing read data acquisition unit for acquiring at least one sequencing read data corresponding to a sample to be tested and obtained by nucleic acid sequence detection;
the data cleaning unit is used for determining the data type of each sequencing read data in the at least one sequencing read data, and performing data cleaning on each sequencing read data in the at least one sequencing read data according to the data type and a preset data cleaning strategy to obtain at least one first sequencing read data with qualified quality;
the pathogen type identification unit is used for carrying out primary identification on the pathogen type of each first sequencing read data in the at least one first sequencing read data with qualified quality by using a first identification method and a second identification method respectively so as to obtain a first pathogen type identification result and a second pathogen type identification result corresponding to each first sequencing read data respectively;
a second sequencing read data set acquisition unit, configured to select second sequencing read data according to the first pathogen type identification result and the second pathogen type identification result corresponding to each first sequencing read data, acquire at least one second sequencing read data, and classify the at least one second sequencing read data according to the identified pathogen type, so as to acquire second sequencing read data sets corresponding to different pathogen types;
a comparison result obtaining unit, configured to compare, for any one of the second sequencing read data sets, each of the second sequencing read data in the any one of the second sequencing read data sets with a pathogen reference sequence of a pathogen type corresponding to the any one of the second sequencing read data sets, and obtain a comparison result corresponding to each of the second sequencing read data in the any one of the second sequencing read data sets;
and the final pathogen type identification result determining unit is used for determining the final pathogen type identification result of the sample to be detected according to the comparison result corresponding to each second sequencing read data in each second sequencing read data set.
9. The system of claim 8, wherein the data washing unit performs data washing on each sequencing read data in the at least one sequencing read data according to a data type and according to a preset data washing strategy to obtain at least one first sequencing read data with qualified quality, and comprises:
determining the sequencing read data quality of each sequencing read data with the data type of the second-generation sequencing data by using the fastQC, and screening out each sequencing read data with the sequencing read data quality meeting a preset first sequencing read data quality standard as first sequencing read data;
and determining the sequencing read data quality of each sequencing read data with the data type of the third-generation sequencing data by utilizing the Nanofilt, and screening each sequencing read data with the sequencing read data quality meeting a preset second sequencing read data quality standard as first sequencing read data.
10. The system of claim 8, wherein the second sequencing read data set obtaining unit selects the second sequencing read data according to the first pathogen type identification result and the second pathogen type identification result corresponding to each first sequencing read data to obtain at least one second sequencing read data, and comprises:
for any one of the first sequencing read data, if the first pathogen type identification result corresponding to the any one of the first sequencing read data indicates that the identified pathogen type belongs to a preset pathogen type set, taking the any one of the first sequencing read data as a second sequencing read data;
for any one first sequencing read data, if the second pathogen type identification result corresponding to the any one first sequencing read data indicates that the identified pathogen type belongs to a preset pathogen type set, the first comparison score of the any one first sequencing read data and the pathogen reference sequence corresponding to the identified pathogen type is greater than or equal to a preset score threshold, and the first length matching ratio of the length of the pathogen reference sequence corresponding to the identified pathogen type in the any one first sequencing read data and the total length of the any one first sequencing read data is greater than or equal to a preset first ratio threshold, the any one first sequencing read data is used as one second sequencing read data.
11. The system of claim 8, wherein the alignment result comprises: each second sequencing read data comprises a second alignment score, a second match length ratio, a sequence similarity, and an expected alignment value of the pathogen reference sequence corresponding to the identified pathogen type.
12. The system according to claim 11, wherein the final pathogen type identification result determining unit determines the final pathogen type identification result of the sample to be tested according to the comparison result corresponding to each second sequencing read data in each second sequencing read data set, and includes:
for any sequencing read data set, selecting second sequencing read data with the highest second alignment score in any sequencing read data set as target sequencing read data, and judging whether the target sequencing read data simultaneously meet the condition that a second matching length ratio is greater than or equal to a preset second ratio threshold, the sequence similarity is greater than or equal to a preset sequence similarity threshold, and the alignment expected value is less than or equal to a preset alignment expected threshold; if so, determining that the pathogen type corresponding to the target sequencing read data exists in the sample to be tested;
and determining the final pathogen type identification result according to the determined pathogen types existing in all the samples to be detected.
13. The system of claim 8, further comprising:
the statistical analysis unit is used for performing statistical analysis on the sequencing read data with the pathogen type identified according to the final pathogen type identification result of the sample to be detected to obtain a statistical analysis result; wherein the statistical analysis result comprises: the number of sequencing read data, composition ratio, RPM value and reference genome coverage for the pathogen types present in the test sample and the genus level and species level of each pathogen type.
14. The system of claim 8, further comprising:
and the detection report generating unit is used for acquiring annotation information of each pathogen type, information of the sample to be detected and the target object information of the sample to be detected in the final pathogen type identification result, and generating a detection report according to the statistical analysis result, the annotation information, the information of the sample to be detected and the target object information.
CN202110331835.5A 2021-03-26 2021-03-26 Method and system for automatically analyzing pathogen type Active CN113096737B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110331835.5A CN113096737B (en) 2021-03-26 2021-03-26 Method and system for automatically analyzing pathogen type

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110331835.5A CN113096737B (en) 2021-03-26 2021-03-26 Method and system for automatically analyzing pathogen type

Publications (2)

Publication Number Publication Date
CN113096737A true CN113096737A (en) 2021-07-09
CN113096737B CN113096737B (en) 2023-10-31

Family

ID=76670200

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110331835.5A Active CN113096737B (en) 2021-03-26 2021-03-26 Method and system for automatically analyzing pathogen type

Country Status (1)

Country Link
CN (1) CN113096737B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113539369A (en) * 2021-07-14 2021-10-22 江苏先声医学诊断有限公司 Optimized kraken2 algorithm and application thereof in second-generation sequencing
CN114464253A (en) * 2022-03-03 2022-05-10 予果生物科技(北京)有限公司 Method, system and application for real-time pathogen detection based on long read-length sequencing

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017053446A2 (en) * 2015-09-21 2017-03-30 The Regents Of The University Of California Pathogen detection using next generation sequencing
WO2018126033A1 (en) * 2016-12-28 2018-07-05 Ascus Biosciences, Inc. Methods, apparatuses, and systems for analyzing microorganism strains in complex heterogeneous communities, determining functional relationships and interactions thereof, and diagnostics and biostate management based thereon
CN108334750A (en) * 2018-04-19 2018-07-27 江苏先声医学诊断有限公司 A kind of macro genomic data analysis method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017053446A2 (en) * 2015-09-21 2017-03-30 The Regents Of The University Of California Pathogen detection using next generation sequencing
WO2018126033A1 (en) * 2016-12-28 2018-07-05 Ascus Biosciences, Inc. Methods, apparatuses, and systems for analyzing microorganism strains in complex heterogeneous communities, determining functional relationships and interactions thereof, and diagnostics and biostate management based thereon
CN110392738A (en) * 2016-12-28 2019-10-29 埃斯库斯生物科技股份公司 For being analyzed the microorganism strain in complex heterogeneous group, being determined its functional relationship and interaction and determine the method, apparatus and system of diagnosis and biological aspect management based on this
CN108334750A (en) * 2018-04-19 2018-07-27 江苏先声医学诊断有限公司 A kind of macro genomic data analysis method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李林海;陈丽丹;肖斌;孙朝晖;: "宏基因组测序在感染性疾病病原体检测中的应用", 传染病信息, no. 01 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113539369A (en) * 2021-07-14 2021-10-22 江苏先声医学诊断有限公司 Optimized kraken2 algorithm and application thereof in second-generation sequencing
CN113539369B (en) * 2021-07-14 2022-03-25 江苏先声医学诊断有限公司 Optimized kraken2 algorithm and application thereof in second-generation sequencing
WO2023283967A1 (en) * 2021-07-14 2023-01-19 江苏先声医学诊断有限公司 Optimized kraken2 algorithm and application thereof in second-generation sequencing
CN114464253A (en) * 2022-03-03 2022-05-10 予果生物科技(北京)有限公司 Method, system and application for real-time pathogen detection based on long read-length sequencing
CN114464253B (en) * 2022-03-03 2023-03-10 予果生物科技(北京)有限公司 Method, system and application for real-time pathogen detection based on long-read-length sequencing

Also Published As

Publication number Publication date
CN113096737B (en) 2023-10-31

Similar Documents

Publication Publication Date Title
CN109767810B (en) High-throughput sequencing data analysis method and device
CN113096736B (en) Virus real-time automatic analysis method and system based on nanopore sequencing
CN104302781B (en) A kind of method and device detecting chromosomal structural abnormality
CN106033502B (en) The method and apparatus for identifying virus
CN113096737B (en) Method and system for automatically analyzing pathogen type
CN103617256A (en) Method and device for processing file needing mutation detection
CN104700033A (en) Virus detection method and virus detection device
CN107944228B (en) Visualization method for gene sequencing variation site
CN112289376B (en) Method and device for detecting somatic cell mutation
CN112397151A (en) Methylation marker screening and evaluating method and device based on target capture sequencing
CN111326212A (en) Detection method of structural variation
CN112669903A (en) HLA typing method and device based on Sanger sequencing
CN110556163A (en) Analysis method of long-chain non-coding RNA translation small peptide based on translation group
CN114038507A (en) Prediction method, training method of prediction model and related device
CN110970093B (en) Method and device for screening primer design template and application
WO2014083018A1 (en) Method and system for processing data for evaluating a quality level of a dataset
JP6356015B2 (en) Gene expression information analyzing apparatus, gene expression information analyzing method, and program
CN114530200B (en) Mixed sample identification method based on calculation of SNP entropy
CN106156539A (en) The method and apparatus analyzing the immunity difference of individual two class states
CN108595914A (en) One grows tobacco mitochondrial RNA (mt RNA) editing sites high-precision forecasting method
JP2013505012A5 (en)
CN116646010B (en) Human virus detection method and device, equipment and storage medium
RU2011117576A (en) METHOD FOR DETERMINING CERTAIN INDICATOR FOR DISTINCTIVE FEATURES OBTAINED FROM CLINICAL DATA, AND APPLYING THE CERTIFICATE INDICATOR FOR PREFERENCE OF ONE DISTINCTIVE FEATURE OVER OTHER
CN117393171B (en) Method and system for constructing prediction model of LARS development track after rectal cancer operation
CN116168761B (en) Method and device for determining characteristic region of nucleic acid sequence, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant