CN114420209A - Sequencing data-based pathogenic microorganism detection method and system - Google Patents

Sequencing data-based pathogenic microorganism detection method and system Download PDF

Info

Publication number
CN114420209A
CN114420209A CN202210308562.7A CN202210308562A CN114420209A CN 114420209 A CN114420209 A CN 114420209A CN 202210308562 A CN202210308562 A CN 202210308562A CN 114420209 A CN114420209 A CN 114420209A
Authority
CN
China
Prior art keywords
kmer
sequencing data
unique
quality control
microorganism detection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210308562.7A
Other languages
Chinese (zh)
Inventor
刘卫国
常启鑫
张�浩
殷泽坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN202210308562.7A priority Critical patent/CN114420209A/en
Publication of CN114420209A publication Critical patent/CN114420209A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5066Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • Genetics & Genomics (AREA)
  • Evolutionary Computation (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention provides a pathogenic microorganism detection method and system based on sequencing data, belongs to the technical field of biological detection, solves the problem of detection efficiency of the existing detection software, and comprises the following steps: unique kmer generation step: generating a unique kmer of a reference gene; quality control step: re-dividing task allocation of the producer consumer model, and performing preprocessing and quality control on sequencing data to obtain a sequencing data file; and (3) microorganism detection: and taking the generated unique kmer file and the sequencing data file subjected to quality control as input files to carry out a pathogenic microorganism detection process. The invention adopts a coding mode for saving the memory and a scheme for storing the intermediate result in the hard disk to solve the problem of overlarge memory occupation in the process of generating the unique kmer.

Description

Sequencing data-based pathogenic microorganism detection method and system
Technical Field
The invention belongs to the technical field of biological detection, and particularly relates to a pathogenic microorganism detection method and system based on sequencing data.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
As an important field of life science, gene sequencing technology has been greatly developed, and the second-generation sequencing technology with high-throughput sequencing capability has been widely applied to various fields at present. Compared with the first-generation sequencing technology, the second-generation sequencing technology (NGS) adds a reversible termination end, performs sequencing while synthesizing, and can perform sequencing on hundreds of thousands to millions of DNA molecules in parallel. Depending on the characteristics of the second-generation sequencing technology, metagenomic sequencing is rapidly developed.
In 1998, Handelsman et al proposed the concept of metagenome, i.e.the sum of all microbial genomes in the environment. Compared with the traditional gene sequencing method, the metagenomic sequencing method does not need to prepare in advance, can directly obtain the virus to be detected from the environment, can detect various microorganisms in a sample, and can effectively analyze the relationship between different microorganisms and the environment or hosts thereof. After the first application of metagenomic sequencing to clinical diagnosis and great success since 2014, metagenomic sequencing has been widely applied to detection and identification of newly-appearing pathogens due to its characteristics of short detection period, high accuracy, wide pathogen coverage and the like. After obtaining the gene sequence of the virus, the metagenomic technology is used for directly detecting whether the target virus is contained in the sample, which is a quick and effective method, and has positive effects on directly determining the infection source from the initial stage of virus transmission and blocking the virus transmission.
In the prior art, the fastv software can achieve very good effects in the aspects of microorganism detection and identification, subspecies identification and the like, but the execution time of the software limits the exertion of the capability, and in addition, the fastv software cannot be applied to detection tasks for large-scale data.
Specifically, the detection of the fastv software on pathogenic microorganisms mainly has the following problems at present:
the running efficiency of the fastv software is problematic, the thread expansibility is poor, and the use value of the fastv software is limited.
The use amount of the memory of the fastv software is very large, so that the fastv software can only process small-scale data and cannot be applied to processing large-scale data.
Fastv, although achieving high accuracy and precision in the detection of pathogenic microorganisms, still has some problems in the detection standards, which limit its use value.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a pathogenic microorganism detection method based on sequencing data, which can process large-scale data and achieve higher accuracy and precision.
In order to achieve the above object, one or more embodiments of the present invention provide the following technical solutions:
in a first aspect, a method for detecting pathogenic microorganisms based on sequencing data is disclosed, comprising:
unique kmer generation step: generating a unique kmer of a reference gene;
quality control step: re-dividing task allocation of the producer consumer model, and performing preprocessing and quality control on sequencing data to obtain a sequencing data file subjected to quality control;
and (3) microorganism detection: and taking the generated unique kmer file and the sequencing data file subjected to quality control as input files to carry out a pathogenic microorganism detection process.
In a further technical scheme, the kmer only appearing in a certain reference genome but not appearing in other reference genomes is called a unique kmer, and the coverage of the unique kmer in sequencing data is used as a detection standard.
In a further technical scheme, intermediate results generated in the process of generating the unique kmer are stored in a hard disk.
In a second aspect, a pathogenic microorganism detection system based on sequencing data is disclosed, comprising:
a unique kmer generation module configured to: generating a unique kmer of a reference gene;
a quality control module configured to: re-dividing task allocation of the producer consumer model, and performing preprocessing and quality control on sequencing data to obtain a sequencing data file subjected to quality control;
a microorganism detection module configured to: and taking the generated unique kmer file and the sequencing data file subjected to quality control as input files to carry out a pathogenic microorganism detection process.
The above one or more technical solutions have the following beneficial effects:
the invention adopts a coding mode for saving the memory and a scheme for storing the intermediate result in the hard disk to solve the problem of overlarge memory occupation in the process of generating the unique kmer. In addition, an efficient implementation mode is adopted, the running speed of the program is increased, and the time consumed by the unique kmer generation process is reduced.
The invention divides the task allocation of the producer and the consumer models again to fully utilize the multi-core of the processor, modifies the coding mode to fully reduce the punishment of the branch prediction error, and uses the vectorization mode to carry out the parallelization processing on the core part, thereby improving the processing speed of the program.
The present invention employs vectorization to accelerate this process. The bloom filter has a very good filtering effect on data which does not exist in the data set, so the data structure of the bloom filter is adopted to accelerate the query process.
Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention.
The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.
Example one
The embodiment discloses a pathogenic microorganism detection method based on sequencing data, aiming at the defect of fastv software, the method is optimized in a targeted manner, the performance of hardware is better exerted, the execution efficiency of the software is improved, and a plurality of methods are adopted to reduce the use amount of a memory of the software, so that the method can process large-scale data. In addition, the invention modifies the detection standard of fastv, so that higher accuracy and precision can be achieved. The method comprises the efficient implementation of the whole process of detecting the pathogenic microorganisms.
Referring to the attached figure 1, the method mainly comprises three steps of unique kmer generation, quality control and microorganism detection.
Step one, generating a unique kmer:
before detection of pathogenic microorganisms is performed, it is first necessary to generate a unique kmer of a reference gene. The kmer that occurs only in a certain reference genome and not in other reference genomes is called the unique kmer, and the coverage of the unique kmer in the sequencing data is used as the standard of detection. The main problem in generating unique kmers is the excessive memory usage, and the size of the intermediate result is usually the product of the size of the reference gene and the kmer length. When generating a unique kmer such as a background reference genome, the amount of memory used may exceed 1T, exceeding the memory size of a typical server.
In order to solve the problem, the invention adopts a coding mode for saving the memory and a scheme for storing the intermediate result in the hard disk to solve the problem of overlarge memory occupation in the process of generating the unique kmer.
Specifically, the kmers are first classified according to the minimum value of each kmer. And calculating the minimum value of each kmer, wherein the kmers with the same continuous minimum values can be classified into the same class, and at the moment, each kmer is not stored, but the word strings covered by the kmers are stored in a hard disk as a result, so that the memory is saved, and the correctness is ensured. In the second step, the intermediate results in the hard disk are read, and the number of the intermediate results processed each time can be determined according to the memory limitation.
In addition, the invention adopts an efficient implementation mode and common code optimization means, such as using a hash data structure to accelerate query, using a producer and consumer model to output binary data and the like.
This increases the speed of program execution and reduces the time consumed by the unique kmer generation process.
Step two, quality control:
in the gene library preparation and sequencing process, errors or errors are inevitably introduced due to equipment or operation problems, but the errors have influence on downstream tasks and hinder the downstream tasks, so that the pretreatment and quality control of sequencing data are essential.
The invention mainly adopts the following quality control methods: primer shearing, base correction, sliding window quality construction, tail cutting, pretreatment, repeatability evaluation and over-expression sequence analysis.
Through the quality control process, the sequencing accuracy is improved in the software level, and the accuracy in the detection process can be effectively improved. But the quality control process slows down the entire process flow.
To this end, the present invention leverages various features and various data structures of modern processors to accelerate the processing of this process. The invention divides the task allocation of the producer and the consumer models again to fully utilize the multi-core of the processor, modifies the coding mode to fully reduce the punishment of the branch prediction error, and uses the vectorization mode to carry out the parallelization processing on the core part, thereby improving the processing speed of the program.
The whole processing flow of the invention uses a producer consumer model, the producer provides data, and the consumer performs quality processing and pathogenic microorganism detection after obtaining the data.
Step three, detecting pathogenic microorganisms:
and taking the generated unique kmer file and the sequencing data file subjected to quality control as input files to carry out a pathogenic microorganism detection process. And obtaining a final detection result through the detection result and the set threshold value.
In the process, the problem that the loading of the unique kmer file is slow is solved, and the single-thread loading of the file is very slow due to the fact that the unique kmer file generated in the process of processing large-scale data is large, so that the whole processing flow is slowed down.
The present invention uses a producer consumer model to handle this problem and designs multithreaded lock-free hash insertions to speed up the process flow.
It should be noted that the producer-consumer model is also used in generating unique kmer and loading unique kmer data.
The kmer is needed in the detection process, and then the kmer is mapped into a 64-bit integer, but the direct mapping affects the execution efficiency of the program, so the invention designs a coding mapping mode to accelerate the process.
Here, four bases are mapped to different values, each base being represented by 2 bits, while the speed of mapping is increased compared to the previous mapping method.
Some statistical information is generated in the detection process, and the statistical information can further help to judge the condition of sequencing data and assist in analyzing the generated result.
The present invention employs vectorization to accelerate this process. The bloom filter has a very good filtering effect on data which does not exist in the data set, so that the data structure of the bloom filter is adopted to accelerate the query process.
Effect verification: see table 1.
TABLE 1 comparison of fastv and Rabbitv results
Figure 186932DEST_PATH_IMAGE001
Meaning that the program cannot run because it uses too much memory.
Example two
The object of this embodiment is to provide a pathogenic microorganism detection system based on sequencing data, comprising:
a unique kmer generation module configured to: generating a unique kmer of a reference gene;
a quality control module configured to: re-dividing task allocation of the producer consumer model, and performing preprocessing and quality control on sequencing data to obtain a sequencing data file;
a microorganism detection module configured to: and taking the generated unique kmer file and the sequencing data file subjected to quality control as input files to carry out a pathogenic microorganism detection process.
In a unique kmer generation module: classifying the kmers according to the minimum value of each kmer, calculating the minimum value of each kmer, and classifying the continuous kmers with the same minimum value into the same class;
when storing, not storing each kmer, storing the string covered by the kmer as the result in the hard disk.
Those skilled in the art will appreciate that the modules or steps of the present invention described above can be implemented using general purpose computer means, or alternatively, they can be implemented using program code that is executable by computing means, such that they are stored in memory means for execution by the computing means, or they are separately fabricated into individual integrated circuit modules, or multiple modules or steps of them are fabricated into a single integrated circuit module. The present invention is not limited to any specific combination of hardware and software.

Claims (10)

1. The pathogenic microorganism detection method based on sequencing data is characterized by comprising the following steps:
unique kmer generation step: generating a unique kmer of a reference gene;
quality control step: re-dividing task allocation of the producer consumer model, and performing preprocessing and quality control on sequencing data to obtain a sequencing data file;
and (3) microorganism detection: and taking the generated unique kmer file and the sequencing data file subjected to quality control as input files to carry out a pathogenic microorganism detection process.
2. The method for detecting pathogenic microorganisms based on sequencing data of claim 1, wherein kmers that occur only in a certain reference genome but not in other reference genomes are called unique kmers, and the coverage of the unique kmers in the sequencing data is used as a detection standard.
3. The method of claim 1, wherein intermediate results generated during the generation of the unique kmer are stored in a hard disk.
4. The sequencing-data-based pathogenic microorganism detection method of claim 1, wherein generating the unique kmer of the reference gene: and classifying the kmers according to the minimum value of each kmer, calculating the minimum value of each kmer, and classifying the continuous kmers with the same minimum value into the same class.
5. A method for pathogenic microorganism detection based on sequencing data according to claim 3, characterized in that the generated intermediate results are stored in a hard disk, in particular: and storing the string covered by the kmer as a middle result in the hard disk without storing each kmer.
6. The method for detecting pathogenic microorganisms based on sequencing data according to claim 3 or 5, further comprising: and reading the intermediate results in the hard disk, and determining the number of the intermediate results processed each time according to the limitation of the memory.
7. The method of claim 1, wherein when preprocessing and quality control are performed on the sequencing data, the task allocation of the producer good consumer model is re-divided to fully utilize the multiple cores of the processor;
and modifying the coding mode to fully reduce the punishment of the branch prediction error, and performing parallelization processing on the core part of preprocessing and quality control on the sequencing data by using a vectorization mode.
8. The method for detecting pathogenic microorganisms based on sequencing data according to claim 1, wherein in the step of detecting the microorganisms, kmer is coded and mapped, and the method comprises the following steps: four bases are mapped to different numbers, each base being represented by 2 bits.
9. The pathogenic microorganism detection system based on sequencing data is characterized by comprising the following components:
a unique kmer generation module configured to: generating a unique kmer of a reference gene;
a quality control module configured to: re-dividing task allocation of the producer consumer model, and performing preprocessing and quality control on sequencing data to obtain a sequencing data file;
a microorganism detection module configured to: and taking the generated unique kmer file and the sequencing data file subjected to quality control as input files to carry out a pathogenic microorganism detection process.
10. The sequencing-data-based pathogenic microorganism detection system of claim 9, wherein the unique kmer generation module is configured to: classifying the kmers according to the minimum value of each kmer, calculating the minimum value of each kmer, and classifying the continuous kmers with the same minimum value into the same class;
when storing, not storing each kmer, storing the string covered by the kmer as the result in the hard disk.
CN202210308562.7A 2022-03-28 2022-03-28 Sequencing data-based pathogenic microorganism detection method and system Pending CN114420209A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210308562.7A CN114420209A (en) 2022-03-28 2022-03-28 Sequencing data-based pathogenic microorganism detection method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210308562.7A CN114420209A (en) 2022-03-28 2022-03-28 Sequencing data-based pathogenic microorganism detection method and system

Publications (1)

Publication Number Publication Date
CN114420209A true CN114420209A (en) 2022-04-29

Family

ID=81264444

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210308562.7A Pending CN114420209A (en) 2022-03-28 2022-03-28 Sequencing data-based pathogenic microorganism detection method and system

Country Status (1)

Country Link
CN (1) CN114420209A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115064218A (en) * 2022-08-17 2022-09-16 中国医学科学院北京协和医院 Method and device for constructing pathogenic microorganism data identification platform

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021997A (en) * 2016-05-17 2016-10-12 杭州和壹基因科技有限公司 Third-generation PacBio sequencing data comparison method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021997A (en) * 2016-05-17 2016-10-12 杭州和壹基因科技有限公司 Third-generation PacBio sequencing data comparison method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HAO ZHANG ET AL.: "RabbitV fast detection of viruses and microorganisms in sequencing data on multi-core architectures", 《BIOINFORMATICS》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115064218A (en) * 2022-08-17 2022-09-16 中国医学科学院北京协和医院 Method and device for constructing pathogenic microorganism data identification platform

Similar Documents

Publication Publication Date Title
Wright et al. ranger: A fast implementation of random forests for high dimensional data in C++ and R
US10332008B2 (en) Parallel decision tree processor architecture
US10268454B2 (en) Methods and apparatus to eliminate partial-redundant vector loads
US9158514B2 (en) Method and apparatus for providing change-related information
US9047077B2 (en) Vectorization in an optimizing compiler
US20150262062A1 (en) Decision tree threshold coding
CN110569629A (en) Binary code file tracing method
US20150262063A1 (en) Decision tree processors
US10277246B2 (en) Program counter compression method and hardware circuit thereof
CN111813670B (en) Non-invasive MC/DC coverage statistical analysis method
CN114420209A (en) Sequencing data-based pathogenic microorganism detection method and system
Kobus et al. MetaCache-GPU: ultra-fast metagenomic classification
Bingöl et al. GateKeeper-GPU: Fast and accurate pre-alignment filtering in short read mapping
CN114237911A (en) CUDA-based gene data processing method and device and CUDA framework
Su et al. An efficient GPU implementation of inclusion-based pointer analysis
US9182960B2 (en) Loop distribution detection program and loop distribution detection method
Gonzalez-Dominguez et al. MPIGeneNet: parallel calculation of gene co-expression networks on multicore clusters
US20160357655A1 (en) Performance information generating method, information processing apparatus and computer-readable storage medium storing performance information generation program
CN116149917A (en) Method and apparatus for evaluating processor performance, computing device, and readable storage medium
Ogasawara et al. Sam2bam: High-performance framework for NGS data preprocessing tools
Bergman et al. Don't Waste Your Time: Early Stopping Cross-Validation
CN113467783A (en) Kernel function compiling method and device of artificial intelligent accelerator
Teyssier et al. GIA: A genome interval arithmetic toolkit for high performance interval set operations
CN113168888A (en) Resequencing analysis method and device based on FPGA
US11836495B2 (en) Method of implementing an ARM64-bit floating point emulator on a Linux system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20220429