CN114242173A

CN114242173A - Data processing method, device and storage medium for identifying microorganisms by using mNGS

Info

Publication number: CN114242173A
Application number: CN202111579973.1A
Authority: CN
Inventors: 黄毅; 杨振宇; 刘久成; 林小芳; 张丹; 易鑫; 杨玲
Original assignee: Shenzhen Guiinga Medical Laboratory
Current assignee: Shenzhen Guiinga Medical Laboratory
Priority date: 2021-12-22
Filing date: 2021-12-22
Publication date: 2022-03-25
Anticipated expiration: 2041-12-22
Also published as: CN114242173B

Abstract

The application discloses a data processing method, a data processing device and a storage medium for identifying microorganisms by using mNGS. The data processing method comprises the steps of loading a database by using a memory map/dev/shm provided by a Linux system; before reading the database, checking the size of the database, if the size of the database is smaller than the size of the database originally loaded in the hard disk, activating the database in a virtual memory touch mode, and completely caching the loaded database in a memory; loading a reference genome by adopting a memory mapping mode; and Linux pipelines are adopted for outputting and reading in, so that temporary files are reduced, and the analysis speed is increased. According to the method, the critical step of limiting the speed in the data processing process is optimized, so that the speed and the efficiency of identifying the microorganism by the mNGS are improved, the dependence on high-performance hardware equipment is reduced, and the microorganism can be quickly, efficiently and accurately analyzed and identified by the mNGS only by adopting the conventional hardware equipment.

Description

Data processing method, device and storage medium for identifying microorganisms by using mNGS

Technical Field

The application relates to the technical field of microbial metagenome sequencing detection, in particular to a data processing method, a data processing device and a storage medium for identifying microbes by using mNGS.

Background

With the progress of sequencing technologies and the reduction of costs, more and more microorganisms are sequenced, such as human microbiome plan HMP, human intestinal metagenome project MetaHIT 2008, us intestinal tract plan AGP 2012, chinese microbiome plan CAS-CMI 2017, and the like. The challenge brought by the large species amount is that the classification database is larger and larger, which brings the challenge to data analysis, especially the detection of pathogenic microorganisms with higher requirements on time efficiency.

Metagenomic sequencing, particularly metagenomic second-generation sequencing (abbreviated as mNGS), refers to directly extracting nucleic acids of all microorganisms from an environment or a host sample, constructing a metagenomic library, and sequencing by using a second-generation sequencing technology. The metagenome sequencing and identification of the microorganism is to directly utilize the metagenome sequencing data to analyze and detect or identify the microorganism carried by the environment or the host. Generally, the identification of microorganisms through metagenome sequencing comprises the following steps of metagenome sequencing, downloading data, removing joints, removing sequencing data of a host by using a host reference sequence, carrying out sequence classification on the data by using a classification sequence library, carrying out classification annotation on the data by using a microorganism knowledge library, and finally obtaining an interpretation report through result filtering. Among these, the three steps of linker removal, removal of host sequencing data and sequence classification usually require a lot of intensive calculations.

The mNGS will indiscriminately sequence the extracted nucleic acids, wherein the non-host sequences are of interest for detecting human infectious microorganisms; therefore, the host sequence needs to be removed in the process of bioinformatics analysis, and species identification is performed only on the removed sequence. In terms of sequence classification, exemplified by the Kraken2 software, the official standard genome reference library includes archaea, bacteria, viruses, plasmids, human hosts and vectors, https:// benlangmead. githu. io/aws-indexes/k2, amounting to about 50.1GB (gigabytes), plus about 53.2GB after protozoa and some fungi, plus other eukaryotes, including parasites, etc., with data volumes up to approximately 90GB, covering about 16000 microorganisms. When the sequence query of such large data requires fast reading, Kraken2 loads the database into the memory to accelerate reading, and a memory disc needs to be manufactured and then the database needs to be loaded.

To increase the speed of analysis of the microbial identification of ngs, developers or enterprises have sought to speed up the use of GPUs or FPGAs for computing, such as MetaCache-GPU and terra-BLAST, but typically require the purchase and deployment of new hardware.

Therefore, how to rapidly, efficiently and accurately perform metagenomic sequencing and identify microorganisms on the basis of the conventional hardware equipment is a problem to be solved urgently.

Disclosure of Invention

It is an object of the present application to provide an improved method, apparatus and storage medium for data processing for identifying microorganisms by mNGS.

In order to achieve the purpose, the following technical scheme is adopted in the application:

a first aspect of the present application discloses a data processing method for identifying a microorganism by a mNGS, comprising the steps of:

a database loading step, which comprises loading a database for identifying the microorganism by using the memory map/dev/shm provided by a Linux system; the system comprises a Linux system, a directory and a memory, wherein/dev/shm of the Linux system is a tmpfs file system, namely a temporary file system, all users of the directory have read-write permission, and the maximum writable size is half of the physical memory of the system;

the database checking step comprises the steps of checking the size of the database before reading the database, and if the size of the database is smaller than that of the originally loaded database, activating the database in a virtual memory touch mode to enable the loaded database to be completely cached in a memory;

the database comparison step comprises loading a reference genome by adopting a memory mapping mode, uniformly caching a reference index when a plurality of alignments are operated simultaneously, and sharing the process and the result by each alignment process; when a new comparison process is added, checking the index, if the new comparison process is loaded in the memory or is in the process of loading, accessing according to the memory address or using after the loading is finished, and not repeatedly loading; when all parallel comparison processes are finished, automatically managing the reference index, and releasing the reference index from the cache when the reference index is not actively accessed;

and the data transmission step comprises outputting and reading in by adopting a Linux pipeline, so that the generation of temporary files is reduced, and the analysis speed is increased.

It should be noted that, by using/dev/shm, the data processing method of the application not only improves the database reading speed, but also solves the problem of needing root authority, and is convenient for common users to use; through the database checking step, the problem of blockage caused by that the system automatically caches the database in the inactive state or part of the database to the hard disk is solved, and the detection speed is further improved; through the memory address mapping in the database comparison step, the speed of comparing a single sample is improved, more samples can be allowed to be compared simultaneously under the same condition, and the detection efficiency is improved; temporary files or process files generated in the analysis process are reduced through output and read-in of the Linux pipeline, the analysis speed is prevented from being influenced by input and reading of the temporary files or the process files, and the detection speed is further improved. In a word, in the data processing process of identifying the microorganisms by metagenomic sequencing, the data processing method optimizes and improves the key steps for limiting the data processing speed, improves the speed and efficiency of identifying the microorganisms by metagenomic sequencing, and reduces the dependence on high-performance hardware equipment, so that the microorganisms for metagenomic sequencing and identification can be analyzed and identified quickly, efficiently and accurately only by adopting the conventional hardware equipment, such as 187G memory and 64-core CPU.

It can be understood that the key point of the present application is to optimize and improve the key steps limiting the data processing speed, and as for general steps of identifying microorganisms by metagenomic sequencing, such as high-throughput sequencing, off-line data quality control, etc., reference can be made to the prior art, while the data processing method of the present application can be adopted in other steps involving database loading, alignment, transmission, etc.

In an implementation manner of the present application, the step of loading the database further includes, before loading the database to the/dev/shm, applying in advance for a memory space larger than the size of the database, and releasing the content cached by the system, so as to ensure that the database can be completely loaded to the/dev/shm.

It should be noted that, the memory space larger than the size of the database is applied in advance, and the database checking step has the same function, both to ensure that the database is completely cached in the memory, and to avoid the deadlock problem caused by the database being cached in the hard disk.

In one implementation manner of the present application, the data processing method of the present application further includes a homologous region tagging step, where the homologous region tagging step includes splitting a host reference genome in a database into short sequences to form a short sequence library; the genome of the eukaryote whose homologous region is to be calculated is aligned with the short sequence library, the region which can be matched with the short sequence library is marked as "N", and the region which is continuously marked as "N" is replaced by "N" bases.

Preferably, the step of homologous region labeling further comprises converting A, C, G, T in the sequence into binary digits respectively, storing the short sequence in an unsigned integer mode, and preloading the short sequence into a memory.

Preferably, the short sequence is 31bp in length.

In the present application, the key of the homologous region labeling step is to perform base "N" labeling on the regions homologous to the host in the genome of the protists and fungi by means of short sequences (kmer), to exclude the regions labeled as "N" from classification, and to reduce the false positive detection caused by the host, such as the common Toxoplasma gondii false positive. It is understood that the conversion of A, C, G, T into binary digits, with a short sequence length of 31bp, is a specific solution in one implementation of the present application, and does not exclude that other methods may be used to convert each base or design short sequences of different lengths as desired.

In an implementation manner of the present application, the secondary alignment step of the data processing method of the present application includes, for a species whose primarily identified sequence support number meets requirements, taking the species as a unit, aligning all sequences of a selected species to a genome reference sequence of the selected species, accurately obtaining the sequence support number of the selected species in a sequencing sample, and thus calculating the coverage, depth distribution and dispersion of the species.

Preferably, the coverage is the ratio of the sum of the regions covering more than 1 × of the genome of the selected species to the genome size L; when multiple genome versions of the selected species exist, the longest genome L is used_maxCalculating the position covered according to the actual alignment position P of each genome_iCalculating; for species with multiple genomic versions, the coverage C obtained is an estimate, i.e., C_approx＝∑P_i/L_max。

Preferably, the dispersion is the ratio of the number N of windows of the reference genome that can be covered by the sequences supported by the selected species to the total window N, i.e. D ═ N/N.

Preferably, the species whose sequence support number meets the requirements, in particular, for parasite sequences the requirement of a support number is greater than or equal to 100, and for other species the requirement of a support number of sequences is greater than or equal to 10.

The existing metagenome sequencing and identifying microorganism software or method cannot provide information such as sequencing depth and genome coverage of the classified species; the application creatively provides that the species of preliminary appraisal is compared for the second time, and the number of fragments of the species in the sequencing sample is accurately obtained, so that the abundance of the species is calculated, information such as coverage, depth distribution and dispersion is obtained, and the classification accuracy can be greatly improved.

A second aspect of the present application discloses a data processing apparatus for the identification of microorganisms by a mNGS, the apparatus comprising a memory and a processor; the memory includes a memory for storing a program; the processor includes a data processing method for identifying a microorganism by executing a program stored in the memory to implement the mNGS of the present application.

A third aspect of the present application discloses a computer-readable storage medium having stored therein a program executable by a processor to implement the data processing method of the present application for identifying a microorganism by a mNGS.

Due to the adoption of the technical scheme, the beneficial effects of the application are as follows:

according to the data processing method for identifying the microorganisms by the mNGS, the critical step of limiting the speed in the data processing process is optimized and improved, so that the speed and the efficiency of identifying the microorganisms by metagenome sequencing are improved, the dependence on high-performance hardware equipment is reduced, and the mNGS identifying microorganisms can be quickly, efficiently and accurately analyzed and identified by the microorganisms only by adopting the conventional hardware equipment.

Drawings

FIG. 1 is a graph of the statistical results of analysis speed tests performed on 20 samples of SE50 sequencing data with an average data size of 34M in the examples of the present application.

Detailed Description

The present application will be described in further detail below with reference to the accompanying drawings by way of specific embodiments. In the following description, numerous details are set forth in order to provide a better understanding of the present application. However, those skilled in the art will readily recognize that some of the features may be omitted or replaced with other devices, materials, methods, etc. in various instances. In some instances, certain operations related to the present application have not been shown or described in detail in this specification in order to avoid obscuring the core of the present application from excessive description, and a detailed description of such related operations is not necessary for those skilled in the art, and the related operations will be fully understood from the description in the specification and the general knowledge of the art.

The identification of microorganisms by the mNGS, namely the identification of microorganisms by metagenome sequencing, greatly influences the speed and efficiency of microorganism identification due to various types and large data volume of the related microorganisms, and cannot realize rapid analysis to obtain microorganism information.

By analyzing the speed-limiting link of the mNGS analysis process, the method improves the analysis speed of the mNGS by reducing writing and reading, sharing the database loaded to the memory, multithreading and the like under the environment of conventional hardware. The analysis speed of a single sample and the parallel analysis speed of multiple samples are improved by carrying out speed-increasing processing such as memory caching, pipeline transmission, memory disc loading/cleaning, memory sharing and the like on the input and the output of a speed-limiting link in the data processing process.

Based on the research and the discovery, the application creatively provides a data processing method for identifying the microorganisms by the mNGS, which comprises a database loading step, a database checking step, a database comparison step and a data transmission step.

And the step of loading the database comprises the step of loading the database for identifying the microorganism by using the memory map/dev/shm provided by the Linux system.

It should be noted that the memory disk hanging mode recommended by Kraken2 requires root authority, and mkdir/ramdisk & & mount-t ramfs none/ramdisk is inconvenient for ordinary users to use. In addition, after the method mounts part of the memory to the memory disk, the part of the memory cannot be directly used by other processes, which results in resource waste. The method and the device have the advantages that the memory mapping/dev/shm provided by the Linux system is used for loading the database creatively, all users of the directory have read-write permission, the maximum writable size is half of the physical memory of the system, and even if the directory is fully written, the system cannot be down. For example, Linux obtains the Linux kernel version from kernel version 2.6 by using the uniform-r command, and starts to support/dev/shm internal memory disk form as a shared memory, and the default size is half of the system physical memory. The read-write speed of the memory is obviously higher than that of hard disk storage, the read-write speed of the DDR3 memory is about 10G/s and is 100 times of the speed of a mechanical hard disk, the read-write speed of the DDR4 memory is about 500-1000 times of the speed of the mechanical hard disk, the analysis performance is increased by fully utilizing/dev/shm instead of/tmp, and inter-process file communication (IPC) is supported. In one implementation of the present application, the test environment has a/dev/shm size of 94G, contains eukaryotes, and the Kraken2 database has a size of 90G, into which all can be loaded, and deleted after the classification step is completed.

And a database checking step, which comprises checking the size of the database before reading the database, and if the size of the database is smaller than that of the originally loaded database, activating the database in a virtual memory touch mode, so that the loaded database is completely cached in a memory.

It should be noted that,/dev/shm is managed by the system, and if a file in the file is in an inactive state, the system will cache a part of the file in the file to a hard disk, i.e., a swapspace cache space, and release the memory to cache other active contents. If there is a large amount of other content cached in the system memory, copying the Kraken2 database to/dev/shm may also occur when part of the content is placed in the hard disk cache space. In this case, if the Kraken2 database in Kraken/shm is read, a stuck condition occurs, and the class module class of Kraken2 enters a read speed close to the hard disk, and the class falls into a pause. Aiming at the situation, the method designs two solutions, the first method is that before the Kraken2 database is copied, a program applies for a memory space larger than the size of the database, the cached content of the system is released, and then the database is copied, so that the situation that the database is partially cached to a hard disk can be avoided; the second method is that the size of the database is checked before the classsify module of Kraken2 reads the database, if the size is smaller than the size of the original database, it indicates that part of the database cached in/dev/shm is actually in the hard disk cache, and the database cached in the hard disk is reloaded into the memory by activating the database in a virtual memory touch mode (vmtouch; https:// githu. com/hoytech/vmtouch), so as to obtain the complete database cached in the memory for classified use.

In the first method, the program applying for the memory needs to prevent the application instruction malloc or callloc from being optimized by GCC, and a modification for preventing compiler GCC optimization needs to be added before the related function, such as GCC: __ attribute __ ((option ("O0"))); clang: __ attribute __ ((optnone)).

The database comparison step comprises loading a reference genome by adopting a memory mapping mode, uniformly caching a reference index when a plurality of alignments are operated simultaneously, and sharing the process and the result by each alignment process; when a new comparison process is added, checking the index, if the new comparison process is loaded in the memory or is in the process of loading, accessing according to the memory address or using after the loading is finished, and not repeatedly loading; and when all parallel comparison processes are finished, automatically managing the reference index, and releasing the reference index from the cache when the reference index is not actively accessed.

It should be noted that, taking the step of removing the host sequence as an example, a single process of bwa-mem2mem reads the host reference genome independently into the memory and then performs alignment of the short sequences, for example, the size of the human reference genome index file is about 16G, and the host-free alignment of 20M sequencing sequences needs to occupy about 32G of the memory, wherein the reference genome index occupies half. Taking 187G memory as an example, 6 host analysis tasks can be executed at most; in practical situations, the system occupies part of the memory, and may also occupy the memory by other processing processes, such as redundancy removal and sorting for samtools comparison, which results in less than 6 tasks that can be performed maximally in parallel; in an implementation manner of the application, the number of comparison tasks which can be performed in parallel in actual testing is only 3-4.

The present application creatively proposes that if the reference genome index portions that need to be read by different processes are shared among the processes, memory can be saved and more alignment tasks can be executed, theoretically, at most 10 of them are parallel, that is, n × 16+16 equals 187, and n equals 10. According to the method, a loading mode of a reference genome is replaced on the basis of bwa-mem2 codes, and an original read-in memory is replaced by a mode of using memory mapping (mmap), so that when a plurality of bwa-mem2mem instances are operated simultaneously, a system can uniformly cache reference indexes, and processes share the process and the result; when a new process is added, the index is checked first, if the new process is loaded in the memory or is loaded, the new process can be accessed according to the memory address or used after the loading is finished, and the new process is not repeatedly loaded; finally, when the parallel bwa-mem2mem process is finished, the system automatically manages the reference index and releases the reference index from the cache when the reference index is not actively accessed. By adopting the scheme, in an implementation mode of the application, the off-host analysis of 8-10 samples can be performed in parallel under the same system environment.

It should be noted that, in an implementation manner of the present application, a trimadap that supports output to a standard output is used to remove a joint ($ adapter) in an input file ($ input.fq); adopting sdust software to replace an NCBI tool DustMasker which does not support output to standard output to mark a low-complexity sequence in the sequence, namely marking the sequence as 'N', and adding support (-d) of the sequence after the mark is output to the sdust to output the sequence to the standard output; and (4) transmitting the signals to an alignment software bwa-mem2 through a Linux pipeline for alignment, and removing the sequence aligned to the host reference genome ($ reference. fa) to obtain a host removal sequence ($ output.fa). Due to the full utilization of the Linux pipeline, the generation of temporary files is reduced, so that extra unnecessary files are prevented from being written out, and the detection speed is increased.

Those skilled in the art will appreciate that all or part of the functions of the above-described methods may be implemented by hardware, or may be implemented by computer programs. When all or part of the functions of the above method are implemented by means of a computer program, the program may be stored in a computer-readable storage medium, and the storage medium may include: a read only memory, a random access memory, a magnetic disk, an optical disk, a hard disk, etc., and the program is executed by a computer to realize the above functions. For example, the program may be stored in a memory of the device, and when the program in the memory is executed by the processor, all or part of the functions described above may be implemented. In addition, when all or part of the functions in the above embodiments are implemented by a computer program, the program may be stored in a storage medium such as a server, another computer, a magnetic disk, an optical disk, a flash disk, or a removable hard disk, and may be downloaded or copied to a memory of a local device, or may be version-updated on a system of the local device, and when the program in the memory is executed by a processor, all or part of the functions in the above methods may be implemented.

Accordingly, there is also provided in another implementation of the present application, a data processing apparatus for identifying a microorganism by an ngs, the apparatus comprising a memory and a processor; a memory including a memory for storing a program; a processor comprising instructions for implementing the following method by executing a program stored in a memory: a database loading step, which comprises loading a database for identifying the microorganism by using the memory map/dev/shm provided by a Linux system; the database checking step comprises the steps of checking the size of the database before reading the database, and if the size of the database is smaller than that of the originally loaded database, activating the database in a virtual memory touch mode to enable the loaded database to be completely cached in a memory; the database comparison step comprises loading a reference genome by adopting a memory mapping mode, uniformly caching a reference index when a plurality of alignments are operated simultaneously, and sharing the process and the result by each alignment process; when a new comparison process is added, checking the index, if the new comparison process is loaded in the memory or is in the process of loading, accessing according to the memory address or using after the loading is finished, and not repeatedly loading; when all parallel comparison processes are finished, automatically managing the reference index, and releasing the reference index from the cache when the reference index is not actively accessed; and the data transmission step comprises outputting and reading in by adopting a Linux pipeline, so that the generation of temporary files is reduced, and the analysis speed is increased.

There is also provided, in another implementation, a computer-readable storage medium including a program, the program being executable by a processor to perform a method comprising: a database loading step, which comprises loading a database for identifying the microorganism by using the memory map/dev/shm provided by a Linux system; the database checking step comprises the steps of checking the size of the database before reading the database, and if the size of the database is smaller than that of the originally loaded database, activating the database in a virtual memory touch mode to enable the loaded database to be completely cached in a memory; the database comparison step comprises loading a reference genome by adopting a memory mapping mode, uniformly caching a reference index when a plurality of alignments are operated simultaneously, and sharing the process and the result by each alignment process; when a new comparison process is added, checking the index, if the new comparison process is loaded in the memory or is in the process of loading, accessing according to the memory address or using after the loading is finished, and not repeatedly loading; when all parallel comparison processes are finished, automatically managing the reference index, and releasing the reference index from the cache when the reference index is not actively accessed; and the data transmission step comprises outputting and reading in by adopting a Linux pipeline, so that the generation of temporary files is reduced, and the analysis speed is increased.

Examples

This example uses 20 samples of SE50 sequencing data with an average data size of 34M, processed according to the data processing method for identifying microorganisms optimized for this example. The operating system for data processing and analysis in this example is a Linux system, 187G memory, and 64-core CPU, and the optimization and improvement in this example mainly includes the optimization and improvement of key steps such as memory caching, pipeline transmission, memory disk loading/cleaning, and memory sharing, and the specific details are as follows:

(1) use of/dev/shm

In the embodiment, a Linux kernel version 2.6 is adopted, a uname-r command is used for obtaining the Linux kernel version, a disk-in-dev/shm mode is supported to be used as a shared memory, and the default size is half of the physical memory of the system. The read-write speed of the memory is obviously higher than that of a hard disk for storage, the read-write speed of the DDR3 memory is about 10G/s and is 100 times of the speed of a mechanical hard disk, the read-write speed of the DDR4 memory is about 500-1000 times of the speed of the mechanical hard disk, the analysis performance is improved by using/dev/shm to replace/tmp, and inter-process file communication (IPC) is supported.

In the embodiment, the database is loaded by using the memory mapping/dev/shm provided by the Linux system, all users of the directory have read-write permission, and even if the directory is fully written, the system cannot be down. The test environment of this example has a/dev/shm size of 94G, contains eukaryotes, and the Kraken2 database size of 90G, into which it can be loaded in its entirety and deleted after the classification step is completed.

(2) Preventing Swap from hard disk by pre-applying memory

The/dev/shm is managed by the system, if the file in the file is in an inactive state, the system caches part of the file in a hard disk, namely swapspace cache space, and the released memory caches other active contents. Copying the Kraken2 database to/dev/shm may also occur when a large amount of other content is cached in system memory, with part of the content being placed in the hard disk cache space. In this case, if the Kraken2 database in Kraken/shm is read, a stuck condition occurs, and the class module class of Kraken2 enters a read speed close to the hard disk, and the class falls into a pause. For the situation, the first method is to apply for a memory space larger than the size of the database by a program before copying the Kraken2 database, release the content cached by the system, and then copy the database to avoid the situation that the database is partially cached to the hard disk; the second method is that the size of the database is checked before the classsify module of Kraken2 reads the database, if the size is smaller than the size of the original database, it indicates that part of the database cached in/dev/shm is actually in the hard disk cache, and the database cached in the hard disk is reloaded into the memory by activating the database in a virtual memory touch mode (vmtouch; https:// githu. com/hoytech/vmtouch), so as to obtain the complete database cached in the memory for classified use.

The first method requires the program applying for memory to prevent the application instructions malloc or calloc from being optimized by GCC, and requires the addition of modifications to prevent compiler GCC optimization, such as GCC: __ attribute __ ((option ("O0")))); clang: __ attribute __ ((optnone)).

(3) Memory address mapping

Memory resources are fully utilized, inter-process sharing of the cache file is increased, more tasks can be run in parallel under limited memory configuration, and the host removing step is taken as an example: bwa-mem2mem single process can independently read the host reference genome to the memory and then carry out the alignment of short sequences, for example, the size of the human reference genome index file is about 16G, and the host alignment of 20M sequencing sequences needs to occupy about 32G of memory, wherein the reference genome index occupies half. Taking 187G memory as an example, at most 6 off-host analysis tasks can be performed in parallel, in practical cases, the system will occupy a part of the memory, and possibly other processes for processing output occupy the memory, such as redundancy removal and sorting for samtools, which results in less than 6 tasks that can be performed in parallel at most, in this example, the actual test is 3-4. If reference genome index parts which need to be read by different processes are shared among the processes, the memory can be saved, and more comparison tasks can be run, theoretically, 10 reference genome index parts are parallel at most, and n × 16+16 is 187; n is 10.

In the embodiment, a loading mode of a reference genome is replaced on the basis of bwa-mem2 codes, and a mode of using memory mapping (mmap) is changed from an original read-in memory, so that when a plurality of bwa-mem2mem instances are operated simultaneously, a system can uniformly cache reference indexes, and processes share the process and the result; the new process also checks the index first, if the new process is loaded in the memory or is in the process of loading, the new process can access according to the memory address or be used after the loading is finished, and the new process is not repeatedly loaded any more; when the parallel bwa-mem2mem process is finished, the system automatically manages the reference index and releases the reference index from the cache when the reference index is not actively accessed.

The bwa-mem2mem modified in the above way can be used for parallel 8-10 sample off-host analysis under the system environment.

(4) Output by adopting Linux pipeline

The input and reading of temporary files or process files in the process analysis can reduce the analysis speed, and the embodiment reduces unnecessary temporary file generation by replacing or developing programs supporting output and read-in of a Linux pipeline ("|" operator) so as to improve the analysis speed. The process of unheading to derhosting from the original data fq and obtaining a clean fa sequence file is as follows:

trimadap-3$adapter$input.fq|\\

samtools fasta-2>/dev/null|\\

sdust-d|\\

bwa-mem2 mem-z$reference.fa-|\\

samtools view-f0x4-b|\\

samtools fasta-1>$output.fa 2>/dev/null

in the above process, the present example adopts trimadap supporting output to standard output to remove the joint ($ adapter) in the input file ($ input.fq); and adopting sdust software to replace an NCBI tool DustMasker which does not support output to standard output to mark low-complexity sequences in the sequences, marking the sequences as 'N', adding support (-d) of the sequences after output marking in the sdust, outputting the sequences to the standard output, transmitting the sequences to comparison software bwa-mem2 through a pipeline to be compared, and removing the sequences compared to a host reference genome ($ reference.fa), thereby obtaining host-removed sequences ($ output.fa). The method fully utilizes the Linux pipeline to reduce the generation of temporary files, thereby avoiding the writing of additional unnecessary files and improving the detection speed.

Based on the above accelerated optimization process, 20 samples of SE50 sequencing data with an average data size of 34M were selected for analysis speed testing, the data samples are shown in table 1, and the test results are shown in fig. 1.

TABLE 1 samples for analytical speed testing

The results in FIG. 1 show that 8-10 threads are suitable for classification results about 30 minutes after linker processing and annotation generation. Therefore, by adopting the improved data processing method, the whole data analysis process can be completed in about 30 minutes, the speed of identifying the microorganisms by the mNGS is greatly improved, and the use requirement of quickly analyzing clinical samples and quickly outputting results can be met. Moreover, by adopting the data processing method of the embodiment, simultaneous processing of 8-10 threads can be realized, and the efficiency of identifying the microorganisms by the mNGS is greatly improved.

It should be noted that the data processing method of this example is optimized and improved based on the Kraken2 software, so the remaining steps that are not mentioned can refer to the Kraken2 software or the existing metagenomic sequencing method for identifying microorganisms, and will not be described herein in detail.

It can be understood that the key to the present example is the optimization and improvement of the data processing method, and the above data processing method is in principle applicable to the step of removing the linker of the metagenomic sequencing identification microorganism, the step of removing the host sequencing data by using the host reference sequence, the step of performing sequence classification on the data by using the classification sequence library, and the step of performing classification annotation on the data by using the microorganism knowledge base. As regards the final filtering of the results and the reading of the reports, reference is made to the prior art and not to be reiterated herein.

Based on the above optimization and improvement, the present example inventively performs host homologous region labeling treatment in order to reduce the false positive detection caused by the host. In order to determine the accuracy of classification judgment, the embodiment creatively carries out secondary comparison on the preliminarily identified species meeting the requirements, and accurately obtains the number of fragments of the microorganism in the sequencing sample through the comparison file, thereby realizing the calculation of related indexes such as the abundance of the microorganism. The method comprises the following specific steps:

(5) host homologous region marker

Regions of the database that are homologous to the host are prone to false positive detection due to the presence of the host or incomplete host removal. The genome of eukaryotes such as parasites is larger than bacteria and viruses, and there are many regions homologous to the host, and removal of these regions can not only reduce false positive detection due to incomplete removal of host sequences, but also reduce the size of the microbial taxa. Conventional homology calculations are performed by genome alignment, but for thousands of genomes in a microbial pool, each microorganism has multiple assembled versions, alignment with the host genome is difficult to achieve, and requires extensive alignment and post-alignment region calculations.

In response to this problem, this example designs a method of base "N" labeling, based on short sequences (kmer), of regions of the genome of the protists and fungi that are homologous to the host, excluding these regions from being classifiable, and reducing false positive detection by the host, such as the common Toxoplasma gondii false positive.

Specifically, a fa sequence of a host reference genome is split into 31bp short sequences (31 mers), A, C, G, T in the sequences are respectively converted into 0,1, 2 and 3 expressions, namely binary 00, 01, 10 and 11, 31 numbers are further stored into 64 bits of uint64_ t unsigned integer mode and preloaded into a memory, about 37G of memory is occupied, then eukaryotic genome needing to calculate homologous regions is respectively read in, the eukaryotic genome is matched with a kmer library and is marked as 'N', and finally, a region continuously marked with 'N' is replaced by 10 'N' bases.

In the example, Toxoplasma gondii is taken as an example, the genome is subjected to the homologous region labeling treatment, and results show that the treated genome is reduced by 17% before and after comparison treatment, so that false positive detection can be effectively reduced.

(6) Second comparison

Software based on precise kmer classification represented by Kraken2 remarkably improves the speed of sequence classification, supports the inclusion of genomes of multiple subspecies/isolates of the same species, improves the representativeness of the species genome, reduces the probability of classification failure caused by kmer mismatching, but cannot give information of sequencing depth and genome coverage of the classified species, and the information can be obtained by comparison analysis, and is very useful in judging the accuracy of classification.

In view of this, the present example compares the preliminarily identified species whose sequence support number meets the minimum requirement in real time on the basis of classifying the data in the Kraken2 complete database, and further obtains the indexes such as coverage, depth distribution and dispersion. For example, the sequence support number of the parasite needs to satisfy ≧ 100; the sequence support numbers for other species need to satisfy ≧ 10.

Specifically, in this example, a reference genome with Complete genome or Chromosome (Chromosome) genome assembly level in NCBIGenBank and RefSeq databases is prepared, chromosomes are connected in sequence at Chromosome level, named by species classification ID and sequence ID (> TxID | SeqID), and then merged, and then compressed by bgzip, and then indexed by samtolsfaidx fai for later use; then, based on the result of Kraken2, selecting species to be compared to obtain a taxonomy ID list; extracting all sequences according to the list, merging the sequences into a temporary reference sequence, and performing sequence alignment after host removal by using minimap2 to obtain a BAM file, namely a common alignment format; and calculating the coverage (C), depth distribution and dispersion index (D) based on the BAM file.

Wherein the coverage is the ratio of the sum of the regions covering more than 1 × of the genome of the selected species to the genome size L, and the longest genome (L) is calculated since there may be multiple genome versions in the same species_max) Calculated, the position covered is the actual alignment position (P) of each genome_i) Calculation, for multigenomic species, the coverage C obtained is an estimate, i.e.C_approx＝∑P_i/L_max。

The dispersion is the ratio of the number of reference genome windows (N) covered by the number of sequences supported by the species to the total window (N), i.e., D ═ N/N.

Wherein, the value range of D is [0,1], and the closer to 1, the more uniform the coverage is, the better the dispersion and the higher the reliability.

The foregoing is a more detailed description of the present application in connection with specific embodiments thereof, and it is not intended that the present application be limited to the specific embodiments thereof. It will be apparent to those skilled in the art from this disclosure that many more simple derivations or substitutions can be made without departing from the spirit of the disclosure.

Claims

1. A data processing method for identifying microorganisms by using ngs, characterized in that: comprises the following steps of (a) carrying out,

a database loading step, which comprises loading a database for identifying the microorganism by using the memory map/dev/shm provided by a Linux system;

2. The data processing method of claim 1, wherein: and the step of loading the database also comprises the steps of applying for a memory space larger than the size of the database in advance before loading the database to the dev/shm, releasing the cached content of the system and ensuring that the database can be completely loaded to the dev/shm.

3. The data processing method of claim 1, wherein: the method also comprises a homologous region marking step, wherein the homologous region marking step comprises the step of splitting a host reference genome in a database into short sequences to form a short sequence library; the genome of the eukaryote whose homologous region is to be calculated is aligned with the short sequence library, the region which can be matched with the short sequence library is marked as "N", and the region which is continuously marked as "N" is replaced by "N" bases.

4. The data processing method of claim 2, wherein: the step of marking the homologous regions further comprises the steps of respectively converting A, C, G, T in the sequence into binary digits, storing the short sequence into an unsigned integer mode and preloading the short sequence into an internal memory;

preferably, the short sequence is 31bp in length.

5. The data processing method according to any one of claims 1 to 4, characterized by: and the secondary alignment step comprises the steps of aligning all sequences of the selected species into a genome reference sequence of the selected species by taking the species as a unit for the species with the preliminarily identified sequence support numbers meeting the requirements, accurately obtaining the sequence support numbers of the selected species in a sequencing sample, and calculating the coverage, depth distribution and dispersion of the species.

6. The data processing method of claim 5, wherein: the coverage is the ratio of the sum of the areas covering more than 1 multiplied by the genome of the selected species to the size L of the genome;

when multiple genome versions of the selected species exist, the longest genome L is used_maxCalculating the position covered according to the actual alignment position P of each genome_iCalculating; for species with multiple genomic versions, the coverage C obtained is an estimate, i.e., C_approx＝∑P_i/L_max。

7. The data processing method of claim 5, wherein: the dispersion is the ratio of the number N of windows of the reference genome that can be covered by the sequences supported by the selected species to the total window N, i.e., D ═ N/N.

8. The data processing method of claim 5, wherein: the sequence support number of the species meeting the requirements, in particular, the parasite sequence support number requirement is greater than or equal to 100, and the sequence support number of other species is greater than or equal to 10.

9. A data processing apparatus for identifying a microorganism by mggs, comprising: the apparatus includes a memory and a processor;

the memory including a memory for storing a program;

the processor comprising a data processing method for identifying a microorganism by executing the program stored in the memory to implement the mNGS identifying microorganism of any one of claims 1 to 8.

10. A computer-readable storage medium characterized by: the storage medium has stored therein a program executable by a processor to implement the data processing method of identifying a microorganism by an mNGS according to any one of claims 1 to 8.