CN114242173B

CN114242173B - Data processing method and device for identifying microorganisms by mNGS and storage medium

Info

Publication number: CN114242173B
Application number: CN202111579973.1A
Authority: CN
Inventors: 黄毅; 杨振宇; 刘久成; 林小芳; 张丹; 易鑫; 杨玲
Original assignee: Shenzhen Guiinga Medical Laboratory
Current assignee: Shenzhen Guiinga Medical Laboratory
Priority date: 2021-12-22
Filing date: 2021-12-22
Publication date: 2023-05-16
Anticipated expiration: 2041-12-22
Also published as: CN114242173A

Abstract

The application discloses a data processing method, a device and a storage medium for identifying microorganisms by mNGS. The data processing method comprises the steps of loading a database by using memory mapping/dev/shm provided by a Linux system; before the database is read, checking the size of the database, if the size of the database is smaller than the size of the database originally loaded in the hard disk, activating the database in a virtual memory touch mode, so that the loaded database is completely cached in a memory; loading a reference genome by adopting a memory mapping mode; and a Linux pipeline is adopted for outputting and reading, so that temporary files are reduced, and the analysis speed is improved. According to the method, through optimizing key steps of limiting the speed in the data processing process, the speed and efficiency of identifying microorganisms by using the mNGS are improved, the dependence on high-performance hardware equipment is reduced, and quick, efficient and accurate microorganism analysis and identification can be realized by using the conventional hardware equipment.

Description

Data processing method and device for identifying microorganisms by mNGS and storage medium

Technical Field

The application relates to the technical field of microorganism metagenome sequencing detection, in particular to a data processing method, a device and a storage medium for identifying microorganisms by using mNGS.

Background

With advances in sequencing technology and reduced costs, more and more microorganisms are being sequenced, such as human microbiome program HMP, human intestinal metagenome project MetaHIT 2008, american intestinal program AGP 2012, chinese microbiome program CAS-CMI 2017, and the like. The challenge with the large amount of species is that classification databases are becoming larger and larger, which presents challenges for data analysis, especially for pathogenic microorganism detection with high requirements on time efficiency.

Metagenome sequencing, particularly metagenome second generation sequencing (abbreviated mNGS), refers to directly extracting nucleic acids of all microorganisms from an environment or host sample, constructing a metagenome library, and sequencing by using a second generation sequencing technology. Metagenome sequencing identifies microorganisms by directly analyzing and detecting or identifying microorganisms carried by the environment or host using metagenome sequencing data. Generally, metagenome sequencing and identifying microorganisms comprises the steps of metagenome sequencing, downloading data, removing connectors, removing sequencing data of a host by using a host reference sequence, classifying the sequences of the data by using a classification sequence library, classifying and annotating the data by using a microorganism knowledge base, and finally obtaining an interpretation report by filtering the results. Among these, the three steps of decommissioning, removing sequencing data from the host, and sequence classification typically require extensive and computationally intensive computations.

The mNGS can indiscriminately sequence the extracted nucleic acid, and the non-host sequences are only significant for detecting human infectious microorganisms; therefore, the host sequence needs to be removed in the bioinformatic analysis process, and species identification is only performed on the removed sequence. In terms of sequence classification, taking the Krake 2 software as an example, its official supply of standard genomic reference libraries, including archaebacteria, bacteria, viruses, plasmids, human hosts and vectors, https:// benlangmea. Gilth. Io/aws-index/k 2, totals about 50.1GB (Gigabyte), plus about 53.2GB after protozoa and partial fungi, plus other eukaryotes, including parasites, etc., can amount to nearly 90GB, covering about 16000 microorganisms. The quick reading is required for the sequence query of the large data, the Krake 2 can load the database into the memory for quick reading, and the database needs to be loaded after the memory disc is manufactured, but the method requires root permission, so that the use of common users is limited.

To increase the speed of analysis of mNGS microbiological identification, developers or businesses have sought to speed up calculations using GPUs or FPGAs, such as MetaCache-GPU and Tera-BLAST, but often require the purchase and deployment of new hardware.

Therefore, how to rapidly, efficiently and accurately perform metagenomic sequencing to identify microorganisms based on the existing conventional hardware equipment is a problem to be solved.

Disclosure of Invention

It is an object of the present application to provide an improved data processing method, apparatus and storage medium for mNSS identification of microorganisms.

In order to achieve the above purpose, the present application adopts the following technical scheme:

the first aspect of the application discloses a data processing method for identifying microorganisms by using mNGS, which comprises the following steps:

a database loading step, which comprises loading a database for identifying microorganisms by using memory mapping/dev/shm provided by a Linux system; the system comprises a Linux system, a dev/shm system, a temporary file system, a storage system and a storage system, wherein the dev/shm of the Linux system is a tmpfs file system, all users of the directory have read-write permission, and the maximum writable size of the directory is half of the physical memory of the system;

the database checking step comprises the steps of checking the size of the database before the database is read, and activating the database by a virtual memory touch mode if the size of the database is smaller than the size of the database which is originally loaded, so that the loaded database is completely cached in a memory;

the database comparison step comprises loading a reference genome in a memory mapping mode, and uniformly caching the reference index when a plurality of comparison processes are operated simultaneously, wherein each comparison process shares the process and the result; when a new comparison process is added, index checking is also performed first, if the index is found to be already loaded in the memory or in the loading process, the index is accessed according to the memory address or used after waiting for loading, and the loading is not repeated; when all parallel comparison processes are finished, automatically managing the reference index, and releasing the reference index from the cache when the reference index is not actively accessed;

and the data transmission step comprises the steps of adopting a Linux pipeline to output and read in, so that the generation of temporary files is reduced, and the analysis speed is improved.

It should be noted that, the data processing method of the present application, through the use of/dev/shm, not only improves the database reading speed, but also solves the problem of requiring root authority, thereby facilitating the use of common users; through the database checking step, the problem of blocking caused by the fact that the system automatically caches the database or part of the database in the inactive state to the hard disk is solved, and the detection speed is further improved; through the memory address mapping of the database comparison step, the speed of comparing single samples is improved, more samples can be allowed to be compared simultaneously under the same condition, and the detection efficiency is improved; the temporary files or the process files generated in the analysis process are reduced through the output and the reading of the Linux pipeline, the influence of the input and the reading of the temporary files or the process files on the analysis speed is avoided, and the detection speed is further improved. In a word, in the data processing method, through optimizing and improving key steps for limiting the data processing speed in the data processing process of metagenome sequencing and identifying microorganisms, the speed and efficiency of metagenome sequencing and identifying microorganisms are improved, and the dependence on high-performance hardware equipment is reduced, so that the metagenome sequencing and identifying microorganisms can be rapidly, efficiently and accurately analyzed and identified by only adopting the conventional hardware equipment, such as 187G memory and 64-core CPU.

It will be appreciated that the key to the present application is the optimization and improvement of the key steps limiting the data processing speed, and that for general steps of identifying microorganisms by metagenomic sequencing, such as high throughput sequencing, on-line data quality control, etc., reference may be made to the prior art, while in other steps, parts related to database loading, alignment, transmission, etc., may employ the data processing method of the present application.

In one implementation manner of the application, the step of loading the database further includes pre-applying for a memory space larger than the size of the database before loading the database to the/dev/shm, and releasing the content cached by the system, so as to ensure that the database can be completely loaded to the/dev/shm.

It should be noted that, the pre-application of the memory space larger than the size of the database and the same function of the database checking step are both to ensure that the database is completely cached in the memory, so as to avoid the problem of blocking caused by the database being cached in the hard disk.

In one implementation mode of the application, the data processing method of the application further comprises a homologous region marking step, wherein the homologous region marking step comprises the steps of splitting a host reference genome in a database into short sequences to form a short sequence library; the eukaryotic genome whose homologous region needs to be calculated is aligned with the short sequence library, the region that can be matched with the short sequence library is marked as "N", and the region that is continuously marked as "N" is replaced with "N" bases.

Preferably, the step of marking the homologous region further comprises converting A, C, G, T in the sequence into binary digits, respectively, storing the short sequence in an unsigned integer manner, and preloading the short sequence into the memory.

Preferably, the short sequence is 31bp in length.

In this application, the key of the homologous region labeling step is to use a short sequence (kmer) to label the region homologous to the host in the genome of the protozoa and fungi by the base "N", and exclude the region labeled "N" from classification, thereby reducing false positive detection caused by the host, such as the common toxoplasma gondii Toxoplasma gondii false positive. It will be appreciated that the conversion of A, C, G, T to binary numbers, with a short sequence of 31bp in length, is a particularly useful scheme in one implementation of the present application, and does not exclude that other ways of converting each base or designing short sequences of different lengths as desired may be used.

In one implementation of the present application, the data processing method of the present application includes a secondary alignment step, which includes aligning, in units of species, all sequences of a selected species to a genomic reference sequence of the selected species for the species for which the number of sequence supports that are primarily identified meets the requirement, and accurately obtaining the number of sequence supports of the selected species in a sequencing sample, thereby calculating coverage, depth distribution, and dispersion of the species.

Preferably, the coverage is the ratio of the sum of the regions covered by the genome of the selected species by more than 1 x to the genome size L; in selectingWhen multiple genome versions exist in a given species, the longest genome L _max Calculating the actual comparison position P of the covered positions according to the respective genome _i Calculating; for species in which multiple genomic versions are present, the resulting coverage C is an estimate, C _approx ＝∑P _i /L _max 。

Preferably, the dispersion is the ratio of the number N of windows of the reference genome to the total window N, i.e. d=n/N, that the sequences supported by the selected species can cover.

Preferably, the number of sequence support is greater than or equal to 100 for species whose number of parasite sequence support is satisfactory, and greater than or equal to 10 for other species.

It should be noted that, the existing metagenome sequencing identification microorganism software or method cannot give information such as sequencing depth and genome coverage of the classified species; the application creatively provides that the secondary comparison is carried out on the initially identified species, the fragment number of the species in the sequencing sample is accurately obtained, so that the abundance of the species is calculated, the information such as coverage, depth distribution and dispersion is obtained, and the classification accuracy can be greatly improved.

In a second aspect the present application discloses a data processing device for identifying microorganisms by mNGS, the device comprising a memory and a processor; the memory includes a memory for storing a program; the processor includes a data processing method for implementing the mNGS identification of microorganisms of the present application by executing a program stored in the memory.

A third aspect of the present application discloses a computer-readable storage medium having stored therein a program executable by a processor to implement a data processing method of mNGS identification microorganisms of the present application.

Due to the adoption of the technical scheme, the beneficial effects of the application are that:

according to the data processing method for identifying the microorganisms by using the mNGS, through optimizing and improving key steps of limiting the speed in the data processing process, the speed and efficiency of identifying the microorganisms by using metagenome sequencing are improved, and the dependence on high-performance hardware equipment is reduced, so that the quick, efficient and accurate microorganism analysis and identification of the microorganisms by using the conventional hardware equipment are realized.

Drawings

FIG. 1 is a graph of the statistics of analysis speed testing of 20 samples of SE50 sequencing data with an average data size of 34M in the examples of the present application.

Detailed Description

The present application is described in further detail below with reference to the accompanying drawings by way of specific embodiments. In the following embodiments, numerous specific details are set forth in order to provide a better understanding of the present application. However, those skilled in the art will readily recognize that some of the features may be omitted, or substituted for other devices, materials, or methods in different situations. In some instances, some operations associated with the present application have not been shown or described in the specification to avoid obscuring the core portions of the present application, and may not be necessary for a person skilled in the art to describe in detail the relevant operations based on the description herein and the general knowledge of one skilled in the art.

The mNGS identifies microorganisms, namely metagenome sequencing identifies microorganisms, and the speed and the efficiency of microorganism identification are greatly influenced due to the fact that the related microorganisms are large in variety and large in data size, so that rapid analysis cannot be realized to obtain microorganism information.

According to the method, the speed limiting link of the mNGS analysis flow is analyzed, and the analysis speed of the mNGS is improved by means of reducing writing and reading, sharing a database loaded into a memory, multithreading and the like under the environment of conventional hardware. The speed-increasing processing of memory caching, pipeline transmission, memory disk loading/cleaning, memory sharing and the like is carried out on the input and output of the speed-limiting link in the data processing process, so that the analysis speed of a single sample and the parallel analysis speed of multiple samples are improved.

Based on the research and the discovery, the application creatively provides a data processing method for identifying microorganisms by using mNGS, which comprises a database loading step, a database checking step, a database comparing step and a data transmission step.

The database loading step comprises the step of loading a database for identifying microorganisms by using memory mapping/dev/shm provided by a Linux system.

It should be noted that the memory disk hanging mode recommended by Krake 2 requires root authority, and the mkdir/ramdisk & & mount-t ramfs none/ramdisk is inconvenient for common users to use. In addition, after the part of the memory is mounted on the memory disk in the mode, the part of the memory cannot be directly used by other processes, so that the resource waste is caused. The memory mapping/dev/shm provided by the Linux system is creatively utilized to load the database, all users of the directory have read-write permission, the maximum writable size is half of the physical memory of the system, and even if the system is fully written, the system is not down. For example, linux is from kernel version 2.6, and the name-r command is used to obtain the Linux kernel version, and the memory disk form of starting support/dev/shm is used as the shared memory, and the default size is half of the physical memory of the system. The read-write speed of the memory is obviously higher than that of the hard disk storage, the read-write speed of the DDR3 memory is about 10G per second and is 100 times that of the mechanical hard disk, the read-write speed of the DDR4 memory is about 500-1000 times that of the mechanical hard disk, the analysis performance is improved by fully utilizing/dev/shm to replace/tmp, and the inter-process file communication (IPC) is supported. In one implementation of the present application, the test environment has a/dev/shm size of 94G, including eukaryotes, and the Kraken2 database has a size of 90G, and may be loaded therein in its entirety, and deleted after the classification step is completed.

And the database checking step comprises the step of checking the size of the database before the database is read, and activating the database by a virtual memory touch mode if the size of the database is smaller than the size of the database which is originally loaded, so that the loaded database is completely cached in the memory.

It should be noted that, if the file is in inactive state, the system will cache part of the file to the hard disk, i.e. the swappace cache space, and the released memory will cache other active content. If a large amount of other content is cached in the system memory, it may happen that part of the content is placed in the hard disk cache space when the Krake 2 database is copied to/dev/shm. In this case, if the Kraken2 database in the v/mm is read, a stuck condition occurs, and the classification module classification of Kraken2 enters a reading speed close to that of the hard disk, classifying a stuck stop. In order to solve the situation, two solutions are designed, namely, before the Kraken2 database is copied, a memory space larger than the size of the database is applied through a program, the cached content of the system is released, and then the situation that the database is partially cached to a hard disk can be avoided by copying the database; the second method is to check the size of the database before the classify module of Kraken2 reads the database, if the size is smaller than the original database, it is indicated that the part of the database cached in the/dev/shm is actually in the hard disk cache, and the database cached in the hard disk is activated by a virtual memory touch mode (vmtuch; https:// gitub.com/hotech/vmtuch) to reload the database cached in the hard disk into the memory to obtain the complete database cached in the memory for classification.

Wherein, the program for applying the memory in the first method needs to prevent the application instruction malloc or calloc from being optimized by GCC, and needs to add a modification for preventing the compiler GCC from optimizing, such as GCC: __ attribute __ ((O0 "))) before the correlation function; clang __ attribute __ (optnone).

The database comparison step comprises loading a reference genome in a memory mapping mode, and uniformly caching the reference index when a plurality of comparison processes are operated simultaneously, wherein each comparison process shares the process and the result; when a new comparison process is added, index checking is also performed first, if the index is found to be already loaded in the memory or in the loading process, the index is accessed according to the memory address or used after waiting for loading, and the loading is not repeated; when all the parallel comparison processes are finished, the reference index is automatically managed, and released from the cache when the reference index is not actively accessed.

It should be noted that, taking the step of removing the host sequence as an example, a single process bwa-mem2mem can independently read the host reference genome into the memory and then perform short sequence alignment, for example, the size of a human reference genome index file is about 16G, and the host removal alignment of the 20M sequencing sequence needs to occupy about 32G of memory, wherein the reference genome index occupies half. Taking 187G memory as an example, 6 decompaction analysis tasks can be performed at most in parallel; in practical situations, the system occupies part of the memory, and possibly other processing processes, such as redundancy elimination, sequencing and the like for comparison by samtools, occupy the memory, so that the number of tasks which can be maximally parallel is less than 6; in one implementation of the application, the actual test may be performed in parallel with only 3 to 4 comparison tasks.

The application creatively proposes that if the reference genome index parts which need to be read by different processes are shared among the processes, the memory can be saved, so that more comparison tasks can be run, and theoretically, up to 10 parallel processes are performed, namely, n×16+16=187, and n=10. The loading mode of the reference genome is replaced on the basis of bwa-mem2 codes, and the original read-in memory is changed into a mode of using memory mapping (mmap), so that when a plurality of bwa-mem2mem examples are operated at the same time, the system uniformly caches the reference index, and all processes share the process and the result; when a new process is added, index checking is performed first, if the index is found to be already loaded in the memory or in the loading process, the index is accessed according to the memory address or used after the loading is waited, and the loading is not repeated; finally, when the parallel bwa-mem2mem process ends, the system automatically manages the reference index and releases it from the cache when it is not actively accessed. With the above scheme, in one implementation manner of the present application, 8-10 samples of the de-hosting analysis can be performed in parallel under the same system environment.

It should be noted that, in one implementation of the present application, a trimadap supporting output to a standard output is used to remove a linker ($adapter) in an input file ($input. Fq); the NCBI tool DustMasker which does not support output to the standard output is replaced by sdurt software to mark the low-complexity sequence in the sequence, namely the low-complexity sequence is marked as 'N', and the support (-d) of the sequence after the output mark is added in the sdurt is used for outputting the sequence to the standard output; the sequences aligned to the host reference genome ($reference. Fa) were removed by transferring to alignment software bwa-mem2 via a Linux pipeline to obtain the de-host sequence ($output. Fa). As the Linux pipeline is fully utilized, the generation of temporary files is reduced, and thus, extra unnecessary file writing is avoided, and the detection speed is improved.

Those skilled in the art will appreciate that all or part of the functions of the above-described methods may be implemented by means of hardware, or by means of a computer program. When all or part of the functions of the above method are implemented by means of a computer program, the program may be stored in a computer readable storage medium, and the storage medium may include: read-only memory, random access memory, magnetic disk, optical disk, hard disk, etc., and the program is executed by a computer to realize the above-mentioned functions. For example, the program is stored in the memory of the device, and when the program in the memory is executed by the processor, all or part of the functions described above can be realized. In addition, when all or part of the functions in the above embodiments are implemented by means of a computer program, the program may be stored in a storage medium such as a server, another computer, a magnetic disk, an optical disk, a flash disk, or a removable hard disk, and all or part of the functions in the above methods may be implemented by downloading or copying the program into a memory of a local device or updating a version of a system of the local device, and executing the program in the memory by a processor.

Accordingly, in another implementation of the present application there is also provided a data processing apparatus for mNGS identification of microorganisms, the apparatus comprising a memory and a processor; a memory including a memory for storing a program; a processor comprising a program for implementing the following method by executing a program stored in a memory: a database loading step, which comprises loading a database for identifying microorganisms by using memory mapping/dev/shm provided by a Linux system; the database checking step comprises the steps of checking the size of the database before the database is read, and activating the database by a virtual memory touch mode if the size of the database is smaller than the size of the database which is originally loaded, so that the loaded database is completely cached in a memory; the database comparison step comprises loading a reference genome in a memory mapping mode, and uniformly caching the reference index when a plurality of comparison processes are operated simultaneously, wherein each comparison process shares the process and the result; when a new comparison process is added, index checking is also performed first, if the index is found to be already loaded in the memory or in the loading process, the index is accessed according to the memory address or used after waiting for loading, and the loading is not repeated; when all parallel comparison processes are finished, automatically managing the reference index, and releasing the reference index from the cache when the reference index is not actively accessed; and the data transmission step comprises the steps of adopting a Linux pipeline to output and read in, so that the generation of temporary files is reduced, and the analysis speed is improved.

There is also provided in another implementation of the present application a computer readable storage medium including a program executable by a processor to implement a method of: a database loading step, which comprises loading a database for identifying microorganisms by using memory mapping/dev/shm provided by a Linux system; the database checking step comprises the steps of checking the size of the database before the database is read, and activating the database by a virtual memory touch mode if the size of the database is smaller than the size of the database which is originally loaded, so that the loaded database is completely cached in a memory; the database comparison step comprises loading a reference genome in a memory mapping mode, and uniformly caching the reference index when a plurality of comparison processes are operated simultaneously, wherein each comparison process shares the process and the result; when a new comparison process is added, index checking is also performed first, if the index is found to be already loaded in the memory or in the loading process, the index is accessed according to the memory address or used after waiting for loading, and the loading is not repeated; when all parallel comparison processes are finished, automatically managing the reference index, and releasing the reference index from the cache when the reference index is not actively accessed; and the data transmission step comprises the steps of adopting a Linux pipeline to output and read in, so that the generation of temporary files is reduced, and the analysis speed is improved.

Examples

In this example, 20 samples of SE50 sequencing data with an average data size of 34M were used and processed according to the data processing method for mNSS-identified microorganisms optimized in this example. The operating system for data processing and analysis in this example is a Linux system, 187G memory, and a 64-core CPU, and the optimization and improvement in this example mainly includes optimization and improvement in key steps such as memory caching, pipeline transfer, memory disk loading/cleaning, memory sharing, and the like, and specifically includes the following steps:

(1) Use of/dev/shm

In the embodiment, a Linux self-kernel version 2.6 is adopted, a name-r command is used for obtaining the Linux self-kernel version, a memory disk form with a starting support/dev/shm is used as a shared memory, and the default size is half of the physical memory of the system. The read-write speed of the memory is obviously higher than that of the hard disk storage, the read-write speed of the DDR3 memory is about 10G per second and is 100 times that of the mechanical hard disk, the read-write speed of the DDR4 memory is about 500-1000 times that of the mechanical hard disk, the analysis performance is improved by using/dev/shm to replace/tmp, and inter-process file communication (IPC) is supported.

The memory mapping/dev/shm provided by the Linux system is used for loading the database, and all users of the directory have read-write permission and cannot cause system downtime even if the directory is fully written. The test environment of this example has a/dev/shm size of 94G, including eukaryotic organisms, and the Kraken2 database has a size of 90G, which can be loaded therein in its entirety and deleted after the classification step is completed.

(2) Preventing Swap to hard disk by pre-applying for memory

The system is managed by the dev/shm, if the file in the system is in an inactive state, the system will cache part of the file in the system to the hard disk, namely the swapppace cache space, and the released memory caches other active contents. If a large amount of other content is cached in the system memory, it also happens that part of the content is placed in the hard disk cache space when the Krake 2 database is copied to/dev/shm. In this case, if the Kraken2 database in the v/mm is read, a stuck condition occurs, and the classification module classification of Kraken2 enters a reading speed close to that of the hard disk, classifying a stuck stop. In order to solve the situation, two solutions are designed in the embodiment, firstly, before the Krake 2 database is copied, a memory space larger than the size of the database is applied through a program, the cached content of the system is released, and then the situation that the database is partially cached to a hard disk can be avoided by copying the database; the second method is to check the size of the database before the classify module of Kraken2 reads the database, if the size is smaller than the original database, it is indicated that the part of the database cached in the/dev/shm is actually in the hard disk cache, and the database cached in the hard disk is activated by a virtual memory touch mode (vmtuch; https:// gitub.com/hotech/vmtuch) to reload the database cached in the hard disk into the memory to obtain the complete database cached in the memory for classification.

The program in the first method, which applies for memory, needs to prevent the application instruction malloc or calloc from being optimized by GCC, needs to add a modification to prevent compiler GCC optimization, such as GCC: __ attribute __ ((optimal ("O0")))) before the correlation function; clang __ attribute __ (optnone).

(3) Memory address mapping

The memory resource is fully utilized, and the inter-process sharing of the cache file is increased, so that more tasks can be run in parallel under the limited memory configuration, and the step of host removal is taken as an example: the bwa-mem2mem single process can independently read the host reference genome into the memory and then perform short sequence comparison, for example, the size of a human reference genome index file is about 16G, the decompaction of 20M sequencing sequences needs to occupy about 32G memory, and the reference genome index occupies half. Taking 187G memory as an example, 6 tasks can be parallel at most, in which case the system occupies part of the memory, and possibly other processes for processing the comparison output occupy the memory, for example, samtools can be used for performing redundancy and sequencing of comparison, so that the number of tasks which can be parallel at most is less than 6, and in this example, the number of actual tests is 3-4. If the index parts of the reference genome which are needed to be read by different processes are shared among the processes, the memory can be saved so as to run more comparison tasks, and at most 10 parallel reference genome indexes are theoretically n multiplied by 16+16=187; n=10.

The loading mode of the reference genome is replaced on the basis of bwa-mem2 codes, and the original read-in memory is changed into a mode of using memory mapping (mmap), so that when a plurality of bwa-mem2mem examples are operated at the same time, the system uniformly caches the reference index, and all processes share the process and the result; the new process also checks the index first, if the index is found to be loaded in the memory or in the loading process, the index is accessed according to the memory address or is used after the loading is completed, and the loading is not repeated; when the parallel bwa-mem2mem process ends, the system automatically manages the reference index and releases it from the cache when it is not actively accessed.

The bwa-mem2mem modified in the above manner can be used for the parallel 8-10 sample de-hosting analysis under the system environment.

(4) Pipeline output by using Linux

The input and reading of temporary files or process files in the process analysis reduces the analysis speed, and the present example reduces unnecessary temporary file generation by replacing or developing programs supporting the output and reading of Linux pipes ("|" operators) to increase the analysis speed. The following is a procedure from raw data fq to decomplexing and obtaining a clean fa sequence file:

trimadap-3$adapter$input.fq|\\

samtools fasta-2>/dev/null|\\

sdust-d|\\

bwa-mem2 mem-z$reference.fa-|\\

samtools view-f0x4-b|\\

samtools fasta-1>$output.fa 2>/dev/null

in the above process, the present example uses trimadap supporting output to standard output to remove the linker ($adapter) in the input file ($input. Fq); the low complexity sequence in the sequence is marked with "N" by using the sdust software instead of the NCBI tool durtmaker which does not support the output to the standard output, and the support (-d) for the sequence after the output mark is added to the sdust outputs the sequence to the standard output, which is piped to the alignment software bwa-mem2 for alignment and removal of the sequence aligned to the host reference genome ($reference. Fa) to obtain the de-host sequence ($output. Fa). The Linux pipeline is fully utilized, so that the generation of temporary files is reduced, extra unnecessary file writing is avoided, and the detection speed is improved.

Based on the acceleration optimization process, 20 SE50 sequencing data samples with the average data size of 34M are selected for analysis speed test, the data samples are shown in table 1, and the test results are shown in fig. 1.

Table 1 samples for analysis speed test

The results of FIG. 1 show that 8-10 threads are suitable from joint processing to classification after annotation for about 30 minutes. Therefore, by adopting the improved data processing method, the whole data analysis process can be completed in about 30 minutes, so that the speed of identifying microorganisms by mNGS is greatly improved, and the use requirement of rapid analysis and rapid output of clinical samples can be met. Moreover, by adopting the data processing method, 8-10 threads can be processed simultaneously, and the efficiency of identifying microorganisms by using mNGS is greatly improved.

It should be noted that, the data processing method of this example is optimized and improved based on the Kraken2 software, so the rest of the steps not mentioned can refer to the Kraken2 software or the existing metagenome sequencing method for identifying microorganisms, which is not described here.

It will be appreciated that the key to this example lies in the optimisation and improvement of the data processing method which is applicable in principle to the step of decommissioning of a metagenomic sequencing identified microorganism, the step of removing host sequencing data using host reference sequences, the step of sequence classification of data using a classification sequence library, and the step of classifying and annotating data using a microbial knowledge base. Reference is made to the prior art for the final result filtering and interpretation report acquisition, which are not detailed herein.

Based on the above optimization and improvement, in order to reduce false positive detection caused by a host, the host homologous region labeling treatment is creatively performed. In order to determine the accuracy of judgment and classification, the inventive method performs secondary comparison on the primarily identified species meeting the requirements, and accurately obtains the fragment number of the microorganism in the sequencing sample through the comparison file, thereby realizing calculation of related indexes such as abundance of the microorganism. The method comprises the following steps:

(5) Host homology region markers

Regions of the database homologous to the host are susceptible to false positive detection due to the presence of the host or incomplete removal of the host. The genome of eukaryotes such as parasites is larger than that of bacteria and viruses, and the areas homologous to hosts are more numerous, so that the removal of the areas can reduce false positive detection caused by incomplete removal of host sequences and the size of a microbial classification library. Conventional homology calculations are performed by genome alignment, but for thousands of genomes in a microorganism library, each microorganism has multiple assembled versions, alignment with the host genome is difficult to achieve, and requires extensive alignment and post-alignment region calculations.

In order to solve the problem, a method for marking the homologous region of the genome of the protozoa and fungi with the base N on the basis of a short sequence (kmer) mode is designed, the region is excluded from classification, and false positive detection caused by the host, such as common toxoplasma gondii Toxoplasma gondii false positive, and the like, is reduced.

Specifically, the fa sequence of the host reference genome is split into 31bp short sequences (31 mers), A, C, G, T in the sequences are respectively converted into 0,1, 2 and 3 representations, namely binary 00, 01, 10 and 11, 31 digits are further stored into 64 bits uint64_t, the unsigned integer mode is preloaded into a memory, about 37G memory is occupied, then eukaryotic genomes needing to calculate homologous regions are respectively read in, all the eukaryotic genomes are marked as 'N' when matched with a kmer library, and finally the region with the continuous label 'N' is replaced by 10 'N' bases.

In this example, toxoplasma gondii is taken as an example, the genome is subjected to the homologous region marking treatment, and the results before and after the comparison treatment show that the number of the treated genome is reduced by 17%, so that false positive detection can be effectively reduced.

(6) Secondary comparison

The software based on accurate kmer classification, represented by Kraken2, significantly increases the speed of sequence classification, and supports the inclusion of genomes of multiple subspecies/isolates of the same species, increases the representativeness of the species genome, reduces the probability of classification failure due to kmer mismatch, but does not give information on sequencing depth and genome coverage of the classified species, which usually needs to be obtained through alignment analysis, and has great utility in judging the accuracy of classification.

In view of this, in this example, on the basis of classifying the data in the Kraken2 complete database, the species whose number of sequence support satisfies the minimum requirement, which is primarily identified, are compared in real time, so as to obtain indexes such as coverage, depth distribution, dispersion, and the like. For example, the number of sequence support of parasites needs to satisfy ≡ 100; the number of sequence supports of other species needs to be satisfied by ≡10.

Specifically, in this example, reference genomes with Complete genome or Chromosome (Chromosome) genome assembly levels in NCBIGenBank and RefSeq databases are prepared, chromosomes are sequentially connected, and the chromosomes are combined after naming by classification ID and sequence ID of species (> TxID|SeqID), compressed by bgzip, and an index fai is established by samroolsfasidx for standby; then, based on a Kraken2 result, selecting a species to be compared to obtain a taxonomy ID list; extracting all sequences according to the list, merging the sequences into a temporary reference sequence, and comparing the sequences after the host removal by using a minimap2 to obtain a BAM file, namely a common comparison format; and calculating coverage (C), depth distribution and dispersion index (D) based on the BAM file.

Wherein the coverage is the ratio of the sum of the regions covered by the genome of the selected species by more than 1 x to the genome size L, calculated as the longest genome (L since multiple versions of the genome are possible for the same species _max ) Calculated, the positions of the coverage are compared with the actual positions of the respective genomes (P _i ) Calculation, for the multiple genomic species, the resultThe coverage C is an estimated value, i.e. C _approx ＝∑P _i /L _max 。

The dispersion is the ratio of the number of windows (N) of the reference genome to the total window (N) that can be covered by the number of sequences supported by the species, i.e. d=n/N.

Wherein, the value range of D is [0,1], and the closer to 1, the more uniform the coverage, the better the dispersion and the high the credibility.

The foregoing is a further detailed description of the present application in connection with the specific embodiments, and it is not intended that the practice of the present application be limited to such descriptions. It will be apparent to those skilled in the art to which the present application pertains that several simple deductions or substitutions may be made without departing from the spirit of the present application.

Claims

1. A data processing method for identifying microorganisms by using mNGS, which is characterized by comprising the following steps of: comprises the steps of,

a database loading step, which comprises loading a database for identifying microorganisms by using memory mapping/dev/shm provided by a Linux system;

the database checking step comprises the steps of checking the size of the database before the database is read, and activating the database by a virtual memory touch mode if the size of the database is smaller than the size of the database originally loaded in the hard disk, so that the loaded database is completely cached in the memory;

the database comparison step comprises loading a reference genome in a memory mapping mode, and uniformly caching the reference index when a plurality of comparison processes are operated simultaneously, wherein each comparison process shares the process and the result; when a new comparison process is added, firstly checking indexes, and if the indexes are found to be already loaded in the memory or in the loading process, accessing according to the memory address or waiting for use after loading is finished, and not repeating loading; when all parallel comparison processes are finished, automatically managing the reference index, and releasing the reference index from the cache when the reference index is not actively accessed;

the data transmission step comprises the steps of adopting a Linux pipeline to output and read in, reducing the generation of temporary files, and improving the analysis speed;

a secondary alignment step, comprising aligning all sequences of a selected species to genome reference sequences of the selected species by taking the species as a unit for species whose number of the primarily identified sequence supports meets the requirement, and accurately obtaining the number of the sequence supports of the selected species in a sequencing sample, thereby calculating coverage, depth distribution and dispersion of the species;

the coverage is the ratio of the sum of the areas covered by the genome of the selected species by more than 1 x to the genome size L;

in the presence of multiple genome versions of a selected species, the longest genome L _max Calculating the actual comparison position P of the covered positions according to the respective genome _i Calculating; for species in which multiple genomic versions are present, the resulting coverage C is an estimate, C _approx ＝∑P _i /L _max 。

2. A data processing method according to claim 1, characterized in that: the step of loading the database further comprises the steps of pre-applying for a memory space larger than the size of the database and releasing the cached content of the system before loading the database to the/dev/shm, so that the database can be completely loaded to the/dev/shm.

3. A data processing method according to claim 1, characterized in that: the method also comprises a homologous region marking step, wherein the homologous region marking step comprises the steps of splitting a host reference genome in a database into short sequences to form a short sequence library; the eukaryotic genome whose homologous region needs to be calculated is aligned with the short sequence library, the region that can be matched with the short sequence library is marked as "N", and the region that is continuously marked as "N" is replaced with "N" bases.

4. A data processing method according to claim 2, characterized in that: the homologous region marking step further comprises the steps of respectively converting A, C, G, T in the sequence into binary digits, storing the short sequence in an unsigned integer mode and preloading the short sequence into a memory.

5. The data processing method according to claim 4, wherein: the length of the short sequence is 31bp.

6. A data processing method according to claim 1, characterized in that: the dispersion is the ratio of the number N of windows of the reference genome to the total window N, i.e. d=n/N, that the sequences supported by the selected species can cover.

7. A data processing method according to claim 1, characterized in that: the number of sequence support is greater than or equal to 100 for species whose number of sequence support meets the requirements, specifically greater than or equal to 10 for other species.

8. A data processing device for identifying microorganisms by mNGS, characterized in that: the apparatus includes a memory and a processor;

the memory comprises a memory for storing a program;

the processor comprising a data processing method for implementing the mNGS identification of microorganisms according to any one of claims 1 to 7 by executing a program stored in the memory.

9. A computer-readable storage medium, characterized by: stored in the storage medium is a program executable by a processor to implement the data processing method for identifying microorganisms by mNGS according to any one of claims 1 to 7.