CN111009286A - Method and apparatus for microbiological analysis of host samples - Google Patents

Method and apparatus for microbiological analysis of host samples Download PDF

Info

Publication number
CN111009286A
CN111009286A CN201811169458.4A CN201811169458A CN111009286A CN 111009286 A CN111009286 A CN 111009286A CN 201811169458 A CN201811169458 A CN 201811169458A CN 111009286 A CN111009286 A CN 111009286A
Authority
CN
China
Prior art keywords
sequence
sequences
sequencing data
database
alignment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811169458.4A
Other languages
Chinese (zh)
Other versions
CN111009286B (en
Inventor
袁剑颖
王子榕
陈晴
孙瑞雪
周加利
吴红龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huada Biotechnology Wuhan Co ltd
Shenzhen Huada Yinyuan Pharmaceutical Technology Co Ltd
Original Assignee
Shenzhen Huada Yinyuan Pharmaceutical Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Huada Yinyuan Pharmaceutical Technology Co Ltd filed Critical Shenzhen Huada Yinyuan Pharmaceutical Technology Co Ltd
Priority to CN201811169458.4A priority Critical patent/CN111009286B/en
Publication of CN111009286A publication Critical patent/CN111009286A/en
Application granted granted Critical
Publication of CN111009286B publication Critical patent/CN111009286B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The invention relates to the field of microbial detection, in particular to a method and a device for carrying out microbial analysis on a host sample. The method comprises the following steps: (1) performing a first filtering process on a set of sequencing data from the host sample using a host genome database to remove sequencing data from the set of sequencing data that is alignable with the host genome database; (2) performing second filtering processing on the sequencing data set by adopting a homology database so as to remove the sequencing data which can be aligned with the homology database from the sequencing data set; (3) comparing the sequencing data set subjected to the first filtering treatment and the second filtering treatment with a microorganism genomic database to determine microorganism sequencing data from the microorganisms in the sequencing data set. The method and the device can realize the rapid and accurate analysis and detection of the microorganisms in the host.

Description

Method and apparatus for microbiological analysis of host samples
Technical Field
The invention relates to the field of microbial detection, in particular to a method and a device for carrying out microbial analysis on a host sample.
Background
The traditional clinical diagnosis method for pathogenic microorganisms has the advantages of higher requirements on experimental and technical conditions, longer period, higher omission factor, complex operation, low flux, low identification precision and resolution ratio and the like, and the short plates cause serious economic burden to patients and delay the optimal diagnosis and treatment time of the patients; and abuse resistance may lead to serious drug resistance consequences.
Due to the development of sequencing technology, sequencing time and sequencing cost are gradually reduced, so that clinical application of the metagenome-based NGS becomes possible. However, methods for detecting pathogens based on metagenomic sequencing have yet to be further improved.
Disclosure of Invention
The present invention is directed to solving, at least to some extent, one of the technical problems in the related art. Therefore, an object of the present invention is to provide a method for rapid and accurate microbiological analysis of a host sample, which can improve the accuracy of detection of pathogenic microorganisms. According to the invention, the accuracy of metagenome detection is effectively improved by constructing a high-quality reference sequence library. The method provided by the invention is used for detecting pathogenic microorganisms and assisting in reading reports.
The present invention is based on the results of the following findings and studies by the inventors:
the inventor of the invention finds out in the research process that: the analysis method for detecting microorganisms based on metagenomic sequencing is mainly oriented to scientific research analysis, lacks clinical targeted design and practical application exploration, cannot meet the clinical application requirements in the aspects of accuracy, result report readability, calculation efficiency and the like, and specifically comprises the following aspects:
A. the database is not suitable for clinical use. The prior art generally performs simple collection and arrangement based on a public genome database, and has the following problems: 1) the analysis and calculation resource requirements are high. If a large sequence library (such as nt library) is directly used, the memory or calculation time required for alignment or Kmer matching is too large. 2) The sequence quality is low. If sequence quality control, representational analysis, hierarchy limitation and the like are not performed, false positives of downstream analysis are caused. 3) The detection range is limited. Such as reference to only a small database or collection and sorting for pathogen types of interest (e.g., for virus detection analysis only), resulting in false negatives for downstream analysis.
B. The detection algorithm has low accuracy: consensus, similar sequences among species, such as plasmids, drug resistance genes, virulence factor sequences, are not unique to a single species and may lead to false positives when the genome of a portion of the species within the pathogen database contains such sequences. In addition, sequence annotation is subject to systematic errors, including the following reasons: 1) host sequence removal is incomplete. Due to the incompleteness of human genome assembly and the polymorphism of people, the residue of the human sequence still exists in the filtration, and when the pathogenic genome sequence is similar to the human sequence, the corresponding false positive detection can be caused; 2) sequencing errors and biases caused by reagents and sequencing platforms; 3) laboratory contamination such as aerosol-induced fragment contamination, and the like.
C. The results report with poor readability. In the prior art, the analysis result is generally in a text format, the content and the parameters are obscure and unintelligible, the information is difficult to retrieve and transversely compare, and the interpretation accuracy and efficiency of a clinician are influenced.
To this end, the invention provides a method for microbiological analysis of a host sample for the detection and determination of pathogenic microorganisms.
To this end, according to one aspect of the present invention, there is provided a method of performing a microbiological analysis of a host sample comprising: (1) performing a first filtering process on a set of sequencing data from the host sample using a host genome database to remove sequencing data from the set of sequencing data that is alignable with the host genome database, the set of sequencing data resulting from metagenomic sequencing of the host sample; (2) performing a second filtering process on the sequencing data set using a homology database, the homology database including at least a portion of known plasmid sequences, drug-resistant sequences, bacterial virulence sequences, so as to remove the sequencing data from the sequencing data set that can be aligned with the homology database; (3) comparing the sequencing data set subjected to the first filtering treatment and the second filtering treatment with a microorganism genomic database to determine microorganism sequencing data from the microorganisms in the sequencing data set.
Performing metagenome sequencing on a host sample to obtain a sequencing data set, and performing first filtering treatment and second filtering treatment on the obtained sequencing data set so as to remove a host genome sequence and a homologous sequence in the sequencing data set; the remaining sequencing data sets are compared to a microbial genome database to determine sequencing data from the microorganisms for analysis to confirm the type of microorganism. The method of the present invention can be used for the analysis and detection of microorganisms in host samples quickly and accurately.
In this context, the term "microorganism" refers to a general term for a small organism that is indistinguishable to the naked eye, has a morphological structure, and can grow and reproduce in a suitable environment, and includes not only bacteria, fungi, viruses, etc., but also parasites.
Herein, the "first filtering process" and the "second filtering process" are used only to distinguish between filtering a set of sequencing data by means of different databases; does not represent the order of filtering treatment. One skilled in the art will recognize that when the sequencing data set from the host sample is subjected to the first filtering process and the second filtering process, host genomic sequences may be filtered out first, followed by filtering out homologous sequences; or filtering out homologous sequences and then filtering out host genome sequences; or simultaneously removing host genome sequences and homologous sequences.
Since the plasmid sequence, the drug-resistant sequence, and the bacterial virulence sequence are non-specific, i.e., the same sequence may exist in multiple microorganisms at the same time, these sequences are removed prior to identification of the microorganism species, preventing interference with subsequent identification of the microorganism. And removing one or more of known plasmid sequences, drug-resistant sequences and bacterial virulence sequences by using a homology database so as to remove corresponding repeated sequences, thereby avoiding interfering the identification of subsequent microorganisms. The homology database, which may also be referred to as a homology database, may be obtained by downloading plasmid (e.g., from NCBI), drug resistance (using CARD download), and bacterial virulence sequences (from VFDB) from public databases and deleting ambiguous gene sequences. Wherein CARD (the Comprehensive antibacterial resistance database) is a drug resistance sequence database, and VFDB (the viral factors database) is a bacterial virulence sequence database.
According to an embodiment of the present invention, the above method for performing microbiological analysis on a host sample may further be added with the following technical features:
in some embodiments of the invention, the host genome database is a human genome database. The method of the invention can be used for analyzing and determining the pathogenic microorganisms in the human body so as to quickly and accurately obtain the information of the microorganisms.
In some embodiments of the invention, the human genome database comprises human reference genomic sequences and hepatitis genomic sequences. The human reference genome sequence may be available for download at the NCBI official website, and may be, for example, the human reference genome hg 38; the Yanhuang genome sequence may be obtained by downloading from the Yanhuang genome public database official website.
In some embodiments of the invention, the microbial genome database is constructed by: (a) obtaining a known microbial genome sequence and constructing a primary microbial genome database, wherein the microbial genome sequence comprises at least one part of whole genome sequence information, chromosome sequence information, Scaffold sequence information and Contig sequence information; (b) performing redundancy removal on sequences in the primary microorganism genome database so as to obtain a redundancy-removed microorganism genome database, wherein the redundancy removal refers to removal of sequences with similarity of more than 99%; (c) selecting the microbial genome sequence of a representative strain for a species for which a plurality of genomic sequences exist based on the redundantly removed microbial genome database, and removing the microbial genome sequences of other strains of the species from the redundantly removed microbial genome database, so as to obtain the microbial genome database. The genome sequences downloaded from public databases, where there may be redundant (i.e. identical or more than 99% similar) genomes, are first removed and then the remaining databases are used to screen for representative strains. Because if the entire strain sequence is used, the database is very large and cannot meet practical application. And by selecting a strain sequence with optimal representativeness, the database can be effectively compressed, and the representativeness of the reference genome is improved, so that the method is an effective scheme after consideration and balance in multiple aspects such as computing resources, analysis precision and analysis time.
Because of redundancy of information recorded in the existing microbial genome database, when sequencing data is compared to the existing microbial genome database, the comparison efficiency and the comparison accuracy are affected. To this end, according to an embodiment of the present invention, the present invention utilizes known microbial genome sequences to remove sequences of non-representative strains and redundant sequences thereof by screening and sorting, and constructs a microbial genome database suitable for alignment of microbial sequencing data.
In some embodiments of the invention, the representative strain is obtained by: determining the average identity between the genomic sequences of each two strains for a species in which there are multiple strains of genomic sequences; obtaining a similarity matrix between the genome sequences of the plurality of strains in the species based on the average identity between the genome sequences of each two strains; based on the similarity matrix, the strain having the greatest average similarity to the sequences of the other respective strains was selected as a representative strain.
In some embodiments of the present invention, the step (3) further comprises performing a third filtering process on the sequencing data of the microorganism by using a high frequency alignment site database to remove the sequencing data aligned to the high frequency alignment site, wherein the high frequency alignment site database is constructed by the following steps: comparing metagenomic sequencing data of a plurality of samples to the microbial genome database, wherein the microbial genome is pre-divided into a plurality of predetermined windows, so as to determine the number of metagenomic sequencing data matching the windows; determining a plurality of high-frequency alignment sites constituting the high-frequency alignment site database based on the number of metagenomic sequencing data matching the window.
Since there are usually some undetermined regions in host genome sequences such as human genome databases and at the same time there are some individual differences (polymorphisms) in the genome, these factors may cause some specific host sequences to remain even after the host genome sequences in the metagenomic sequencing data are removed, and these sequences may remain to be annotated downstream into the microbial genome. If the microbial genome sequence is similar to these retained sequences, this will result in false positive detection, i.e., actually the host nucleic acid sequence is misidentified as a microorganism. In addition, some systematic errors such as reagents, sequencing errors or bias caused by the sequencing platform, background contamination of the laboratory, etc. during the sequencing process also generate the same sequence and occur multiple times in the experiment. Therefore, by using the high frequency alignment site database, these sequences can be effectively removed, and accurate determination results can be obtained.
In some embodiments of the invention, the plurality of samples and the host sample belong to the same species. Thereby identifying high frequency alignment sites present in the same species.
In some embodiments of the present invention, an alignment site with an alignment frequency greater than 5% is used as the high-frequency alignment site, and the alignment frequency is a ratio of the number of samples aligned to the site to the total number of samples.
In some embodiments of the invention, step (3) when aligning the sequencing data set subjected to the first filtering process and the second filtering process with a microorganism genomic database, determining the microorganism sequencing data from the microorganism in the sequencing data set further based on at least one of:
reserving sequences with alignment length ratio larger than 90% in the sequencing data set;
preserving sequences with mismatched base numbers less than 5% in the sequencing data set;
and (3) retaining the sequence with the alignment specificity, wherein the sequence with the alignment specificity refers to the alignment of the statistical sequence to different positions, and the sequence with the ratio of the suboptimal alignment score to the optimal alignment score being less than 0.8 is selected as the sequence with the alignment specificity. When a sequence is aligned to a position by using software, such as bwa software, the alignment score of the sequence to a position can be calculated during the alignment process, and the sequence with the optimal alignment score and the suboptimal alignment score (which is only inferior to the optimal alignment score) is selected, wherein the greater the difference between the optimal alignment score and the suboptimal alignment score, the stronger the specificity of the sequence is; and taking the sequence with the ratio of the suboptimal alignment score to the optimal alignment score being less than 0.8 as the alignment specific sequence.
In some embodiments of the invention, the method further comprises:
prior to performing step (1), pre-processing raw sequencing data from the host sample to obtain the set of sequencing data, the pre-processing comprising filtering out at least one of the following sequences: the sequence with continuous base number above 10bp is shared with the linker sequence; reading sequences whose length is below a predetermined threshold; and the ratio of the number of bases with the mass value less than 5 to the total number of bases in the sequence is more than 50%.
In some embodiments of the present invention, the predetermined threshold is 50-55 bp.
In some embodiments of the invention, the method further comprises at least one of annotating and visualizing the sequencing data of the microorganism.
In some embodiments of the invention, the annotation process is selected from at least one of: aligning sequence numbers, the aligning sequence numbers referring to the sequence numbers of the species aligned at the species level; a unique number of aligned sequences, the unique number of aligned sequences being the number of sequences that are uniquely aligned to a species or genus level; coverage, which refers to the percentage of the detected length of the nucleic acid sequence of the microorganism to the length of the entire genomic sequence of the microorganism; depth of coverage, which refers to the average depth of bases over the range of coverage on the genome; relative abundance, which refers to the proportion of microorganisms detected at the species or genus level among the same type of microorganisms detected throughout the sample; and distribution randomness.
According to another aspect of the present invention, there is provided a device for performing microbiological analysis of a host sample, comprising:
a host data filtering module that performs a first filtering process on a set of sequencing data from the host sample using a host genome database to remove sequencing data from the set of sequencing data that can be aligned with the host genome database, the set of sequencing data resulting from performing metagenomic sequencing on the host sample;
a homologous data filtering module connected to the host data filtering module, wherein the homologous data filtering module performs a second filtering process on the sequencing data set by using a homologous database, and the homologous database includes at least a part of known plasmid sequences, drug-resistant sequences and bacterial virulence sequences, so as to remove the sequencing data that can be aligned with the homologous database from the sequencing data set;
and the microorganism data comparison module is connected with the host data filtering module or the homologous data filtering module, and compares the sequencing data set subjected to the first filtering treatment and the second filtering treatment with a microorganism genome database so as to determine the microorganism sequencing data from the microorganisms in the sequencing data set.
According to an embodiment of the present invention, the above apparatus for performing microbiological analysis on a host sample may further have the following technical features:
in some embodiments of the invention, the host genome database is a human genome database.
In some embodiments of the invention, the human genome database comprises human reference genomic sequences and hepatitis genomic sequences.
In some embodiments of the invention, the microbial genome database is constructed by: (a) obtaining a known microbial genome sequence, and constructing a primary microbial genome database, wherein the microbial genome sequence in the primary microbial genome database comprises at least one part of whole genome sequence information, chromosome sequence information, Scaffold sequence information and Contig sequence information; (b) performing redundancy removal on sequences in the primary microorganism genome database so as to obtain a redundancy-removed microorganism genome database, wherein the redundancy removal refers to removal of sequences with similarity of more than 99%; (c) selecting the microbial genome sequence of a representative strain for a species for which a plurality of genomic sequences exist based on the de-redundant microbial genome database, and removing the microbial genome sequences of other strains of the species from the de-redundant microbial genome database so as to obtain the microbial genome database;
in some embodiments of the invention, the representative strain is obtained by: determining the average identity between the genomic sequences of each two strains for a species in which there are multiple strains of genomic sequences; obtaining a similarity matrix between the genome sequences of the plurality of strains in the species based on the average identity between the genome sequences of each two strains; based on the similarity matrix, the strain having the greatest average similarity to the sequences of the other respective strains was selected as a representative strain.
In some embodiments of the present invention, the microorganism data alignment module further comprises a third filtering process on the microorganism sequencing data by using a high frequency alignment site database to remove sequencing data aligned to the high frequency alignment site, wherein the high frequency alignment site database is constructed by the following steps: comparing metagenomic sequencing data of a plurality of samples to the microbial genome database, wherein the microbial genome is pre-divided into a plurality of predetermined windows, so as to determine the number of metagenomic sequencing data matching the windows; determining a plurality of high-frequency alignment sites constituting the high-frequency alignment site database based on the number of metagenomic sequencing data matching the window.
In some embodiments of the invention, the plurality of samples and the host sample belong to the same species.
In some embodiments of the present invention, an alignment site with an alignment frequency greater than 5% is used as the high-frequency alignment site, and the alignment frequency is a ratio of the number of samples aligned to the alignment site to the total number of samples.
In some embodiments of the invention, in the microorganism data alignment module, upon aligning the sequencing data set subjected to the first filtering process and the second filtering process with a microorganism genomic database, determining the microorganism sequencing data from the microorganism in the sequencing data set further based on at least one of: reserving sequences with alignment length ratio larger than 90% in the sequencing data set; preserving sequences with mismatched base numbers less than 5% in the sequencing data set; and (3) retaining the sequence with the alignment specificity, wherein the sequence with the alignment specificity refers to the alignment score of the statistical sequence aligned to different positions, and the sequence with the ratio of the suboptimal alignment score to the optimal alignment score being less than 0.8 is selected as the sequence with the alignment specificity.
In some embodiments of the invention, the apparatus further comprises: a sequencing data quality control module connected to the host sequence filtering module, the sequencing data quality control module pre-processing raw sequencing data from the host sample to obtain the set of sequencing data, the pre-processing including filtering to remove at least one of the following sequences:
the sequence with continuous base number above 10bp is shared with the linker sequence;
reading sequences whose length is below a predetermined threshold; preferably, the predetermined threshold is 50-55 bp;
and the ratio of the number of bases with the mass value less than 5 to the total number of bases in the sequence is more than 50%.
In some embodiments of the invention, the apparatus further comprises: and the data output module is connected with the microorganism data comparison module and is used for performing at least one of annotation processing and visualization processing on the microorganism sequencing data.
In some embodiments of the invention, the annotation process is selected from at least one of:
the number of aligned sequences, which refers to the number of sequences of the species aligned at the species level.
Unique number of aligned sequences, which refers to the number of sequences that are uniquely aligned to a species (genus).
Coverage, which is the percentage of the detected length of the nucleic acid sequence of the microorganism to the length of the entire genomic sequence of the microorganism.
Depth of coverage, which refers to the average depth of bases over the range of coverage on the genome.
Relative abundance, which refers to the proportion of the microorganisms detected at the species (genus) level among the same type of microorganisms detected throughout the sample.
And distribution randomness.
The beneficial effects obtained by the invention are as follows: the method and the device for carrying out the microbiological analysis on the host sample have high automation degree, reduce the requirements of experiments and technical conditions, and improve the detection flux, range and precision. The method is particularly suitable for performing high-precision analysis on the microorganisms by using metagenome sequencing data.
In addition, the method provided by the invention improves the accuracy, result report readability and calculation efficiency of microorganism detection through the following aspects:
(1) constructing a high-quality genome database. Through genome sequence average similarity calculation or cluster analysis, typical representative sequences of four types of microorganisms including bacteria, fungi, viruses and parasites at the seed level are selected, the detection range is expanded, the sequence library capacity is reduced, and the calculation efficiency and accuracy of downstream comparison analysis are improved.
(2) Developing a high-precision detection algorithm: the method for detecting sequence distribution randomness and the like is carried out by homologous sequence filtering, high-frequency comparison site comparison library filtering, information entropy and depth ratio using methods, the influence of system errors and pollution on detection is reduced, and the analysis accuracy is improved.
(3) And (3) comprehensively imaging a result report: the method realizes the output of single sample detection parameters, whole genome distribution and intra-batch detection graphical reports, improves the interpretation accuracy and shortens the interpretation period.
Drawings
Fig. 1 is a diagram of a similarity network provided in accordance with an embodiment of the present invention.
FIG. 2 is a genome-wide alignment frequency map of a whipworm, provided according to an embodiment of the invention.
FIG. 3 is a visualization presentation provided in accordance with an embodiment of the present invention.
Fig. 4 is a graphical result diagram of parameters of partially detected species of sample 17S0270988 provided according to an embodiment of the present invention.
Fig. 5 is a graph showing the detection results of all samples of Nocardia _ farcina species provided in the embodiment of the present invention within a batch.
FIG. 6 is a genomic distribution map of the Nocardia _ Farcinica species detected sequence provided in accordance with an embodiment of the present invention.
FIG. 7 is a schematic view of an apparatus for performing microbiological analysis of a host sample provided in accordance with an embodiment of the present invention.
FIG. 8 is a schematic view of an apparatus for performing microbiological analysis of a host specimen provided in accordance with an embodiment of the present invention.
FIG. 9 is a schematic view of an apparatus for performing microbiological analysis of a host specimen provided in accordance with an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
Construction of a microbial genome database
The invention provides a method for constructing a microbial genome database, which comprises the following steps: (a) obtaining a known microbial genome sequence, and constructing a primary microbial genome database, wherein the microbial genome sequence in the primary microbial genome database comprises at least one part of whole genome sequence information, chromosome sequence information, Scaffold sequence information and Contig sequence information; (b) performing redundancy removal on sequences in the primary microorganism genome database so as to obtain a redundancy-removed microorganism genome database, wherein the redundancy removal refers to removal of sequences with similarity of more than 99%; (c) selecting the microbial genome sequence of a representative strain for a species for which a plurality of genomic sequences exist based on the redundantly removed microbial genome database, and removing the microbial genome sequences of other strains of the species from the redundantly removed microbial genome database, so as to obtain the microbial genome database.
In one embodiment of the present invention, the representative strain is obtained by the following method: determining the average identity between the genomic sequences of each two strains for a species in which there are multiple strains of genomic sequences; obtaining a similarity matrix between the genome sequences of the plurality of strains in the species based on the average identity between the genome sequences of each two strains; based on the similarity matrix, the strain having the greatest average similarity to the sequences of the other respective strains was selected as a representative strain.
In this context, the "average identity" refers to the size of similarity between the genomic sequences of the two strains. Cutting the genome of the strain A into a plurality of short sequences, aligning the short sequences to a genome reference sequence of the strain B, counting all sequences with the alignment length of more than 90 percent and the number of mismatched bases of less than 5 percent, wherein the total length of all the sequences is RLS, the total alignment length is MLS, and the number of the mismatched bases is ENS, and then the average consistency from A to B is as follows: (MLS-ENS)/RLS.
In this context, the term "similarity matrix" refers to a matrix arrangement of all strains, wherein each row and each column corresponds to the average consistency of different strains. A microorganism has M strains, and is sorted according to 1,2,3, … M, and the average consistency between every two strains is calculated to obtain a two-dimensional similarity matrix with M rows and M columns. The value in row i and column j in the matrix is the average identity from strain i to strain j. According to the similarity matrix, the average value of the ith row represents the similarity degree of the ith strain and other strains, the corresponding strain with the largest numerical value is the strain with the largest similarity, and the strain with the smallest numerical value corresponds to the strain with the smallest similarity.
The above values are made into a similarity network diagram, as shown in fig. 1. Wherein each round sphere represents a strain, if the similarity between strains is more than 95%, connecting lines exist, and the thickness represents the height of the similarity (the thicker the larger the similarity). The graph shows that the strains shown in the spheroids have the highest average similarity to other strains as the final representative strains in the database.
According to embodiments of the invention, the known microbial genome sequences may be downloaded from the public databases of NCBI, PATRIC, EuPathDB.
Method for microbiological analysis of host samples
According to another aspect of the present invention, there is provided a method of performing a microbiological analysis of a host sample comprising: (1) performing a first filtering process on a set of sequencing data from the host sample using a host genome database to remove sequencing data from the set of sequencing data that is alignable with the host genome database, the set of sequencing data resulting from metagenomic sequencing of the host sample; (2) performing a second filtering process on the sequencing data set using a homology database, the homology database including at least a portion of known plasmid sequences, drug-resistant sequences, bacterial virulence sequences, so as to remove the sequencing data from the sequencing data set that can be aligned with the homology database; (3) comparing the sequencing data set subjected to the first filtering treatment and the second filtering treatment with a microorganism genomic database to determine microorganism sequencing data from the microorganisms in the sequencing data set.
According to an embodiment of the present invention, the host genome database comprises both the human reference genome (hg38) downloaded from the NCBI official website and the yellow genome sequence downloaded from the yellow genome public database official website. The main component of the sample nucleic acid is human nucleic acid, and in order to improve the detection precision and efficiency of pathogenic microorganisms, sequencing data is compared with a human reference genome sequence and filtered, and then comparison and annotation of pathogenic microorganism sequences are carried out. Therefore, the filtered sequences in step (1) are aligned to the host reference genome file. The sequence will be filtered when the alignment length reaches 90%.
According to an embodiment of the invention, the homologous sequence files are obtained from the same sequence library arrangement. According to embodiments of the invention, plasmid (NCBI), drug resistance (CARD) and bacterial Virulence (VFDB) sequences can be downloaded from public databases and deletions made of the uninformatively defined gene sequences, and the resulting pool of homologous sequences used as a filter for homologous sequences. The sequence will be filtered when the alignment length reaches 90%.
According to an embodiment of the present invention, the high frequency alignment site data is removed in step (3). Host sequence filtration efficiency depends on the integrity of the sequence library, but as the human genome assembly sequence still exists in an undetermined region, and meanwhile, certain individual differences (polymorphism) exist in the genome, the factors can cause the host removal module to leave specific human sequences which are reserved in downstream pathogen annotations. If the genomic sequence of a pathogenic microorganism is similar to these regions for determination, this will result in false positive detection, i.e. in fact the human nucleic acid sequence is misinterpreted as a pathogenic microorganism. In addition, some systematic errors, such as reagents, sequencing errors or bias caused by the sequencing platform, background contamination of the laboratory, etc., can produce the same sequence and occur multiple times in the experiment. Based on this feature, the high frequency (> 5%) alignment site sequences in (4) above the alignment were filtered to reduce the detection of false positives.
The term "high frequency alignment site" refers to the position of the genome of a species that is aligned at a high frequency in a sample library (historical sample), and is mainly the alignment system error caused by sequencing bias, alignment system error, homology similarity, and the like. And therefore need to be filtered out. Wherein the high-frequency comparison sites can be obtained by arranging a high-frequency comparison site library. The specific construction method of the high-frequency comparison site library comprises the following steps: selecting N clinical samples, removing human sequences from the original sequencing data of the N clinical samples, and aligning the N clinical samples to a microbial genome database. Wherein the microorganism genome database is divided into a plurality of windows (sites), and the frequency of each window is counted as follows: according to the comparison result, if k samples are compared to the window, the frequency of the window is k/N. Correspondingly, the specific method for filtering the high-frequency alignment site library comprises the following steps: and for a new sample, after removing the human sequences, comparing the residual data to a microbial genome database, and if the frequency of a window where the comparison position of one sequence is located is more than 5%, filtering out the comparison result of the sequence without participating in subsequent statistical analysis. Comparing the metagenome sequencing data of more than 30 samples to a microbial genome database by using comparison software, counting the frequency of comparison sequences existing in all samples in a fixed window on each microbial genome, and obtaining a high-frequency comparison site database for filtering high-frequency comparison site sequences at downstream. As shown in FIG. 2, the partial genome positions of the whole genome of the Verbena lanuginosa are shown, and the alignment frequency distribution is shown in FIG. 2, and it can be seen from FIG. 2 that the partial genome regions detect the aligned sequences in the history samples at high frequency.
According to an embodiment of the invention, in step (3), upon aligning the sequencing data set subjected to the first filtering treatment and the second filtering treatment with a microorganism genomic database, determining microorganism sequencing data from the microorganism with the microorganism in the sequencing data set based on at least one of: reserving sequences with the comparison length of more than 90 percent in a sequencing data set, namely reserving sequences with the length of more than 90 percent of the full length of the sequences of the reference sequence on the alignment of a single sequence; sequences with the number of mismatched bases being less than 5% in the sequencing data set, namely, sequences with the number of bases inconsistent with the reference sequence being less than 5% due to sequencing errors are reserved, namely, the upper part of the alignment is reserved; and (3) retaining the sequence with the alignment specificity, wherein the sequence with the alignment specificity refers to the alignment score of the statistical sequence aligned to different positions, and the sequence with the ratio of the suboptimal alignment score to the optimal alignment score being less than 0.8 is selected as the sequence with the alignment specificity. Because of the existence of repeated, highly similar segments in the genome, and homologous similar sequences between different species, sequencing of these regions will produce sequences with identical or similar alignment scores. And screening specific alignment sequences according to the score difference ratio (namely, the sub-optimal alignment divided by the optimal alignment is less than 0.8) of the multi-alignment results to obtain a 'unique' alignment sequence. For example, in the comparison process using bwa software, similar AS in the i:50 marker (where AS represents the optimal comparison score in bwa software, and i represents that the index type is an integer), 50 is the optimal comparison score; similar XS: i:45 markers (where XS represents the suboptimal alignment score), 45 is the suboptimal alignment score. The specific principle is as follows: if a sequence can be aligned to multiple positions in the database, the alignment software will calculate a score for each aligned position, and then take the two scores with the highest ranking as the optimal alignment score and the suboptimal alignment score, which may be equal. We require that the "ratio" of suboptimal to optimal alignment be less than 0.8 to define that optimal alignment is sufficiently specific.
According to an embodiment of the invention, the method further comprises: prior to performing step (1), pre-processing raw sequencing data from the host sample to obtain the set of sequencing data, the pre-processing comprising filtering out at least one of the following sequences:
the sequence with continuous base number above 10bp is shared with the linker sequence; that is, if a sequence has a continuous base fragment of more than 10bp identical to the fragment of the adaptor sequence, the sequence is filtered;
reading sequences whose length is below a predetermined threshold; preferably, the predetermined threshold is 50-55 bp;
and the ratio of the number of bases with the mass value less than 5 to the total number of bases in the sequence is more than 50%. Sequencing the obtained data, wherein each base corresponds to a quality value, and if the length of the sequence is N, the number of bases with the quality value less than 5 is K, and if K/N is greater than 50%, filtering out the sequence.
In this context, the primary sequencing data is obtained by preparing a sequencing library from the nucleic acid sequence of a sample to be tested and performing the sequencing on the nucleic acid sequence. According to an embodiment of the invention, obtaining the sequencing data comprises: obtaining nucleic acid in a sample to be detected, preparing a sequencing library of the nucleic acid, and sequencing the sequencing library. The preparation method of the sequencing library is carried out according to the requirements of the selected sequencing method, the sequencing method can select but not limited to Hisq2000/2500 sequencing platform of Illumina, Ion Torrent platform of Life Technologies, BGISEQ platform of BGI and single molecule sequencing platform according to the difference of the selected sequencing platform, the sequencing mode can select single-ended sequencing or double-ended sequencing, and the obtained off-line data is a sequencing and reading fragment called reads (reads).
The term "aligned" means matched. For specific alignment, known alignment software, such as SOAP, BWA, TeraMap, etc., may be used, which is not limited in this embodiment. In the alignment process, according to the setting of alignment parameters, at most n base mismatches (mismatches) are allowed for a pair or a read, for example, n is set to 1 or 2, if more than n bases in a read are mismatched, it is considered that the pair of reads cannot be aligned to a reference sequence, or if all the mismatched n bases are located in one read of the pair of reads, it is considered that the read in the pair of reads cannot be aligned to the reference sequence.
According to an embodiment of the present invention, the method for analyzing the host sample microorganism may further comprise: and performing annotation analysis or visualization processing on the microorganism sequencing data. Wherein the annotation analysis may include one or more of the following:
alignment sequence number (MRN): the number of sequences of the species aligned at the species level is referred to.
Number of unique aligned Sequences (SMRN): the number of sequences uniquely aligned to a species or genus level.
Coverage (CovRate): refers to the percentage of the length of the nucleic acid sequence of the microorganism detected to the length of the entire genomic sequence of the microorganism.
Depth of coverage (CovDepth) refers to the average depth of bases in the coverage area on the genome.
Relative abundance (Re _ Abu): the proportion of the microorganism detected at the species (genus) level among the same type of microorganisms detected in the whole sample.
Distribution randomness: the metagenome sequencing is performed by a shotgun method, and the sequencing depth of true positive pathogenic microorganisms accords with a Poisson distribution model. The method is based on the number of comparison sequences and the comparison position distribution of the whole genome, and calculates the ratio Depth _ ratio of the theoretical coverage Depth (formula I) and the actual Depth (namely the Depth ratio) of the detected pathogen and the distribution information entropy Shannon _ Index (formula II) of the detected sequences on the whole genome to perform false positive identification.
Theoretical depth:
Figure BDA0001822056850000131
in the formula I, N represents the number of sequencing bases of the genome, L represents the length of the genome, and e represents a natural constant.
The formula II is as follows: information entropy:
Figure BDA0001822056850000132
in the formula II, the genome is divided into n regions on average, and Pi is the number ratio of aligned sequences of the ith region.
According to an embodiment of the present invention, the visualization process, also called visualization analysis, may include the following aspects: the first aspect is as follows: visually displaying the total data size of the off-line data and the processed data of each part, and judging whether the sample data size meets the standard or not; the second aspect is that: visualizing the pathogen detection result, and displaying each parameter in the pathogen detection result on a graph for judging false positive detection; and counting pathogen information detected together in the same batch of samples, and judging whether the pollution condition in the batch exists or not. The third aspect is that: and drawing a reads distribution graph of each detected pathogen, and judging the detection reliability through distribution randomness.
Therefore, the method of the invention is used for carrying out microorganism analysis on the host sample, obtaining the data of pathogenic microorganisms in the host sample and generating a corresponding report. For example, a tex format report can be automatically generated based on the latex language and converted into a pdf document format test analysis report, and the report display content can include basic information of the subject, clinical information, sample information, test results, result descriptions, suspected background microorganism lists, references, and the like. Wherein the basic information of the examinee is: may include name, type, age, hospital number, bed number, etc. Clinical information: including clinical manifestations, clinical tests (WBC, lymphocytes, neutrophils, CRP, PCT, culture results, identification results, microscopic examination results), clinical diagnosis, focus on pathogens, anti-infective medication, and the like. Sample information: submission unit, department, submission doctor, sample collection date, report date, sample number, sample type, sample volume, etc. And (3) detection results: the listed species are all microorganisms detected in the detection of the sample, are classified by bacteria, viruses, fungi, parasites, mycobacteria and atypical pathogens, are respectively sorted from top to bottom according to the detected sequence number, and are ranked close to the former species, and the relative content of the microorganisms is higher. The results show that: the pathogenic microorganisms listed in the test results are combined to provide a pathogenic information profile in the interpretation database. List of suspected background microorganisms: the background microorganisms detected, i.e., those with a frequency of greater than 50% and less pathogenic in the sample pool (historical samples), are listed and incorporated by reference in the appendix. Reference documents: the results illustrate the literature information referenced.
Device for microbiological analysis of host samples
According to another aspect of the present invention, there is provided a device for performing microbiological analysis on a host sample. According to a specific embodiment of the present invention, the apparatus includes a host data filtering module, a homologous data filtering module, and a microorganism data comparing module, wherein the microorganism data comparing module is connected to the host data filtering module or the homologous data filtering module. As shown in fig. 7, the homologous data filtering module is connected to the host data filtering module, and the microbial data comparison module is connected to the homologous data filtering module. The host data filtering module performs a first filtering process on a sequencing data set from the host sample by using a host genome database so as to remove sequencing data which can be aligned with the host genome database from the sequencing data set, wherein the sequencing data set is obtained by performing metagenomic sequencing on the host sample; the homologous data filtering module performs second filtering processing on the sequencing data set by adopting a homologous database, wherein the homologous database comprises at least one part of known plasmid sequences, drug-resistant sequences and bacterial virulence sequences so as to remove the sequencing data which can be aligned with the homologous database from the sequencing data set; a microorganism data alignment module aligns the sequencing data set subjected to the first filtering process and the second filtering process with a microorganism genomic database to determine microorganism sequencing data from the microorganisms in the sequencing data set.
According to an embodiment of the present invention, the apparatus further comprises a sequencing data quality control module, as shown in fig. 8, the sequencing data quality control module is connected to the host sequence filtering module, and the sequencing data quality control module preprocesses raw sequencing data from the host sample to obtain the sequencing data set.
According to an embodiment of the present invention, the apparatus further includes a data output module, as shown in fig. 9, the data output module is connected to the microorganism data comparison module, and the data output module performs at least one of annotation processing and visualization processing on the microorganism sequencing data.
The scheme of the invention will be explained with reference to the examples. It will be appreciated by those skilled in the art that the following examples are illustrative of the invention only and should not be taken as limiting the scope of the invention. The examples, where specific techniques or conditions are not indicated, are to be construed according to the techniques or conditions described in the literature in the art or according to the product specifications. The reagents or instruments used are not indicated by the manufacturer, and are all conventional products commercially available.
Example 1
103 clinical infection samples were obtained from hospitals, of which 63 positive pathogens were known. Sequencing by using a BGISEQ-50 sequencing platform to obtain original sequencing data, and then analyzing by using the method and the control method respectively.
First, the method of the invention is implemented
The raw sequencing data were analyzed using the following method:
(1) data quality control
The raw sequencing data was filtered from three sides:
the first aspect is as follows: filtering the sequence which shares continuous base number more than 10bp with the joint sequence.
The second aspect is that: sequences read below a certain threshold (default 50bp) are filtered.
The third aspect is that: sequences with a ratio of the number of bases having a mass value of less than 5 to the total number of bases of the sequence of more than 50% are filtered out.
(2) Host sequence removal
And (2) aligning the sequence filtered in the step (1) to a host reference genome file, and filtering when the alignment length of the sequence reaches 90%. The host genome database contains both the human reference genome (hg38) downloaded from the NCBI official website and the yellow genome sequence downloaded from the yellow genome public database official website.
(3) Removal of homologous sequences
And (3) aligning the rest of the sequences in the step (2) to a homologous sequence file, and filtering the sequences when the length ratio of the aligned sequences reaches 90%. Wherein, the homologous sequence file is obtained by arranging the same sequence library. Plasmid (NCBI), drug resistance (CARD) and bacterial Virulence (VFDB) sequences can be downloaded from public databases and deletions made for ambiguous gene sequences. The pool of homologous sequences thus obtained was used as a filter for the homologous sequences.
(4) Pathogen library comparison and quality control
A microbial genome database was first obtained as follows and was ready for use.
And (3) downloading a pathogenic microorganism genome sequence from public databases of NCBI, PATRIC and EuPathDB, and screening, sorting and building a clinically applicable pathogenic microorganism detection database. The finishing method comprises the following steps:
the first aspect is as follows: screening the Genome tag from high to low according to the priority of the integrity to obtain Complete Genome/Chromosome/Scaffold/Contig sequences;
the second aspect is that: genomic sequences that are completely identical or highly similar in sequence (meaning more than 99% similar) are redundantly removed.
The third aspect is that: the method comprises the steps of carrying out sequence segmentation on one genome to obtain a simulation sequence, rapidly comparing the simulation sequence with another genome through comparison software, obtaining average consistency between the two genomes according to comparison results, obtaining a similarity matrix between a plurality of genome sequences in the same species, and finally selecting one genome with the maximum average similarity with other genome sequences as a representative sequence.
Thereby obtaining a microbial genome database.
And then, comparing the metagenome sequencing data of more than 30 samples by using comparison software to obtain an obtained microbial genome database, counting the frequency of comparison sequences existing in all samples in a fixed window on each microbial genome, and obtaining a high-frequency comparison site library for downstream filtering of the high-frequency comparison site sequences.
And (3) comparing the residual sequences in the step (3) with a microbial genome database (bacteria, fungi, parasites and viruses) by using comparison software, sequencing comparison results and repeatedly removing PCR (polymerase chain reaction) by using samtools according to comparison results, performing quality control on the comparison results after duplication removal, and obtaining high-quality comparison results according to the following screening principle:
principle one: sequences with alignment length more than 90% are retained: i.e., the length of the reference sequence in a single sequence alignment is equal to 90% of the full length of the sequence.
Principle two: sequences with mismatched bases less than 5% were retained: i.e., the upper part of the alignment, the proportion of the number of bases that are inconsistent with the reference sequence due to sequencing errors is less than 5%.
Principle three: sequences that retain alignment specificity: because of the existence of repeated, highly similar segments in the genome, and homologous similar sequences between different species, sequencing of these regions will produce sequences with identical or similar alignment scores. And (3) screening specific alignment sequences according to the score difference ratio (namely, the suboptimal alignment score divided by the optimal alignment score is less than 0.8) of the multiple alignment results.
Thus, the sequences obtained based on principles one, two and three are the only aligned sequences.
Principle four: filtering the high-frequency alignment site sequence: the high frequency (> 5%) alignment site sequences in (4) above the alignment were filtered to reduce detection false positives.
The steps and flows of the method are integrated into a software package named PMFISH, and the running environment of the software is a Unix/Linux operating system and runs through a Unix/Linux command line.
The specific operation steps are as follows:
1. the following commands are input in the LINUX operating system computer terminal:
PMFISH<parameter file><sample information file><output directory>
PMFISH command line parameter meaning:
< parameter file > specifies a parameter configuration file in which all analysis parameters are contained.
Sample information, including the number, type, sequencing data file for all samples in the batch.
Output directory of < output directory > result
2. Data to be analyzed:
sequencing data: sequencing Fastq files of 103 samples.
A database: host.fa (reference sequence for host species), host.fa (pool of homologous sequences), bacteria.fa (representative sequence for bacterial species), virus.fa (representative sequence for viral species), fungi.fa (representative sequence for fungal species), protozoa.fa (representative sequence for parasitic species), highfreqmap.pos (pool of high frequency alignment sites for each species).
Sample initial information (sample information): sample number, sample type, sample sequencing data.
(5) Pathogen annotation analysis
Counting the following indexes for the detected pathogenic microorganisms according to the comparison result of the pathogenic microorganism reference genome database in the step (4):
alignment sequence number (MRN): the number of sequences of the species aligned at the species level is referred to.
Number of unique aligned Sequences (SMRN): the sequence number of a certain species (genus) is uniquely aligned.
Coverage (CovRate): refers to the percentage of the length of the nucleic acid sequence of the microorganism detected to the length of the entire genomic sequence of the microorganism.
Depth of coverage (CovDepth) refers to the average depth of bases in the coverage area on the genome.
Relative abundance (Re _ Abu): the proportion of the microorganism detected at the species (genus) level among the same type of microorganisms detected in the whole sample.
Distribution randomness: the metagenome sequencing is performed by a shotgun method, and the sequencing depth of true positive pathogenic microorganisms accords with a Poisson distribution model. The method is based on the number of comparison sequences and the comparison position distribution of the whole genome, and calculates the ratio Depth _ ratio of the theoretical coverage Depth (formula I) and the actual Depth of a detected pathogen and the distribution information entropy Shannon _ Index (formula II) of the detected sequences on the whole genome to perform false positive identification.
Theoretical depth:
Figure BDA0001822056850000181
the formula II is as follows: information entropy:
Figure BDA0001822056850000182
(6) result visualization
The method comprises the following aspects:
the first aspect is as follows: and visually displaying the total data size of the off-line data and the processed data of each part, and judging whether the sample data size meets the standard or not. As shown in fig. 3 below.
The second aspect is that: visualizing the pathogen detection result, displaying each parameter in the pathogen detection result on a graph for judging false positive detection (as shown in figure 4); and counting the pathogen information detected together in the same batch of samples to determine whether the contamination condition exists in the batch (see fig. 5).
The third aspect is that: and drawing a reads distribution graph of each detected pathogen, and judging the detection reliability through distribution randomness. (see fig. 6)
(7) Report generation
Automatically generating a tex format report based on a latex language and converting the tex format report into a detection analysis report in a pdf document format, wherein the report display content comprises the following aspects:
basic information of the examinee: including name, type, age, hospital number, bed number, etc.
Clinical information: including clinical manifestations, clinical tests (WBC, lymphocytes, neutrophils, CRP, PCT, culture results, identification results, microscopic examination results), clinical diagnosis, focus on pathogens, anti-infective medication, and the like.
Sample information: submission unit, department, submission doctor, sample collection date, report date, sample number, sample type, sample volume, etc.
And (3) detection results: the listed species are all microorganisms detected in the detection of the sample, are classified by bacteria, viruses, fungi, parasites, mycobacteria and atypical pathogens, are respectively sorted from top to bottom according to the detected sequence number, and are ranked close to the former species, and the relative content of the microorganisms is higher.
The results show that: the pathogenic microorganisms listed in the test results are combined to provide a pathogenic information profile in the interpretation database.
List of suspected background microorganisms: the detected background microorganisms are listed and placed in the appendix.
Reference documents: the results illustrate the literature information referenced.
Second, control method implementation
In order to better evaluate the key technical effect of the invention, the comparison method is ControlTest, which is defined as follows: by using the same database and comparison software constructed by the invention, the following key technologies are shielded on the basis of the invention: (1) homologous sequence filtering, (2) high-frequency alignment site comparison library filtering, and (3) low-randomness detection filtering by using an information entropy value and depth ratio method. The running environment of the contrast method is a Unix/Linux operating system and runs through a Unix/Linux command line.
The specific operation steps are as follows:
the following commands are input in the LINUX operating system computer terminal:
ControlTest<parameter file><sample information file><output directory>
ControlTest command line parameter meaning:
< parameter file > specifies a parameter configuration file in which all analysis parameters are contained.
Sample information, including the number, type, sequencing data file for all samples in the batch.
Output directory of < output directory > result
2. Data to be analyzed:
sequencing data: sequencing Fastq files of 103 samples.
A database: host.fa (host species reference sequence), bacteria.fa (representative sequence of bacterial species), virus.fa (representative sequence of viral species), fungi.fa (representative sequence of fungal species), protozoa.fa (representative sequence of parasitic species).
Sample initial information (sample information): sample number, sample type, sample sequencing data.
Third, analysis results
Analysis of 103 samples from both methods statistically showed that the overall false positive detection by the method of the invention was reduced by 36% (bacteria-30%; viruses-18%; fungi-72%; parasites-95%) relative to the control method. Taking the bacterial detection of sample 17S0270988 as an example, the output results of the two methods are shown respectively:
results of the comparative method
Table 1 shows the bacteria detection results of sample 17S0270988, which include three parameters of coverage rate, coverage depth, and unique alignment sequence number, and detect 33 bacteria detected in total with sequence number greater than 10.
TABLE 1 comparison method test results
Figure BDA0001822056850000201
Figure BDA0001822056850000211
Results of the method of the invention
Table 2 shows that sample 17S0270988 using the test results of the present invention, the number of sequences in the reference genome of a species in the alignment was greater than 10, and the Depth ratio (Depth _ ratio) and entropy value (Shannon _ Index) were greater than 0.75, indicating that the sample contains the species, 20 bacteria were co-detected, about 39% less than the control method, showing that false positives were effectively filtered. Meanwhile, the result is found to be consistent with the clinical detection result by comparing with the clinical detection data.
TABLE 2 test results of the method of the present invention
Figure BDA0001822056850000212
Figure BDA0001822056850000221
In addition, the invention provides a graphical result, and visually displays the total data size of the off-line and the processed data of each part, so as to judge whether the sample data size meets the standard or not. As shown in fig. 3, wherein the ordinate in fig. 3 is the Sample number and the Sample type (Sample ID), the first histogram is the original number of Reads (Raw Reads), and the two dotted lines in the graph are 8M and 12M respectively, for comparing whether the offline data size of each Sample meets the requirement; the second histogram is the data proportion (Filter rate) showing quality control, the units of the abscissa values are; the third graph shows the human source sequence scale (Host rate) in units of the abscissa value.
It can be seen from fig. 4 that the results of the sample before being filtered show graphically. The ordinate in fig. 4 is the detected pathogen name of the sample, including latin name and chinese name; the position of the midpoint of the first scattergram corresponds to the coverage (Cover Rate) of the abscissa, the size of the solid circle represents the coverage length range, the size of the hollow circle represents the genome length range, and the size of the circle in the legend represents the gradient; the second histogram shows the normalized alignment sequence number (SDMRN) and the normalized unique alignment sequence number (SDSMRN), with SDMRN being filled in the legend and SDSMRN being open; the third histogram shows the magnitude of the Depth ratio (Depth _ ratio) and information entropy (Shannon _ Index), with the solid in the legend representing the Depth ratio and the open representing the information entropy. FIG. 5 shows the detection of Nocardia gangrene in the batch, and FIG. 5 shows that Nocardia _ farcina species are detected in all samples in the batch, but not in other samples, so that cross contamination is eliminated and the species is confirmed to be detected. The left ordinate of fig. 5 is the sample number and sample type; the first histogram shows the coverage (Cover Rate) of the pathogen in each sample; the second histogram shows the normalized alignment sequence number (SDMRN) and the normalized unique alignment sequence number (SDSMRN) for each sample, with SDMRN indicated in the solid in the legend and SDSMRN indicated in the open; the third histogram shows the Depth ratio (Depth _ ratio) and information entropy (Shannon _ Index) for each sample, with solid in the legend representing the Depth ratio and open representing the information entropy. FIG. 6 shows a base sequence distribution diagram of 1.08% of the total genome of Nocardia _ farcina (Nocardia meliorata), the abscissa is the base position of the Nocardia _ farcina genome, and the ordinate is the number of bases (reads number). It can be seen that the detected sequences of the species are uniformly distributed on the genome, which indicates that the detection reliability of the species is higher. FIG. 6 shows the genome-wide distribution of the detected sequences of Nocardia _ Farcinica species, and further confirms that the species was detected with good randomness.
In the present invention, unless otherwise expressly stated or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally formed; may be mechanically coupled, may be electrically coupled or may be in communication with each other; they may be directly connected or indirectly connected through intervening media, or they may be connected internally or in any other suitable relationship, unless expressly stated otherwise. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (10)

1. A method of performing a microbiological analysis of a host sample comprising:
(1) performing a first filtering process on a set of sequencing data from the host sample using a host genome database to remove sequencing data from the set of sequencing data that is alignable with the host genome database, the set of sequencing data resulting from metagenomic sequencing of the host sample;
(2) performing a second filtering process on the sequencing data set using a homology database, the homology database including at least a portion of known plasmid sequences, drug-resistant sequences, bacterial virulence sequences, so as to remove the sequencing data from the sequencing data set that can be aligned with the homology database;
(3) comparing the sequencing data set subjected to the first filtering treatment and the second filtering treatment with a microorganism genomic database to determine microorganism sequencing data from the microorganisms in the sequencing data set;
optionally, the host genome database is a human genome database;
optionally, the human genome database comprises human reference genomic sequences and Yanhuang genomic sequences.
2. The method according to claim 1, wherein the microorganism genome database is constructed by the steps of:
(a) obtaining a known microbial genome sequence, and constructing a primary microbial genome database, wherein the microbial genome sequence in the primary microbial genome database comprises at least one part of whole genome sequence information, chromosome sequence information, Scaffold sequence information and Contig sequence information;
(b) performing redundancy removal on sequences in the primary microorganism genome database so as to obtain a redundancy-removed microorganism genome database, wherein the redundancy removal refers to removal of sequences with similarity of more than 99%;
(c) selecting the microbial genome sequence of a representative strain for a species for which a plurality of genomic sequences exist based on the redundantly removed microbial genome database, and removing the microbial genome sequences of other strains of the species from the redundantly removed microbial genome database, so as to obtain the microbial genome database.
3. The method according to claim 2, wherein the representative strain is obtained by:
determining the average identity between the genomic sequences of each two strains for a species in which there are multiple strains of genomic sequences;
obtaining a similarity matrix between the genome sequences of the plurality of strains in the species based on the average identity between the genome sequences of each two strains;
based on the similarity matrix, the strain having the greatest average similarity to the sequences of the other respective strains was selected as a representative strain.
4. The method according to any one of claims 1 to 3, further comprising performing a third filtering process on the sequencing data of the microorganism by using a high-frequency alignment site database to remove the sequencing data aligned to the high-frequency alignment sites in step (3), wherein the high-frequency alignment site database is constructed by the following steps:
comparing metagenomic sequencing data of a plurality of samples to the microbial genome database, wherein the microbial genome is pre-divided into a plurality of predetermined windows, so as to determine the number of metagenomic sequencing data matching the windows;
determining a plurality of high-frequency alignment sites constituting the high-frequency alignment site database based on the number of metagenomic sequencing data matching the window;
optionally, the plurality of samples are of the same species as the host sample;
optionally, using an alignment site with an alignment frequency of more than 5% as the high-frequency alignment site, wherein the alignment frequency is a ratio of the number of samples aligned to the alignment site to the total number of samples;
optionally, step (3) when aligning the sequencing data set subjected to the first filtering process and the second filtering process with the microorganism genomic database, determining the microorganism sequencing data from the microorganism in the sequencing data set further based on at least one of:
reserving sequences with alignment length ratio larger than 90% in the sequencing data set;
preserving sequences with mismatched base numbers less than 5% in the sequencing data set;
and (3) retaining the sequence with the alignment specificity, wherein the sequence with the alignment specificity refers to the alignment score of the statistical sequence aligned to different positions, and the sequence with the ratio of the suboptimal alignment score to the optimal alignment score being less than 0.8 is selected as the sequence with the alignment specificity.
5. The method of any one of claims 1 to 4, further comprising:
prior to performing step (1), pre-processing raw sequencing data from the host sample to obtain the set of sequencing data, the pre-processing comprising filtering out at least one of the following sequences:
the sequence with continuous base number above 10bp is shared with the linker sequence;
reading sequences whose length is below a predetermined threshold; the preset threshold is preferably 50-55 bp;
the sequence with the ratio of the number of bases with the mass value less than 5 to the total number of bases in the sequence more than 50 percent;
optionally, the method further comprises at least one of annotating, visualizing the microorganism sequencing data;
optionally, the annotation process is selected from at least one of:
aligning sequence numbers, the aligning sequence numbers referring to the sequence numbers of the species aligned at the species level;
a unique number of aligned sequences, the unique number of aligned sequences being the number of sequences that are uniquely aligned to a species or genus level;
coverage, which refers to the percentage of the detected length of the nucleic acid sequence of the microorganism to the length of the entire genomic sequence of the microorganism;
depth of coverage, which refers to the average depth of bases over the range of coverage on the genome;
relative abundance, which refers to the proportion of microorganisms detected at the species or genus level among the same type of microorganisms detected throughout the sample;
and distribution randomness.
6. An apparatus for performing microbiological analysis of a host sample, comprising:
a host data filtering module that performs a first filtering process on a set of sequencing data from the host sample using a host genome database to remove sequencing data from the set of sequencing data that can be aligned with the host genome database, the set of sequencing data resulting from performing metagenomic sequencing on the host sample;
a homologous data filtering module connected to the host data filtering module, wherein the homologous data filtering module performs a second filtering process on the sequencing data set by using a homologous database, and the homologous database includes at least a part of known plasmid sequences, drug-resistant sequences and bacterial virulence sequences, so as to remove the sequencing data that can be aligned with the homologous database from the sequencing data set;
a microbial data comparison module connected to the host data filtering module or the homologous data filtering module, the microbial data comparison module comparing the sequencing data set subjected to the first filtering process and the second filtering process with a microbial genome database to determine microbial sequencing data from the microbes in the sequencing data set;
optionally, the host genome database is a human genome database;
optionally, the human genome database comprises human reference genomic sequences and Yanhuang genomic sequences.
7. The apparatus of claim 6, wherein the microorganism genome database is constructed by:
(a) obtaining a known microbial genome sequence, and constructing a primary microbial genome database, wherein the microbial genome sequence in the primary microbial genome database comprises at least one part of whole genome sequence information, chromosome sequence information, Scaffold sequence information and Contig sequence information;
(b) performing redundancy removal on sequences in the primary microorganism genome database so as to obtain a primary processing microorganism genome database, wherein the redundancy removal refers to removal of sequences with similarity of more than 99%;
(c) selecting the microbial genome sequence of a representative strain for a species for which a plurality of genomic sequences exist based on the one-time treatment microbial genome database, and removing the microbial genome sequences of other strains of the species from the one-time treatment microbial genome database so as to obtain the microbial genome database.
8. The device according to claim 7, wherein the representative strain is obtained by:
determining the average identity between the genomic sequences of each two strains for a species in which there are multiple strains of genomic sequences;
obtaining a similarity matrix between the genome sequences of the plurality of strains in the species based on the average identity between the genome sequences of each two strains;
based on the similarity matrix, the strain having the greatest average similarity to the sequences of the other respective strains was selected as a representative strain.
9. The apparatus according to any one of claims 6 to 8, wherein the microorganism data alignment module further comprises a third filtering process for the microorganism sequencing data by using a high frequency alignment site database to remove the sequencing data aligned to the high frequency alignment site, wherein the high frequency alignment site database is constructed by the following steps:
comparing metagenomic sequencing data of a plurality of samples to the microbial genome database, wherein the microbial genome is pre-divided into a plurality of predetermined windows, so as to determine the number of metagenomic sequencing data matching the windows;
determining a plurality of high-frequency alignment sites constituting the high-frequency alignment site database based on the number of metagenomic sequencing data matching the window;
optionally, the plurality of samples are of the same species as the host sample;
optionally, using an alignment site with an alignment frequency of more than 5% as the high-frequency alignment site, wherein the alignment frequency is a ratio of the number of samples aligned to the alignment site to the total number of samples;
optionally, in the microorganism data alignment module, when aligning the sequencing data set subjected to the first filtering process and the second filtering process with a microorganism genomic database, determining the microorganism sequencing data from the microorganism in the sequencing data set further based on at least one of:
reserving sequences with alignment length ratio larger than 90% in the sequencing data set;
preserving sequences with mismatched base numbers less than 5% in the sequencing data set;
and (3) retaining the sequence with the alignment specificity, wherein the sequence with the alignment specificity refers to the alignment score of the statistical sequence aligned to different positions, and the sequence with the ratio of the suboptimal alignment score to the optimal alignment score being less than 0.8 is selected as the sequence with the alignment specificity.
10. The apparatus of any one of claims 6-9, further comprising:
a sequencing data quality control module connected to the host sequence filtering module, the sequencing data quality control module pre-processing raw sequencing data from the host sample to obtain the set of sequencing data, the pre-processing including filtering to remove at least one of the following sequences:
the sequence with continuous base number above 10bp is shared with the linker sequence;
reading sequences whose length is below a predetermined threshold; the preset threshold is preferably 50-55 bp;
the sequence with the ratio of the number of bases with the mass value less than 5 to the total number of bases in the sequence more than 50 percent;
optionally, the device further comprises:
the data output module is connected with the microbial data comparison module and is used for performing at least one of annotation processing and visualization processing on the microbial sequencing data;
optionally, the annotation process is selected from at least one of:
aligning sequence numbers, the aligning sequence numbers referring to the sequence numbers of the species aligned at the species level;
the unique alignment sequence number refers to the number of sequences which are aligned to a certain species or a certain attribute level uniquely;
coverage, which refers to the percentage of the detected length of the nucleic acid sequence of the microorganism to the length of the entire genomic sequence of the microorganism;
depth of coverage, which refers to the average depth of bases over the range of coverage on the genome;
relative abundance, which refers to the proportion of the microorganism detected at the species or genus level that is among the same type of microorganism detected in the entire sample;
and distribution randomness.
CN201811169458.4A 2018-10-08 2018-10-08 Method and apparatus for microbiological analysis of a host sample Active CN111009286B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811169458.4A CN111009286B (en) 2018-10-08 2018-10-08 Method and apparatus for microbiological analysis of a host sample

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811169458.4A CN111009286B (en) 2018-10-08 2018-10-08 Method and apparatus for microbiological analysis of a host sample

Publications (2)

Publication Number Publication Date
CN111009286A true CN111009286A (en) 2020-04-14
CN111009286B CN111009286B (en) 2023-04-28

Family

ID=70110580

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811169458.4A Active CN111009286B (en) 2018-10-08 2018-10-08 Method and apparatus for microbiological analysis of a host sample

Country Status (1)

Country Link
CN (1) CN111009286B (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111676284A (en) * 2020-06-27 2020-09-18 张银冰 Monitoring method for biological medicine for neuromuscular diseases
CN112309499A (en) * 2020-11-09 2021-02-02 浙江大学 Method and device for quickly annotating bacterial pdif
CN112331268A (en) * 2020-10-19 2021-02-05 成都基因坊科技有限公司 Method for obtaining specific sequence of target species and method for detecting target species
CN112687343A (en) * 2020-12-31 2021-04-20 杭州柏熠科技有限公司 Nanopore sequencing-based broad-spectrum pathogenic microorganism and drug resistance analysis system
CN112837745A (en) * 2021-01-15 2021-05-25 广州微远基因科技有限公司 Pathogenic microorganism virulence gene association model and establishment method and application thereof
CN113215235A (en) * 2021-06-17 2021-08-06 嘉兴允英医学检验有限公司 Method for rapidly detecting pathogenic microorganisms in high flux
CN113327646A (en) * 2021-06-30 2021-08-31 南京医基云医疗数据研究院有限公司 Sequencing sequence processing method and device, storage medium and electronic equipment
CN113539378A (en) * 2021-07-16 2021-10-22 明科生物技术(杭州)有限公司 Data analysis method, system, equipment and storage medium of virus database
CN113571128A (en) * 2021-08-05 2021-10-29 深圳华大因源医药科技有限公司 Method for establishing reference threshold for detecting macro genomics pathogens
CN113782100A (en) * 2021-11-10 2021-12-10 中国人民解放军军事科学院军事医学研究院 Method for identifying plasmid type carried by bacterial population based on bacterial genome high-throughput sequencing data
CN114067911A (en) * 2020-08-07 2022-02-18 西安中科茵康莱医学检验有限公司 Method, apparatus, computer-readable storage medium and electronic device for obtaining microbial species and related information by sequencing
CN114187968A (en) * 2020-09-15 2022-03-15 深圳华大生命科学研究院 Sterility detection method based on NGS technology
CN115101129A (en) * 2022-06-27 2022-09-23 青岛华大医学检验所有限公司 Method for assembling pathogenic microorganism genome based on metagenome sequencing data
CN115346608A (en) * 2022-06-27 2022-11-15 北京吉因加科技有限公司 Method and device for constructing pathogenic organism genome database
CN115820402A (en) * 2022-11-29 2023-03-21 深圳市国赛生物技术有限公司 Automatic system for microbial testing and microbial testing method
CN116153410A (en) * 2022-12-20 2023-05-23 瑞因迈拓科技(广州)有限公司 Microbial genome reference database, construction method and application thereof

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105112569A (en) * 2015-09-14 2015-12-02 中国医学科学院病原生物学研究所 Virus infection detection and identification method based on metagenomics
CN105525033A (en) * 2014-09-29 2016-04-27 天津华大基因科技有限公司 Method and device for detecting microorganisms in blood
CN105950707A (en) * 2016-03-30 2016-09-21 广州精科生物技术有限公司 Method and system for determining nucleic acid sequence
US20160341743A1 (en) * 2014-02-03 2016-11-24 The University Of Chicago Compositions and methods for quantitative assessment of dna-protein complex density
US20180121601A1 (en) * 2016-10-28 2018-05-03 Edico Genome, Corp. Bioinformatics systems, apparatuses, and methods for performing secondary and/or tertiary processing
CN108334750A (en) * 2018-04-19 2018-07-27 江苏先声医学诊断有限公司 A kind of macro genomic data analysis method and system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160341743A1 (en) * 2014-02-03 2016-11-24 The University Of Chicago Compositions and methods for quantitative assessment of dna-protein complex density
CN105525033A (en) * 2014-09-29 2016-04-27 天津华大基因科技有限公司 Method and device for detecting microorganisms in blood
CN105112569A (en) * 2015-09-14 2015-12-02 中国医学科学院病原生物学研究所 Virus infection detection and identification method based on metagenomics
CN105950707A (en) * 2016-03-30 2016-09-21 广州精科生物技术有限公司 Method and system for determining nucleic acid sequence
US20180121601A1 (en) * 2016-10-28 2018-05-03 Edico Genome, Corp. Bioinformatics systems, apparatuses, and methods for performing secondary and/or tertiary processing
CN108334750A (en) * 2018-04-19 2018-07-27 江苏先声医学诊断有限公司 A kind of macro genomic data analysis method and system

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111676284A (en) * 2020-06-27 2020-09-18 张银冰 Monitoring method for biological medicine for neuromuscular diseases
CN114067911A (en) * 2020-08-07 2022-02-18 西安中科茵康莱医学检验有限公司 Method, apparatus, computer-readable storage medium and electronic device for obtaining microbial species and related information by sequencing
CN114067911B (en) * 2020-08-07 2024-02-06 西安中科茵康莱医学检验有限公司 Method and device for acquiring microorganism species and related information
CN114187968A (en) * 2020-09-15 2022-03-15 深圳华大生命科学研究院 Sterility detection method based on NGS technology
CN112331268B (en) * 2020-10-19 2023-04-14 成都基因坊科技有限公司 Method for obtaining specific sequence of target species and method for detecting target species
CN112331268A (en) * 2020-10-19 2021-02-05 成都基因坊科技有限公司 Method for obtaining specific sequence of target species and method for detecting target species
CN112309499A (en) * 2020-11-09 2021-02-02 浙江大学 Method and device for quickly annotating bacterial pdif
CN112687343A (en) * 2020-12-31 2021-04-20 杭州柏熠科技有限公司 Nanopore sequencing-based broad-spectrum pathogenic microorganism and drug resistance analysis system
CN112837745A (en) * 2021-01-15 2021-05-25 广州微远基因科技有限公司 Pathogenic microorganism virulence gene association model and establishment method and application thereof
CN112837745B (en) * 2021-01-15 2023-11-21 广州微远基因科技有限公司 Pathogenic microorganism virulence gene association model and establishment method and application thereof
CN113215235A (en) * 2021-06-17 2021-08-06 嘉兴允英医学检验有限公司 Method for rapidly detecting pathogenic microorganisms in high flux
CN113327646A (en) * 2021-06-30 2021-08-31 南京医基云医疗数据研究院有限公司 Sequencing sequence processing method and device, storage medium and electronic equipment
CN113327646B (en) * 2021-06-30 2024-04-23 南京医基云医疗数据研究院有限公司 Sequencing sequence processing method and device, storage medium and electronic equipment
CN113539378A (en) * 2021-07-16 2021-10-22 明科生物技术(杭州)有限公司 Data analysis method, system, equipment and storage medium of virus database
CN113571128A (en) * 2021-08-05 2021-10-29 深圳华大因源医药科技有限公司 Method for establishing reference threshold for detecting macro genomics pathogens
CN113782100A (en) * 2021-11-10 2021-12-10 中国人民解放军军事科学院军事医学研究院 Method for identifying plasmid type carried by bacterial population based on bacterial genome high-throughput sequencing data
CN115346608A (en) * 2022-06-27 2022-11-15 北京吉因加科技有限公司 Method and device for constructing pathogenic organism genome database
CN115101129A (en) * 2022-06-27 2022-09-23 青岛华大医学检验所有限公司 Method for assembling pathogenic microorganism genome based on metagenome sequencing data
CN115820402A (en) * 2022-11-29 2023-03-21 深圳市国赛生物技术有限公司 Automatic system for microbial testing and microbial testing method
CN116153410A (en) * 2022-12-20 2023-05-23 瑞因迈拓科技(广州)有限公司 Microbial genome reference database, construction method and application thereof
CN116153410B (en) * 2022-12-20 2023-12-19 瑞因迈拓科技(广州)有限公司 Microbial genome reference database, construction method and application thereof

Also Published As

Publication number Publication date
CN111009286B (en) 2023-04-28

Similar Documents

Publication Publication Date Title
CN111009286A (en) Method and apparatus for microbiological analysis of host samples
CN111951895B (en) Pathogen analysis method based on metagenomics analysis device, apparatus, and storage medium
EP2926288B1 (en) Accurate and fast mapping of targeted sequencing reads
Turkahia et al. Pandemic-scale phylogenomics reveals elevated recombination rates in the SARS-CoV-2 spike region
US20200294628A1 (en) Creation or use of anchor-based data structures for sample-derived characteristic determination
KR20020075265A (en) Method for providing clinical diagnostic services
CN115064215B (en) Method for tracing strains and identifying attributes through similarity
Kearse et al. The Geneious 6.0. 3 read mapper
WO2019242445A1 (en) Detection method, device, computer equipment and storage medium of pathogen operation group
CN115132276A (en) Solid tumor mutant gene detection and analysis method and system
CN110970093B (en) Method and device for screening primer design template and application
Connor et al. Towards increased accuracy and reproducibility in SARS-CoV-2 next generation sequence analysis for public health surveillance
CN116469462A (en) Ultra-low frequency DNA mutation identification method and device based on double sequencing
CN116013420A (en) Virulence factor database construction method, device, equipment and medium
Kaiser et al. Automated structural variant verification in human genomes using single-molecule electronic DNA mapping
CN114496089B (en) Pathogenic microorganism identification method
RU2772912C1 (en) Method for analysing mitochondrial dna for non-invasive prenatal testing
US20220042091A1 (en) Mitochondrial DNA Quality Control
Marić et al. Approaches to metagenomic classification and assembly
Listgarten et al. PERSONALIZED MEDICINE: FROM GENOTYPES AND MOLECULAR PHENOTYPES TOWARDS THERAPY
Gonzalez et al. Essentials in Metagenomics (Part II)
CN117524312A (en) Analysis method and device for pathogen metagenome sequencing data and application thereof
US20210214774A1 (en) Method for the identification of organisms from sequencing data from microbial genome comparisons
Bálint et al. Purging genomes of contamination eliminates systematic bias from evolutionary analyses of ancestral genomes
Cervi et al. The MetaGens algorithm for metagenomic database lossy compression and subject alignment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40016761

Country of ref document: HK

TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20210527

Address after: 518057 room 201203a5, building w2a, building B, building a, Gaoxin industrial village, 025 Gaoxin South 4th Road, Gaoxin community, Yuehai street, Nanshan District, Shenzhen City, Guangdong Province

Applicant after: Shenzhen Huada Yinyuan Pharmaceutical Technology Co.,Ltd.

Applicant after: Huada Biotechnology (Wuhan) Co.,Ltd.

Address before: 518057 room 201203a5, building w2a, building B, building a, Gaoxin industrial village, 025 Gaoxin South 4th Road, Gaoxin community, Yuehai street, Nanshan District, Shenzhen City, Guangdong Province

Applicant before: Shenzhen Huada Yinyuan Pharmaceutical Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant