CN113223618A

CN113223618A - Method and system for detecting virulence genes of clinically important pathogenic bacteria based on metagenome

Info

Publication number: CN113223618A
Application number: CN202110579642.1A
Authority: CN
Inventors: 夏涵; 官远林; 江月; 樊淑; 杨静; 胡煜
Original assignee: Yuguo Microcode Biotechnology Co ltd Of Xixian New Area; Yuguo Zhizao Technology Beijing Co ltd; Yuguo Biotechnology Beijing Co ltd
Current assignee: Yuguo Microcode Biotechnology Co ltd Of Xixian New Area; Yuguo Zhizao Technology Beijing Co ltd; Yuguo Biotechnology Beijing Co ltd
Priority date: 2021-05-26
Filing date: 2021-05-26
Publication date: 2021-08-06
Anticipated expiration: 2041-05-26
Also published as: CN113223618B

Abstract

The invention discloses a method and a system for detecting virulence genes of clinically important pathogenic bacteria based on metagenome. The method comprises the following steps: s10, establishing a clinical pathogenic bacterium virulence gene database; s20, acquiring original data of clinical sample metagenome sequencing, and preprocessing the original data to acquire target data; s30, analyzing target data by using a preset metagenome sequencing data multiple-comparison annotation system, and identifying virulence genes; s40, establishing an important virulence gene-virulence factor-characterization (function/clinical phenotype) correlation database; and S50, generating a virulence gene identification report based on the virulence gene identification result and the associated database by using a preset clinical automation report system. The system can identify virulence genes of metagenome sequencing data of clinical infection samples of different types (cerebrospinal fluid and the like), can identify a plurality of important virulence genes of a plurality of pathogenic bacteria in the samples at one time, has better sensitivity and accuracy, and helps doctors to diagnose, treat and prognose in time.

Description

Method and system for detecting virulence genes of clinically important pathogenic bacteria based on metagenome

Technical Field

The invention belongs to the technical field of biological information algorithm software. Can be applied to clinical pathogen detection products, namely the analysis of clinical pathogen virulence genes for pathogen metagenome detection, and comprises hundreds of virulence genes of various clinical pathogens. The application field is as follows: the identification, identification and traceability of pathogenic bacteria virulence genes detected by pathogen metagenome of samples such as tissues, body fluids (cerebrospinal fluid, alveolar lavage fluid, blood and sputum) and the like of various infectious disease patients assist clinicians in accurate diagnosis, treatment scheme selection and prognosis judgment, and provide useful information in monitoring bacterial infection diseases.

Background

The bacterial infection can cause various acute and chronic diseases, and can also be used as conditional pathogenic bacteria to cause diseases through the interaction among pathogeny, host and environmental factors, wherein certain clinical pathogenic bacteria can seriously harm the life and health of human beings. Such as Staphylococcus aureus (Staphylococcus aureus), which often causes pyogenic infection in humans, can directly cause pneumonia, pseudomembranous enteritis, pericarditis, and even septicemia, sepsis and other systemic infections. Klebsiella pneumoniae (Klebsiella pneumoniae) is widely present on animal mucosal surfaces (such as human gastrointestinal tract) or environments and is a main pathogen of hospital medical related infection and severe community-acquired infection. In china, klebsiella pneumoniae accounts for 11.9% of pathogens isolated from ventilator-associated pneumonia and intensive care unit-acquired pneumonia. Streptococcus pneumoniae (Streptococcus pneumoniae) is one of the main pathogenic bacteria of community-acquired pneumonia, otitis media, meningitis, abscess, septicemia and the like. In developing countries, over 110 million children die each year from pneumonia, with streptococcus pneumoniae accounting for approximately 20%. Another type of treatment that is troublesome, has a high mortality rate, and often exhibits multiple or pan-drug resistance is pseudomonas aeruginosa, a gram-negative bacterium that is susceptible to colonization and infection in the respiratory tract, particularly in immunocompromised persons. These clinical pathogens often exert their pathogenicity through multiple virulence genes during the course of infecting humans, leading to disease development.

Virulence factors are a general term for the functional units of a class of effector or regulatory molecules (proteins, lipid molecules or compounds, etc.) and combinations thereof, which are produced by pathogenic microorganisms and cause host disease to occur. The genes encoding these virulence factors are often referred to as virulence genes. For example, staphylococcus aureus can realize adhesion, infection and dissemination to host cells by producing various virulence factors, and can escape from the action of a host immune system or antibiotics by forming a biofilm, important virulence genes comprise pvl, sea, seb and the like, pvl can promote neutrophilic granulocyte lysis to endow the strain with strong pathogenicity, and the staphylococcus aureus is related to skin and soft tissue suppurative infection, severe patients can cause necrotizing pneumonia, and the lethality rate is high; the enterotoxin genes sea, seb, etc. can stimulate vomiting center to cause acute gastroenteritis with vomiting as main symptom, and are the main reason for bacterial food poisoning of human beings. And the streptococcus pneumoniae has various virulence genes, such as capsular polysaccharide synthetase A gene, pneumolysin gene (ply), lytA, nanA and the like. Wherein the capsular related gene such as cps4A is a prerequisite for pathogenicity of streptococcus pneumoniae; hemolysin can cause host cell lysis, cause alveolar edema and hemorrhage, and induce pneumonia, as well as cause bacteremia by forcing bacteria into the blood; lytA is involved in bacterial autolysis, resulting in secretion of hemolysin and other components, which may cause a strong inflammatory response in the host. The difference of virulence genes of different strains of Klebsiella pneumoniae can cause the difference of pathogenicity of the Klebsiella pneumoniae, an important virulence gene of the Klebsiella pneumoniae has rmpA, the synthesis of capsular polysaccharide of the Klebsiella pneumoniae can be adjusted, the high mucus phenotype of the Klebsiella pneumoniae is generated, the high pathogenicity of the Klebsiella pneumoniae causes the strong pathogenicity of the strains, and the virulence function of the Klebsiella pneumoniae and the virulence genes such as iUTA influence the formation of liver abscess. These virulence genes encoding virulence factors such as toxins and surface proteins can help bacteria to adhere to and invade host cells, improve survival and propagation of the bacteria in the host cells, cause toxic death of the host cells, and the like, and finally cause various infectious diseases of the host. Therefore, the detection and identification of the virulence genes of the high-frequency or important pathogenic bacteria of the clinical samples are beneficial to identifying potential pathogenic bacteria, evaluating the virulence of clinical strains, and assisting the selection and implementation of specific measures such as diagnosis, accurate treatment, prognosis treatment and the like of clinical infectious diseases. On the other hand, in the field of public health, detection of virulence genes and identification of specific virulence spectrums thereof provide useful information on monitoring of bacterial infection diseases, judgment of epidemic outbreak probability, evaluation of epidemic severity and the like, and help to propose and implement reasonable disease control measures.

The method for detecting and identifying virulence genes of clinical pathogenic bacteria mainly comprises the following steps: single/multiple gene detection, loop-mediated isothermal amplification, gene chip and second-generation sequencing metagenome detection technology based on Polymerase Chain Reaction (PCR) and derivative technology thereof. Among them, the most widely used in clinical practice is the PCR technique, which mainly aims at the conserved region of the nucleic acid sequence of a specific virulence gene to design a specific primer, and takes the nucleic acid of a clinical sample or an isolated strain as a template for amplification detection. The technology can realize the rapid detection of the gene and has the characteristic of high sensitivity. The clinical application mainly comprises: 1) multiplex PCR technique: two pairs or more than two pairs of primers are added into the same PCR reaction system, a plurality of nucleic acid fragments are amplified simultaneously, and two or more than two virulence genes can be detected and identified simultaneously; 2) fluorescent quantitative PCR technology: on the basis of common PCR, the real-time detection of the fluorescence signal of each cycle product is added, so that the quantitative and qualitative analysis of the initial template DNA is realized. Meanwhile, the two PCR technologies also have the defects of complex operation, high requirements on instruments and personnel, unsuitability for rapid on-site diagnosis and the like. The loop-mediated isothermal amplification (LAMP) is a novel nucleic acid amplification technology different from PCR, relies on DNA polymerase with strand displacement activity and 2 pairs of specially designed primers, does not need repeated temperature cycle and expensive instruments and equipment, can efficiently and quickly complete the amplification reaction under isothermal conditions, and is widely applied to detection and identification of pathogens such as bacteria, viruses, parasites and the like at present. Compared with the common PCR technology, LAMP has the characteristics of high specificity, high sensitivity, simple operation, low requirements on instruments and equipment, capability of quickly completing nucleic acid amplification under a constant temperature condition and the like. The defects are that the design requirement on the primer is high, the non-specific amplification is not easy to distinguish, the pollution influence is large, and the like. A gene chip, also called DNA microarray, refers to a dense molecular array formed by fixing a large number of DNA probes such as gene fragments and artificially synthesized oligonucleotides on a carrier in a pre-designed manner by using in-situ synthesis (in-situ synthesis) or micro-spotting and other methods, hybridizing with a nucleic acid sample labeled by fluorescein or other methods, and determining the presence or absence of a target gene in the sample and quantifying by detecting the strength of a hybridization signal. Recent developments have led to the use of gene chip technology in the fields of gene expression analysis, mutation and polymorphism analysis, and the like. Compared with PCR or LAMP technology, the gene chip technology has the advantages of capability of realizing detection of a large number of genes in one experiment, rapidness, high parallelism, diversity, automation and the like. On the other hand, the gene chip has high detection cost, high operation requirement and poor sensitivity, which results in limited application range. No matter PCR, LAMP or gene chip, the prior knowledge of the sample is needed to be known, only the specific virulence genes of specific bacteria can be detected, the genes with large variation and unpredicted variation are difficult to deal with, and the virulence genes with important clinical significance cannot be completely covered. The Metagenomic sequencing technology (Metagenomic sequencing) which is rapidly developed in recent years and is based on the next generation sequencing has unique advantages in overcoming the defects. The metagenome sequencing does not need to separately culture pathogen separation, and clinical samples can be directly analyzed through nucleic acid extraction and purification. And (3) carrying out comprehensive virulence gene annotation and identification by utilizing sequence homology comparison.

A plurality of pathogenic bacteria exist in a clinical infection disease sample, the pathogenic mechanism of the pathogenic bacteria relates to different virulence factors, and pathogenicity is generated by the synergistic regulation and control effect of a plurality of virulence genes. The prior art relating to the detection of microbial virulence genes comprises PCR and derivative technology thereof, loop-mediated isothermal amplification, gene chips and the like, and has the problems of limited number and range of virulence gene detection, prior cognition, easy cross contamination and the like. The current products for virulence gene identification are based primarily on PCR techniques, which can only detect a limited range of bacteria and a limited number of virulence genes. In particular, in the case of ordinary PCR, only one virulence gene of one bacterium can be detected in one experiment, for example, in Chinese patent publication CN110669853A, only the ampR gene of Klebsiella pneumoniae, which is not sticky, can be detected. Even in the case of multiplex PCR, it is necessary to consider the problem that an excessive number of primer pairs will easily form dimers and affect the amplification efficiency, resulting in a small number of virulence genes to be detected, for example, in Chinese patent publication CN111876509A, four virulence genes such as abaR, CsuA, and bap of Acinetobacter baumannii are detected at a time by applying the multiplex PCR technology, 7 virulence genes of Aeromonas are detected by the multiplex PCR product of Chinese patent publication CN109554449A, and a seven-fold PCR detection primer set is designed by the Chinese patent publication CN108707680A technology, and only covers specific regions of 21 virulence genes such as sip, fbsA, and hylB of Streptococcus agalactiae. The multiplex fluorescent PCR technology is also applied to the detection of virulence genes due to the convenience of result interpretation, for example, the design of multiplex fluorescent PCR in the Chinese patent publication CN112430677A submitted in 2020 quantitatively detects three virulence genes of icuA, rmpA1 and rmpA2 of Klebsiella pneumoniae. Meanwhile, the loop-mediated isothermal amplification technology developed in recent years is also applied to clinical virulence gene detection. For example, Chinese patent publication CN11150075A discloses that 2 pairs of primers are used to amplify 6 different regions of peg-344 gene of Klebsiella pneumoniae with high virulence, so as to identify clinical high virulence strains. The gene chip has higher cost and is less applied to clinical virulence gene detection. The product of Chinese patent No. CN105950732B filed in 2016 is designed and identified with 9 animal-derived food pathogenic bacteria: 17 virulence genes of Salmonella (Salmonella), Enterococcus (Enterococcus), Clostridium perfringens (Clostridium perfringens), and the like. The prior art needs to design or use specific primers of one or a plurality of known genes before experiments, so that only virulence genes in a preset range can be detected. Clinically, a more sensitive and comprehensive virulence gene detection strategy for infectious pathogenic bacteria is needed, and the requirements of China on diagnosis, treatment and epidemiological monitoring of important pathogenic bacteria with high incidence and high toxicity are met. In the metagenomic sequencing technology developed in recent years, the whole microbial community in a specific habitat is taken as a research object, and the DNA of all the microbial groups in a clinical sample is directly extracted for sequencing annotation and comparative analysis. The technology makes up the defects of the prior sequencing method, does not need culture or prior knowledge of samples, and can simultaneously carry out comprehensive virulence gene scanning and identification on clinical pathogen metagenome. The prior Chinese patent published application or obtained projects have no products or similar projects for detecting virulence genes based on metagenome, and the research, development and popularization of the products are helpful for meeting the requirements of the diagnosis of the highly virulent pathogenic bacteria of clinical infectious diseases.

Disclosure of Invention

The patent provides a method and a system for detecting virulence genes of clinically important pathogenic bacteria based on metagenome, including but not limited to identification of hundreds of important virulence genes of various pathogenic bacteria such as Klebsiella pneumoniae, Streptococcus pneumoniae, Escherichia coli, Haemophilus influenzae, Staphylococcus aureus and the like, such as rmpA, iucA, ply, cps, stx1A, bexA, lukF-PV, hly, ompA, plc, cylL, ctxA, eccA1, lipA, slo, acm, icmTlef, toxA, pgm and the like. The method comprises the following main parts: 1) establishing a clinical pathogenic bacterium virulence gene database; 2) obtaining clinical sample metagenome sequencing original data, and preprocessing the clinical sample metagenome sequencing original data to obtain target data; 3) analyzing target data by using a preset metagenome sequencing data multiple comparison annotation system, and identifying virulence genes; 4) establishing an important virulence gene-virulence factor-characterization (function/clinical phenotype) association database; 5) and generating a virulence gene identification report based on the virulence gene identification result and the associated database by using a preset clinical automatic report system. The method is suitable for clinical multiple infection disease sample types (cerebrospinal fluid, alveolar lavage fluid, blood and the like), can be used for identifying multiple high-frequency and important virulence genes of multiple clinical pathogenic bacteria at one time, reduces additional screening time, has higher sensitivity and accuracy in a deep association database and a multiple comparison strategy, and can be used for rapidly generating reports by an automatic reporting system, so that doctors can be helped to identify, diagnose, treat and prognose infection-type high-virulence pathogenic strains in time.

The invention discloses a method for detecting virulence genes of clinically important pathogenic bacteria based on metagenome, which comprises the following steps:

s10, establishing a clinical pathogenic bacterium virulence gene database;

s20, acquiring original data of clinical sample metagenome sequencing, and preprocessing to obtain target data;

s30, analyzing target data by using a preset metagenome sequencing data multiple-comparison annotation system, and identifying virulence genes;

s40, establishing an important virulence gene-virulence factor-characterization (function/clinical phenotype) correlation database;

and S50, generating a virulence gene identification report based on the S30 virulence gene identification result and the S40 association database by using a preset clinical automation report system.

In some embodiments of the present invention, the S10 includes the following steps:

obtaining the virulence genes and sequences of clinical pathogenic bacteria from a virulence gene database;

acquiring all genomes, gene sequences and annotation information of the clinical pathogenic bacteria from a public database;

filtering pseudogenes, fragments, and misannotated sequences in the gene sequence;

clustering each gene unit sequence by multiple thresholds, and performing cross comparison and de-duplication in groups;

simulating a data set to test a gene unit reference gene sequence, and adjusting a supplementary gene unit reference sequence;

clustering each gene unit sequence by circulating multiple threshold values, and performing cross comparison and duplication removal in groups;

clustering reference sequences of all gene units, and filtering abnormal sequences;

extracting public database annotation information such as gene names and species names of the reference sequences, and proofreading and standardizing reference sequence annotations of each gene unit;

establishing reference sequence indexes of all virulence gene units;

and optionally, establishing a software to realize automatic downloading sequence, cluster deduplication, updating and standardization of the database.

In some embodiments of the invention, the S20 includes:

filtering reads with a quality value below 2 and a base count of 40% of the total read;

excising bases with average mass of less than 20 bases in the sliding window (5 bp);

filtering reads with average quality less than 20, N number greater than 5, and length less than 50.

In some embodiments of the present invention, the S30 includes the following steps:

the set of reference sequences for a particular virulence gene was set as: { s₁,s₂,…,s_n}; wherein s is_n: a reference sequence n;

comparing high-quality reads (clean reads) of the metagenome to a reference sequence set by using a multiple comparison algorithm, wherein a threshold value e-value is 1 e-5;

the alignment of each read was: { R₁,R₂,…,R_m}∈g_i(ii) a Wherein m is more than or equal to 0 and less than or equal to n; r_m: the result of the mth alignment; g_i: the ith gene unit;

filtration strategy for detection of virulence genes (VF-result):

wherein, id ═ sequence similarity score (%);

score is the quality score of pairwise alignment of sequences;

and (3) filtering conditions: VF-result belonging to g_i，i>1, discarding and obtaining no result;

and optionally, building software to implement automated alignment, filtering, and result list generation.

In some embodiments of the invention, the alignment of S30 comprises:

when the comparison result is a single result (m is 1), taking the result (R)_m) As final result (r);

when there are multiple comparisons (m)>1) And the target reference sequence is the same gene unit of the same species, and after scoring and sorting, the final result r_iThe following were used:

when there are multiple comparisons (m)>1) And the targeting reference sequences are different gene units (g) of the same species_i，i>1) The final result r is the union of the best results in each gene group { r₁,r₂,…,r_iIn which g is_iThe results in the grouping are as follows:

in some embodiments of the present invention, the S40 includes the following steps:

collecting metagenome sequencing data of a pathogenic bacteria clinical sample;

analyzing the data based on the S20 and S30, and constructing a virulence gene spectrum of the single pathogenic bacteria of each sample;

extracting and standardizing corresponding clinical phenotype and physiological and biochemical indexes of the sample;

analyzing and extracting gene characteristics by using a maximum likelihood method;

combining clinical routine detection indexes and sequence characteristics (PAAC and PSSM-C) of virulence genes, and applying a multi-machine learning strategy to construct a virulence gene characteristic spectrum related to clinical diagnosis;

clustering synergistic virulence genes into single virulence factor units, correlating the corresponding characteristics (functional/clinical phenotype);

constructing a virulence factor-virulence gene and virulence factor-characterization (function/clinical phenotype) association table, and establishing a clinically important virulence gene-virulence factor-characterization (function/clinical phenotype) association database;

and optionally, automated alignment, filtering, and result list generation are implemented by software.

In some embodiments of the present invention, the analyzing and extracting gene features using the maximum likelihood method in S40 includes:

extraction of protein sequence physicochemical characteristics (PAAC) of virulence genes:

ai: the physicochemical feature set of 20 amino acids,

nth position physicochemical characteristic, N: the total number of physicochemical characteristics of amino acid;

wherein, the physical and chemical characteristics of single amino acid are as follows:

for any two amino acids R_bAnd R_dThe correlation of (A) is:

F_k(R_b) Is R_bPhysicochemical characteristics of the q-th position of (1);

for amino acid sequences of length L, the sequence position correlation parameter θ_hThe definition is as follows:

then, the physicochemical feature extraction formula for amino acid e in the 20+ λ (λ ═ 2) dimensional sequence is as follows:

wherein f is_e: the frequency of amino acid e in the sequence; ω: the amino acid position in the sequence is weighted by a parameter with a default value of 0.1.

Extraction of evolutionary features (PSSM) of virulence protein sequences:

the protein sequence of the virulence genes within the transforming gene units is the original PSSM matrix as follows:

wherein, L: the length of the sequence; 20, column number presents 20 natural amino acids; p is a radical of_u,v: possibility of evolutionary mutation of the u amino acid to the v amino acid;

PSSM-C the PSSM matrix was transformed into a 20x20 matrix, with amino acid Z in row u_uThe calculation is as follows:

wherein the content of the first and second substances,

z_t: the value of the t-th bit in the original PSSM table; p is a radical of_t: amino acid at position t in the sequence; l: the length of the sequence; a is_uIs the u-th amino acid among the 20 amino acids.

In some embodiments of the present invention, the S50 includes the following steps:

importing the result list obtained in the S30, comparing the result list with the S40 association database to generate a virulence gene result, wherein the virulence gene result comprises pathogenic bacteria species (species Latin name and Chinese name) and gene information (gene name, virulence factor, characterization, support score and the like);

importing the result into a corresponding table of a report template;

importing the client information of the database into a report template;

a virulence gene identification report (PDF format) of the particular pathogen is generated for the final clinical sample.

The second aspect of the invention discloses a metagenome-based system for detecting virulence genes of clinically important pathogenic bacteria, which comprises the following components:

a clinical pathogen virulence gene database;

important virulence genes-virulence factors-characterization (functional/clinical phenotype) association databases;

a multiple alignment annotation system for metagenomic sequencing data;

a clinical automated reporting system.

The beneficial technical effects of the invention are as follows:

(1) a clinical important pathogenic bacterium virulence gene detection system based on metagenome is established, the multi-aspect limitations of the prior art and the method are overcome, virulence gene detection and identification can be carried out on clinical infection samples with different sample types (cerebrospinal fluid, lung lavage fluid, blood, throat swabs and the like) and low nucleic acid content, hundreds of important virulence genes of various clinical pathogenic bacteria can be identified at one time, and the additional screening time is reduced. The sensitivity and accuracy of identification are improved by the deep level database and the two-step comparison strategy. The clinical automatic report system can quickly generate reports to help doctors to diagnose, treat and prognose in time;

(2) constructing a comprehensive and artificially corrected clinical important pathogenic bacterium virulence gene database, wherein the database comprises all reference sequences of hundreds of important virulence genes of various clinical important pathogenic bacteria, and corrected species and function annotation information;

(3) the machine learning algorithm is applied to carry out literature and clinical big data mining, the high frequency and important virulence gene spectrum of each pathogenic bacterium and virulence factors and characteristics (functions/clinical phenotypes) thereof are identified, the pathogenic bacterium is divided into different virulence factor units according to the synergistic effect of the genes, and a strong association knowledge base of the important virulence genes and the virulence factors and the characteristics (functions/clinical phenotypes) thereof is established, so that the method has more reference value for clinical diagnosis and prognosis treatment;

(4) based on a comparison grading threshold filtering algorithm after large sample analysis, the method improves the sensitivity of a virulence gene detection result while considering the comparison accuracy, overcomes the limitation of the prior related technology on low-abundance and short-reading sample identification, and is particularly suitable for the virulence gene detection and identification of clinical samples (such as cerebrospinal fluid) with single-end short reading (50-75 bp) and low nucleic acid content;

(5) the comparison result of metagenome data and virulence factor and characterization (function/clinical phenotype) information of important virulence genes are integrated in a clinical automatic report system, and the system has higher reliability and clinical practicability.

Drawings

FIG. 1 is a flow chart of a method for detecting a virulence gene of a clinically important pathogen according to an embodiment of the present invention;

FIG. 2 is a flow chart of the operation of the gene detection system for virulence of clinically important pathogenic bacteria according to an embodiment of the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention.

Example 1

As shown in figure 1, the method for detecting the virulence gene of clinically important pathogenic bacteria based on metagenome mainly comprises the following steps:

1. establishing clinical pathogenic bacteria virulence gene database

1.1. Acquiring 1761 virulence genes and sequences of 24 important pathogenic bacteria (covering 18 genera, 10 gram-negative bacteria and 14 gram-positive bacteria) from a virulence database such as VFDB;

1.2. downloading from a public database (NCBI RefSeq) all genomes and gene sequences and annotation information for filtering 24 pathogens;

1.3. filtering pseudogenes, segments and misannotated sequences in the downloaded sequence by using self-developed software;

1.4. clustering each gene unit sequence by multiple thresholds, and performing cross comparison and de-duplication in groups;

1.5. simulating a data set to test a gene unit reference gene sequence, and adjusting a supplementary gene unit reference sequence;

1.6. 1.4, clustering the gene unit sequences, and performing cross comparison and duplication removal in groups;

1.7. clustering reference sequences of all gene units, and filtering abnormal sequences;

1.8. extracting NCBI annotation information such as gene names and species names of the reference sequences by using a regular expression, and proofreading and standardizing the annotation of the reference sequences of all gene units;

1.9. establishing reference sequence indexes of all virulence gene units;

1.10. the software (VF _ MKDB) implements the automatic download sequence, cluster deduplication, update and standardize databases.

2 obtaining clinical sample metagenome sequencing original data, preprocessing the original data to obtain target data

2.1. Filtering reads with a quality value below 2 and a base count of 40% of the total read;

2.2 excising bases with an average mass of less than 20 bases within the sliding window (5 bp);

2.3 Filtering reads with an average mass of less than 20, N number greater than 5, and length less than 50.

3 analyzing the target data by using a preset metagenome sequencing data multiple comparison annotation system to identify the virulence genes

The two-step comparison strategy and judgment method based on metagenome sequencing Read (Read) is as follows:

3.1. the set of reference sequences for a particular virulence gene was set as: { s₁,s₂,…,s_n}; it is composed ofIn, s_n: a reference sequence n;

3.2. aligning high-quality reads (clean reads) of the metagenome to a reference sequence set (threshold e-value of 1e-5) by applying a multiple alignment algorithm;

3.3. the alignment of each read was: { R₁,R₂,…,R_m}∈g_i(ii) a Wherein m is more than or equal to 0 and less than or equal to n; r_m: the result of the mth alignment; g_i: the ith gene unit;

3.4, step one:

if the comparison result is a single result (m is 1), taking the result (R)_m) As final result (r);

3.5. if there are multiple alignments (m >1), two cases:

3.5.1 the targeting reference sequence is the same gene unit of the same species, after scoring and ordering, the final result r_iThe following were used:

3.5.2 targeting reference sequences are different Gene units (g) of the same species_i，i>1) The final result r is the union of the best results in each gene group { r₁,r₂,…,r_iIn which g is_iThe results in the grouping are as follows:

3.6. step two:

filtration strategy for detection of virulence genes (VF-result):

wherein, id ═ sequence similarity score (%);

score is the quality score of pairwise alignment of sequences;

and (3) filtering conditions: VF-result belonging to g_i，i>1, discard, result None (None);

3.7. the software (VF _ Finder) implements automated alignment, filtering, and result list generation.

4. Establishing important virulence gene-virulence factor-characterization (function/clinical phenotype) correlation database

Metagenomic sequencing data collection of 4.1.24 pathogen clinical samples (approximately 50 samples/individual pathogen);

4.2. analyzing the data based on the metagenome sequencing data multiple comparison annotation system, and constructing a virulence gene profile of a single pathogenic bacterium in each sample;

4.3 extracting and standardizing corresponding clinical phenotype and physiological and biochemical indexes of the sample, which mainly comprises the following steps: white blood cell count, neutrophil count, monocyte fraction, lymphocyte fraction, C-reactive protein, endotoxin, etc.;

4.4. analyzing and extracting gene characteristics by using a maximum likelihood method:

4.4.1. extraction of protein sequence physicochemical characteristics (PAAC) of virulence genes:

ai: the physicochemical feature set of 20 amino acids,

for any two amino acids R_bAnd R_dThe correlation of (A) is:

F_k(R_b) Is R_bPhysicochemical characteristics of the q-th position of (1);

4.4.2. Extraction of evolutionary features (PSSM) of virulence protein sequences:

the protein sequence of the virulence genes within the gene unit transformed using PSI-BLAST is the original PSSM matrix (Position-specific targeting matrix) as follows:

PSSM-C (PSSM-composition) conversion PSSM matrix into 20x20 matrix, wherein amino groups of u row

Acid Z_uThe calculation is as follows:

wherein the content of the first and second substances,

4.5. Constructing a virulence gene characteristic spectrum related to clinical diagnosis by applying a multi-machine learning strategy (multi-task logistic regression, random forest, support vector machine and the like) by combining clinical routine detection indexes and sequence characteristics (PAAC and PSSM-C) of virulence genes;

4.6. clustering synergistic virulence genes into single virulence factor units, correlating the corresponding characteristics (functional/clinical phenotype);

4.7. constructing a virulence factor-virulence gene and virulence factor-characterization (function/clinical phenotype) association table, and establishing a clinically important virulence gene-virulence factor-characterization (function/clinical phenotype) association database;

4.8. and the software (VF-KDB) realizes the collection, analysis and upgrading of data.

5. Generating a virulence gene identification report based on the virulence gene identification result and the associated database by using a preset clinical automation report system

5.1. Importing a result list obtained by the metagenomic sequencing data multi-comparison annotation system, comparing an important virulence gene-virulence factor-characterization (function/clinical phenotype) association database, and automatically generating a gene result list (text format): comprises pathogenic bacteria species (species Latin name, Chinese name) and gene information (gene name, virulence factor, characterization and support score, etc.);

5.2. the program automatically leads the results into corresponding forms of the report template;

5.3. the program automatically leads the client information of the database into a report template;

5.5. a virulence gene identification report (PDF format) of the particular pathogen is generated for the final clinical sample.

As shown in fig. 2, the metagenome-based system for detecting virulence genes of clinically important pathogenic bacteria comprises the following components:

1. a clinical pathogen virulence gene database;

2. important virulence genes-virulence factors-characterization (functional/clinical phenotype) association databases;

3. a multiple alignment annotation system for metagenomic sequencing data;

4. a clinical automated reporting system.

While the preferred embodiments and examples of the present invention have been described in detail, the present invention is not limited to the embodiments and examples, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. A method for detecting virulence genes of clinically important pathogenic bacteria based on metagenome is characterized by comprising the following steps:

s10, establishing a clinical pathogenic bacterium virulence gene database;

and S50, generating a virulence gene identification report based on the virulence gene identification result and the associated database by using a preset clinical automation report system.

2. The method according to claim 1, wherein the S10 includes:

extracting public database annotation information of the reference sequence, and checking and standardizing reference sequence annotations of each gene unit;

establishing reference sequence indexes of all virulence gene units;

3. The method according to claim 1, wherein the S20 includes:

4. The method according to claim 1, wherein the S30 includes:

comparing the high-quality reading sequence of the metagenome to a reference sequence set by using a multiple comparison algorithm, wherein the threshold value e-value is 1 e-5;

filtration strategy of detection results of virulence genes:

wherein, id ═ sequence similarity score (%);

score is the quality score of pairwise alignment of sequences;

5. The method of claim 4, wherein the comparing in S30 comprises:

when the comparison result is a single result, taking the result as a final result;

when a plurality of comparison results and the target reference sequence are the same gene unit of the same species, and after scoring and sorting, the final result r_iThe following were used:

when a plurality of comparison results and the target reference sequences are different gene units of the same species, the final result is the union of the optimal results in each gene group, wherein g_iThe results in the grouping are as follows:

6. the method according to claim 1, wherein the S40 includes:

collecting metagenome sequencing data of a pathogenic bacteria clinical sample;

combining clinical routine detection indexes and sequence characteristics of virulence genes, and applying a multiple machine learning strategy to construct a virulence gene characteristic spectrum related to clinical diagnosis;

7. The method according to claim 6, wherein the extracting the gene features by applying the maximum likelihood analysis in S40 comprises:

extraction of protein sequence physicochemical characteristics of virulence genes:

ai: the physicochemical feature set of 20 amino acids,

n is 1,2, …, N; x is 1 to n;

for any two amino acids R_bAnd R_dThe correlation of (A) is:

F_k(R_b) Is R_bPhysicochemical characteristics of the q-th position of (1);

h＝1,2,…,L-1；

wherein f is_e: the frequency of amino acid e in the sequence; ω: the amino acid position weighting parameter in the sequence is 0.1 as default;

extraction of the evolutionary features of the virulence protein sequences:

the PSSM matrix is transformed into a 20x20 matrix, wherein the amino acid Z in the u-th row_uThe calculation is as follows:

wherein the content of the first and second substances,

(u＝1,…,20；t＝1,…,L)

8. The method according to claim 1, wherein the S50 includes:

importing the result list obtained in the S30, and comparing the result list with the S40 association database to generate a virulence gene result which comprises pathogenic bacteria species and gene information;

importing the result into a corresponding table of a report template;

importing the client information of the database into a report template;

generating a virulence gene identification report of the specific pathogenic bacteria of the final clinical sample.

9. A system for detecting virulence genes of clinically important pathogenic bacteria based on metagenome comprises the following components:

a clinical pathogen virulence gene database;

a multiple alignment annotation system for metagenomic sequencing data;

a clinical automated reporting system.