CN114333987A - Metagenome sequencing-based data analysis method for predicting drug resistance phenotype - Google Patents
Metagenome sequencing-based data analysis method for predicting drug resistance phenotype Download PDFInfo
- Publication number
- CN114333987A CN114333987A CN202111680866.8A CN202111680866A CN114333987A CN 114333987 A CN114333987 A CN 114333987A CN 202111680866 A CN202111680866 A CN 202111680866A CN 114333987 A CN114333987 A CN 114333987A
- Authority
- CN
- China
- Prior art keywords
- drug
- gene
- genes
- resistant
- drug resistance
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012163 sequencing technique Methods 0.000 title claims abstract description 54
- 238000000034 method Methods 0.000 title claims description 59
- 238000007405 data analysis Methods 0.000 title description 4
- 238000004902 predicting drug resistance Methods 0.000 title description 2
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 408
- 239000003814 drug Substances 0.000 claims abstract description 243
- 229940079593 drug Drugs 0.000 claims abstract description 243
- 238000001514 detection method Methods 0.000 claims abstract description 107
- 244000052616 bacterial pathogen Species 0.000 claims abstract description 35
- 238000010276 construction Methods 0.000 claims abstract description 5
- 206010059866 Drug resistance Diseases 0.000 claims description 152
- 230000035945 sensitivity Effects 0.000 claims description 81
- 241000894007 species Species 0.000 claims description 48
- 238000012549 training Methods 0.000 claims description 36
- 244000052769 pathogen Species 0.000 claims description 33
- 230000001717 pathogenic effect Effects 0.000 claims description 31
- 241000588747 Klebsiella pneumoniae Species 0.000 claims description 24
- 238000001914 filtration Methods 0.000 claims description 24
- 230000001580 bacterial effect Effects 0.000 claims description 19
- 238000003205 genotyping method Methods 0.000 claims description 18
- 238000012216 screening Methods 0.000 claims description 17
- 238000004088 simulation Methods 0.000 claims description 17
- 230000008569 process Effects 0.000 claims description 16
- 238000004458 analytical method Methods 0.000 claims description 13
- 238000007481 next generation sequencing Methods 0.000 claims description 13
- 108010065885 aminoglycoside N(3')-acetyltransferase Proteins 0.000 claims description 11
- 238000004364 calculation method Methods 0.000 claims description 11
- 238000007672 fourth generation sequencing Methods 0.000 claims description 10
- -1 AAC (3) -IId Proteins 0.000 claims description 9
- 239000003153 chemical reaction reagent Substances 0.000 claims description 9
- 101100108294 Caenorhabditis elegans aex-5 gene Proteins 0.000 claims description 8
- 150000007523 nucleic acids Chemical group 0.000 claims description 8
- 239000013612 plasmid Substances 0.000 claims description 8
- 238000002360 preparation method Methods 0.000 claims description 8
- 101100231599 Azotobacter vinelandii (strain DJ / ATCC BAA-1303) mphE gene Proteins 0.000 claims description 7
- 101100026178 Caenorhabditis elegans egl-3 gene Proteins 0.000 claims description 7
- 101100472523 Proteus mirabilis rmtC gene Proteins 0.000 claims description 7
- 101100472522 Serratia marcescens rmtB gene Proteins 0.000 claims description 7
- 241001123248 Arma Species 0.000 claims description 6
- 238000004422 calculation algorithm Methods 0.000 claims description 6
- 238000002864 sequence alignment Methods 0.000 claims description 6
- CEAZRRDELHUEMR-URQXQFDESA-N Gentamicin Chemical compound O1[C@H](C(C)NC)CC[C@@H](N)[C@H]1O[C@H]1[C@H](O)[C@@H](O[C@@H]2[C@@H]([C@@H](NC)[C@@](C)(O)CO2)O)[C@H](N)C[C@@H]1N CEAZRRDELHUEMR-URQXQFDESA-N 0.000 claims description 5
- 229930182566 Gentamicin Natural products 0.000 claims description 5
- 229960000484 ceftazidime Drugs 0.000 claims description 5
- NMVPEQXCMGEDNH-TZVUEUGBSA-N ceftazidime pentahydrate Chemical compound O.O.O.O.O.S([C@@H]1[C@@H](C(N1C=1C([O-])=O)=O)NC(=O)\C(=N/OC(C)(C)C(O)=O)C=2N=C(N)SC=2)CC=1C[N+]1=CC=CC=C1 NMVPEQXCMGEDNH-TZVUEUGBSA-N 0.000 claims description 5
- 150000001875 compounds Chemical class 0.000 claims description 5
- 229960002518 gentamicin Drugs 0.000 claims description 5
- 229960005404 sulfamethoxazole Drugs 0.000 claims description 5
- JLKIGFTWXXRPMT-UHFFFAOYSA-N sulphamethoxazole Chemical compound O1C(C)=CC(NS(=O)(=O)C=2C=CC(N)=CC=2)=N1 JLKIGFTWXXRPMT-UHFFFAOYSA-N 0.000 claims description 5
- 229960000707 tobramycin Drugs 0.000 claims description 5
- NLVFBUXFDBBNBW-PBSUHMDJSA-N tobramycin Chemical compound N[C@@H]1C[C@H](O)[C@@H](CN)O[C@@H]1O[C@H]1[C@H](O)[C@@H](O[C@@H]2[C@@H]([C@@H](N)[C@H](O)[C@@H](CO)O2)O)[C@H](N)C[C@@H]1N NLVFBUXFDBBNBW-PBSUHMDJSA-N 0.000 claims description 5
- 238000012937 correction Methods 0.000 claims description 4
- 238000009826 distribution Methods 0.000 claims description 4
- 238000003908 quality control method Methods 0.000 claims description 4
- 230000001404 mediated effect Effects 0.000 claims description 3
- 108091028043 Nucleic acid sequence Proteins 0.000 claims description 2
- 239000000203 mixture Substances 0.000 claims description 2
- 239000004973 liquid crystal related substance Substances 0.000 claims 1
- 239000000463 material Substances 0.000 claims 1
- 238000012360 testing method Methods 0.000 description 18
- 230000003115 biocidal effect Effects 0.000 description 17
- 230000007246 mechanism Effects 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 6
- 208000015181 infectious disease Diseases 0.000 description 6
- 239000011159 matrix material Substances 0.000 description 5
- 108020004707 nucleic acids Proteins 0.000 description 5
- 102000039446 nucleic acids Human genes 0.000 description 5
- 238000012795 verification Methods 0.000 description 5
- YZBQHRLRFGPBSL-RXMQYKEDSA-N carbapenem Chemical class C1C=CN2C(=O)C[C@H]21 YZBQHRLRFGPBSL-RXMQYKEDSA-N 0.000 description 4
- 238000010219 correlation analysis Methods 0.000 description 4
- 238000011896 sensitive detection Methods 0.000 description 4
- 238000012070 whole genome sequencing analysis Methods 0.000 description 4
- 241000894006 Bacteria Species 0.000 description 3
- 108090000790 Enzymes Proteins 0.000 description 3
- 102000004190 Enzymes Human genes 0.000 description 3
- 206010033307 Overweight Diseases 0.000 description 3
- 238000012098 association analyses Methods 0.000 description 3
- 230000008859 change Effects 0.000 description 3
- 238000003745 diagnosis Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000012268 genome sequencing Methods 0.000 description 3
- 241000588626 Acinetobacter baumannii Species 0.000 description 2
- 241000588724 Escherichia coli Species 0.000 description 2
- 241000589517 Pseudomonas aeruginosa Species 0.000 description 2
- 229940126575 aminoglycoside Drugs 0.000 description 2
- 238000003556 assay Methods 0.000 description 2
- 108010068385 carbapenemase Proteins 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000003317 immunochromatography Methods 0.000 description 2
- 230000002458 infectious effect Effects 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 230000000813 microbial effect Effects 0.000 description 2
- 238000009629 microbiological culture Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000010200 validation analysis Methods 0.000 description 2
- 108700026220 vif Genes Proteins 0.000 description 2
- IZXIZTKNFFYFOF-UHFFFAOYSA-N 2-Oxazolidone Chemical class O=C1NCCO1 IZXIZTKNFFYFOF-UHFFFAOYSA-N 0.000 description 1
- 101100162340 Anguilla japonica l-2 gene Proteins 0.000 description 1
- 241000588697 Enterobacter cloacae Species 0.000 description 1
- 241001078637 Enterobacter cloacae complex Species 0.000 description 1
- 241000588921 Enterobacteriaceae Species 0.000 description 1
- 241000194032 Enterococcus faecalis Species 0.000 description 1
- 241000194031 Enterococcus faecium Species 0.000 description 1
- 102000002068 Glycopeptides Human genes 0.000 description 1
- 108010015899 Glycopeptides Proteins 0.000 description 1
- 241000606768 Haemophilus influenzae Species 0.000 description 1
- 206010034133 Pathogen resistance Diseases 0.000 description 1
- 241000191967 Staphylococcus aureus Species 0.000 description 1
- 241000191963 Staphylococcus epidermidis Species 0.000 description 1
- 241000193998 Streptococcus pneumoniae Species 0.000 description 1
- 241000193996 Streptococcus pyogenes Species 0.000 description 1
- 239000004098 Tetracycline Substances 0.000 description 1
- WKDDRNSBRWANNC-UHFFFAOYSA-N Thienamycin Natural products C1C(SCCN)=C(C(O)=O)N2C(=O)C(C(O)C)C21 WKDDRNSBRWANNC-UHFFFAOYSA-N 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 239000003242 anti bacterial agent Substances 0.000 description 1
- 238000009635 antibiotic susceptibility testing Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 210000004369 blood Anatomy 0.000 description 1
- 239000008280 blood Substances 0.000 description 1
- 229940041011 carbapenems Drugs 0.000 description 1
- 125000001271 cephalosporin group Chemical group 0.000 description 1
- 210000001175 cerebrospinal fluid Anatomy 0.000 description 1
- 238000002790 cross-validation Methods 0.000 description 1
- 238000013211 curve analysis Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 229940032049 enterococcus faecalis Drugs 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000012530 fluid Substances 0.000 description 1
- 238000012120 genotypic test Methods 0.000 description 1
- PCHJSUWPFVWCPO-UHFFFAOYSA-N gold Chemical compound [Au] PCHJSUWPFVWCPO-UHFFFAOYSA-N 0.000 description 1
- 229940047650 haemophilus influenzae Drugs 0.000 description 1
- 238000012165 high-throughput sequencing Methods 0.000 description 1
- ZSKVGTPCRGIANV-ZXFLCMHBSA-N imipenem Chemical compound C1C(SCC\N=C\N)=C(C(O)=O)N2C(=O)[C@H]([C@H](O)C)[C@H]21 ZSKVGTPCRGIANV-ZXFLCMHBSA-N 0.000 description 1
- 229960002182 imipenem Drugs 0.000 description 1
- 238000001727 in vivo Methods 0.000 description 1
- 230000002779 inactivation Effects 0.000 description 1
- 239000003112 inhibitor Substances 0.000 description 1
- 238000009533 lab test Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 229960002260 meropenem Drugs 0.000 description 1
- DMJNNHOOLUXYBV-PQTSNVLCSA-N meropenem Chemical compound C=1([C@H](C)[C@@H]2[C@H](C(N2C=1C(O)=O)=O)[C@H](O)C)S[C@@H]1CN[C@H](C(=O)N(C)C)C1 DMJNNHOOLUXYBV-PQTSNVLCSA-N 0.000 description 1
- 244000005700 microbiome Species 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 244000045947 parasite Species 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 150000007660 quinolones Chemical class 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000010561 standard procedure Methods 0.000 description 1
- 238000003860 storage Methods 0.000 description 1
- 229940031000 streptococcus pneumoniae Drugs 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 229940124530 sulfonamide Drugs 0.000 description 1
- 150000003456 sulfonamides Chemical class 0.000 description 1
- 235000019364 tetracycline Nutrition 0.000 description 1
- 150000003522 tetracyclines Chemical class 0.000 description 1
- 229940040944 tetracyclines Drugs 0.000 description 1
- 238000001269 time-of-flight mass spectrometry Methods 0.000 description 1
- 210000004881 tumor cell Anatomy 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B15/00—ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
- G16B15/30—Drug targeting using structural data; Docking or binding prediction
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A50/00—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE in human health protection, e.g. against extreme weather
- Y02A50/30—Against vector-borne diseases, e.g. mosquito-borne, fly-borne, tick-borne or waterborne diseases whose impact is exacerbated by climate change
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- General Health & Medical Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Biophysics (AREA)
- Theoretical Computer Science (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Pharmacology & Pharmacy (AREA)
- Analytical Chemistry (AREA)
- Crystallography & Structural Chemistry (AREA)
- Artificial Intelligence (AREA)
- Bioethics (AREA)
- Medicinal Chemistry (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Evolutionary Computation (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The application relates to the technical field of bioinformatics, in particular to a model construction method for detecting and identifying drug-resistant genes and predicting drug-resistant phenotypes based on gene sequencing reads comparison. The invention combines the drug-resistant phenotype related important characteristic genes to construct and realize the direct comparison detection and identification of target pathogenic bacteria and drug-resistant genes carried by the target pathogenic bacteria based on a sequencing read sequence and the prediction of the drug-resistant phenotype.
Description
Technical Field
The application relates to the technical field of bioinformatics, in particular to a method for detecting and identifying drug-resistant genes and predicting drug-resistant phenotypes based on gene sequencing read comparison.
Technical Field
Accurate detection of infectious pathogens and drug resistance thereof is the key to guide clinical accurate medication. At present, laboratory tests for drug resistance of infectious pathogenic bacteria are classified into phenotypic tests and genotypic tests. In the aspect of a phenotype detection method, a gold standard method of microbial culture and drug sensitivity test is mainly clinically adopted, the method can provide a powerful basis for diagnosis and treatment of clinical infection, but has some limitations, such as time consumption for culture (generally 2-4 days), low positive rate of pathogen culture, easy influence of various uncertain factors on the culture process and the like, and some difficultly-cultured or rare pathogens even cannot be cultured successfully. Other detection techniques such as Carba NP test, modified carbapenem inactivation test (including mCIM and eCIM), carbapenemase inhibitor enhancement test, time-of-flight mass spectrometry technology and the like are detection aiming at clinically common carbapenemases of Enterobacteriaceae, and are only suitable for detecting carbapenem drug resistance caused by production of related enzymes, and can not detect drug resistance caused by other mechanisms (such as efflux pump mechanism). And (3) genotype detection, including an enzyme immunochromatography technology and a gene detection technology. The enzyme immunochromatography technology and the conventional gene detection technology (such as GeneXpert, Verigene, Filmerray and other detection systems) aim at a specific target gene and have the characteristic of rapidness and easiness in reading, but if the gene to be detected is different from the target gene, a false negative result is easy to appear.
With the continuous development of sequencing technology and the reduction of sequencing cost, the detection of pathogenic bacteria and drug resistance genes thereof based on sequencing is also becoming popular. Microbial genome sequencing, including both Whole Genome Sequencing (WGS) and metagenome sequencing (mNGS) strategies. For whole genome sequencing of bacterial strains, research shows that a model (rule-based model or machine-learning model) is constructed based on genome data of all strains of a single population and corresponding drug-resistant phenotype data, and then the model is used for analyzing and predicting the drug-resistant phenotype of the new strains based on the genome data of the new strains, so that the method has a good effect (the accuracy can reach more than 95%), but the method has a limitation that the whole genome sequencing similarly avoids the limitation of pathogen culture enrichment. The pathogen metagenome high-throughput sequencing (mNGS) is a novel pathogen detection means developed in more than ten years, is different from microbial culture, and mNGS can read the base sequences of all pathogen nucleic acids from a small amount of samples at one time without screening pure cultures of all pathogens in the environment, and provides information such as the types of pathogens. The mNGS has the characteristics of rapid detection, wide coverage pathogen and no deviation, and a plurality of related application research articles are published in a high-level SCI journal, and moreover, a plurality of published articles are analyzed and discussed aiming at the detection of a drug resistance gene, which shows that the mNGS has the potential value of predicting the drug sensitivity characteristic of pathogenic bacteria. However, in clinical specimens (alveolar lavage fluid, blood, cerebrospinal fluid and the like), the specimens are known to be seriously polluted by hosts (the total nucleic acid of the extracted specimens accounts for more than 95 percent), the microbial load contained in the specimens is low, and the bacterial genomes measured by the clinical specimens cover much more than 1X under the conventional sequencing quantity of 20M or so reads. Then, in the case of this amount of sequencing data, it is necessary to further investigate the detection efficacy of the mNGS, whether the mNGS can identify the pathogenic bacteria well and detect the drug resistance gene accurately at the same time to predict the drug resistance of the pathogenic bacteria.
Disclosure of Invention
In order to solve the technical problems, the invention combines important characteristic genes related to the drug-resistant phenotype which is screened and determined by BGWAS to construct a data analysis method and a data analysis system for directly carrying out comparison detection and identification on target pathogenic bacteria and drug-resistant genes carried by the target pathogenic bacteria based on an NGS or nanopore sequencing read sequence and predicting the drug-resistant phenotype. Aiming at all strain genomes of a training set brought into an early BGWAS, detecting a target drug-resistant gene by simulating NGS or nopore sequencing read sequence comparison (read-based) and genome Contig sequence comparison (assembly-based), and verifying and optimizing a read-based detection process by taking an assembly-based method detection result as a reference so as to realize the purpose of accurately detecting the genotyping of the read-based. And then, calculating the Score by a self-defined formula, taking the Score as an interpretation index for predicting the drug sensitivity property of the antibiotic drug, and carrying out ROC analysis by combining a read sequence simulation test to determine an optimal cutoff threshold value and simultaneously evaluating the accuracy and the performance of a prediction model. Finally, the effectiveness of the assay system is assessed using clinical specimens or specimens of pure pathogenic bacterial strains isolated by culture for validation.
Specifically, the application provides the following technical scheme:
the application firstly provides a method for detecting and identifying drug-resistant genes and predicting the construction of a drug-resistant phenotype model based on gene sequencing reads comparison, which comprises the following steps:
step 1): combining classification information of drug resistance genes such as CARD drug resistance database, arranging and correcting drug resistance gene family information and consistency annotation information among gene sequences, preferably, arranging and correcting source species information and/or a mediation mode;
step 2): calculating a drug-resistant gene family weight coefficient, namely calculating the gene family weight coefficient based on the weight coefficient of the member genes in the family and the sample detection frequency of the corresponding genes in the BGWAS model training set;
step 3): detection of drug resistance genes and process correction.
Further, in the step 1),
the gene family is defined according to the family information of the drug resistance gene recorded by the drug resistance database, and the information of the family of the drug resistance gene recorded by the NCBI NDARO database and the MEGARes database is referred to for combing and correcting;
comparing all the drug-resistant reference genes with an NCBI NT library, and reserving hit with the identity > of 95% and the subject coverage > of 95% to obtain all species annotation information of each drug-resistant gene;
the mediation mode is to inquire whether the drug resistance gene is mediated by the plasmid in the reference sequence description information on the comparison;
the information of the consistency annotation among the gene sequences is that all the drug-resistant gene sequences are compared pairwise to obtain the consistency value among all the gene sequences.
Further, in step 2), the weight coefficient defines the following formula:
in the formula, arg _ Ni is the number of samples detected by corresponding genes in a target gene family in a BGWAS model training set, arg _ Wi is a weight coefficient of the corresponding genes in the family, j represents j key genes in the family, and j + k represents the number of all genes in the family.
Further, the step 3) of detecting the drug-resistant gene and correcting the process comprises the following steps:
d) sequencing Reads data simulation;
e) sequence alignment and annotation statistics;
f) screening and filtering drug-resistant genes.
Further, the a) sequencing Reads data simulation is to perform data simulation on a strain sample based on a BGWAS model training set; preferably:
simulating bacterial strain genome NGS sequencing short reads sequence data by ART _ Illumina software;
simulating bacterial strain genome Nanopore sequencing reads sequence data using ReadSim software;
more preferably, the simulation is to simulate gradient data volumes of 0.05X, 0.1X, 0.2X, 0.3X, 0.4X, 0.5X, 0.6X, 0.7X, 0.8X, 0.9X, 1X, 2X, 3X, 5X, 10X, 30X.
Further, the b) sequence alignment and annotation statistics comprise:
comparing the simulated reads sequence with a drug-resistant gene library, and filtering low-quality hit; performing final gene annotation, counting to obtain the detected specific reads number, multiple comparison reads number and the reads number of the family to which the drug-resistant gene belongs in the sample, and calculating the coverage of the detected gene;
preferably, the final gene annotation of the reads sequence using the optimal alignment and LCA algorithm is: selecting the hit, namely best hit, of the highest score of each read sequence as final hit, and if the best hit has a plurality of same values, namely multiple alignments, performing final annotation on the read sequence by adopting an LCA algorithm on the plurality of hits, namely for a single read sequence, annotating the genotype to a higher level as a gene family level due to the multiple alignments;
more preferably, it is a mixture of more preferably,
aiming at NGS sequencing data, comparing a simulated reads sequence with a drug-resistant gene library by using blastn software, and filtering to reserve hit with the identity higher than 90%;
alignment to the drug-resistant reference gene library was performed using minimap2 software for Nanopore sequencing data, filtering out hits with identity below 0.7 or subject coverage below 0.4.
Further, the c) screening and filtering of the drug-resistant gene comprises the following steps: screening and filtering the drug-resistant genes of the upper reads sequence compared in the step b);
preferably, any one or more of the following screening filter criteria are included:
A) evaluating the influence of sequence consistency among different typing genes in a drug-resistant gene reference library on the accurate detection of the drug-resistant gene typing by read-based: drug-resistant genes with different maximum sequence consistencies in a database are picked, short reads sequences are simulated to carry out read-based flow detection analysis, and the number of specific reads detected by a target gene and the number of reads of all the compared target genes are counted; aiming at the strategy of detecting and identifying the genotyping according to whether a specific reads sequence exists, 95% identity is selected as a threshold standard for easily realizing accurate genotyping of a target gene;
B) screening and filtering based on drug resistance genotyping identification: aiming at target genes with high similarity with other genes in the database, adopting all rows of first bits of reads capable of comparing the target genes as true positive detection, and calculating non-first bits of reserved genes based on accurate comparison reads to obtain genes with 100% coverage as true positive detection; secondly, aiming at the target genes with low similarity with other genes in the database, judging whether the target genes are true positive results or not according to the detected number of the specific reads; direct filtration for some gene families where no specific reads were detected;
C) evaluating the accuracy of the read-based detection for identifying the drug-resistant genotyping: taking the result of the assembly-based detection of the drug-resistant gene in the BGWAS model training process as reference, counting the important genes or gene families screened out by the BGWAS model to obtain the accuracy, sensitivity and specificity indexes of the read-based detection of the drug-resistant gene or gene family,
further, step 4): defining and calculating a Score value of a negative and positive judgment index, and determining a report rule and a cutoff threshold value based on ROC analysis;
the Score values were calculated as follows:
wherein arg _ Wi represents a weight coefficient of a corresponding genotype, and genefamly _ Wi represents a weight coefficient of a corresponding gene family; when the genotype is detected and the genotype weight coefficient is greater than 0, calculating by the genotype weight coefficient; when the genotype is detected, but the genotype weighting factor is 0 or no weighting factor, the gene family weighting factor is calculated.
Further, the sequencing reads are primary, secondary and tertiary sequencing reads, preferably NGS or Nanopore sequencing reads; more preferred are NGS or Nanopore metagenomic sequencing reads.
The invention also provides a model construction method for predicting the attribution of the drug-resistant gene species, which is characterized by comprising the following steps:
step 1): comparing the target pathogenic bacteria genome sequence and calculating the detected sequence number, genome coverage and coverage depth;
step 2): counting the copy number of the drug-resistant gene carried by the target pathogenic strain based on the detection result of the drug-resistant gene of the BGWAS model training set sample;
step 3): and calculating the copy number of the drug-resistant gene and judging the species attribution based on the assumed gene-species attribution relation.
Further, the step 1) is specifically that
Selecting clinically common pathogenic bacteria as target pathogens, searching and downloading a target pathogen reference genome from an NCBI genome database, and taking the target pathogen reference genome as a reference sequence library for identification of target pathogen strains;
comparing each sequencing reads sequence with the reference sequence library, and calculating the detected sequence number, genome coverage and coverage depth of the target pathogenic bacteria in comparison; and (4) counting to obtain the total reads sequence number, the genome coverage and the coverage depth of the detected pathogenic bacteria.
Further, the step 2) is specifically as follows:
and counting to obtain the detection distribution and copy variation range of the drug resistance genes and the drug resistance gene families of the target pathogenic strains based on the detection results of the assembly-based drug resistance genes of the training set samples during the BGWAS model training.
Further, the step 3) is specifically:
when assuming a drug resistance gene-species correspondence, the main basis is: a. annotating in the database the species of the reference gene for inclusion of the target species, and if so, accepting an assumption of the gene-species affiliation; b. if a is not satisfied, checking the mediation mode annotation of the reference gene in the database, judging whether the mediation mode of the plasmid is contained, and if so, accepting the hypothesis of the attribution relationship of the gene and the species; c. if a and b are not satisfied, the species source is presumed according to the species annotation of the ARG-like reads, and the copy number of the drug-resistant gene is calculated according to the following formula:
and if the calculated copy number of the drug-resistant gene falls within the normal copy number range of the target gene family obtained based on the statistics of the BGWAS model training set, accepting the assumed gene-species attribution relationship, and otherwise, rejecting the assumed gene-species attribution relationship.
The invention also provides a method for detecting the drug resistance of the metagenome sequencing data, which comprises the following steps:
1) performing quality control and human source nucleic acid sequence removal on sample sequencing data;
2) detecting and identifying the drug resistance gene contained in the sample: carrying out drug-resistant gene comparison and annotation statistics on a sample sequence based on the detection and identification method, and detecting and identifying drug-resistant genes contained in the sample;
3) predicting the species attribution of the detected drug-resistant genes in the sample: identifying target pathogenic bacteria contained in the sample according to the species attribution prediction method, and predicting the species attribution of the detected drug-resistant gene;
4) according to the detected drug resistance gene carrying condition of the target pathogenic bacteria, the score value of the target species-antibiotic drug is calculated and obtained according to the score calculation mode, and is compared with the cutoff value: when score > -cutoff, then predict as R; score < cutoff, S is predicted if the detected pathogen genome coverage is higher than the minimum genome coverage or data volume required for model stability, otherwise is reported as unknown.
The invention also provides a model for detecting and identifying drug-resistant genes based on gene sequencing reads comparison, which comprises the following modules:
module 1): the method is used for combining classification information of drug resistance genes in the CARD drug resistance database, sorting and correcting drug resistance gene family information and consistency annotation information among gene sequences, and preferably, sorting and correcting source species information and/or a mediation mode;
module 2): the method is used for calculating the weight coefficient of the drug-resistant gene family, and the weight coefficient of the gene family is calculated based on the weight coefficient of the member genes in the family and the sample detection frequency of the corresponding genes in the BGWAS model training set;
module 3): the method is used for detecting drug resistance genes and correcting the process.
The invention also provides a prediction model of drug-resistant gene species affiliation, and the method comprises the following steps:
module 1): the method is used for comparing the target pathogenic bacteria genome sequence and calculating the detected sequence number, genome coverage and coverage depth;
module 2): the method is used for counting the copy number of the drug-resistant gene carried by a target pathogenic strain based on the detection result of the drug-resistant gene of a BGWAS model training set sample;
module 3): the method is used for calculating the copy number of the drug-resistant gene and judging the species attribution based on the assumed gene-species attribution relationship.
The further definition of each module in the model is the same as that of each step in any one of the above methods.
The present invention also provides an apparatus comprising: at least one memory for storing a program; at least one processor configured to load the program to perform the method of any of the above.
The invention also provides a storage medium having stored therein processor-executable instructions for implementing a method as described in any one of the above when executed by a processor.
The invention also provides the following:
the application of the genes AAC (3) -IIe, AAC (3) -IV, AAC (3) -IId, rmtC, armA, rmtF, rmtB, AAC (6') -33 and ANT (2') -Ia as non-core type drug-resistant genes in auxiliary drug sensitivity prediction of Klebsiella pneumoniae;
the drug susceptibility prediction comprises drug resistance prediction and sensitivity prediction, preferably sensitivity prediction;
more preferably, the drug sensitivity is to a gentamicin drug.
The application of detection reagents aiming at non-core type drug-resistant genes AAC (3) -IIe, AAC (3) -IV, AAC (3) -IId, rmtC, armA, rmtF, rmtB, AAC (6') -33 and ANT (2') -Ia in the preparation of a Klebsiella pneumoniae auxiliary drug susceptibility prediction kit;
the drug susceptibility prediction comprises drug resistance prediction and sensitivity prediction, preferably sensitivity prediction;
more preferably, the drug sensitivity is to a gentamicin drug;
further preferably, the genes AAC (3) -IIe, AAC (3) -IV, AAC (3) -IId, rmtC, armA, rmtF, rmtB, AAC (6') -33 and ANT (2') -Ia which are high in weight and mainly mediate the generation of drug resistance are simultaneously detected, and if the detection results are negative, sensitivity is presumed.
The application of the genes AAC (3) -IV, AAC (3) -IId, AAC (6') -Ib', AAC (6') -Ib-cr, AAC (6') -Ib-Hangzhou, AAC (6') -Ib4, mphE, ANT (2') -Ia and aadA24 as non-core type drug resistance genes in auxiliary drug sensitivity prediction of Klebsiella pneumoniae;
the drug susceptibility prediction comprises drug resistance prediction and sensitivity prediction, preferably sensitivity prediction;
more preferably, the drug sensitivity is to tobramycin drugs.
The application of detection reagents aiming at non-core type drug resistance genes AAC (3) -IV, AAC (3) -IId, AAC (6') -Ib', AAC (6') -Ib-cr, AAC (6') -Ib-Hangzhou, AAC (6') -Ib4, mphE, ANT (2') -Ia and aadA24 in the preparation of a Klebsiella pneumoniae auxiliary drug sensitivity prediction kit;
the drug susceptibility prediction comprises drug resistance prediction and sensitivity prediction, preferably sensitivity prediction;
more preferably, the drug sensitivity is to tobramycin drugs;
further preferably, the genes AAAC (3) -IV, AAC (3) -IId, AAC (6') -Ib', AAC (6') -Ib-cr, AAC (6') -Ib-Hangzhou, AAC (6') -Ib4, mphE, ANT (2') -Ia and aadA24 which have high frequency occurrence and high weight and mainly mediate the generation of drug resistance are simultaneously detected, and if the detection results are negative, the sensitivity is presumed.
The application of the genes CTX-M-55, CTX-M-11, CTX-M-15, SHV-155, SHV-5, SHV-11, SHV-12, SHV-76, SHV-30, SHV-53, SHV-124, SHV-182, DHA-1, KPC-3 and KPC-2 as non-core type drug resistance genes in auxiliary drug sensitivity prediction of Klebsiella pneumoniae.
The drug susceptibility prediction comprises drug resistance prediction and sensitivity prediction, preferably sensitivity prediction;
more preferably, the drug sensitivity is directed to a ceftazidime drug.
The application of the detection reagent for the non-core type drug resistance genes CTX-M-55, CTX-M-11, CTX-M-15, SHV-155, SHV-5, SHV-11, SHV-12, SHV-76, SHV-30, SHV-53, SHV-124, SHV-182, DHA-1, KPC-3 and KPC-2 in the preparation of the Klebsiella pneumoniae auxiliary drug sensitivity prediction kit;
the drug susceptibility prediction is a drug resistance prediction;
preferably, the drug sensitivity is directed to a ceftazidime drug.
More preferably, the drug resistance is estimated by detecting the CTX-M-55, CTX-M-11, CTX-M-15, SHV-155, SHV-5, SHV-11, SHV-12, SHV-76, SHV-30, SHV-53, SHV-124, SHV-182, DHA-1, KPC-3 and KPC-2 genes which mainly mediate the high frequency occurrence of the drug resistance and have higher weight.
Use of genes dfrA12, dfrA15, dfrA17, dfrA19, dfrA30, dfrA8, dfrA5, dfrA15b, dfrA14, dfr22, dfrA27 and dfrA1 as non-core type drug resistance genes in assisted drug susceptibility prediction of klebsiella pneumoniae;
the drug susceptibility prediction is a drug resistance prediction;
preferably, the drug sensitivity is directed to a compound sulfamethoxazole drug.
The application of detection reagents aiming at non-core type drug resistance genes dfrA12, dfrA15, dfrA17, dfrA19, dfrA30, dfrA8, dfrA5, dfrA15b, dfrA14, dfr22, dfrA2 and dfrA1 in the preparation of a Klebsiella pneumoniae auxiliary drug sensitivity prediction kit;
the drug susceptibility prediction is a drug resistance prediction;
preferably, the drug sensitivity is directed to a compound sulfamethoxazole drug.
More preferably, the drug resistance is estimated by simultaneously detecting the higher-weighted genes dfrA12, dfrA15, dfrA17, dfrA19, dfrA30, dfrA8, dfrA5, dfrA15b, dfrA14, dfr22, dfrA27 and dfrA1, which mainly mediate the high frequency occurrence of drug resistance.
The application has the beneficial technical effects that:
1) the invention relates to a method for detecting drug resistance based on nucleic acid molecules, which bypasses the limitation of traditional culture, directly performs metagenomic sequencing detection on clinical specimens to identify target pathogenic strains and drug resistance gene carrying conditions thereof, and further predicts drug sensitivity results of antibiotic drugs based on the detection characteristics of the drug resistance genes.
2) When the drug-resistant gene is detected and identified, the drug-resistant gene is directly detected by comparing the reads sequence sequenced on the basis of NGS or nonapore, and compared with the detection by comparing the sequence based on genome contig, the method bypasses the step of genome assembly and has higher detection sensitivity. Specifically, a set of read-based drug resistance gene comparison detection method and a corresponding database are constructed, and meanwhile, the drug resistance gene detection result of the empty-based method is used as a reference to verify and evaluate the performance of the read-based process, so that the accuracy of the read-based detection of the drug resistance gene is ensured.
3) Aiming at read-based drug resistance gene detection, the invention constructs a specific drug resistance gene reference database, particularly the arrangement of drug resistance gene typing and classification and realizes the annotation strategy that LCA can be adopted for a sequence to be inquired. Specifically, all gene sequences of the CARD drug-resistant public library are collected as reference genes, a gene multilevel level labeling mode of a MEGARes database and family information of the drug-resistant genes recorded in an NCBI NDARO database are referred, and 6-level labeling is performed on each reference gene, so that an LCA (local common indicators) annotation strategy of the drug-resistant genes can be realized, namely, the number of detected specific reads at each level is obtained. For the example of the OXA-181 gene, the levels at level 6 are labeled as: OXA-181__1(L1_ geneST), OXA-181(L2_ gene), OXA-48subfamily (L3_ subgroup), OXA family (L4_ Group), Class _ D _ betaactams (L5_ Mechanism), betaactams (L6_ Class).
4) Aiming at the accurate detection of read-based drug resistance genotyping, the invention adopts a strategy of combining two rules, and can effectively improve the genotyping capability and accuracy of the detection process. Firstly, aiming at target genes with high similarity (such as consistency exceeding 95%) with other genes in a database, the first-order ARG-like reads (namely all reads capable of being compared with the target genes or specific reads plus multiple comparison reads) is adopted as a true positive detection, and for non-first-order genes, the genes with the coverage of 100% calculated based on the accurate comparison reads are reserved as the true positive detection; and secondly, aiming at the target genes with low similarity (the consistency is lower than 95%) with other genes in the database, judging whether the target genes are true positive results or not according to the detected number of specific reads. For some families where specific reads were not detected, they were considered false positives and were directly filtered out.
5) The invention defines a method (or formula) for calculating the corresponding weight coefficient of the drug-resistant gene family based on the weight coefficient of the drug-resistant gene typing, and aims at the condition that the gene typing is possibly inaccurate, the gene family weight is used for calculating and predicting the drug sensitivity result instead, so that the false positive caused by inaccurate gene typing is effectively avoided.
6) The invention defines a method rule for predicting the drug sensitivity result, namely a calculation formula of a negative and positive interpretation index Score is defined, and the performance of a prediction model is evaluated and a cutoff threshold value is determined by combining gradient simulation tests of different sequencing data volumes, so that the antibiotic drug sensitivity result can be effectively predicted finally.
7) The invention aims at the metagenome sequencing (mixed flora sequencing) of a clinical specimen, defines a method (or formula) for calculating the copy number of a drug-resistant gene based on the detection sequence conditions of target pathogenic bacteria and the drug-resistant gene, and predicts and estimates the possible pathogenic species source of the drug-resistant gene according to the calculated copy number of the drug-resistant gene. Specifically, when the drug resistance gene-pathogen species correspondence is inferred, the drug resistance gene is assumed to be from a certain target species according to actually detected information of the drug resistance gene and the pathogen species, then the copy number of the drug resistance gene is calculated, whether the calculated copy number falls within a normal range is checked, if the copy number is normal, the attribution relationship is considered to be accepted, and if the copy number is not normal, the attribution relationship is rejected. When assuming a drug resistance gene-species correspondence, the main basis is: a. annotating in the database the species of the reference gene for inclusion of the target species, and if so, accepting an assumption of the gene-species affiliation; b. if a is not satisfied, checking the mediation mode information of the reference gene marked in the database, judging whether the mediation mode of the plasmid is contained, and if so, accepting the hypothesis of the attribution relationship of the gene and the species; c. if a and b are not satisfied, the species source is presumed according to the species annotation of the ARG-like reads.
8) By taking the detection of the drug resistance of the Klebsiella pneumoniae antibiotic drug as an example, the method can accurately identify Klebsiella pneumoniae strains and drug resistance genes carried by the Klebsiella pneumoniae strains, can effectively predict the drug sensitivity results of carbapenems (imipenem and meropenem) and aminoglycosides (gentamicin and tobramycin) and predict the drug resistance properties of ceftazidime and compound sulfamethoxazole, and has the prediction accuracy rate of over 90 percent. Clinical specimen sampling verification shows that the accuracy rate of carbapenem drug sensitivity prediction can reach 100%, and the sample proportion prompted by a drug sensitivity prediction result is definitely over 80%. The invention can assist the clinical detection and diagnosis of infection drug-resistant bacteria.
Drawings
FIG. 1 is a technical roadmap of the present invention;
FIG. 2 is a graph showing the results of a drug-resistant gene detection test performed by simulating 100X target gene read data for target genes with different identities from other genes in the database. In the figure, ARG-like represents the ratio of the detected number of sequences of a target gene or a non-target gene to the detected number of sequences of a family to which the target gene belongs, and specificity represents the detected number of Specific sequences of the target gene.
FIG. 3 is a graph showing the accuracy of the read-based identification of genotyping or genotyping of gene families, with reference to the result of the assay-based drug resistance gene detection.
FIG. 4 technical flow chart of Score index calculation and report rules based on read-based drug resistance gene detection
FIG. 5 is a graph of performance (AUC value) change of 6 antibiotic resistance prediction models under simulation of different sequencing data amounts
Figure 6 is a graph of the performance (AUC values) of a predictive model of 6 antibiotic drugs at 30X genomic data volume for the training and validation sets.
Detailed Description
Embodiments of the present application will be described in detail below with reference to examples, but those skilled in the art will appreciate that the following examples are only illustrative of the present application and should not be construed as limiting the scope of the present application. The examples, in which specific conditions are not specified, were conducted under conventional conditions or conditions recommended by the manufacturer. The reagents or instruments used are not indicated by manufacturers, and are all conventional products available on the market.
Definition of partial terms
Unless defined otherwise below, all technical and scientific terms used in the detailed description of the present application are intended to have the same meaning as commonly understood by one of ordinary skill in the art. While the following terms are believed to be well understood by those skilled in the art, the following definitions are set forth to better explain the present application.
As used in this application, the terms "comprising," "including," "having," "containing," or "involving" are inclusive or open-ended and do not exclude additional unrecited elements or method steps. The term "consisting of …" is considered to be a preferred embodiment of the term "comprising". If in the following a certain group is defined to comprise at least a certain number of embodiments, this should also be understood as disclosing a group which preferably only consists of these embodiments.
Where an indefinite or definite article is used when referring to a singular noun e.g. "a" or "an", "the", this includes a plural of that noun.
The term "about" in the present application denotes an interval of accuracy that can be understood by a person skilled in the art, which still guarantees the technical effect of the feature in question. The term generally denotes a deviation of ± 10%, preferably ± 5%, from the indicated value.
Furthermore, the terms first, second, third, (a), (b), (c), and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and that the embodiments described herein are capable of operation in other sequences than described or illustrated herein.
The term "drug resistance" as used herein refers to the tolerance of microorganisms, parasites and tumor cells to the action of drugs, which is significantly reduced once drug resistance is developed. Preferred in this application means that the infection in vivo is bacterial resistance to antibiotic drugs.
The term "drug-resistant phenotype" as used herein generally refers to a characteristic of drug resistance exhibited by a living organism referred to as a drug-resistant phenotype (resistance genotype), and a drug-resistant gene possessed by the living organism referred to as a drug-resistant genotype (resistance genotype).
The "non-core gene" as referred to herein is a gene which exists only in a part of strains with respect to a certain bacterial population, and corresponds to a core gene, i.e., a gene existing in all strains. The drug resistance gene detected by the method is mainly directed to the non-core gene.
The "important characteristic gene" refers to the non-core drug resistance gene, i.e., a drug resistance characteristic or a drug resistance gene significantly related to a drug resistance phenotype of an antibiotic drug.
The "read-based" of the present invention refers to the alignment of read sequences: and (4) directly comparing the sequencing read sequence with a drug-resistant gene library to detect and analyze the drug-resistant gene.
The "assembly-based" of the invention refers to the sequence alignment of the genome Contig; and (3) assembling the sequencing read sequence with a species genome to obtain contig, and then comparing a drug-resistant gene library based on the contig sequence to detect and analyze the drug-resistant gene.
The invention relates to a BGWAS or BGWAS model, which refers to the correlation analysis of a bacterial whole genome, namely the correlation analysis of bacterial genome data and drug-resistant phenotype data is carried out to screen and find important drug-resistant characteristics or drug-resistant genes which are obviously related to a drug-resistant phenotype.
Correspondingly, the "BGWAS model training set" refers to all bacterial strain data used in performing bacterial whole genome association analysis, i.e., the model training set.
With regard to the "BGWAS" or "BGWAS model", see also the applicant's earlier patent CN202111400540.5 for details, the model specifically includes the following modules:
module 1) for obtaining target bacterial strain genome data and collecting corresponding drug sensitivity test result data;
module 2) for performing alignment annotation of a drug-resistant database based on contig sequences of bacterial genomes;
module 3) is used for carrying out genotype and drug-resistant phenotype data correlation analysis aiming at the target drug, screening important characteristic genes related to drug resistance generation and calculating weight coefficients of the important characteristic genes; preferably, the important characteristic gene is a non-core type drug resistance gene.
Module 4), ROC analysis and evaluation of model performance of predicting drug sensitivity results based on the screened important genes.
The ROC analysis is as follows: defining and calculating a Score value based on the matrix of the important gene weight coefficients obtained in the step 3), and calculating the Score value according to the Score valueAs a negative and positive interpretation index, drawing an ROC curve, determining a cut off value, and verifying and evaluating the model performance by using a verification set sample; the above-mentionedWherein arg _ WiThe weight coefficient value indicating the detection of the corresponding gene.
Further, in the step 1), the number of the bacterial strain genomes > is 100, the bacterial strain sources cover various subtypes, and the ratio of the number of the resistant strains to the number of the sensitive strains is balanced; in some preferred forms, the obtaining is from public database searches downloading published target genomic sequences, or by sequencing assembly of bacterial strains identified by currently collected clinical cultures; in some more preferred ways, the searching and downloading from the public database is: the bacterial strain information recorded with the drug susceptibility test results is collected from the NCBI NDARO database and the PATRIC database platform, the phenotype data is collated, and the genome data is downloaded in batches from the NCBI genome database according to the genome assembly id number or from the PATRIC database according to the PATRIC id. Further, the alignment annotations in the step 2) are: comparing the contig sequence with the CARD drug-resistant gene reference sequence library, filtering the bits with low identity and coverage (preferably, filtering the bits with identity less than 90% or reference gene coverage less than 90%), selecting best bit in each contig comparison region as the final comparison result of the contig region, and adding annotation information of the drug-resistant gene. Further, the association analysis in the step 3) adopts a inhaul cable regression model for association analysis. Further, the method for analyzing the relevance of the cable regression model in the step 3) specifically comprises the following steps: taking a gene detection distribution matrix and an antibiotic susceptibility test result data matrix as input, performing correlation analysis on genotype and drug-resistant phenotype data by using a glmnet program package, performing k (preferably k is 5-15) repeated cross validation, screening to obtain important characteristic genes related to the drug-resistant phenotype, and calculating weight coefficients of the important characteristic genes; further, the important characteristic genes are specifically: and selecting the number genes corresponding to the position where the CV error rate is lowest and the model AUC value is relatively stable at the moment as the important characteristic genes according to the model CV error rate and the AUC change curve under different number of characteristic genes. Further, step 3) may further include a manual recall, where the manual recall is: the genes having a higher PPV than the drug-resistant phenotype (preferably, PPV > -0.8) are manually recalled, and the weight coefficients of the recalled genes are calculated based on the weight coefficient values of the important genes obtained above. Further, the bacteria described herein include, but are not limited to, Escherichia coli, Klebsiella pneumoniae, Acinetobacter baumannii, Pseudomonas aeruginosa, Enterobacter cloacae complex, Staphylococcus aureus, enterococcus faecium, enterococcus faecalis, Streptococcus pneumoniae, Streptococcus pyogenes, Haemophilus influenzae, Staphylococcus epidermidis; preferred is klebsiella pneumoniae. Further, the drug-resistant phenotypes described herein include, but are not limited to, carbapenem-, cephalosporin-, penicillin-, β -lactam-antibiotic-inhibitors-, aminoglycosides, sulfonamides, tetracyclines, quinolones, glycopeptides, oxazolidinones, polymyxin-drug-resistant phenotypes; preferably, the drug-resistant phenotype is a carbapenem drug-resistant phenotype.
The application is illustrated below with reference to specific examples.
Example 1 establishment of the method of the invention
Fig. 1 is a technical route diagram of the present invention, and the steps are described as follows:
firstly), constructing a drug-resistant gene detection and phenotype prediction process based on sequencing reads comparison based on screened important genes of BGWAS, and carrying out test verification and determining a positive cutoff value through simulation data.
1.1 combining the classification information of the drug resistance genes of the CARD drug resistance database (V3.1.0), rearranging and correcting annotation information such as family to which the drug resistance genes belong, possible source species, mediation mode, consistency between gene sequences and the like. The gene family definition is defined according to the family information recorded by a drug resistance database (CARD), and the combing and the correction are carried out by referring to the family information recorded by an NCBI NDARO database and a MEGARes database. Such as: the OXA family may be subdivided into families such as OXA-48family, OXA-51 family, etc., and the contribution of different subfamilies to the determination of resistance may vary depending on the antibiotic drug, thus requiring the definition of OXA specific typing genes to the level of the subfamily, such as OXA-181 and OXA-232 for OXA-48 family. Secondly, all drug-resistant reference genes are compared with the NCBI NT library, the hit with the identity > of 95% and the subject coverage > of 95% is reserved to obtain all species annotation information of each drug-resistant gene, and a keyword 'plasmid' is searched in the reference sequence description information on the comparison to inquire whether the drug-resistant genes are mediated by plasmids. Finally, pairwise comparison is carried out on all drug-resistant gene sequences by using blastn software to obtain a consistency value among all the gene sequences.
1.2 weight coefficient calculation of drug resistance gene family. The weight coefficient of the gene family is calculated based on the weight coefficient of the member genes in the family and the sample detection frequency of the corresponding genes in the BGWAS model training set (see the applicant's prior patent CN202111400540.5 specifically). The calculation formula is as follows:
in the formula, arg _ NiThe number of samples detected in the training set (namely samples used for model training in early-stage BGWAS analysis) of corresponding genes in the target gene family, arg _ WiIs the weight coefficient of the corresponding gene in the family. j indicates that there are j key genes in the family, and j + k indicates the number of all genes in the family.
1.3 simulation of NGS or ONT sequencing reads for detection of drug resistance genes and flow correction
1.3.1Reads data simulation
Based on the training set strain samples used in the previous BGWAS model, the ART _ Illumina software (Version 2.5.8) is used to simulate the genome sequencing short reads sequence (Ilumina SE75) data of the bacterial strain (parameter setting: ss NS50-l 75-f 5-nf 0-rs 1). Nanopore sequencing reads sequences were simulated using ReadSim software (Version 1.6), with parameters set: -rev _ strd on-tech nanopore-read _ mu 3000-read _ dist normal. Given that the genome depth of pathogenic bacteria detected under the data amount of 20Mreads of the routine sequencing of clinical specimens does not exceed 1X generally, the data amount of gradients of 0.05X, 0.1X, 0.2X, 0.3X, 0.4X, 0.5X, 0.6X, 0.7X, 0.8X, 0.9X, 1X, 2X, 3X, 5X, 10X, 30X and the like are simulated.
1.3.2 sequence alignment and Annotation statistics
Comparing the simulated Illumina reads sequence with a drug-resistant gene database by using blastn (version 2.9.0+) software (parameter setting: evalue 1e-5-outfmt 6), firstly filtering to only reserve hit with the identity higher than 90%, then selecting the highest score hit, namely best hit, of each read sequence as final hit, if the highest socre hit has a plurality of same values (namely multiple comparison), carrying out final annotation on the read sequences by adopting an LCA algorithm on the plurality of hits (namely, for a single read sequence, the genotype cannot be annotated due to multiple comparison, and further, the genotype is annotated to a higher level as gene family level), then counting the detected specific reads number and multiple comparison reads number of the drug-resistant gene in the sample and the specific reads number of the drug-resistant gene family, and calculating the coverage index of each detected gene.
For the simulated nanopore reads sequence data, an alignment to the drug-resistant reference gene bank was performed using minimap2 software (version 2.17), with parameters set: -c-x map-ont-L-secondary no, then filter out hits with identity below 0.7 or with subject coverage below 0.4. And finally performing gene annotation on the reads sequence by adopting the optimal comparison or LCA algorithm, counting to obtain the detected specific reads number and the multiple comparison reads number of the drug-resistant gene in the sample and the reads number of the family to which the drug-resistant gene belongs, and calculating the coverage of the detected gene.
1.3.3 screening filtration of drug-resistant genes with aligned reads sequences
A) Evaluating the influence of sequence consistency among different typing genes in a drug-resistant gene reference library on the accurate detection of the drug-resistant gene typing of the read-based. Drug-resistant genes with different maximum sequence consistencies in a database are picked, short reads sequences are simulated to carry out read-based flow detection analysis, and the number of specific reads and the number of ARG-like reads detected by a target gene (namely the number of reads of the target gene on all comparisons) are counted. Finally, the strategy of identifying genotyping for the presence of specific reads sequences can be selected as a threshold criterion for whether accurate genotyping can be easily achieved by 95% identity (see fig. 2).
B) The identification of drug resistance genotyping adopts a strategy of combining two rules: firstly, aiming at target genes with high similarity (such as consistency exceeding 95%) with other genes in a database, adopting ARG-like reads (namely all reads capable of being compared with the target genes or specific reads plus multiple comparison reads) as true positive detection, and reserving genes with 100% coverage calculated based on accurate comparison reads as true positive detection if the genes are not the first ones; and secondly, aiming at the target genes with low similarity (if the consistency is lower than 95%) with other genes in the database, judging whether the target genes are true positive results or not mainly according to the number of detected specific reads. For some gene families in which specific reads were not detected, they were considered false positives and were directly filtered out.
C) Evaluating the accuracy of the read-based detection for identifying the drug-resistant genotyping. By taking the result of the assembly-based detection of the drug-resistant gene in the BGWAS model training process as reference, the accuracy, sensitivity and specificity indexes of the read-based detection of the drug-resistant gene or gene family are obtained through statistics aiming at the important gene or gene family screened out by the BGWAS model, and as shown in figure 3, the read-based detection process has better performance when detecting most of the drug-resistant gene or gene family.
1.4, defining and calculating a Score value of a negative and positive judgment index, performing ROC analysis by combining simulation tests under different sequencing data quantities, and determining a report rule and a cutoff threshold value.
1.4.1 Score index values are defined and calculated from the results of drug resistance gene detection of samples based on the important genes (and gene families) obtained by screening and their weight coefficients. The calculation formula is as follows:
in the formula, arg _ WiWeight coefficient indicating the corresponding genotype, geneffamine _ WiRepresenting the weight coefficients of the corresponding gene families. When genotype (i.e. gt) is detected and the genotype weight coefficient>0, calculating by using a genotype weight coefficient; when a genotype is detected but the genotype weighting factor is 0 or no weighting factor, it is calculated as a gene family weighting factor (0 is noted when no weighting factor is given to a gene family).
Specifically, according to the rule of fig. 4, Score calculation is performed based on the read-based drug resistance gene detection typing results and the gene weight coefficient matrix, and meanwhile, the positive coincidence rate (PPV) factor of each gene and the drug sensitivity results in the BGWAS model training set is considered. (in FIG. 4,: gt represents ARG type, i.e., drug resistance genotyping, and gf represents ARG family, i.e., drug resistance gene family).
1.4.2 aiming at the simulation test under different data volumes based on the training set strains, carrying out ROC analysis based on the drug resistance gene detection result and the actual drug sensitivity result of the training set strains, evaluating the drug sensitivity prediction model performance (AUC value) under different sequencing data volumes, and determining the cutoff threshold.
Specifically, thresholds were set for reporting "drug resistance" and "sensitivity" respectively, in view of model performance and sequencing data volume influencing factors. The threshold for reporting "drug resistance" is set as R _ cutoff when the amount of data for genome sequencing of the target strain is sufficient and the model performance is stable (e.g., 30X), and the Score value corresponding to the maximum value of Youden index is the threshold. Reporting the threshold setting of "sensitive", two sets of threshold criteria are used. Firstly, the R _ cutoff determined under the condition that the sequencing data amount is enough (30X data amount, model is stable) is taken as a threshold value for reporting sensitivity and is marked as S _ cutoff2, and the NPV is required to be satisfied at the moment and is more than 0.9, otherwise, the sensitivity cannot be reported. The second is the minimum sequencing data volume (marked as gf _ LOD1) corresponding to the stable model performance found by directly calculating the Score value based on the gene family weight coefficient, and the maximum Score value (the Score value must be less than or equal to S _ cutoff2) meeting the NPV exceeding 0.9 is taken as the report 'sensitive' threshold value under the data volume and is marked as S _ cutoff 1. Second, for each simulated sequencing data volume between gf _ LOD1 and 30X, the feasibility of reporting a "sensitive" threshold criterion with S _ cutoff2 was examined, i.e., finding the minimum data volume that satisfies the NPV >0.9 correspondence, denoted gf _ LOD 2. Then, finally, two sets of threshold criteria for reporting "sensitivity" can be determined, i.e., when the amount of sequencing data is above gf _ LOD2, S _ cutoff2 is used as the threshold for reporting "sensitivity", and when the amount of sequencing data is between gf _ LOD1 and gf _ LOD2, S _ cutoff1 is used as the threshold for reporting "sensitivity".
The specific report rule for predicting drug sensitivity of a certain antibiotic is as follows: when the Score value is greater than R _ cutoff, it is reported as "potential drug resistance"; "sensitivity is reported when the genomic coverage of the detected pathogen is greater than or equal to gf _ LOD2 and the Score value is less than S _ cutoff2, or" sensitivity is reported when the genomic coverage of the detected pathogen is between gf _ LOD1 and gf _ LOD2 and the Score value is less than S _ cutoff 1; when the genomic coverage of the detected pathogen was less than gf _ LOD1 and Score was less than S _ cutoff1, then it was reported as "/", i.e. unknown.
II) prediction of drug resistance gene species affiliation
2.1 detection of target pathogenic bacteria species and prediction of species affiliation of drug-resistant genes
2.1.1 comparison of genome sequence of target pathogenic bacteria and calculation of detected sequence number, genome coverage and coverage depth
Clinical common pathogenic bacteria (Klebsiella pneumoniae, Escherichia coli, Acinetobacter baumannii, Pseudomonas aeruginosa, Enterobacter cloacae and the like) are selected as target pathogens, and a target pathogen reference genome is searched and downloaded from an NCBI genome database and is used as a reference sequence library for identifying target pathogen strains.
The Illumina reads sequence is aligned with the target pathogen reference genome sequence library set above (alignment parameters: -x sr-a-second ═ no-L) by using minimap2 software (v2.17), and then the detected sequence number, genome coverage and coverage depth of the target pathogen species on the alignment are calculated. For alignment of nanopore sequencing reads, parameters were set to-x map-ont-a-second no-L.
Then, the total reads sequence number, the genome coverage and the coverage depth of the detected pathogenic bacteria are obtained through statistics.
2.1.2 based on the BGWAS model training set specimen drug-resistant gene detection result, counting the copy number of the drug-resistant gene carried by the target pathogenic bacteria
And counting to obtain the detection distribution and copy variation range of the drug resistance genes and the drug resistance gene families of the target pathogenic strains based on the detection results of the assembly-based drug resistance genes of the training set samples during the BGWAS model training.
2.1.3 drug resistance Gene copy number calculation and species affiliation judgment based on the presumed Gene-species affiliation relationship
When assuming a drug resistance gene-species correspondence, the main basis is: a. annotating in the database the species of the reference gene for inclusion of the target species, and if so, accepting an assumption of the gene-species affiliation; b. if a is not satisfied, checking the mediation mode annotation of the reference gene in the database, judging whether the mediation mode of the plasmid is contained, and if so, accepting the hypothesis of the attribution relationship of the gene and the species; c. if a and b are not satisfied, the species source is presumed according to the species annotation of the ARG-like reads. Then, the copy number of the drug-resistant gene is calculated according to the following formula:
and if the calculated copy number of the drug-resistant gene falls within the normal copy number range of the target gene family obtained based on the statistics of the BGWAS model training set, accepting the assumed gene-species attribution relationship, and otherwise, rejecting the assumed gene-species attribution relationship.
Thirdly) detecting and verifying drug resistance gene by metagenome sequencing of clinical specimen
Collecting clinical samples which are cultured and tested for drug sensitivity at the same time, transferring the samples to a medical examination laboratory according to requirements for pretreatment, nucleic acid extraction, library establishment and on-machine sequencing (Illumina CN500 SE75 or nanopore sequencing), and then performing comparison analysis on target species genome and drug-resistant gene database on reads sequences obtained by sequencing to identify pathogenic bacteria and drug-resistant gene conditions carried by the pathogenic bacteria in the samples.
3.1 quality control filtration and removal of human-derived nucleic acid sequences from sequenced original low-quality sequences
Sequence data measured for the Illumina platform were processed as follows:
a. processing the sequencing off-machine data by using BCL2fastq (v2.20.0.422), and converting the BCL format data into fastq format sequence data;
b. filtering the obtained original fastq sequence data by using fastp (v0.19.5) software (parameter setting: q 15-u 40-l read _ length 0.67), and removing low-quality and short sequences; meanwhile, komplexy (v0.3.6) software is used for calculating the complexity of sequence information (parameter setting: -F-t 0.4), and low-complexity sequences are filtered.
c. The clean sequence obtained by quality control filtration is aligned with the human reference genome sequence (human _38) by using bowtie2(v2.3.4.3) software (parameter setting: mm-very-sensitive-k 1) to filter out human sequences.
3.2 detection and identification of drug-resistant genes contained in the sample and species affiliation thereof
Drug-resistant gene alignment and annotation statistics of sample sequences are carried out by using blastn software according to the step 1.3.2, target species genome alignment is carried out by using minimap2 software according to the step 2.1.1, detected genome coverage is calculated, and then species attribution of the drug-resistant genes which are predicted to be detected is evaluated according to the step 2.1.3.
3.3 for the pathogenic bacteria concerned and detected, according to the detection of the drug resistance genes, calculating and obtaining the score value of the target species-antibiotic drug according to the score calculation mode defined by 1.4.1, and comparing with the cutoff value: when score > cutoff, then it is predicted as R; score < cutoff, S is predicted if the detected pathogen genomic coverage is higher than the minimum genomic coverage or data volume (e.g. > 40%) required for model stability determined based on the 1.4.2 step, otherwise is reported as "/" (indicating unknown)
And 3.4, finally, comparing the drug sensitivity prediction result with the actual drug sensitivity test result of the clinical specimen collected at the same time, counting the accuracy of the prediction result and the sample number proportion reported effectively, and evaluating the performance of the drug resistance detection process.
Example 2 detection of drug-resistant genes and phenotypic prediction analysis of Klebsiella pneumoniae against clinical specimens
1. Based on Klebsiella pneumoniae genome, BGWAS screens to obtain antibiotic-related important drug resistance genes and corresponding gene families, and calculates to obtain weight coefficients of the genes or the gene families
From NCBI NDARO database and PATRIC database, collecting and downloading Klebsiella pneumoniae strain genome data with drug sensitivity test result information, screening and filtering to obtain 3072 (training set, verification set are 2410 and 662) strain samples, screening by using a machine learning method to obtain important genes related to antibiotic drug resistance and a weight coefficient matrix thereof, and calculating the weight coefficients of families to which the important genes belong according to a formula of technical scheme 1.2, wherein the results are as follows:
2. based on 2410 cases of Klebsiella pneumoniae genome in the training set, 75bp short reads (NGS sequencing platform) are simulated according to the steps of the technical scheme 1.3 to carry out test verification of the read-based drug resistance gene detection process, and different gradient data volumes of 0.05X, 0.1X, 0.2X, 0.3X, 0.4X, 0.5X, 0.6X, 0.7X, 0.8X, 0.9X, 1X, 2X, 3X, 5X, 10X, 30X and the like are simulated. And then carrying out drug resistance gene detection to obtain a drug resistance gene detection result of each simulated sample, calculating a score value according to the steps of the technical scheme 1.4.1, carrying out ROC curve analysis to obtain model performance AUC values of each antibiotic drug under different data quantities, and then drawing a model performance AUC value change curve as shown in a figure 5. At data volume 30X, the model performance has stabilized, and the AUC values for each antibiotic model performance at this time are shown in fig. 6 below. Finally, according to the steps of the technical scheme 1.4.2, the report rule and cutoff threshold value of each antibiotic drug are determined as shown in the following table.
The cutoff threshold is predicted by the mNGS susceptibility of 6 antibiotic drugs of Klebsiella pneumoniae.
Note: "/" indicates that "sensitivity" cannot be reported.
3. A total of 48 cases of culture and identification of clinical specimens containing Klebsiella pneumoniae were collected and stored in a freezer at-80 ℃. Then, after nucleic acid extraction, the 48 specimens were used to construct a macro-gene secondary (insert length 200-400bp) library, which was subjected to secondary (Illumina nextsseq CN500 SE75) machine sequencing. And identifying the pathogen and the drug resistance gene carried by the pathogen according to off-machine data, and then calculating a Score value and predicting and judging a drug sensitivity result. Finally obtaining the target pathogenic bacteria of each specimen and the detection and identification result and the drug sensitivity prediction result of the drug resistance gene thereof. Some of the sample results are shown in the following table:
description of the drawings: ND means not detected, "/" means unknown.
The statistics shows that the accuracy of drug sensitivity prediction and the ratio of reportable samples are shown in the following table:
the result shows that the invention can effectively and accurately identify the pathogenic bacteria and the drug-resistant gene carried by the pathogenic bacteria aiming at the clinical specimen and predict the drug sensitivity result of the antibiotic drug, and can be used for assisting the clinical detection and diagnosis of the infection drug-resistant bacteria.
Further, as can be seen from the above results, for gentamicin, AAC (3) -IIe, AAC (3) -IV, AAC (3) -IId, rmtC, armA, rmtF, rmtB, AAC (6') -33, ANT (2 ") -Ia are important drug resistance genes, and by combining the gene weights, gene family weights, gene occurrence frequencies, and all possible mechanisms of drug resistance generation, in practice, the genes with higher weights can be simultaneously detected for AAC (3) -IIe, AAC (3) -IV, AAC (3) -IId, rmtC, armma, rmtF, rmtB, AAC (6') -33, and ANT (2") -Ia, which mainly mediate high frequency occurrence of drug resistance generation, and if the detection results are negative, drug sensitivity can be presumed, and especially, drug sensitivity can be achieved.
Further, as can be seen from the above results, Tobramycin, AAC (3) -IV, AAC (3) -IId, AAC (6') -Ib', AAC (6') -Ib-cr, AAC (6') -Ib-Hangzhou, AAC (6') -Ib4, mphE, ANT (2') -Ia and aadA24 are important drug resistance genes, and the combination of gene weights, gene family weights, frequency of gene generation and all possible mechanisms of drug resistance generation can be practically detected simultaneously by using AAAC (3) -IV, AAC (3) -IId, AAC (6') -Ib', AAC (6') -Ib-Hangzhou, AAC (6') -4, mphE, ANT (2) -Ia and ada24 genes which have high frequency and high weight for mainly mediating drug resistance generation, if the detection results are negative, the sensitivity can be presumed, and the purpose of drug sensitivity, especially sensitive detection, is realized.
Further, from the above results, it can be seen that ceftazidime, CTX-M-55, CTX-M-11, CTX-M-15, SHV-155, SHV-5, SHV-11, SHV-12, SHV-76, SHV-30, SHV-53, SHV-124, SHV-182, DHA-1, KPC-3, and KPC-2 are important drug resistance genes, and the high-frequency generation and high-weight CTX-M-55, CTX-M-11, CTX-M-15, SHV-155, SHV-5, SHV-11, SHV-12, SHV-76, SHV-30, and KPC-3 are practically induced by the high-frequency generation mainly mediating drug resistance and all possible mechanisms of drug resistance, And (3) detecting SHV-53, SHV-124, SHV-182, DHA-1, KPC-3 and/or KPC-2 genes, and if the detection result is positive, presuming drug resistance. Meanwhile, as can be seen from the results of the 30X genome simulation reads test, it is not possible to find a suitable threshold value, so that when the calculated Score value is < the threshold value, the model NPV is high (e.g., >0.9), and thus the purpose of sensitive detection cannot be achieved.
Further, from the above results, it is found that, in the case of the compound sulfamethoxazole, dfrA12, dfrA15, dfrA17, dfrA19, dfrA30, dfrA8, dfrA5, dfrA15b, dfrA14, dfr22, dfrA27, and dfrA1 are important drug resistance genes, and in practice, the resistance genes can be detected as positive if all of the genes are detected at the same time by combining the gene weights, gene family weights, gene occurrence frequencies, and all possible mechanisms of resistance generation, such as dfrA12, dfrA15, dfrA17, dfrA19, dfrA30, dfrA8, dfrA5, dfa 15b, dfa 14, dfr22, dfrA27, and/or dfrA1, which are high-weighted for mainly mediating the generation of high frequency of resistance generation. Thus realizing the purpose of drug sensitivity, especially sensitive detection. Meanwhile, as can be seen from the results of the 30X genome simulation reads test, it is not possible to find a suitable threshold value, so that when the calculated Score value is < the threshold value, the model NPV is high (e.g., >0.9), and thus the purpose of sensitive detection cannot be achieved.
Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.
Claims (14)
1. A model construction method for detecting and identifying drug-resistant genes and predicting drug-resistant phenotypes based on gene sequencing reads comparison is characterized by comprising the following steps:
step 1): combining the classification information of drug resistance genes of a drug resistance database, sorting and correcting drug resistance gene family information and consistency annotation information among gene sequences, preferably, sorting and correcting source species information and/or a mediation mode;
step 2): calculating a drug-resistant gene family weight coefficient, namely calculating the gene family weight coefficient based on the weight coefficient of the member genes in the family and the sample detection frequency of the corresponding genes in the BGWAS model training set;
the weight coefficient defining formula is as follows:
in the formula, arg _ Ni is the number of samples detected by corresponding genes in a target gene family in a BGWAS model training set, arg _ Wi is a weight coefficient of the corresponding genes in the family, j represents j key genes in the family, and j + k represents the number of all genes in the family;
step 3): detecting drug-resistant genes and correcting the process;
preferably, the sequencing reads are NGS or Nanopore sequencing reads; more preferred are NGS or Nanopore metagenomic sequencing reads.
2. The method of claim 1,
in the step 1) described above, the step of,
the drug resistance database is a CARD drug resistance database;
the gene family is defined according to the family information of the drug resistance gene recorded by the drug resistance database, and the information of the family of the drug resistance gene recorded by the NCBI NDARO database and the MEGARes database is referred to for combing and correcting;
comparing all the drug-resistant reference genes with an NCBI NT library, and reserving hit with the identity > of 95% and the subject coverage > of 95% to obtain all species annotation information of each drug-resistant gene;
the mediation mode is to inquire whether the drug resistance gene is mediated by the plasmid in the reference sequence description information on the comparison;
the information of the consistency annotation among the gene sequences is that all the drug-resistant gene sequences are compared pairwise to obtain the consistency value among all the gene sequences.
3. The method of any one of claims 1 to 2,
the detection and process correction of the drug resistance gene in the step 3) comprises the following steps:
a) sequencing Reads data simulation;
b) sequence alignment and annotation statistics;
c) screening and filtering drug-resistant genes.
4. The method of claim 3,
the a) sequencing Reads data simulation is based on BGWAS model training set strain samples for data simulation;
preferably, the first and second liquid crystal materials are,
simulating bacterial strain genome NGS sequencing short reads sequence data by ART _ Illumina software;
bacterial strain genomic Nanopore sequencing reads sequence data was simulated using ReadSim software.
5. The method of any one of claims 3 to 4,
the b) sequence alignment and annotation statistics comprise:
comparing the simulated reads sequence with a drug-resistant gene library, and filtering low-quality hit; performing final gene annotation, counting to obtain the detected specific reads number, multiple comparison reads number and the reads number of the family to which the drug-resistant gene belongs in the sample, and calculating the coverage of the detected gene;
preferably, the final gene annotation of the reads sequence using the optimal alignment and LCA algorithm is: selecting the hit, namely best hit, of the highest score of each read sequence as final hit, and if the best hit has a plurality of same values, namely multiple alignments, performing final annotation on the read sequence by adopting an LCA algorithm on the plurality of hits, namely for a single read sequence, annotating the genotype to a higher level as a gene family level due to the multiple alignments;
more preferably, it is a mixture of more preferably,
aiming at NGS sequencing data, comparing a simulated reads sequence with a drug-resistant gene library by using blastn software, and filtering to reserve hit with the identity higher than 90%;
alignment to the drug-resistant reference gene library was performed using minimap2 software for Nanopore sequencing data, filtering out hits with identity below 0.7 or subject coverage below 0.4.
6. The method of any one of claims 3 to 5,
the c) screening and filtering of the drug-resistant gene are as follows: screening and filtering the drug-resistant genes of the upper reads sequence compared in the step b);
preferably, any one or more of the following screening filter criteria are included:
A) evaluating the influence of sequence consistency among different typing genes in a drug-resistant gene reference library on the accurate detection of the drug-resistant gene typing by read-based: drug-resistant genes with different maximum sequence consistencies in a database are picked, short reads sequences are simulated to carry out read-based flow detection analysis, and the number of specific reads detected by a target gene and the number of reads of all the compared target genes are counted; aiming at the strategy of detecting and identifying the genotyping according to whether a specific reads sequence exists, 95% identity is selected as a threshold standard for easily realizing accurate genotyping of a target gene;
B) screening and filtering based on drug resistance genotyping identification: aiming at target genes with high similarity with other genes in the database, adopting all rows of first bits of reads capable of comparing the target genes as true positive detection, and calculating non-first bits of reserved genes based on accurate comparison reads to obtain genes with 100% coverage as true positive detection; secondly, aiming at the target genes with low similarity with other genes in the database, judging whether the target genes are true positive results or not according to the detected number of the specific reads; direct filtration for some gene families where no specific reads were detected;
C) evaluating the accuracy of the read-based detection for identifying the drug-resistant genotyping: and taking the result of the assembly-based detection of the drug-resistant gene in the BGWAS model training process as reference, and counting the important gene or gene family screened out by the BGWAS model to obtain the accuracy, sensitivity and specificity indexes of the read-based detection of the drug-resistant gene or gene family.
7. The method of any one of claims 1 to 6, further comprising the steps of:
step 4): defining and calculating a Score value of a negative and positive judgment index, and determining a report rule and a cutoff threshold value based on ROC analysis;
the Score values were calculated as follows:
wherein arg _ Wi represents a weight coefficient of a corresponding genotype, and genefamly _ Wi represents a weight coefficient of a corresponding gene family; when the genotype is detected and the genotype weight coefficient is greater than 0, calculating by the genotype weight coefficient; when the genotype is detected, but the genotype weighting factor is 0 or no weighting factor, the gene family weighting factor is calculated.
8. A model construction method for drug-resistant gene species attribution prediction is characterized by comprising the following steps:
step 1): comparing the target pathogenic bacteria genome sequence and calculating the detected sequence number, genome coverage and coverage depth;
step 2): counting the copy number of the drug-resistant gene carried by the target pathogenic strain based on the detection result of the drug-resistant gene of the BGWAS model training set sample;
step 3): calculating the copy number of the drug-resistant gene and judging the species affiliation based on the assumed gene-species affiliation relationship;
preferably, the step 1) is specifically
Selecting clinically common pathogenic bacteria as target pathogens, searching and downloading a target pathogen reference genome from an NCBI genome database, and taking the target pathogen reference genome as a reference sequence library for identification of target pathogen strains;
comparing each sequencing reads sequence with the reference sequence library, and calculating the detected sequence number, genome coverage and coverage depth of the target pathogenic bacteria in comparison; and (4) counting to obtain the total reads sequence number, the genome coverage and the coverage depth of the detected pathogenic bacteria.
Preferably, the step 2) is specifically:
and counting to obtain the detection distribution and copy variation range of the drug resistance genes and the drug resistance gene families of the target pathogenic strains based on the detection results of the assembly-based drug resistance genes of the training set samples during the BGWAS model training.
Preferably, the step 3) is specifically:
when assuming a drug resistance gene-species correspondence, the main basis is: a. annotating in the database the species of the reference gene for inclusion of the target species, and if so, accepting an assumption of the gene-species affiliation; b. if a is not satisfied, checking the mediation mode annotation of the reference gene in the database, judging whether the mediation mode of the plasmid is contained, and if so, accepting the hypothesis of the attribution relationship of the gene and the species; c. if a and b are not satisfied, the species source is presumed according to the species annotation of the ARG-like reads, and the copy number of the drug-resistant gene is calculated according to the following formula:
and if the calculated copy number of the drug-resistant gene falls within the normal copy number range of the target gene family obtained based on the statistics of the BGWAS model training set, accepting the assumed gene-species attribution relationship, and otherwise, rejecting the assumed gene-species attribution relationship.
9. A method for drug resistance detection of metagenomic sequencing data is characterized by comprising the following steps:
1) performing quality control and human source nucleic acid sequence removal on sample sequencing data;
2) detecting and identifying the drug resistance gene contained in the sample: carrying out drug-resistant gene alignment and annotation statistics on sample sequences based on the method of any one of claims 1-6, and detecting and identifying drug-resistant genes contained in the samples;
3) predicting the species attribution of the detected drug-resistant genes in the sample: identifying the target pathogenic bacteria contained in the sample according to the method of claim 8, and predicting the species assignment of the detected drug resistance gene;
4) calculating the score value of the target species-antibiotic drug according to the score calculation mode in claim 7 according to the detected drug resistance gene carrying condition of the target pathogenic bacteria, and comparing the score value with the cutoff value: when score > -cutoff, then predict as R; score < cutoff, S is predicted if the detected pathogen genome coverage is higher than the minimum genome coverage or data volume required for model stability, otherwise is reported as unknown.
10. An apparatus, comprising: at least one memory for storing a program; at least one processor configured to load the program to perform the method of any of claims 1-9.
11. The application of detection reagents aiming at non-core type drug-resistant genes AAC (3) -IIe, AAC (3) -IV, AAC (3) -IId, rmtC, armA, rmtF, rmtB, AAC (6') -33 and ANT (2') -Ia in the preparation of a Klebsiella pneumoniae auxiliary drug susceptibility prediction kit;
the drug susceptibility prediction comprises drug resistance prediction and sensitivity prediction, preferably sensitivity prediction;
more preferably, the drug sensitivity is to a gentamicin drug;
further preferably, the genes AAC (3) -IIe, AAC (3) -IV, AAC (3) -IId, rmtC, armA, rmtF, rmtB, AAC (6') -33 and ANT (2') -Ia which are high in weight and mainly mediate the generation of drug resistance are simultaneously detected, and if the detection results are negative, sensitivity is presumed.
12. The application of detection reagents aiming at non-core type drug resistance genes AAC (3) -IV, AAC (3) -IId, AAC (6') -Ib', AAC (6') -Ib-cr, AAC (6') -Ib-Hangzhou, AAC (6') -Ib4, mphE, ANT (2') -Ia and aadA24 in the preparation of a Klebsiella pneumoniae auxiliary drug sensitivity prediction kit;
the drug susceptibility prediction comprises drug resistance prediction and sensitivity prediction, preferably sensitivity prediction;
more preferably, the drug sensitivity is to tobramycin drugs;
further preferably, the genes AAAC (3) -IV, AAC (3) -IId, AAC (6') -Ib', AAC (6') -Ib-cr, AAC (6') -Ib-Hangzhou, AAC (6') -Ib4, mphE, ANT (2') -Ia and aadA24 which have high frequency occurrence and high weight and mainly mediate the generation of drug resistance are simultaneously detected, and if the detection results are negative, the sensitivity is presumed.
13. The application of the detection reagent for the non-core type drug resistance genes CTX-M-55, CTX-M-11, CTX-M-15, SHV-155, SHV-5, SHV-11, SHV-12, SHV-76, SHV-30, SHV-53, SHV-124, SHV-182, DHA-1, KPC-3 and KPC-2 in the preparation of the Klebsiella pneumoniae auxiliary drug sensitivity prediction kit;
the drug susceptibility prediction is a drug resistance prediction;
preferably, the drug sensitivity is directed to a ceftazidime drug.
More preferably, the drug resistance is estimated by detecting the CTX-M-55, CTX-M-11, CTX-M-15, SHV-155, SHV-5, SHV-11, SHV-12, SHV-76, SHV-30, SHV-53, SHV-124, SHV-182, DHA-1, KPC-3 and KPC-2 genes which mainly mediate the high frequency occurrence of the drug resistance and have higher weight.
14. The application of detection reagents aiming at non-core type drug resistance genes dfrA12, dfrA15, dfrA17, dfrA19, dfrA30, dfrA8, dfrA5, dfrA15b, dfrA14, dfr22, dfrA27 and dfrA1 in the preparation of a Klebsiella pneumoniae auxiliary drug sensitivity prediction kit;
the drug susceptibility prediction is a drug resistance prediction;
preferably, the drug sensitivity is to a compound sulfamethoxazole drug;
more preferably, the drug resistance is estimated by simultaneously detecting dfrA12, dfrA15, dfrA17, dfrA19, dfrA30, dfrA8, dfrA5, dfrA15b, dfrA14, dfr22, dfrA27 and/or dfrA1 genes, which are high in weight and mainly mediate high frequency occurrence of drug resistance.
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310625499.4A CN116631500A (en) | 2021-12-30 | 2021-12-30 | Non-core drug-resistant gene |
CN202111680866.8A CN114333987B (en) | 2021-12-30 | 2021-12-30 | Data analysis method for predicting drug resistance phenotype based on metagenomic sequencing |
CN202310625501.8A CN116631501A (en) | 2021-12-30 | 2021-12-30 | Model construction method for drug-resistant gene species attribution prediction |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111680866.8A CN114333987B (en) | 2021-12-30 | 2021-12-30 | Data analysis method for predicting drug resistance phenotype based on metagenomic sequencing |
Related Child Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310625501.8A Division CN116631501A (en) | 2021-12-30 | 2021-12-30 | Model construction method for drug-resistant gene species attribution prediction |
CN202310625499.4A Division CN116631500A (en) | 2021-12-30 | 2021-12-30 | Non-core drug-resistant gene |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114333987A true CN114333987A (en) | 2022-04-12 |
CN114333987B CN114333987B (en) | 2023-05-12 |
Family
ID=81022835
Family Applications (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310625499.4A Pending CN116631500A (en) | 2021-12-30 | 2021-12-30 | Non-core drug-resistant gene |
CN202111680866.8A Active CN114333987B (en) | 2021-12-30 | 2021-12-30 | Data analysis method for predicting drug resistance phenotype based on metagenomic sequencing |
CN202310625501.8A Pending CN116631501A (en) | 2021-12-30 | 2021-12-30 | Model construction method for drug-resistant gene species attribution prediction |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310625499.4A Pending CN116631500A (en) | 2021-12-30 | 2021-12-30 | Non-core drug-resistant gene |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310625501.8A Pending CN116631501A (en) | 2021-12-30 | 2021-12-30 | Model construction method for drug-resistant gene species attribution prediction |
Country Status (1)
Country | Link |
---|---|
CN (3) | CN116631500A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115323067A (en) * | 2022-09-19 | 2022-11-11 | 北京金匙医学检验实验室有限公司 | Characteristic gene combination, kit and sequencing method for predicting antibiotic drug sensitive phenotype of enterobacter cloacae |
CN115798575A (en) * | 2023-02-06 | 2023-03-14 | 中国医学科学院北京协和医院 | System and method for predicting sensitivity of Klebsiella to ceftazidime |
CN115910216A (en) * | 2022-12-01 | 2023-04-04 | 杭州瑞普基因科技有限公司 | Method and system for identifying genome sequence classification errors based on machine learning |
CN116597893A (en) * | 2023-06-14 | 2023-08-15 | 北京金匙医学检验实验室有限公司 | Method for predicting drug resistance gene-pathogenic microorganism attribution |
CN116825182A (en) * | 2023-06-14 | 2023-09-29 | 北京金匙医学检验实验室有限公司 | Method for screening bacterial drug resistance characteristics based on genome ORFs and application |
CN118098369A (en) * | 2024-03-26 | 2024-05-28 | 杭州洛兮医学检验实验室有限公司 | Method for analyzing drug-resistant phenotype of pathogenic microorganism |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117612747B (en) * | 2024-01-24 | 2024-05-03 | 杭州广科安德生物科技有限公司 | Drug sensitivity prediction method and device for klebsiella pneumoniae |
CN118212987B (en) * | 2024-05-21 | 2024-08-20 | 中国医学科学院北京协和医院 | Gene data processing method and device, storage medium and electronic equipment |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160085912A1 (en) * | 2014-05-27 | 2016-03-24 | Opgen, Inc. | Systems, apparatus, and methods for generating and analyzing resistome profiles |
CN106886689A (en) * | 2015-12-15 | 2017-06-23 | 浙江大学 | A kind of pathogenic microorganism genome rapid analysis method and system |
US20200273576A1 (en) * | 2019-02-26 | 2020-08-27 | Tempus | Systems and methods for using sequencing data for pathogen detection |
US20200303078A1 (en) * | 2019-03-22 | 2020-09-24 | Inflammatix, Inc. | Systems and Methods for Deriving and Optimizing Classifiers from Multiple Datasets |
US20210193262A1 (en) * | 2019-12-24 | 2021-06-24 | Koninklijke Philips N.V. | System and method for predicting antimicrobial phenotypes using accessory genomes |
CN113035269A (en) * | 2021-04-16 | 2021-06-25 | 北京计算科学研究中心 | Genome metabolism model construction, optimization and visualization method based on high-throughput sequencing technology |
WO2021214071A1 (en) * | 2020-04-20 | 2021-10-28 | Nec Oncoimmunity As | Method and system for identifying one or more candidate regions of one or more source proteins that are predicted to instigate an immunogenic response, and method for creating a vaccine |
-
2021
- 2021-12-30 CN CN202310625499.4A patent/CN116631500A/en active Pending
- 2021-12-30 CN CN202111680866.8A patent/CN114333987B/en active Active
- 2021-12-30 CN CN202310625501.8A patent/CN116631501A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160085912A1 (en) * | 2014-05-27 | 2016-03-24 | Opgen, Inc. | Systems, apparatus, and methods for generating and analyzing resistome profiles |
CN106886689A (en) * | 2015-12-15 | 2017-06-23 | 浙江大学 | A kind of pathogenic microorganism genome rapid analysis method and system |
US20200273576A1 (en) * | 2019-02-26 | 2020-08-27 | Tempus | Systems and methods for using sequencing data for pathogen detection |
US20200303078A1 (en) * | 2019-03-22 | 2020-09-24 | Inflammatix, Inc. | Systems and Methods for Deriving and Optimizing Classifiers from Multiple Datasets |
US20210193262A1 (en) * | 2019-12-24 | 2021-06-24 | Koninklijke Philips N.V. | System and method for predicting antimicrobial phenotypes using accessory genomes |
WO2021214071A1 (en) * | 2020-04-20 | 2021-10-28 | Nec Oncoimmunity As | Method and system for identifying one or more candidate regions of one or more source proteins that are predicted to instigate an immunogenic response, and method for creating a vaccine |
CN113035269A (en) * | 2021-04-16 | 2021-06-25 | 北京计算科学研究中心 | Genome metabolism model construction, optimization and visualization method based on high-throughput sequencing technology |
Non-Patent Citations (1)
Title |
---|
陈伟 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115323067A (en) * | 2022-09-19 | 2022-11-11 | 北京金匙医学检验实验室有限公司 | Characteristic gene combination, kit and sequencing method for predicting antibiotic drug sensitive phenotype of enterobacter cloacae |
CN115323067B (en) * | 2022-09-19 | 2023-11-17 | 北京金匙医学检验实验室有限公司 | Characteristic gene combination, kit and sequencing method for predicting antibiotic drug sensitivity phenotype of enterobacter cloacae |
CN115910216A (en) * | 2022-12-01 | 2023-04-04 | 杭州瑞普基因科技有限公司 | Method and system for identifying genome sequence classification errors based on machine learning |
CN115910216B (en) * | 2022-12-01 | 2023-07-25 | 杭州瑞普基因科技有限公司 | Method and system for identifying genome sequence classification errors based on machine learning |
CN115798575A (en) * | 2023-02-06 | 2023-03-14 | 中国医学科学院北京协和医院 | System and method for predicting sensitivity of Klebsiella to ceftazidime |
CN116597893A (en) * | 2023-06-14 | 2023-08-15 | 北京金匙医学检验实验室有限公司 | Method for predicting drug resistance gene-pathogenic microorganism attribution |
CN116825182A (en) * | 2023-06-14 | 2023-09-29 | 北京金匙医学检验实验室有限公司 | Method for screening bacterial drug resistance characteristics based on genome ORFs and application |
CN116597893B (en) * | 2023-06-14 | 2023-12-15 | 北京金匙医学检验实验室有限公司 | Method for predicting drug resistance gene-pathogenic microorganism attribution |
CN116825182B (en) * | 2023-06-14 | 2024-02-06 | 北京金匙医学检验实验室有限公司 | Method for screening bacterial drug resistance characteristics based on genome ORFs and application |
CN118098369A (en) * | 2024-03-26 | 2024-05-28 | 杭州洛兮医学检验实验室有限公司 | Method for analyzing drug-resistant phenotype of pathogenic microorganism |
Also Published As
Publication number | Publication date |
---|---|
CN116631501A (en) | 2023-08-22 |
CN114333987B (en) | 2023-05-12 |
CN116631500A (en) | 2023-08-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114333987A (en) | Metagenome sequencing-based data analysis method for predicting drug resistance phenotype | |
Sangiovanni et al. | From trash to treasure: detecting unexpected contamination in unmapped NGS data | |
Wagner et al. | Mitochondrial DNA mutation analysis from exome sequencing—A more holistic approach in diagnostics of suspected mitochondrial disease | |
Greninger et al. | Metagenomics to assist in the diagnosis of bloodstream infection | |
Heyse et al. | Coculturing bacteria leads to reduced phenotypic heterogeneities | |
Charalampous et al. | Evaluating the potential for respiratory metagenomics to improve treatment of secondary infection and detection of nosocomial transmission on expanded COVID-19 intensive care units | |
Mallawaarachchi et al. | Genomic diagnostics in polycystic kidney disease: an assessment of real-world use of whole-genome sequencing | |
US20230141128A1 (en) | Molecular technology for predicting a phenotypic trait of a bacterium from its genome | |
WO2021227329A1 (en) | Classification unit component computing method for sequencing data | |
Hasan et al. | A metagenomics-based diagnostic approach for central nervous system infections in hospital acute care setting | |
WO2018160899A1 (en) | Systems and methods for metagenomic analysis | |
JP2016518822A (en) | Characterization of biological materials using unassembled sequence information, probabilistic methods, and trait-specific database catalogs | |
Hertzberg et al. | TADA—a machine learning tool for functional annotation-based prioritisation of pathogenic CNVs | |
CN113278706A (en) | Method for distinguishing somatic mutation from germline mutation | |
CN114496089B (en) | Pathogenic microorganism identification method | |
CN115943215A (en) | Systems and methods for analyzing the presence of microorganisms | |
Abdelrazik et al. | Benchmarking of Antimicrobial Resistance Gene Detection Tools in Assembled Bacterial Whole Genomes | |
CN116597893B (en) | Method for predicting drug resistance gene-pathogenic microorganism attribution | |
CN117219157B (en) | Characteristic gene for predicting drug sensitivity phenotype of pseudomonas aeruginosa carbapenem drugs, kit and application | |
KR101853916B1 (en) | Method for determining pathway-specificity of protein domains, and its appication for identifying disease genes | |
Wojciechowski et al. | The correctness of large scale analysis of genomic data | |
Batisti Biffignandi et al. | Optimising machine learning prediction of minimum inhibitory concentrations in Klebsiella pneumoniae | |
CN115418406B (en) | Characteristic gene combination, kit and sequencing method for predicting drug sensitivity phenotype of klebsiella pneumoniae to antibiotics | |
Tournoud et al. | Clinical metagenomics bioinformatics pipeline for the identification of hospital-acquired pneumonia pathogens antibiotic resistance genes from bronchoalveolar lavage samples | |
Ross et al. | Assessing the potential of parentage testing using portable long read sequencing technologies. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |