CN114333987A

CN114333987A - Metagenome sequencing-based data analysis method for predicting drug resistance phenotype

Info

Publication number: CN114333987A
Application number: CN202111680866.8A
Authority: CN
Inventors: 饶冠华; 韩朋; 高建鹏; 陈方媛; 蒋智
Original assignee: Tianjin Jinke Medical Technology Co ltd
Current assignee: Tianjin Jinke Medical Technology Co ltd
Priority date: 2021-12-30
Filing date: 2021-12-30
Publication date: 2022-04-12
Anticipated expiration: 2041-12-30
Also published as: CN116631501A; CN114333987B; CN116631500A

Abstract

The application relates to the technical field of bioinformatics, in particular to a model construction method for detecting and identifying drug-resistant genes and predicting drug-resistant phenotypes based on gene sequencing reads comparison. The invention combines the drug-resistant phenotype related important characteristic genes to construct and realize the direct comparison detection and identification of target pathogenic bacteria and drug-resistant genes carried by the target pathogenic bacteria based on a sequencing read sequence and the prediction of the drug-resistant phenotype.

Description

Metagenome sequencing-based data analysis method for predicting drug resistance phenotype

Technical Field

The application relates to the technical field of bioinformatics, in particular to a method for detecting and identifying drug-resistant genes and predicting drug-resistant phenotypes based on gene sequencing read comparison.

Technical Field

Accurate detection of infectious pathogens and drug resistance thereof is the key to guide clinical accurate medication. At present, laboratory tests for drug resistance of infectious pathogenic bacteria are classified into phenotypic tests and genotypic tests. In the aspect of a phenotype detection method, a gold standard method of microbial culture and drug sensitivity test is mainly clinically adopted, the method can provide a powerful basis for diagnosis and treatment of clinical infection, but has some limitations, such as time consumption for culture (generally 2-4 days), low positive rate of pathogen culture, easy influence of various uncertain factors on the culture process and the like, and some difficultly-cultured or rare pathogens even cannot be cultured successfully. Other detection techniques such as Carba NP test, modified carbapenem inactivation test (including mCIM and eCIM), carbapenemase inhibitor enhancement test, time-of-flight mass spectrometry technology and the like are detection aiming at clinically common carbapenemases of Enterobacteriaceae, and are only suitable for detecting carbapenem drug resistance caused by production of related enzymes, and can not detect drug resistance caused by other mechanisms (such as efflux pump mechanism). And (3) genotype detection, including an enzyme immunochromatography technology and a gene detection technology. The enzyme immunochromatography technology and the conventional gene detection technology (such as GeneXpert, Verigene, Filmerray and other detection systems) aim at a specific target gene and have the characteristic of rapidness and easiness in reading, but if the gene to be detected is different from the target gene, a false negative result is easy to appear.

With the continuous development of sequencing technology and the reduction of sequencing cost, the detection of pathogenic bacteria and drug resistance genes thereof based on sequencing is also becoming popular. Microbial genome sequencing, including both Whole Genome Sequencing (WGS) and metagenome sequencing (mNGS) strategies. For whole genome sequencing of bacterial strains, research shows that a model (rule-based model or machine-learning model) is constructed based on genome data of all strains of a single population and corresponding drug-resistant phenotype data, and then the model is used for analyzing and predicting the drug-resistant phenotype of the new strains based on the genome data of the new strains, so that the method has a good effect (the accuracy can reach more than 95%), but the method has a limitation that the whole genome sequencing similarly avoids the limitation of pathogen culture enrichment. The pathogen metagenome high-throughput sequencing (mNGS) is a novel pathogen detection means developed in more than ten years, is different from microbial culture, and mNGS can read the base sequences of all pathogen nucleic acids from a small amount of samples at one time without screening pure cultures of all pathogens in the environment, and provides information such as the types of pathogens. The mNGS has the characteristics of rapid detection, wide coverage pathogen and no deviation, and a plurality of related application research articles are published in a high-level SCI journal, and moreover, a plurality of published articles are analyzed and discussed aiming at the detection of a drug resistance gene, which shows that the mNGS has the potential value of predicting the drug sensitivity characteristic of pathogenic bacteria. However, in clinical specimens (alveolar lavage fluid, blood, cerebrospinal fluid and the like), the specimens are known to be seriously polluted by hosts (the total nucleic acid of the extracted specimens accounts for more than 95 percent), the microbial load contained in the specimens is low, and the bacterial genomes measured by the clinical specimens cover much more than 1X under the conventional sequencing quantity of 20M or so reads. Then, in the case of this amount of sequencing data, it is necessary to further investigate the detection efficacy of the mNGS, whether the mNGS can identify the pathogenic bacteria well and detect the drug resistance gene accurately at the same time to predict the drug resistance of the pathogenic bacteria.

Disclosure of Invention

In order to solve the technical problems, the invention combines important characteristic genes related to the drug-resistant phenotype which is screened and determined by BGWAS to construct a data analysis method and a data analysis system for directly carrying out comparison detection and identification on target pathogenic bacteria and drug-resistant genes carried by the target pathogenic bacteria based on an NGS or nanopore sequencing read sequence and predicting the drug-resistant phenotype. Aiming at all strain genomes of a training set brought into an early BGWAS, detecting a target drug-resistant gene by simulating NGS or nopore sequencing read sequence comparison (read-based) and genome Contig sequence comparison (assembly-based), and verifying and optimizing a read-based detection process by taking an assembly-based method detection result as a reference so as to realize the purpose of accurately detecting the genotyping of the read-based. And then, calculating the Score by a self-defined formula, taking the Score as an interpretation index for predicting the drug sensitivity property of the antibiotic drug, and carrying out ROC analysis by combining a read sequence simulation test to determine an optimal cutoff threshold value and simultaneously evaluating the accuracy and the performance of a prediction model. Finally, the effectiveness of the assay system is assessed using clinical specimens or specimens of pure pathogenic bacterial strains isolated by culture for validation.

Specifically, the application provides the following technical scheme:

the application firstly provides a method for detecting and identifying drug-resistant genes and predicting the construction of a drug-resistant phenotype model based on gene sequencing reads comparison, which comprises the following steps:

step 1): combining classification information of drug resistance genes such as CARD drug resistance database, arranging and correcting drug resistance gene family information and consistency annotation information among gene sequences, preferably, arranging and correcting source species information and/or a mediation mode;

step 2): calculating a drug-resistant gene family weight coefficient, namely calculating the gene family weight coefficient based on the weight coefficient of the member genes in the family and the sample detection frequency of the corresponding genes in the BGWAS model training set;

step 3): detection of drug resistance genes and process correction.

Further, in the step 1),

the gene family is defined according to the family information of the drug resistance gene recorded by the drug resistance database, and the information of the family of the drug resistance gene recorded by the NCBI NDARO database and the MEGARes database is referred to for combing and correcting;

comparing all the drug-resistant reference genes with an NCBI NT library, and reserving hit with the identity > of 95% and the subject coverage > of 95% to obtain all species annotation information of each drug-resistant gene;

the mediation mode is to inquire whether the drug resistance gene is mediated by the plasmid in the reference sequence description information on the comparison;

the information of the consistency annotation among the gene sequences is that all the drug-resistant gene sequences are compared pairwise to obtain the consistency value among all the gene sequences.

Further, in step 2), the weight coefficient defines the following formula:

in the formula, arg _ Ni is the number of samples detected by corresponding genes in a target gene family in a BGWAS model training set, arg _ Wi is a weight coefficient of the corresponding genes in the family, j represents j key genes in the family, and j + k represents the number of all genes in the family.

Further, the step 3) of detecting the drug-resistant gene and correcting the process comprises the following steps:

d) sequencing Reads data simulation;

e) sequence alignment and annotation statistics;

f) screening and filtering drug-resistant genes.

Further, the a) sequencing Reads data simulation is to perform data simulation on a strain sample based on a BGWAS model training set; preferably:

simulating bacterial strain genome NGS sequencing short reads sequence data by ART _ Illumina software;

simulating bacterial strain genome Nanopore sequencing reads sequence data using ReadSim software;

more preferably, the simulation is to simulate gradient data volumes of 0.05X, 0.1X, 0.2X, 0.3X, 0.4X, 0.5X, 0.6X, 0.7X, 0.8X, 0.9X, 1X, 2X, 3X, 5X, 10X, 30X.

Further, the b) sequence alignment and annotation statistics comprise:

comparing the simulated reads sequence with a drug-resistant gene library, and filtering low-quality hit; performing final gene annotation, counting to obtain the detected specific reads number, multiple comparison reads number and the reads number of the family to which the drug-resistant gene belongs in the sample, and calculating the coverage of the detected gene;

preferably, the final gene annotation of the reads sequence using the optimal alignment and LCA algorithm is: selecting the hit, namely best hit, of the highest score of each read sequence as final hit, and if the best hit has a plurality of same values, namely multiple alignments, performing final annotation on the read sequence by adopting an LCA algorithm on the plurality of hits, namely for a single read sequence, annotating the genotype to a higher level as a gene family level due to the multiple alignments;

more preferably, it is a mixture of more preferably,

aiming at NGS sequencing data, comparing a simulated reads sequence with a drug-resistant gene library by using blastn software, and filtering to reserve hit with the identity higher than 90%;

alignment to the drug-resistant reference gene library was performed using minimap2 software for Nanopore sequencing data, filtering out hits with identity below 0.7 or subject coverage below 0.4.

Further, the c) screening and filtering of the drug-resistant gene comprises the following steps: screening and filtering the drug-resistant genes of the upper reads sequence compared in the step b);

preferably, any one or more of the following screening filter criteria are included:

A) evaluating the influence of sequence consistency among different typing genes in a drug-resistant gene reference library on the accurate detection of the drug-resistant gene typing by read-based: drug-resistant genes with different maximum sequence consistencies in a database are picked, short reads sequences are simulated to carry out read-based flow detection analysis, and the number of specific reads detected by a target gene and the number of reads of all the compared target genes are counted; aiming at the strategy of detecting and identifying the genotyping according to whether a specific reads sequence exists, 95% identity is selected as a threshold standard for easily realizing accurate genotyping of a target gene;

B) screening and filtering based on drug resistance genotyping identification: aiming at target genes with high similarity with other genes in the database, adopting all rows of first bits of reads capable of comparing the target genes as true positive detection, and calculating non-first bits of reserved genes based on accurate comparison reads to obtain genes with 100% coverage as true positive detection; secondly, aiming at the target genes with low similarity with other genes in the database, judging whether the target genes are true positive results or not according to the detected number of the specific reads; direct filtration for some gene families where no specific reads were detected;

C) evaluating the accuracy of the read-based detection for identifying the drug-resistant genotyping: taking the result of the assembly-based detection of the drug-resistant gene in the BGWAS model training process as reference, counting the important genes or gene families screened out by the BGWAS model to obtain the accuracy, sensitivity and specificity indexes of the read-based detection of the drug-resistant gene or gene family,

further, step 4): defining and calculating a Score value of a negative and positive judgment index, and determining a report rule and a cutoff threshold value based on ROC analysis;

the Score values were calculated as follows:

wherein arg _ Wi represents a weight coefficient of a corresponding genotype, and genefamly _ Wi represents a weight coefficient of a corresponding gene family; when the genotype is detected and the genotype weight coefficient is greater than 0, calculating by the genotype weight coefficient; when the genotype is detected, but the genotype weighting factor is 0 or no weighting factor, the gene family weighting factor is calculated.

Further, the sequencing reads are primary, secondary and tertiary sequencing reads, preferably NGS or Nanopore sequencing reads; more preferred are NGS or Nanopore metagenomic sequencing reads.

The invention also provides a model construction method for predicting the attribution of the drug-resistant gene species, which is characterized by comprising the following steps:

step 1): comparing the target pathogenic bacteria genome sequence and calculating the detected sequence number, genome coverage and coverage depth;

step 2): counting the copy number of the drug-resistant gene carried by the target pathogenic strain based on the detection result of the drug-resistant gene of the BGWAS model training set sample;

step 3): and calculating the copy number of the drug-resistant gene and judging the species attribution based on the assumed gene-species attribution relation.

Further, the step 1) is specifically that

Selecting clinically common pathogenic bacteria as target pathogens, searching and downloading a target pathogen reference genome from an NCBI genome database, and taking the target pathogen reference genome as a reference sequence library for identification of target pathogen strains;

comparing each sequencing reads sequence with the reference sequence library, and calculating the detected sequence number, genome coverage and coverage depth of the target pathogenic bacteria in comparison; and (4) counting to obtain the total reads sequence number, the genome coverage and the coverage depth of the detected pathogenic bacteria.

Further, the step 2) is specifically as follows:

and counting to obtain the detection distribution and copy variation range of the drug resistance genes and the drug resistance gene families of the target pathogenic strains based on the detection results of the assembly-based drug resistance genes of the training set samples during the BGWAS model training.

Further, the step 3) is specifically:

when assuming a drug resistance gene-species correspondence, the main basis is: a. annotating in the database the species of the reference gene for inclusion of the target species, and if so, accepting an assumption of the gene-species affiliation; b. if a is not satisfied, checking the mediation mode annotation of the reference gene in the database, judging whether the mediation mode of the plasmid is contained, and if so, accepting the hypothesis of the attribution relationship of the gene and the species; c. if a and b are not satisfied, the species source is presumed according to the species annotation of the ARG-like reads, and the copy number of the drug-resistant gene is calculated according to the following formula:

and if the calculated copy number of the drug-resistant gene falls within the normal copy number range of the target gene family obtained based on the statistics of the BGWAS model training set, accepting the assumed gene-species attribution relationship, and otherwise, rejecting the assumed gene-species attribution relationship.

The invention also provides a method for detecting the drug resistance of the metagenome sequencing data, which comprises the following steps:

1) performing quality control and human source nucleic acid sequence removal on sample sequencing data;

2) detecting and identifying the drug resistance gene contained in the sample: carrying out drug-resistant gene comparison and annotation statistics on a sample sequence based on the detection and identification method, and detecting and identifying drug-resistant genes contained in the sample;

3) predicting the species attribution of the detected drug-resistant genes in the sample: identifying target pathogenic bacteria contained in the sample according to the species attribution prediction method, and predicting the species attribution of the detected drug-resistant gene;

4) according to the detected drug resistance gene carrying condition of the target pathogenic bacteria, the score value of the target species-antibiotic drug is calculated and obtained according to the score calculation mode, and is compared with the cutoff value: when score > -cutoff, then predict as R; score < cutoff, S is predicted if the detected pathogen genome coverage is higher than the minimum genome coverage or data volume required for model stability, otherwise is reported as unknown.

The invention also provides a model for detecting and identifying drug-resistant genes based on gene sequencing reads comparison, which comprises the following modules:

module 1): the method is used for combining classification information of drug resistance genes in the CARD drug resistance database, sorting and correcting drug resistance gene family information and consistency annotation information among gene sequences, and preferably, sorting and correcting source species information and/or a mediation mode;

module 2): the method is used for calculating the weight coefficient of the drug-resistant gene family, and the weight coefficient of the gene family is calculated based on the weight coefficient of the member genes in the family and the sample detection frequency of the corresponding genes in the BGWAS model training set;

module 3): the method is used for detecting drug resistance genes and correcting the process.

The invention also provides a prediction model of drug-resistant gene species affiliation, and the method comprises the following steps:

module 1): the method is used for comparing the target pathogenic bacteria genome sequence and calculating the detected sequence number, genome coverage and coverage depth;

module 2): the method is used for counting the copy number of the drug-resistant gene carried by a target pathogenic strain based on the detection result of the drug-resistant gene of a BGWAS model training set sample;

module 3): the method is used for calculating the copy number of the drug-resistant gene and judging the species attribution based on the assumed gene-species attribution relationship.

The further definition of each module in the model is the same as that of each step in any one of the above methods.

The present invention also provides an apparatus comprising: at least one memory for storing a program; at least one processor configured to load the program to perform the method of any of the above.

The invention also provides a storage medium having stored therein processor-executable instructions for implementing a method as described in any one of the above when executed by a processor.

The invention also provides the following:

the application of the genes AAC (3) -IIe, AAC (3) -IV, AAC (3) -IId, rmtC, armA, rmtF, rmtB, AAC (6') -33 and ANT (2') -Ia as non-core type drug-resistant genes in auxiliary drug sensitivity prediction of Klebsiella pneumoniae;

the drug susceptibility prediction comprises drug resistance prediction and sensitivity prediction, preferably sensitivity prediction;

more preferably, the drug sensitivity is to a gentamicin drug.

The application of detection reagents aiming at non-core type drug-resistant genes AAC (3) -IIe, AAC (3) -IV, AAC (3) -IId, rmtC, armA, rmtF, rmtB, AAC (6') -33 and ANT (2') -Ia in the preparation of a Klebsiella pneumoniae auxiliary drug susceptibility prediction kit;

more preferably, the drug sensitivity is to a gentamicin drug;

further preferably, the genes AAC (3) -IIe, AAC (3) -IV, AAC (3) -IId, rmtC, armA, rmtF, rmtB, AAC (6') -33 and ANT (2') -Ia which are high in weight and mainly mediate the generation of drug resistance are simultaneously detected, and if the detection results are negative, sensitivity is presumed.

The application of the genes AAC (3) -IV, AAC (3) -IId, AAC (6') -Ib', AAC (6') -Ib-cr, AAC (6') -Ib-Hangzhou, AAC (6') -Ib4, mphE, ANT (2') -Ia and aadA24 as non-core type drug resistance genes in auxiliary drug sensitivity prediction of Klebsiella pneumoniae;

more preferably, the drug sensitivity is to tobramycin drugs.

The application of detection reagents aiming at non-core type drug resistance genes AAC (3) -IV, AAC (3) -IId, AAC (6') -Ib', AAC (6') -Ib-cr, AAC (6') -Ib-Hangzhou, AAC (6') -Ib4, mphE, ANT (2') -Ia and aadA24 in the preparation of a Klebsiella pneumoniae auxiliary drug sensitivity prediction kit;

more preferably, the drug sensitivity is to tobramycin drugs;

further preferably, the genes AAAC (3) -IV, AAC (3) -IId, AAC (6') -Ib', AAC (6') -Ib-cr, AAC (6') -Ib-Hangzhou, AAC (6') -Ib4, mphE, ANT (2') -Ia and aadA24 which have high frequency occurrence and high weight and mainly mediate the generation of drug resistance are simultaneously detected, and if the detection results are negative, the sensitivity is presumed.

The application of the genes CTX-M-55, CTX-M-11, CTX-M-15, SHV-155, SHV-5, SHV-11, SHV-12, SHV-76, SHV-30, SHV-53, SHV-124, SHV-182, DHA-1, KPC-3 and KPC-2 as non-core type drug resistance genes in auxiliary drug sensitivity prediction of Klebsiella pneumoniae.

more preferably, the drug sensitivity is directed to a ceftazidime drug.

The application of the detection reagent for the non-core type drug resistance genes CTX-M-55, CTX-M-11, CTX-M-15, SHV-155, SHV-5, SHV-11, SHV-12, SHV-76, SHV-30, SHV-53, SHV-124, SHV-182, DHA-1, KPC-3 and KPC-2 in the preparation of the Klebsiella pneumoniae auxiliary drug sensitivity prediction kit;

the drug susceptibility prediction is a drug resistance prediction;

preferably, the drug sensitivity is directed to a ceftazidime drug.

More preferably, the drug resistance is estimated by detecting the CTX-M-55, CTX-M-11, CTX-M-15, SHV-155, SHV-5, SHV-11, SHV-12, SHV-76, SHV-30, SHV-53, SHV-124, SHV-182, DHA-1, KPC-3 and KPC-2 genes which mainly mediate the high frequency occurrence of the drug resistance and have higher weight.

Use of genes dfrA12, dfrA15, dfrA17, dfrA19, dfrA30, dfrA8, dfrA5, dfrA15b, dfrA14, dfr22, dfrA27 and dfrA1 as non-core type drug resistance genes in assisted drug susceptibility prediction of klebsiella pneumoniae;

the drug susceptibility prediction is a drug resistance prediction;

preferably, the drug sensitivity is directed to a compound sulfamethoxazole drug.

The application of detection reagents aiming at non-core type drug resistance genes dfrA12, dfrA15, dfrA17, dfrA19, dfrA30, dfrA8, dfrA5, dfrA15b, dfrA14, dfr22, dfrA2 and dfrA1 in the preparation of a Klebsiella pneumoniae auxiliary drug sensitivity prediction kit;

the drug susceptibility prediction is a drug resistance prediction;

More preferably, the drug resistance is estimated by simultaneously detecting the higher-weighted genes dfrA12, dfrA15, dfrA17, dfrA19, dfrA30, dfrA8, dfrA5, dfrA15b, dfrA14, dfr22, dfrA27 and dfrA1, which mainly mediate the high frequency occurrence of drug resistance.

The application has the beneficial technical effects that:

1) the invention relates to a method for detecting drug resistance based on nucleic acid molecules, which bypasses the limitation of traditional culture, directly performs metagenomic sequencing detection on clinical specimens to identify target pathogenic strains and drug resistance gene carrying conditions thereof, and further predicts drug sensitivity results of antibiotic drugs based on the detection characteristics of the drug resistance genes.

2) When the drug-resistant gene is detected and identified, the drug-resistant gene is directly detected by comparing the reads sequence sequenced on the basis of NGS or nonapore, and compared with the detection by comparing the sequence based on genome contig, the method bypasses the step of genome assembly and has higher detection sensitivity. Specifically, a set of read-based drug resistance gene comparison detection method and a corresponding database are constructed, and meanwhile, the drug resistance gene detection result of the empty-based method is used as a reference to verify and evaluate the performance of the read-based process, so that the accuracy of the read-based detection of the drug resistance gene is ensured.

3) Aiming at read-based drug resistance gene detection, the invention constructs a specific drug resistance gene reference database, particularly the arrangement of drug resistance gene typing and classification and realizes the annotation strategy that LCA can be adopted for a sequence to be inquired. Specifically, all gene sequences of the CARD drug-resistant public library are collected as reference genes, a gene multilevel level labeling mode of a MEGARes database and family information of the drug-resistant genes recorded in an NCBI NDARO database are referred, and 6-level labeling is performed on each reference gene, so that an LCA (local common indicators) annotation strategy of the drug-resistant genes can be realized, namely, the number of detected specific reads at each level is obtained. For the example of the OXA-181 gene, the levels at level 6 are labeled as: OXA-181__1(L1_ geneST), OXA-181(L2_ gene), OXA-48subfamily (L3_ subgroup), OXA family (L4_ Group), Class _ D _ betaactams (L5_ Mechanism), betaactams (L6_ Class).

4) Aiming at the accurate detection of read-based drug resistance genotyping, the invention adopts a strategy of combining two rules, and can effectively improve the genotyping capability and accuracy of the detection process. Firstly, aiming at target genes with high similarity (such as consistency exceeding 95%) with other genes in a database, the first-order ARG-like reads (namely all reads capable of being compared with the target genes or specific reads plus multiple comparison reads) is adopted as a true positive detection, and for non-first-order genes, the genes with the coverage of 100% calculated based on the accurate comparison reads are reserved as the true positive detection; and secondly, aiming at the target genes with low similarity (the consistency is lower than 95%) with other genes in the database, judging whether the target genes are true positive results or not according to the detected number of specific reads. For some families where specific reads were not detected, they were considered false positives and were directly filtered out.

5) The invention defines a method (or formula) for calculating the corresponding weight coefficient of the drug-resistant gene family based on the weight coefficient of the drug-resistant gene typing, and aims at the condition that the gene typing is possibly inaccurate, the gene family weight is used for calculating and predicting the drug sensitivity result instead, so that the false positive caused by inaccurate gene typing is effectively avoided.

6) The invention defines a method rule for predicting the drug sensitivity result, namely a calculation formula of a negative and positive interpretation index Score is defined, and the performance of a prediction model is evaluated and a cutoff threshold value is determined by combining gradient simulation tests of different sequencing data volumes, so that the antibiotic drug sensitivity result can be effectively predicted finally.

7) The invention aims at the metagenome sequencing (mixed flora sequencing) of a clinical specimen, defines a method (or formula) for calculating the copy number of a drug-resistant gene based on the detection sequence conditions of target pathogenic bacteria and the drug-resistant gene, and predicts and estimates the possible pathogenic species source of the drug-resistant gene according to the calculated copy number of the drug-resistant gene. Specifically, when the drug resistance gene-pathogen species correspondence is inferred, the drug resistance gene is assumed to be from a certain target species according to actually detected information of the drug resistance gene and the pathogen species, then the copy number of the drug resistance gene is calculated, whether the calculated copy number falls within a normal range is checked, if the copy number is normal, the attribution relationship is considered to be accepted, and if the copy number is not normal, the attribution relationship is rejected. When assuming a drug resistance gene-species correspondence, the main basis is: a. annotating in the database the species of the reference gene for inclusion of the target species, and if so, accepting an assumption of the gene-species affiliation; b. if a is not satisfied, checking the mediation mode information of the reference gene marked in the database, judging whether the mediation mode of the plasmid is contained, and if so, accepting the hypothesis of the attribution relationship of the gene and the species; c. if a and b are not satisfied, the species source is presumed according to the species annotation of the ARG-like reads.

8) By taking the detection of the drug resistance of the Klebsiella pneumoniae antibiotic drug as an example, the method can accurately identify Klebsiella pneumoniae strains and drug resistance genes carried by the Klebsiella pneumoniae strains, can effectively predict the drug sensitivity results of carbapenems (imipenem and meropenem) and aminoglycosides (gentamicin and tobramycin) and predict the drug resistance properties of ceftazidime and compound sulfamethoxazole, and has the prediction accuracy rate of over 90 percent. Clinical specimen sampling verification shows that the accuracy rate of carbapenem drug sensitivity prediction can reach 100%, and the sample proportion prompted by a drug sensitivity prediction result is definitely over 80%. The invention can assist the clinical detection and diagnosis of infection drug-resistant bacteria.

Drawings

FIG. 1 is a technical roadmap of the present invention;

FIG. 2 is a graph showing the results of a drug-resistant gene detection test performed by simulating 100X target gene read data for target genes with different identities from other genes in the database. In the figure, ARG-like represents the ratio of the detected number of sequences of a target gene or a non-target gene to the detected number of sequences of a family to which the target gene belongs, and specificity represents the detected number of Specific sequences of the target gene.

FIG. 3 is a graph showing the accuracy of the read-based identification of genotyping or genotyping of gene families, with reference to the result of the assay-based drug resistance gene detection.

FIG. 4 technical flow chart of Score index calculation and report rules based on read-based drug resistance gene detection

FIG. 5 is a graph of performance (AUC value) change of 6 antibiotic resistance prediction models under simulation of different sequencing data amounts

Figure 6 is a graph of the performance (AUC values) of a predictive model of 6 antibiotic drugs at 30X genomic data volume for the training and validation sets.

Detailed Description

Embodiments of the present application will be described in detail below with reference to examples, but those skilled in the art will appreciate that the following examples are only illustrative of the present application and should not be construed as limiting the scope of the present application. The examples, in which specific conditions are not specified, were conducted under conventional conditions or conditions recommended by the manufacturer. The reagents or instruments used are not indicated by manufacturers, and are all conventional products available on the market.

Definition of partial terms

Unless defined otherwise below, all technical and scientific terms used in the detailed description of the present application are intended to have the same meaning as commonly understood by one of ordinary skill in the art. While the following terms are believed to be well understood by those skilled in the art, the following definitions are set forth to better explain the present application.

As used in this application, the terms "comprising," "including," "having," "containing," or "involving" are inclusive or open-ended and do not exclude additional unrecited elements or method steps. The term "consisting of …" is considered to be a preferred embodiment of the term "comprising". If in the following a certain group is defined to comprise at least a certain number of embodiments, this should also be understood as disclosing a group which preferably only consists of these embodiments.

Where an indefinite or definite article is used when referring to a singular noun e.g. "a" or "an", "the", this includes a plural of that noun.

The term "about" in the present application denotes an interval of accuracy that can be understood by a person skilled in the art, which still guarantees the technical effect of the feature in question. The term generally denotes a deviation of ± 10%, preferably ± 5%, from the indicated value.

Furthermore, the terms first, second, third, (a), (b), (c), and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and that the embodiments described herein are capable of operation in other sequences than described or illustrated herein.

The term "drug resistance" as used herein refers to the tolerance of microorganisms, parasites and tumor cells to the action of drugs, which is significantly reduced once drug resistance is developed. Preferred in this application means that the infection in vivo is bacterial resistance to antibiotic drugs.

The term "drug-resistant phenotype" as used herein generally refers to a characteristic of drug resistance exhibited by a living organism referred to as a drug-resistant phenotype (resistance genotype), and a drug-resistant gene possessed by the living organism referred to as a drug-resistant genotype (resistance genotype).

The "non-core gene" as referred to herein is a gene which exists only in a part of strains with respect to a certain bacterial population, and corresponds to a core gene, i.e., a gene existing in all strains. The drug resistance gene detected by the method is mainly directed to the non-core gene.

The "important characteristic gene" refers to the non-core drug resistance gene, i.e., a drug resistance characteristic or a drug resistance gene significantly related to a drug resistance phenotype of an antibiotic drug.

The "read-based" of the present invention refers to the alignment of read sequences: and (4) directly comparing the sequencing read sequence with a drug-resistant gene library to detect and analyze the drug-resistant gene.

The "assembly-based" of the invention refers to the sequence alignment of the genome Contig; and (3) assembling the sequencing read sequence with a species genome to obtain contig, and then comparing a drug-resistant gene library based on the contig sequence to detect and analyze the drug-resistant gene.

The invention relates to a BGWAS or BGWAS model, which refers to the correlation analysis of a bacterial whole genome, namely the correlation analysis of bacterial genome data and drug-resistant phenotype data is carried out to screen and find important drug-resistant characteristics or drug-resistant genes which are obviously related to a drug-resistant phenotype.

Correspondingly, the "BGWAS model training set" refers to all bacterial strain data used in performing bacterial whole genome association analysis, i.e., the model training set.

With regard to the "BGWAS" or "BGWAS model", see also the applicant's earlier patent CN202111400540.5 for details, the model specifically includes the following modules:

module 1) for obtaining target bacterial strain genome data and collecting corresponding drug sensitivity test result data;

module 2) for performing alignment annotation of a drug-resistant database based on contig sequences of bacterial genomes;

module 3) is used for carrying out genotype and drug-resistant phenotype data correlation analysis aiming at the target drug, screening important characteristic genes related to drug resistance generation and calculating weight coefficients of the important characteristic genes; preferably, the important characteristic gene is a non-core type drug resistance gene.

Module 4), ROC analysis and evaluation of model performance of predicting drug sensitivity results based on the screened important genes.

The ROC analysis is as follows: defining and calculating a Score value based on the matrix of the important gene weight coefficients obtained in the step 3), and calculating the Score value according to the Score valueAs a negative and positive interpretation index, drawing an ROC curve, determining a cut off value, and verifying and evaluating the model performance by using a verification set sample; the above-mentioned

Wherein arg _ W_iThe weight coefficient value indicating the detection of the corresponding gene.

Further, in the step 1), the number of the bacterial strain genomes > is 100, the bacterial strain sources cover various subtypes, and the ratio of the number of the resistant strains to the number of the sensitive strains is balanced; in some preferred forms, the obtaining is from public database searches downloading published target genomic sequences, or by sequencing assembly of bacterial strains identified by currently collected clinical cultures; in some more preferred ways, the searching and downloading from the public database is: the bacterial strain information recorded with the drug susceptibility test results is collected from the NCBI NDARO database and the PATRIC database platform, the phenotype data is collated, and the genome data is downloaded in batches from the NCBI genome database according to the genome assembly id number or from the PATRIC database according to the PATRIC id. Further, the alignment annotations in the step 2) are: comparing the contig sequence with the CARD drug-resistant gene reference sequence library, filtering the bits with low identity and coverage (preferably, filtering the bits with identity less than 90% or reference gene coverage less than 90%), selecting best bit in each contig comparison region as the final comparison result of the contig region, and adding annotation information of the drug-resistant gene. Further, the association analysis in the step 3) adopts a inhaul cable regression model for association analysis. Further, the method for analyzing the relevance of the cable regression model in the step 3) specifically comprises the following steps: taking a gene detection distribution matrix and an antibiotic susceptibility test result data matrix as input, performing correlation analysis on genotype and drug-resistant phenotype data by using a glmnet program package, performing k (preferably k is 5-15) repeated cross validation, screening to obtain important characteristic genes related to the drug-resistant phenotype, and calculating weight coefficients of the important characteristic genes; further, the important characteristic genes are specifically: and selecting the number genes corresponding to the position where the CV error rate is lowest and the model AUC value is relatively stable at the moment as the important characteristic genes according to the model CV error rate and the AUC change curve under different number of characteristic genes. Further, step 3) may further include a manual recall, where the manual recall is: the genes having a higher PPV than the drug-resistant phenotype (preferably, PPV > -0.8) are manually recalled, and the weight coefficients of the recalled genes are calculated based on the weight coefficient values of the important genes obtained above. Further, the bacteria described herein include, but are not limited to, Escherichia coli, Klebsiella pneumoniae, Acinetobacter baumannii, Pseudomonas aeruginosa, Enterobacter cloacae complex, Staphylococcus aureus, enterococcus faecium, enterococcus faecalis, Streptococcus pneumoniae, Streptococcus pyogenes, Haemophilus influenzae, Staphylococcus epidermidis; preferred is klebsiella pneumoniae. Further, the drug-resistant phenotypes described herein include, but are not limited to, carbapenem-, cephalosporin-, penicillin-, β -lactam-antibiotic-inhibitors-, aminoglycosides, sulfonamides, tetracyclines, quinolones, glycopeptides, oxazolidinones, polymyxin-drug-resistant phenotypes; preferably, the drug-resistant phenotype is a carbapenem drug-resistant phenotype.

The application is illustrated below with reference to specific examples.

Example 1 establishment of the method of the invention

Fig. 1 is a technical route diagram of the present invention, and the steps are described as follows:

firstly), constructing a drug-resistant gene detection and phenotype prediction process based on sequencing reads comparison based on screened important genes of BGWAS, and carrying out test verification and determining a positive cutoff value through simulation data.

1.1 combining the classification information of the drug resistance genes of the CARD drug resistance database (V3.1.0), rearranging and correcting annotation information such as family to which the drug resistance genes belong, possible source species, mediation mode, consistency between gene sequences and the like. The gene family definition is defined according to the family information recorded by a drug resistance database (CARD), and the combing and the correction are carried out by referring to the family information recorded by an NCBI NDARO database and a MEGARes database. Such as: the OXA family may be subdivided into families such as OXA-48family, OXA-51 family, etc., and the contribution of different subfamilies to the determination of resistance may vary depending on the antibiotic drug, thus requiring the definition of OXA specific typing genes to the level of the subfamily, such as OXA-181 and OXA-232 for OXA-48 family. Secondly, all drug-resistant reference genes are compared with the NCBI NT library, the hit with the identity > of 95% and the subject coverage > of 95% is reserved to obtain all species annotation information of each drug-resistant gene, and a keyword 'plasmid' is searched in the reference sequence description information on the comparison to inquire whether the drug-resistant genes are mediated by plasmids. Finally, pairwise comparison is carried out on all drug-resistant gene sequences by using blastn software to obtain a consistency value among all the gene sequences.

1.2 weight coefficient calculation of drug resistance gene family. The weight coefficient of the gene family is calculated based on the weight coefficient of the member genes in the family and the sample detection frequency of the corresponding genes in the BGWAS model training set (see the applicant's prior patent CN202111400540.5 specifically). The calculation formula is as follows:

in the formula, arg _ N_iThe number of samples detected in the training set (namely samples used for model training in early-stage BGWAS analysis) of corresponding genes in the target gene family, arg _ W_iIs the weight coefficient of the corresponding gene in the family. j indicates that there are j key genes in the family, and j + k indicates the number of all genes in the family.

1.3 simulation of NGS or ONT sequencing reads for detection of drug resistance genes and flow correction

1.3.1Reads data simulation

Based on the training set strain samples used in the previous BGWAS model, the ART _ Illumina software (Version 2.5.8) is used to simulate the genome sequencing short reads sequence (Ilumina SE75) data of the bacterial strain (parameter setting: ss NS50-l 75-f 5-nf 0-rs 1). Nanopore sequencing reads sequences were simulated using ReadSim software (Version 1.6), with parameters set: -rev _ strd on-tech nanopore-read _ mu 3000-read _ dist normal. Given that the genome depth of pathogenic bacteria detected under the data amount of 20Mreads of the routine sequencing of clinical specimens does not exceed 1X generally, the data amount of gradients of 0.05X, 0.1X, 0.2X, 0.3X, 0.4X, 0.5X, 0.6X, 0.7X, 0.8X, 0.9X, 1X, 2X, 3X, 5X, 10X, 30X and the like are simulated.

1.3.2 sequence alignment and Annotation statistics

Comparing the simulated Illumina reads sequence with a drug-resistant gene database by using blastn (version 2.9.0+) software (parameter setting: evalue 1e-5-outfmt 6), firstly filtering to only reserve hit with the identity higher than 90%, then selecting the highest score hit, namely best hit, of each read sequence as final hit, if the highest socre hit has a plurality of same values (namely multiple comparison), carrying out final annotation on the read sequences by adopting an LCA algorithm on the plurality of hits (namely, for a single read sequence, the genotype cannot be annotated due to multiple comparison, and further, the genotype is annotated to a higher level as gene family level), then counting the detected specific reads number and multiple comparison reads number of the drug-resistant gene in the sample and the specific reads number of the drug-resistant gene family, and calculating the coverage index of each detected gene.

For the simulated nanopore reads sequence data, an alignment to the drug-resistant reference gene bank was performed using minimap2 software (version 2.17), with parameters set: -c-x map-ont-L-secondary no, then filter out hits with identity below 0.7 or with subject coverage below 0.4. And finally performing gene annotation on the reads sequence by adopting the optimal comparison or LCA algorithm, counting to obtain the detected specific reads number and the multiple comparison reads number of the drug-resistant gene in the sample and the reads number of the family to which the drug-resistant gene belongs, and calculating the coverage of the detected gene.

1.3.3 screening filtration of drug-resistant genes with aligned reads sequences

A) Evaluating the influence of sequence consistency among different typing genes in a drug-resistant gene reference library on the accurate detection of the drug-resistant gene typing of the read-based. Drug-resistant genes with different maximum sequence consistencies in a database are picked, short reads sequences are simulated to carry out read-based flow detection analysis, and the number of specific reads and the number of ARG-like reads detected by a target gene (namely the number of reads of the target gene on all comparisons) are counted. Finally, the strategy of identifying genotyping for the presence of specific reads sequences can be selected as a threshold criterion for whether accurate genotyping can be easily achieved by 95% identity (see fig. 2).

B) The identification of drug resistance genotyping adopts a strategy of combining two rules: firstly, aiming at target genes with high similarity (such as consistency exceeding 95%) with other genes in a database, adopting ARG-like reads (namely all reads capable of being compared with the target genes or specific reads plus multiple comparison reads) as true positive detection, and reserving genes with 100% coverage calculated based on accurate comparison reads as true positive detection if the genes are not the first ones; and secondly, aiming at the target genes with low similarity (if the consistency is lower than 95%) with other genes in the database, judging whether the target genes are true positive results or not mainly according to the number of detected specific reads. For some gene families in which specific reads were not detected, they were considered false positives and were directly filtered out.

C) Evaluating the accuracy of the read-based detection for identifying the drug-resistant genotyping. By taking the result of the assembly-based detection of the drug-resistant gene in the BGWAS model training process as reference, the accuracy, sensitivity and specificity indexes of the read-based detection of the drug-resistant gene or gene family are obtained through statistics aiming at the important gene or gene family screened out by the BGWAS model, and as shown in figure 3, the read-based detection process has better performance when detecting most of the drug-resistant gene or gene family.

1.4, defining and calculating a Score value of a negative and positive judgment index, performing ROC analysis by combining simulation tests under different sequencing data quantities, and determining a report rule and a cutoff threshold value.

1.4.1 Score index values are defined and calculated from the results of drug resistance gene detection of samples based on the important genes (and gene families) obtained by screening and their weight coefficients. The calculation formula is as follows:

in the formula, arg _ W_iWeight coefficient indicating the corresponding genotype, geneffamine _ W_iRepresenting the weight coefficients of the corresponding gene families. When genotype (i.e. gt) is detected and the genotype weight coefficient>0, calculating by using a genotype weight coefficient; when a genotype is detected but the genotype weighting factor is 0 or no weighting factor, it is calculated as a gene family weighting factor (0 is noted when no weighting factor is given to a gene family).

Specifically, according to the rule of fig. 4, Score calculation is performed based on the read-based drug resistance gene detection typing results and the gene weight coefficient matrix, and meanwhile, the positive coincidence rate (PPV) factor of each gene and the drug sensitivity results in the BGWAS model training set is considered. (in FIG. 4,: gt represents ARG type, i.e., drug resistance genotyping, and gf represents ARG family, i.e., drug resistance gene family).

1.4.2 aiming at the simulation test under different data volumes based on the training set strains, carrying out ROC analysis based on the drug resistance gene detection result and the actual drug sensitivity result of the training set strains, evaluating the drug sensitivity prediction model performance (AUC value) under different sequencing data volumes, and determining the cutoff threshold.

Specifically, thresholds were set for reporting "drug resistance" and "sensitivity" respectively, in view of model performance and sequencing data volume influencing factors. The threshold for reporting "drug resistance" is set as R _ cutoff when the amount of data for genome sequencing of the target strain is sufficient and the model performance is stable (e.g., 30X), and the Score value corresponding to the maximum value of Youden index is the threshold. Reporting the threshold setting of "sensitive", two sets of threshold criteria are used. Firstly, the R _ cutoff determined under the condition that the sequencing data amount is enough (30X data amount, model is stable) is taken as a threshold value for reporting sensitivity and is marked as S _ cutoff2, and the NPV is required to be satisfied at the moment and is more than 0.9, otherwise, the sensitivity cannot be reported. The second is the minimum sequencing data volume (marked as gf _ LOD1) corresponding to the stable model performance found by directly calculating the Score value based on the gene family weight coefficient, and the maximum Score value (the Score value must be less than or equal to S _ cutoff2) meeting the NPV exceeding 0.9 is taken as the report 'sensitive' threshold value under the data volume and is marked as S _ cutoff 1. Second, for each simulated sequencing data volume between gf _ LOD1 and 30X, the feasibility of reporting a "sensitive" threshold criterion with S _ cutoff2 was examined, i.e., finding the minimum data volume that satisfies the NPV >0.9 correspondence, denoted gf _ LOD 2. Then, finally, two sets of threshold criteria for reporting "sensitivity" can be determined, i.e., when the amount of sequencing data is above gf _ LOD2, S _ cutoff2 is used as the threshold for reporting "sensitivity", and when the amount of sequencing data is between gf _ LOD1 and gf _ LOD2, S _ cutoff1 is used as the threshold for reporting "sensitivity".

The specific report rule for predicting drug sensitivity of a certain antibiotic is as follows: when the Score value is greater than R _ cutoff, it is reported as "potential drug resistance"; "sensitivity is reported when the genomic coverage of the detected pathogen is greater than or equal to gf _ LOD2 and the Score value is less than S _ cutoff2, or" sensitivity is reported when the genomic coverage of the detected pathogen is between gf _ LOD1 and gf _ LOD2 and the Score value is less than S _ cutoff 1; when the genomic coverage of the detected pathogen was less than gf _ LOD1 and Score was less than S _ cutoff1, then it was reported as "/", i.e. unknown.

II) prediction of drug resistance gene species affiliation

2.1 detection of target pathogenic bacteria species and prediction of species affiliation of drug-resistant genes

2.1.1 comparison of genome sequence of target pathogenic bacteria and calculation of detected sequence number, genome coverage and coverage depth

Clinical common pathogenic bacteria (Klebsiella pneumoniae, Escherichia coli, Acinetobacter baumannii, Pseudomonas aeruginosa, Enterobacter cloacae and the like) are selected as target pathogens, and a target pathogen reference genome is searched and downloaded from an NCBI genome database and is used as a reference sequence library for identifying target pathogen strains.

The Illumina reads sequence is aligned with the target pathogen reference genome sequence library set above (alignment parameters: -x sr-a-second ═ no-L) by using minimap2 software (v2.17), and then the detected sequence number, genome coverage and coverage depth of the target pathogen species on the alignment are calculated. For alignment of nanopore sequencing reads, parameters were set to-x map-ont-a-second no-L.

Then, the total reads sequence number, the genome coverage and the coverage depth of the detected pathogenic bacteria are obtained through statistics.

2.1.2 based on the BGWAS model training set specimen drug-resistant gene detection result, counting the copy number of the drug-resistant gene carried by the target pathogenic bacteria

2.1.3 drug resistance Gene copy number calculation and species affiliation judgment based on the presumed Gene-species affiliation relationship

When assuming a drug resistance gene-species correspondence, the main basis is: a. annotating in the database the species of the reference gene for inclusion of the target species, and if so, accepting an assumption of the gene-species affiliation; b. if a is not satisfied, checking the mediation mode annotation of the reference gene in the database, judging whether the mediation mode of the plasmid is contained, and if so, accepting the hypothesis of the attribution relationship of the gene and the species; c. if a and b are not satisfied, the species source is presumed according to the species annotation of the ARG-like reads. Then, the copy number of the drug-resistant gene is calculated according to the following formula:

Thirdly) detecting and verifying drug resistance gene by metagenome sequencing of clinical specimen

Collecting clinical samples which are cultured and tested for drug sensitivity at the same time, transferring the samples to a medical examination laboratory according to requirements for pretreatment, nucleic acid extraction, library establishment and on-machine sequencing (Illumina CN500 SE75 or nanopore sequencing), and then performing comparison analysis on target species genome and drug-resistant gene database on reads sequences obtained by sequencing to identify pathogenic bacteria and drug-resistant gene conditions carried by the pathogenic bacteria in the samples.

3.1 quality control filtration and removal of human-derived nucleic acid sequences from sequenced original low-quality sequences

Sequence data measured for the Illumina platform were processed as follows:

a. processing the sequencing off-machine data by using BCL2fastq (v2.20.0.422), and converting the BCL format data into fastq format sequence data;

b. filtering the obtained original fastq sequence data by using fastp (v0.19.5) software (parameter setting: q 15-u 40-l read _ length 0.67), and removing low-quality and short sequences; meanwhile, komplexy (v0.3.6) software is used for calculating the complexity of sequence information (parameter setting: -F-t 0.4), and low-complexity sequences are filtered.

c. The clean sequence obtained by quality control filtration is aligned with the human reference genome sequence (human _38) by using bowtie2(v2.3.4.3) software (parameter setting: mm-very-sensitive-k 1) to filter out human sequences.

3.2 detection and identification of drug-resistant genes contained in the sample and species affiliation thereof

Drug-resistant gene alignment and annotation statistics of sample sequences are carried out by using blastn software according to the step 1.3.2, target species genome alignment is carried out by using minimap2 software according to the step 2.1.1, detected genome coverage is calculated, and then species attribution of the drug-resistant genes which are predicted to be detected is evaluated according to the step 2.1.3.

3.3 for the pathogenic bacteria concerned and detected, according to the detection of the drug resistance genes, calculating and obtaining the score value of the target species-antibiotic drug according to the score calculation mode defined by 1.4.1, and comparing with the cutoff value: when score > cutoff, then it is predicted as R; score < cutoff, S is predicted if the detected pathogen genomic coverage is higher than the minimum genomic coverage or data volume (e.g. > 40%) required for model stability determined based on the 1.4.2 step, otherwise is reported as "/" (indicating unknown)

And 3.4, finally, comparing the drug sensitivity prediction result with the actual drug sensitivity test result of the clinical specimen collected at the same time, counting the accuracy of the prediction result and the sample number proportion reported effectively, and evaluating the performance of the drug resistance detection process.

Example 2 detection of drug-resistant genes and phenotypic prediction analysis of Klebsiella pneumoniae against clinical specimens

1. Based on Klebsiella pneumoniae genome, BGWAS screens to obtain antibiotic-related important drug resistance genes and corresponding gene families, and calculates to obtain weight coefficients of the genes or the gene families

From NCBI NDARO database and PATRIC database, collecting and downloading Klebsiella pneumoniae strain genome data with drug sensitivity test result information, screening and filtering to obtain 3072 (training set, verification set are 2410 and 662) strain samples, screening by using a machine learning method to obtain important genes related to antibiotic drug resistance and a weight coefficient matrix thereof, and calculating the weight coefficients of families to which the important genes belong according to a formula of technical scheme 1.2, wherein the results are as follows:

2. based on 2410 cases of Klebsiella pneumoniae genome in the training set, 75bp short reads (NGS sequencing platform) are simulated according to the steps of the technical scheme 1.3 to carry out test verification of the read-based drug resistance gene detection process, and different gradient data volumes of 0.05X, 0.1X, 0.2X, 0.3X, 0.4X, 0.5X, 0.6X, 0.7X, 0.8X, 0.9X, 1X, 2X, 3X, 5X, 10X, 30X and the like are simulated. And then carrying out drug resistance gene detection to obtain a drug resistance gene detection result of each simulated sample, calculating a score value according to the steps of the technical scheme 1.4.1, carrying out ROC curve analysis to obtain model performance AUC values of each antibiotic drug under different data quantities, and then drawing a model performance AUC value change curve as shown in a figure 5. At data volume 30X, the model performance has stabilized, and the AUC values for each antibiotic model performance at this time are shown in fig. 6 below. Finally, according to the steps of the technical scheme 1.4.2, the report rule and cutoff threshold value of each antibiotic drug are determined as shown in the following table.

The cutoff threshold is predicted by the mNGS susceptibility of 6 antibiotic drugs of Klebsiella pneumoniae.

Note: "/" indicates that "sensitivity" cannot be reported.

3. A total of 48 cases of culture and identification of clinical specimens containing Klebsiella pneumoniae were collected and stored in a freezer at-80 ℃. Then, after nucleic acid extraction, the 48 specimens were used to construct a macro-gene secondary (insert length 200-400bp) library, which was subjected to secondary (Illumina nextsseq CN500 SE75) machine sequencing. And identifying the pathogen and the drug resistance gene carried by the pathogen according to off-machine data, and then calculating a Score value and predicting and judging a drug sensitivity result. Finally obtaining the target pathogenic bacteria of each specimen and the detection and identification result and the drug sensitivity prediction result of the drug resistance gene thereof. Some of the sample results are shown in the following table:

description of the drawings: ND means not detected, "/" means unknown.

The statistics shows that the accuracy of drug sensitivity prediction and the ratio of reportable samples are shown in the following table:

the result shows that the invention can effectively and accurately identify the pathogenic bacteria and the drug-resistant gene carried by the pathogenic bacteria aiming at the clinical specimen and predict the drug sensitivity result of the antibiotic drug, and can be used for assisting the clinical detection and diagnosis of the infection drug-resistant bacteria.

Further, as can be seen from the above results, for gentamicin, AAC (3) -IIe, AAC (3) -IV, AAC (3) -IId, rmtC, armA, rmtF, rmtB, AAC (6') -33, ANT (2 ") -Ia are important drug resistance genes, and by combining the gene weights, gene family weights, gene occurrence frequencies, and all possible mechanisms of drug resistance generation, in practice, the genes with higher weights can be simultaneously detected for AAC (3) -IIe, AAC (3) -IV, AAC (3) -IId, rmtC, armma, rmtF, rmtB, AAC (6') -33, and ANT (2") -Ia, which mainly mediate high frequency occurrence of drug resistance generation, and if the detection results are negative, drug sensitivity can be presumed, and especially, drug sensitivity can be achieved.

Further, as can be seen from the above results, Tobramycin, AAC (3) -IV, AAC (3) -IId, AAC (6') -Ib', AAC (6') -Ib-cr, AAC (6') -Ib-Hangzhou, AAC (6') -Ib4, mphE, ANT (2') -Ia and aadA24 are important drug resistance genes, and the combination of gene weights, gene family weights, frequency of gene generation and all possible mechanisms of drug resistance generation can be practically detected simultaneously by using AAAC (3) -IV, AAC (3) -IId, AAC (6') -Ib', AAC (6') -Ib-Hangzhou, AAC (6') -4, mphE, ANT (2) -Ia and ada24 genes which have high frequency and high weight for mainly mediating drug resistance generation, if the detection results are negative, the sensitivity can be presumed, and the purpose of drug sensitivity, especially sensitive detection, is realized.

Further, from the above results, it can be seen that ceftazidime, CTX-M-55, CTX-M-11, CTX-M-15, SHV-155, SHV-5, SHV-11, SHV-12, SHV-76, SHV-30, SHV-53, SHV-124, SHV-182, DHA-1, KPC-3, and KPC-2 are important drug resistance genes, and the high-frequency generation and high-weight CTX-M-55, CTX-M-11, CTX-M-15, SHV-155, SHV-5, SHV-11, SHV-12, SHV-76, SHV-30, and KPC-3 are practically induced by the high-frequency generation mainly mediating drug resistance and all possible mechanisms of drug resistance, And (3) detecting SHV-53, SHV-124, SHV-182, DHA-1, KPC-3 and/or KPC-2 genes, and if the detection result is positive, presuming drug resistance. Meanwhile, as can be seen from the results of the 30X genome simulation reads test, it is not possible to find a suitable threshold value, so that when the calculated Score value is < the threshold value, the model NPV is high (e.g., >0.9), and thus the purpose of sensitive detection cannot be achieved.

Further, from the above results, it is found that, in the case of the compound sulfamethoxazole, dfrA12, dfrA15, dfrA17, dfrA19, dfrA30, dfrA8, dfrA5, dfrA15b, dfrA14, dfr22, dfrA27, and dfrA1 are important drug resistance genes, and in practice, the resistance genes can be detected as positive if all of the genes are detected at the same time by combining the gene weights, gene family weights, gene occurrence frequencies, and all possible mechanisms of resistance generation, such as dfrA12, dfrA15, dfrA17, dfrA19, dfrA30, dfrA8, dfrA5, dfa 15b, dfa 14, dfr22, dfrA27, and/or dfrA1, which are high-weighted for mainly mediating the generation of high frequency of resistance generation. Thus realizing the purpose of drug sensitivity, especially sensitive detection. Meanwhile, as can be seen from the results of the 30X genome simulation reads test, it is not possible to find a suitable threshold value, so that when the calculated Score value is < the threshold value, the model NPV is high (e.g., >0.9), and thus the purpose of sensitive detection cannot be achieved.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A model construction method for detecting and identifying drug-resistant genes and predicting drug-resistant phenotypes based on gene sequencing reads comparison is characterized by comprising the following steps:

step 1): combining the classification information of drug resistance genes of a drug resistance database, sorting and correcting drug resistance gene family information and consistency annotation information among gene sequences, preferably, sorting and correcting source species information and/or a mediation mode;

the weight coefficient defining formula is as follows:

in the formula, arg _ Ni is the number of samples detected by corresponding genes in a target gene family in a BGWAS model training set, arg _ Wi is a weight coefficient of the corresponding genes in the family, j represents j key genes in the family, and j + k represents the number of all genes in the family;

step 3): detecting drug-resistant genes and correcting the process;

preferably, the sequencing reads are NGS or Nanopore sequencing reads; more preferred are NGS or Nanopore metagenomic sequencing reads.

2. The method of claim 1,

in the step 1) described above, the step of,

the drug resistance database is a CARD drug resistance database;

3. The method of any one of claims 1 to 2,

the detection and process correction of the drug resistance gene in the step 3) comprises the following steps:

a) sequencing Reads data simulation;

b) sequence alignment and annotation statistics;

c) screening and filtering drug-resistant genes.

4. The method of claim 3,

the a) sequencing Reads data simulation is based on BGWAS model training set strain samples for data simulation;

preferably, the first and second liquid crystal materials are,

bacterial strain genomic Nanopore sequencing reads sequence data was simulated using ReadSim software.

5. The method of any one of claims 3 to 4,

the b) sequence alignment and annotation statistics comprise:

more preferably, it is a mixture of more preferably,

6. The method of any one of claims 3 to 5,

the c) screening and filtering of the drug-resistant gene are as follows: screening and filtering the drug-resistant genes of the upper reads sequence compared in the step b);

C) evaluating the accuracy of the read-based detection for identifying the drug-resistant genotyping: and taking the result of the assembly-based detection of the drug-resistant gene in the BGWAS model training process as reference, and counting the important gene or gene family screened out by the BGWAS model to obtain the accuracy, sensitivity and specificity indexes of the read-based detection of the drug-resistant gene or gene family.

7. The method of any one of claims 1 to 6, further comprising the steps of:

step 4): defining and calculating a Score value of a negative and positive judgment index, and determining a report rule and a cutoff threshold value based on ROC analysis;

the Score values were calculated as follows:

8. A model construction method for drug-resistant gene species attribution prediction is characterized by comprising the following steps:

step 3): calculating the copy number of the drug-resistant gene and judging the species affiliation based on the assumed gene-species affiliation relationship;

preferably, the step 1) is specifically

Preferably, the step 2) is specifically:

Preferably, the step 3) is specifically:

9. A method for drug resistance detection of metagenomic sequencing data is characterized by comprising the following steps:

2) detecting and identifying the drug resistance gene contained in the sample: carrying out drug-resistant gene alignment and annotation statistics on sample sequences based on the method of any one of claims 1-6, and detecting and identifying drug-resistant genes contained in the samples;

3) predicting the species attribution of the detected drug-resistant genes in the sample: identifying the target pathogenic bacteria contained in the sample according to the method of claim 8, and predicting the species assignment of the detected drug resistance gene;

4) calculating the score value of the target species-antibiotic drug according to the score calculation mode in claim 7 according to the detected drug resistance gene carrying condition of the target pathogenic bacteria, and comparing the score value with the cutoff value: when score > -cutoff, then predict as R; score < cutoff, S is predicted if the detected pathogen genome coverage is higher than the minimum genome coverage or data volume required for model stability, otherwise is reported as unknown.

10. An apparatus, comprising: at least one memory for storing a program; at least one processor configured to load the program to perform the method of any of claims 1-9.

11. The application of detection reagents aiming at non-core type drug-resistant genes AAC (3) -IIe, AAC (3) -IV, AAC (3) -IId, rmtC, armA, rmtF, rmtB, AAC (6') -33 and ANT (2') -Ia in the preparation of a Klebsiella pneumoniae auxiliary drug susceptibility prediction kit;

more preferably, the drug sensitivity is to a gentamicin drug;

12. The application of detection reagents aiming at non-core type drug resistance genes AAC (3) -IV, AAC (3) -IId, AAC (6') -Ib', AAC (6') -Ib-cr, AAC (6') -Ib-Hangzhou, AAC (6') -Ib4, mphE, ANT (2') -Ia and aadA24 in the preparation of a Klebsiella pneumoniae auxiliary drug sensitivity prediction kit;

more preferably, the drug sensitivity is to tobramycin drugs;

13. The application of the detection reagent for the non-core type drug resistance genes CTX-M-55, CTX-M-11, CTX-M-15, SHV-155, SHV-5, SHV-11, SHV-12, SHV-76, SHV-30, SHV-53, SHV-124, SHV-182, DHA-1, KPC-3 and KPC-2 in the preparation of the Klebsiella pneumoniae auxiliary drug sensitivity prediction kit;

the drug susceptibility prediction is a drug resistance prediction;

preferably, the drug sensitivity is directed to a ceftazidime drug.

14. The application of detection reagents aiming at non-core type drug resistance genes dfrA12, dfrA15, dfrA17, dfrA19, dfrA30, dfrA8, dfrA5, dfrA15b, dfrA14, dfr22, dfrA27 and dfrA1 in the preparation of a Klebsiella pneumoniae auxiliary drug sensitivity prediction kit;

the drug susceptibility prediction is a drug resistance prediction;

preferably, the drug sensitivity is to a compound sulfamethoxazole drug;

more preferably, the drug resistance is estimated by simultaneously detecting dfrA12, dfrA15, dfrA17, dfrA19, dfrA30, dfrA8, dfrA5, dfrA15b, dfrA14, dfr22, dfrA27 and/or dfrA1 genes, which are high in weight and mainly mediate high frequency occurrence of drug resistance.