CN115171792A

CN115171792A - Hybrid prediction method of virulence factor and antibiotic resistance gene

Info

Publication number: CN115171792A
Application number: CN202210781902.8A
Authority: CN
Inventors: 彭绍亮; 姬博亚; 皮文定; 刘文娟; 赵雄君
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2022-10-11

Abstract

The invention discloses a mixed prediction method of virulence factors and antibiotic resistance genes, belonging to the technical field of deep learning and bioinformatics, which comprises the following steps: s1, respectively obtaining known antibiotic resistance gene sequence data, virulence factor sequence data and negative sample gene sequence data from a database; s2, respectively calculating multiple core gene characteristics by using gene sequence information, and constructing a deep learning neural network architecture and a classical ensemble learning architecture; s3, taking the three types of sequence data in the S1 as samples, and dividing a training data set and a testing data set; s4, acquiring a new training data set by using various classification methods; and constructing a classification model for the new training data set, and obtaining the performance evaluation index of the classification model. The mixed prediction method of the virulence factor and the antibiotic resistance gene has good prediction effect and high prediction accuracy.

Description

Hybrid prediction method of virulence factor and antibiotic resistance gene

Technical Field

The invention relates to the technical field of deep learning and bioinformatics, in particular to a hybrid prediction method of virulence factors and antibiotic resistance genes.

Background

Microbiomics is essential for the internal ecosystem of hosts such as humans, animals and plants as well as for maintaining the external environment. Particularly, pathogenic microorganisms carry Virulence Factors (VFs) and Antibiotic Resistance Genes (ARGs) to cause diseases, even threaten the life safety of a host, accurately and timely identify the VFs and the ARGs, can effectively guide medical treatment, reduce the morbidity and mortality of the host, and reduce economic losses in the aspects of animal husbandry, aquaculture and the like.

Furthermore, although the evolutionary pathways differ, VFs and ARGs share common features that are necessary for pathogenic bacteria to adapt to and survive in a competitive microbial environment, and in particular, both VFs and ARGs are often transferred between bacteria by Horizontal Gene Transfer (HGT), and both utilize similar systems (i.e., two-component systems, efflux pumps, cell wall alterations, and porins) to activate or inhibit the expression of various genes. Pathogens can use VFs to cause disease in their host, while they can colonize in selective antibiotic-stressed environments by acquiring or presenting ARGs. Thus, in order to understand the causal relationship between microbiome composition, function and disease, both VFs and ARGs must be determined simultaneously, and predicting both VFs and ARGs simultaneously can save pathogen monitoring time, particularly for on-site detection of epidemic pathogens. However, the conventional bioinformatics tools for identifying ARGs or VFs usually focus on independent prediction of ARG or VFs, the prediction tools are relatively backward, the prediction precision and recall rate are relatively low, and in addition, the conventional prediction methods for VFs and ARGs have the technical problems of high false negative rate, high sensitivity to cut-off threshold, only identification of conserved genes and relatively poor prediction effect, so that a mixed prediction method of virulence factors and antibiotic resistance genes needs to be designed.

Disclosure of Invention

In order to solve the above problems, the present invention aims to provide a hybrid prediction method for virulence factors and antibiotic resistance genes, which solves the technical problems that the prediction tools are relatively backward, the prediction precision and recall ratio are relatively low, and the existing prediction effect is relatively poor in the prior art, and uses a calculation method including machine learning and deep learning neural networks, so that the prediction effect is relatively good.

In order to achieve the purpose, the technical scheme of the invention is as follows:

the invention provides a hybrid prediction method of virulence factor and antibiotic resistance gene, comprising the following steps:

s1, respectively obtaining known antibiotic resistance gene sequence data, virulence factor sequence data and negative sample gene sequence data from a database;

s2, respectively calculating multiple core gene characteristics by using gene sequence information, and respectively constructing a deep learning neural network architecture and a classical integrated learning architecture through the core gene characteristics;

s3, taking the three types of sequence data in the S1 as samples, randomly extracting the samples to be used as a data collection, and randomly dividing the data collection five times, wherein four parts in each division are training data sets, and the rest part is a testing data set;

s4, acquiring a new training data set by using various classification methods; and constructing a classification model for the new training data set based on the extreme random tree, and obtaining the performance evaluation index of the classification model.

As an aspect of the mixed prediction method of virulence factors and antibiotic resistance genes, S1 specifically comprises the following steps:

s11, acquiring known antibiotic resistance gene sequence data from databases of ARDB, CARD and Uniprot;

s12, acquiring known virulence factor sequence data from databases of VFDB, PATRIC, victors and Unit;

s13, acquiring negative sample gene sequence data from the database Unit.

As one aspect of a hybrid prediction method of virulence factor and antibiotic resistance genes, S2 comprises the following specific steps:

s21, utilizing the gene sequence information to respectively calculate similar characteristics based on comparison scores, simple characteristics based on the gene sequence of the unique hot code, characteristics based on the gene evolution information and characteristics based on the gene sequence information;

s22, constructing a deep learning network architecture by using similar features based on comparison scores and simple features based on gene sequences of unique hot codes, and training a neural network classification model in an end-to-end mode;

and S23, constructing a classical ensemble learning framework by using the characteristics based on the gene evolution information and the characteristics based on the gene sequence information, and training a classical machine learning classification model by using the prior characteristic information.

As an aspect of the hybrid prediction method of virulence factor and antibiotic resistance gene, wherein the calculation of similar features based on alignment scores in S21 comprises the following specific steps:

the DIAMOND program was selected and the gene sequences in the training dataset were aligned with the remaining known 12724 ARGs and 30945 VFs for comparison under sensitive parameters;

the training data set has been de-duplicated with the data set for comparison using the CD-HIT program, and the comparison scores are normalized to the [0,1] interval;

the similarity feature based on bit score of each gene sequence in the training data set is converted into a fixed feature vector with dimensions of 12724+30945= 43669.

As one aspect of a mixed prediction method of virulence factors and antibiotic resistance genes, the characteristics based on gene evolution information in S21 consist of three specific characteristics based on a specific location scoring matrix, including PSSM-component characteristics, RPM-PSSM characteristics, and AADP-PSSM characteristics;

wherein the PSSM-composition characteristics eliminate variations in protein sequence length by summing and averaging all rows of the original PSSM profile for each naturally occurring amino acid type, as defined below:

wherein R is _i Row i, r, representing the PSSM-composite feature matrix _k Line k, p, representing the normalized PSSM _k Denotes the kth amino acid in the protein sequence, a _i Represents the ith amino acid of the 20 standard amino acids;

the RPM-PSSM feature transforms the original PSSM by filtering negative values to 0 while leaving positive values unchanged, the idea for the RPM-PSSM feature is from the residue probe approach, i.e., considering each amino acid corresponding to a particular column in the PSSM as a probe, the original PSSM is transformed into a 400-dimensional feature vector using the definitions below:

wherein M is _i Line i, m, representing the RPM-PSSM feature matrix _k Denotes line k, p of PSSM _k Denotes the kth amino acid in the protein sequence, a _i Represents the ith amino acid of the 20 standard amino acids;

the AADP-PSSM feature extends the traditional AAC and DPC concepts to PSSM, and the AAC-PSSM is converted into a fixed-length 20-dimensional feature vector by averaging the columns of the original PSSM profile, which is defined as follows:

wherein x is _j Line j representing the surrogate AAC-PSSM signature matrix, representing the average proportion of amino acid mutations during evolution, p _i，j Entities representing i rows and j columns in the original PSSM;

DPC-PSSM was converted to a fixed length 400-dimensional feature vector to avoid X-induced information loss in proteins, defined as follows:

AADP-PSSM is converted into a fixed-length eigenvector of 20+400=420 dimensions by combining the two components.

As an aspect of a mixed prediction method of virulence factor and antibiotic resistance genes, wherein the characteristics based on gene sequence information in S21 include amino acid composition characteristics, dipeptide composition characteristics, and dipeptide deviation characteristics from an expected average value;

the amino acid composition profile represents the frequency of 20 natural amino acids in the protein sequence, calculated as follows:

wherein N (a) represents the number of specific amino acids a, N represents the sequence length of the protein or peptide, and f (a) represents the finally generated 20-dimensional feature vector;

the dipeptide composition characteristic represents the frequency of the dipeptide in a protein or polypeptide sequence, and the calculation formula is as follows:

wherein N is _ab Denotes the number of the given dipeptides ab, N denotes the sequence length of the protein or peptide, and D (a, b) denotes the final generated 400-dimensional feature vector.

As an aspect of a mixed prediction method of virulence factors and antibiotic resistance genes, where the deviation of the dipeptide from the expected mean is characterized by a combination of three features: theoretical mean TM, dipeptide composition DPC and theoretical difference TV;

the formula for calculating the TM feature is as follows:

wherein, C _a And C _b The codon numbers encoding amino acids a and b, respectively. C _N Equal to 61, indicates the total number of possible codons excluding three stop codons.

The formula for the TV signature is as follows: :

wherein TM represents the TM characteristic, TV represents the TV characteristic, and N represents the sequence length of the protein or peptide.

The calculation formula of the DDE characteristics is as follows:

wherein DPC stands for DPC feature, TM stands for TM feature, and TV stands for TV feature.

As one aspect of the mixed prediction method of the virulence factors and the antibiotic resistance genes, the priori characteristic information in the S23 training classical machine learning classification model comprises a random forest classification algorithm, an extreme random tree classification algorithm, an Xgboost classification algorithm, a GradientBoosting classification algorithm and an Adaboost classification algorithm.

As one aspect of a hybrid prediction method of virulence factor and antibiotic resistance genes, wherein S4 comprises the steps of:

s41, performing a stacking algorithm by using a plurality of classification methods, and taking the prediction scores of different classification methods on the training data as a new training data set;

and S42, constructing a classification model by using a new training data set based on the extreme random tree, scoring the model by using the test data set, repeating the five experiments, and taking the average result of the five experiments as the performance evaluation index of the model.

As an aspect of the mixed prediction method of virulence factors and antibiotic resistance genes, S41 specifically comprises the following steps:

s411, integrating a plurality of basic classification models through a meta-model;

s412, training the basic-level classification model by using the whole training data set, and training the meta-model by using the output of the basic-level classification model as the training characteristic;

and S413, respectively training the basic-level classification models by using a 5-time cross validation method.

By adopting the technical scheme, the invention has the following advantages:

1. the invention provides a hybrid prediction method of virulence factors and antibiotic resistance genes, which can fully utilize the characteristics of a plurality of key core genes, superpose the strength of a classical collective learning method and deep learning and efficiently predict potential virulence factors and antibiotic resistance genes at the same time, and has strong scientific performance and higher accuracy of prediction results.

2. The invention can simultaneously and accurately predict the virulence factor, the drug resistance gene and the negative sample gene (neither the virulence factor nor the antibiotic resistance gene), can flexibly and accurately predict independently, solves the defects of high false negative rate, high sensitivity to a cut-off threshold value and capability of only identifying the conserved gene in the traditional optimal hit method, and obtains better prediction effect.

3. The invention has more precision and recall rate in the aspects of novel virulence factors and drug resistance genes, virulence factors and drug resistance genes in real metagenome data and pseudo virulence factors and drug resistance genes (gene segments) than the prior traditional prediction tool; the invention uses a calculation method comprising machine learning and deep learning neural networks, and compared with all the most advanced prediction tools, the result is competitive and the scientific performance is higher.

Drawings

Other features, objects and advantages of the present invention will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, with reference to the accompanying drawings.

FIG. 1 is a flow chart of a method for predicting the mixing of virulence factors and antibiotic resistance genes according to the invention;

FIG. 2 is a histogram comparing results of the hybrid prediction method of the present invention with other computational methods to predict both virulence factors and antibiotic resistance genes.

Detailed Description

The technical solutions of the present invention are described in detail below with reference to embodiments, and the detailed features and advantages of the present invention are described in detail in the embodiments, which are sufficient for any person skilled in the art to understand the technical contents of the present invention and implement the present invention, and the related objects and advantages of the present invention can be easily understood by those skilled in the art according to the description, the claims and the attached drawings disclosed in the present specification.

Referring to fig. 1, a method for predicting the mixture of virulence factors and antibiotic resistance genes in microbial data comprises the following steps:

s1, respectively obtaining known antibiotic resistance gene sequence data, virulence factor sequence data and negative sample gene sequence data (not belonging to antibiotic resistance genes and virulence factors);

s1 comprises the following specific steps:

s13, acquiring negative sample gene sequence data from the database Unit.

s2 comprises the following specific steps:

s21, because the multiple core gene characteristics comprise similar characteristics based on comparison scores, characteristics based on gene evolution information, characteristics based on gene sequence information and simple characteristics based on a gene sequence of unique heat codes; therefore, the gene sequence information is used to calculate the similarity feature based on the alignment score, the simple feature based on the gene sequence of the unique hot code, the feature based on the gene evolution information, and the feature based on the gene sequence information, respectively.

For a similarity feature based on alignment scores consisting of the alignment scores of virulence factors and antibiotic resistance genes with known virulence factors and antibiotic resistance genes, which takes into account the similarity distribution of the sequences in the ARGs and VFs databases, not just the optimal hit rate, the alignment score is used as a similarity index because, unlike e-value, it takes into account the degree of uniformity between the sequences and is independent of the size of the database.

The step of calculating the similar features based on the comparison scores in the step S21 comprises the following specific steps:

selecting a faster DIAMOND program than BLAST, aligning the gene sequences in the training dataset with the remaining known 12724 ARGs and 30945 VFs for comparison under sensitive parameters;

the training data set has been de-duplicated using the CD-HIT program with the data set for comparison to avoid the possibility of tag leakage, and the alignment scores are normalized to the [0,1] interval to represent the similarity of the sequences over distance;

the bit score based similarity feature for each gene sequence in the training dataset is converted into a fixed feature vector of dimensions 12724+30945=43669, where each dimension is the alignment score output by the DIAMOND program between the full gene length sequence and each available ARG and VF in the alignment dataset.

The features based on gene evolution information consist of three specific features based on a position-specific scoring matrix (PSSM), including PSSM-component feature, RPM-PSSM feature, AADP-PSSM feature, wherein the PSSM-component feature eliminates variations due to protein sequence length by summing and averaging all rows of the original PSSM profile for each naturally occurring amino acid type, defined as follows:

wherein R is _i Row i, r, representing the PSSM-composite feature matrix _k Line k, p, representing the normalized PSSM _k Denotes the k amino acid in the protein sequence, a _i Represents the ith amino acid of the 20 standard amino acids.

The RPM-PSSM feature transforms the original PSSM by filtering negative values to 0 while leaving the positive values unchanged. The idea of this approach is derived from the residue probe approach, i.e., considering each amino acid corresponding to a particular column in the PSSM as a probe, and finally, the original PSSM is converted into a 400-dimensional feature vector using the definitions given below:

wherein M is _i Line i, m, representing the characteristic matrix of RPM-PSSM _k Denotes the kth line, p, of the PSSM _k Denotes the kth amino acid in the protein sequence, a _i Represents the ith amino acid of the 20 standard amino acids.

The AADP-PSSM feature extends the traditional AAC and DPC concepts to PSSM, and the AAC-PSSM is converted into a 20-dimensional feature vector of fixed length by averaging the columns of the original PSSM profile, which is defined as follows:

wherein x is _j Line j of the feature matrix representing the substitutions AAC-PSSM, representing the average proportion of amino acid mutations during evolution, p _i，j Representing the i rows and j columns of the original PSSM. Second, DPC-PSSM was converted to a fixed-length 400-dimensional feature vector to avoid information loss by X in the protein, defined as follows:

The gene sequence information-based features include an amino acid composition feature (AAC), a dipeptide composition feature (DPC), a dipeptide deviation from expected mean feature (DDE), a pseudo-amino acid composition feature (PAAC) feature, and a quasi-sequence order feature (QSO).

Wherein, amino acid composition characteristic (AAC) represents the frequency of 20 natural amino acids (i.e., ACDEFGHIKLMNPQRSTVWY) in a protein sequence, and can be calculated as:

wherein N (a) represents the number of specific amino acids a, N represents the sequence length of the protein or peptide, and f (a) represents the finally generated 20-dimensional feature vector.

DPC characterization represents the frequency of a dipeptide in a protein or polypeptide sequence and can be calculated as:

wherein N is _ab Denotes the number of given dipeptides ab, N denotes the length of the sequence of the protein or peptide, and D (a, b) denotes the resulting 400-dimensional eigenvector. Calculation of DPC characteristics reference is made to the previous description.

The DDE feature is a combination of three features: theoretical Mean (TM), dipeptide composition (DPC) and theoretical difference (TV), specifically, TM characteristics were calculated as follows:

wherein, C _a And C _b Code for amino acids a and b, respectively; c _N Equal to 61, indicates the total number of possible codons not including three stop codons.

The TV features are calculated as follows:

wherein TM represents the characteristic of TM and the calculation is as described above, and N represents the length of the sequence of the protein or peptide.

The calculation method of the DDE characteristics is as follows:

where TM represents TM features and TV represents TV features, the calculations refer to the previous description.

s23, constructing a classical ensemble learning framework by using the characteristics based on the gene evolution information and the characteristics based on the gene sequence information, and training a classical machine learning classification model by using prior characteristic information.

The prior characteristic information in the S23 training classical machine learning classification model comprises a Random Forest (Random Forest) classification algorithm, an extreme Random tree (Extra Trees) classification algorithm, an Xgboost classification algorithm, a GradientBoosting classification algorithm and an Adaboost classification algorithm.

S3, taking the known antibiotic resistance gene data as a first type sample, the known virulence factor sequence information data as a second type sample, the known negative sample gene sequence information data as a third type sample, randomly extracting three types of data samples, and randomly dividing the whole training data set five times, wherein four parts are taken as a training data set and the rest is taken as a test data set;

and S4, acquiring a new training data set by using various classification methods, constructing a classification model for the new training data set based on the extreme random tree, and acquiring a performance evaluation index of the classification model.

S4 specifically comprises the following steps:

in order to obtain excellent prediction performance of virulence factors and antibiotic resistance genes, the classical machine learning method and the deep learning strength are integrated into a stacked algorithm,

s41 specifically includes the following steps:

s413, respectively training the basic-level classification models by utilizing a 5-fold cross validation method to solve the overfitting phenomenon in final prediction, wherein in a specific embodiment, a pseudo code shown in an algorithm of a stacking algorithm in the invention is shown in the following table 1

Table 1: stacking Algorithm the pseudo code shown in the Algorithm

Example two

To better illustrate the effectiveness of the prediction method of the present invention, we have implemented a rigorous procedure, and cross-validation steps were used to make the validity of the present invention be unbiased evaluated, and table 2 lists the results of the present example on the mixed prediction of virulence factors and drug resistance genes under quintuplet cross-validation method:

TABLE 2 results of the present invention on simultaneous prediction of virulence factors and drug resistance genes under quintupling cross validation

In table 2: precision Recall Recall ratio F1-score F1 score, VFs: virulence factors, ARGs: drug resistance gene NSs: negative sample gene Micro-average: micro average

As can be seen from table 2, in the present embodiment, a higher evaluation score is obtained on the results of multiple cross experiments, and the results indicate that the present invention not only can simultaneously predict virulence factors and antibiotic resistance genes, but also has excellent performance in terms of accuracy and recall rate.

EXAMPLE III

In order to test the prediction capability of the present invention on unknown Virulence Factors (VFs), drug-resistant genes (ARGs) and negative sample genes (NSs), the present invention constructs an independent data set comprising 209 ARGs, 209 VFs and 209 NSs, it is noted that these unknown genes are completely independent of the genes in the training data set, by setting the identity threshold of CD-HIT to 100%, all identical or repeated sequences are removed, furthermore, we introduced the currently available VRprofile model (the latest calculation model) as a comparison method and the traditional "best HIT" method as a baseline (using Diamond sequence alignment tool) as a comparison method under three different parameters, table 3 lists the results of the present invention (HyperVR), VRprofile model and simultaneously predicting the unknown virulence factors and drug-resistant genes using Diamond sequence alignment tool under three different parameters in sequence:

table 3: the embodiment of the invention, the VRprofile model and the baseline comparison method simultaneously predict the unknown virulence factors and drug resistance genes

In the table: precision: precision rate, recall: recall, F1-score: f1 score, VFs: virulence factors, ARGs: drug resistance gene, NSs: negative sample gene, micro-average: micro average

From table 3, it can be seen that, in comparison between the example of the present invention (HyperVR) and the VRprofile model using a Diamond sequence alignment tool as a comparison method under three different parameters, the experimental results of the example of the present invention (HyperVR) all obtain the highest evaluation score, and have more excellent performance in terms of accuracy and recall rate than other baseline comparison methods.

FIG. 2 shows the histogram results of the simultaneous prediction of unknown virulence factors and drug resistance genes by the example of the invention (HyperVR), VRprofile model (the most recent calculation model) and baseline comparison method (including three different parameters), where in FIG. 2, bar a represents the F1 score and bar b represents the recall; column c represents the accuracy, diamond-81%, diamond-64%, diamond-21% respectively represent baseline as a comparison method using the Diamond sequence alignment tool at three different parameters; the height of the histogram represents the performance of the prediction performance of the method, and as can be seen from comparison of the histogram in fig. 2, the embodiment (HyperVR) of the invention has higher prediction performance compared with the latest calculation model (VRprofile) and baseline comparison method (including three different parameters), and the comprehensive performance is superior to other models.

Finally, it should be noted that while the present invention has been described with reference to the specific embodiments thereof, it should be understood by those skilled in the art that the above embodiments are only for illustrating the present invention and are not to be construed as limiting the present invention, and various equivalent changes and substitutions may be made therein without departing from the spirit of the present invention, and therefore, it is intended that all changes and modifications to the above embodiments within the spirit and scope of the present invention be covered by the appended claims.

Claims

1. A hybrid prediction method of virulence factor and antibiotic resistance genes is characterized by comprising the following steps:

2. The method for predicting the mixture of virulence factors and antibiotic resistance genes according to claim 1, wherein the step S1 comprises the following steps:

s12, acquiring known virulence factor sequence data from databases of VFDB, PATRIC, victors and Unit prot;

s13, acquiring negative sample gene sequence data from the database Unit.

3. The method for predicting the mixture of virulence factors and antibiotic resistance genes according to claim 1, wherein the step S2 comprises the following steps:

4. The method of claim 3, wherein the step of calculating the similarity based on the alignment scores in S21 comprises the following steps:

the training data set has been de-duplicated with the data set for comparison using the CD-HIT program, and the comparison score is normalized to the [0,1] interval;

5. The method of claim 3, wherein the characteristics based on the genetic evolution information in S21 are composed of three specific characteristics based on the specific location scoring matrix, including PSSM-component characteristics, RPM-PSSM characteristics and AADP-PSSM characteristics;

wherein R is _i Row i, r, representing the PSSM-composite feature matrix _k Line k, p, representing normalized PSSM _k Denotes the kth amino acid, alpha, in the protein sequence _i Represents the ith amino acid of the 20 standard amino acids;

wherein M is _i Line i, m, representing the characteristic matrix of RPM-PSSM _k Line k representing PSSM，p _k Denotes the kth amino acid, alpha, in the protein sequence _i Represents the ith amino acid of the 20 standard amino acids;

AADP-PSSM is converted into a fixed-length feature vector of 20+400=420 dimensions by combining the two components.

6. The method according to claim 3, wherein the characteristics based on the gene sequence information in S21 include amino acid composition characteristics, dipeptide composition characteristics, and dipeptide deviation characteristics from an expected average value;

wherein, the amino acid composition characteristic represents the frequency of 20 natural amino acids in the protein sequence, and the calculation formula is as follows:

wherein N is _ab Denotes the number of given dipeptides ab, N denotes the length of the sequence of the protein or peptide, and D (a, b) denotes the resulting 400-dimensional eigenvector.

7. The method of claim 6, wherein the deviation of the dipeptide from the expected average is characterized by a combination of three characteristics: theoretical mean TM, dipeptide composition DPC and theoretical difference TV;

the formula for calculating the TM feature is as follows:

wherein, C _a And C _b The codon numbers encoding amino acids a and b, respectively. C _N Equal to 61, indicates the total number of possible codons not including three stop codons.

The calculation formula of the TV characteristic is as follows:

wherein TM represents TM characteristic, TV represents TV characteristic, and N represents sequence length of protein or peptide.

The calculation formula of DDE characteristics is as follows:

8. The hybrid prediction method of virulence factors and antibiotic resistance genes according to claim 3, wherein the priori feature information in S23 training a classical machine learning classification model comprises a random forest classification algorithm, an extreme random tree classification algorithm, an Xgboost classification algorithm, a GradientBoosting classification algorithm, and an Adaboost classification algorithm.

9. The method for predicting the mixture of virulence factors and antibiotic resistance genes according to claim 1, wherein the step S4 comprises the following steps:

and S42, constructing a classification model by using a new training data set based on the extreme random tree, scoring the model by using a test data set, repeating the experiment for five times, and taking the average result of the experiment for five times as the performance evaluation index of the model.

10. The method for predicting the mixture of virulence factors and antibiotic resistance genes according to claim 9, wherein the S41 comprises the following steps: