CN115171792A - Hybrid prediction method of virulence factor and antibiotic resistance gene - Google Patents

Hybrid prediction method of virulence factor and antibiotic resistance gene Download PDF

Info

Publication number
CN115171792A
CN115171792A CN202210781902.8A CN202210781902A CN115171792A CN 115171792 A CN115171792 A CN 115171792A CN 202210781902 A CN202210781902 A CN 202210781902A CN 115171792 A CN115171792 A CN 115171792A
Authority
CN
China
Prior art keywords
pssm
gene
data set
antibiotic resistance
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210781902.8A
Other languages
Chinese (zh)
Inventor
彭绍亮
姬博亚
皮文定
刘文娟
赵雄君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University filed Critical Hunan University
Priority to CN202210781902.8A priority Critical patent/CN115171792A/en
Publication of CN115171792A publication Critical patent/CN115171792A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B10/00ICT specially adapted for evolutionary bioinformatics, e.g. phylogenetic tree construction or analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physiology (AREA)
  • Animal Behavior & Ethology (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a mixed prediction method of virulence factors and antibiotic resistance genes, belonging to the technical field of deep learning and bioinformatics, which comprises the following steps: s1, respectively obtaining known antibiotic resistance gene sequence data, virulence factor sequence data and negative sample gene sequence data from a database; s2, respectively calculating multiple core gene characteristics by using gene sequence information, and constructing a deep learning neural network architecture and a classical ensemble learning architecture; s3, taking the three types of sequence data in the S1 as samples, and dividing a training data set and a testing data set; s4, acquiring a new training data set by using various classification methods; and constructing a classification model for the new training data set, and obtaining the performance evaluation index of the classification model. The mixed prediction method of the virulence factor and the antibiotic resistance gene has good prediction effect and high prediction accuracy.

Description

Hybrid prediction method of virulence factor and antibiotic resistance gene
Technical Field
The invention relates to the technical field of deep learning and bioinformatics, in particular to a hybrid prediction method of virulence factors and antibiotic resistance genes.
Background
Microbiomics is essential for the internal ecosystem of hosts such as humans, animals and plants as well as for maintaining the external environment. Particularly, pathogenic microorganisms carry Virulence Factors (VFs) and Antibiotic Resistance Genes (ARGs) to cause diseases, even threaten the life safety of a host, accurately and timely identify the VFs and the ARGs, can effectively guide medical treatment, reduce the morbidity and mortality of the host, and reduce economic losses in the aspects of animal husbandry, aquaculture and the like.
Furthermore, although the evolutionary pathways differ, VFs and ARGs share common features that are necessary for pathogenic bacteria to adapt to and survive in a competitive microbial environment, and in particular, both VFs and ARGs are often transferred between bacteria by Horizontal Gene Transfer (HGT), and both utilize similar systems (i.e., two-component systems, efflux pumps, cell wall alterations, and porins) to activate or inhibit the expression of various genes. Pathogens can use VFs to cause disease in their host, while they can colonize in selective antibiotic-stressed environments by acquiring or presenting ARGs. Thus, in order to understand the causal relationship between microbiome composition, function and disease, both VFs and ARGs must be determined simultaneously, and predicting both VFs and ARGs simultaneously can save pathogen monitoring time, particularly for on-site detection of epidemic pathogens. However, the conventional bioinformatics tools for identifying ARGs or VFs usually focus on independent prediction of ARG or VFs, the prediction tools are relatively backward, the prediction precision and recall rate are relatively low, and in addition, the conventional prediction methods for VFs and ARGs have the technical problems of high false negative rate, high sensitivity to cut-off threshold, only identification of conserved genes and relatively poor prediction effect, so that a mixed prediction method of virulence factors and antibiotic resistance genes needs to be designed.
Disclosure of Invention
In order to solve the above problems, the present invention aims to provide a hybrid prediction method for virulence factors and antibiotic resistance genes, which solves the technical problems that the prediction tools are relatively backward, the prediction precision and recall ratio are relatively low, and the existing prediction effect is relatively poor in the prior art, and uses a calculation method including machine learning and deep learning neural networks, so that the prediction effect is relatively good.
In order to achieve the purpose, the technical scheme of the invention is as follows:
the invention provides a hybrid prediction method of virulence factor and antibiotic resistance gene, comprising the following steps:
s1, respectively obtaining known antibiotic resistance gene sequence data, virulence factor sequence data and negative sample gene sequence data from a database;
s2, respectively calculating multiple core gene characteristics by using gene sequence information, and respectively constructing a deep learning neural network architecture and a classical integrated learning architecture through the core gene characteristics;
s3, taking the three types of sequence data in the S1 as samples, randomly extracting the samples to be used as a data collection, and randomly dividing the data collection five times, wherein four parts in each division are training data sets, and the rest part is a testing data set;
s4, acquiring a new training data set by using various classification methods; and constructing a classification model for the new training data set based on the extreme random tree, and obtaining the performance evaluation index of the classification model.
As an aspect of the mixed prediction method of virulence factors and antibiotic resistance genes, S1 specifically comprises the following steps:
s11, acquiring known antibiotic resistance gene sequence data from databases of ARDB, CARD and Uniprot;
s12, acquiring known virulence factor sequence data from databases of VFDB, PATRIC, victors and Unit;
s13, acquiring negative sample gene sequence data from the database Unit.
As one aspect of a hybrid prediction method of virulence factor and antibiotic resistance genes, S2 comprises the following specific steps:
s21, utilizing the gene sequence information to respectively calculate similar characteristics based on comparison scores, simple characteristics based on the gene sequence of the unique hot code, characteristics based on the gene evolution information and characteristics based on the gene sequence information;
s22, constructing a deep learning network architecture by using similar features based on comparison scores and simple features based on gene sequences of unique hot codes, and training a neural network classification model in an end-to-end mode;
and S23, constructing a classical ensemble learning framework by using the characteristics based on the gene evolution information and the characteristics based on the gene sequence information, and training a classical machine learning classification model by using the prior characteristic information.
As an aspect of the hybrid prediction method of virulence factor and antibiotic resistance gene, wherein the calculation of similar features based on alignment scores in S21 comprises the following specific steps:
the DIAMOND program was selected and the gene sequences in the training dataset were aligned with the remaining known 12724 ARGs and 30945 VFs for comparison under sensitive parameters;
the training data set has been de-duplicated with the data set for comparison using the CD-HIT program, and the comparison scores are normalized to the [0,1] interval;
the similarity feature based on bit score of each gene sequence in the training data set is converted into a fixed feature vector with dimensions of 12724+30945= 43669.
As one aspect of a mixed prediction method of virulence factors and antibiotic resistance genes, the characteristics based on gene evolution information in S21 consist of three specific characteristics based on a specific location scoring matrix, including PSSM-component characteristics, RPM-PSSM characteristics, and AADP-PSSM characteristics;
wherein the PSSM-composition characteristics eliminate variations in protein sequence length by summing and averaging all rows of the original PSSM profile for each naturally occurring amino acid type, as defined below:
Figure BDA0003723441950000031
Figure BDA0003723441950000032
wherein R is i Row i, r, representing the PSSM-composite feature matrix k Line k, p, representing the normalized PSSM k Denotes the kth amino acid in the protein sequence, a i Represents the ith amino acid of the 20 standard amino acids;
the RPM-PSSM feature transforms the original PSSM by filtering negative values to 0 while leaving positive values unchanged, the idea for the RPM-PSSM feature is from the residue probe approach, i.e., considering each amino acid corresponding to a particular column in the PSSM as a probe, the original PSSM is transformed into a 400-dimensional feature vector using the definitions below:
Figure BDA0003723441950000033
Figure BDA0003723441950000034
wherein M is i Line i, m, representing the RPM-PSSM feature matrix k Denotes line k, p of PSSM k Denotes the kth amino acid in the protein sequence, a i Represents the ith amino acid of the 20 standard amino acids;
the AADP-PSSM feature extends the traditional AAC and DPC concepts to PSSM, and the AAC-PSSM is converted into a fixed-length 20-dimensional feature vector by averaging the columns of the original PSSM profile, which is defined as follows:
Figure BDA0003723441950000035
wherein x is j Line j representing the surrogate AAC-PSSM signature matrix, representing the average proportion of amino acid mutations during evolution, p i,j Entities representing i rows and j columns in the original PSSM;
DPC-PSSM was converted to a fixed length 400-dimensional feature vector to avoid X-induced information loss in proteins, defined as follows:
Figure BDA0003723441950000041
AADP-PSSM is converted into a fixed-length eigenvector of 20+400=420 dimensions by combining the two components.
As an aspect of a mixed prediction method of virulence factor and antibiotic resistance genes, wherein the characteristics based on gene sequence information in S21 include amino acid composition characteristics, dipeptide composition characteristics, and dipeptide deviation characteristics from an expected average value;
the amino acid composition profile represents the frequency of 20 natural amino acids in the protein sequence, calculated as follows:
Figure BDA0003723441950000042
wherein N (a) represents the number of specific amino acids a, N represents the sequence length of the protein or peptide, and f (a) represents the finally generated 20-dimensional feature vector;
the dipeptide composition characteristic represents the frequency of the dipeptide in a protein or polypeptide sequence, and the calculation formula is as follows:
Figure BDA0003723441950000043
wherein N is ab Denotes the number of the given dipeptides ab, N denotes the sequence length of the protein or peptide, and D (a, b) denotes the final generated 400-dimensional feature vector.
As an aspect of a mixed prediction method of virulence factors and antibiotic resistance genes, where the deviation of the dipeptide from the expected mean is characterized by a combination of three features: theoretical mean TM, dipeptide composition DPC and theoretical difference TV;
the formula for calculating the TM feature is as follows:
Figure BDA0003723441950000044
wherein, C a And C b The codon numbers encoding amino acids a and b, respectively. C N Equal to 61, indicates the total number of possible codons excluding three stop codons.
The formula for the TV signature is as follows: :
Figure BDA0003723441950000045
wherein TM represents the TM characteristic, TV represents the TV characteristic, and N represents the sequence length of the protein or peptide.
The calculation formula of the DDE characteristics is as follows:
Figure BDA0003723441950000046
wherein DPC stands for DPC feature, TM stands for TM feature, and TV stands for TV feature.
As one aspect of the mixed prediction method of the virulence factors and the antibiotic resistance genes, the priori characteristic information in the S23 training classical machine learning classification model comprises a random forest classification algorithm, an extreme random tree classification algorithm, an Xgboost classification algorithm, a GradientBoosting classification algorithm and an Adaboost classification algorithm.
As one aspect of a hybrid prediction method of virulence factor and antibiotic resistance genes, wherein S4 comprises the steps of:
s41, performing a stacking algorithm by using a plurality of classification methods, and taking the prediction scores of different classification methods on the training data as a new training data set;
and S42, constructing a classification model by using a new training data set based on the extreme random tree, scoring the model by using the test data set, repeating the five experiments, and taking the average result of the five experiments as the performance evaluation index of the model.
As an aspect of the mixed prediction method of virulence factors and antibiotic resistance genes, S41 specifically comprises the following steps:
s411, integrating a plurality of basic classification models through a meta-model;
s412, training the basic-level classification model by using the whole training data set, and training the meta-model by using the output of the basic-level classification model as the training characteristic;
and S413, respectively training the basic-level classification models by using a 5-time cross validation method.
By adopting the technical scheme, the invention has the following advantages:
1. the invention provides a hybrid prediction method of virulence factors and antibiotic resistance genes, which can fully utilize the characteristics of a plurality of key core genes, superpose the strength of a classical collective learning method and deep learning and efficiently predict potential virulence factors and antibiotic resistance genes at the same time, and has strong scientific performance and higher accuracy of prediction results.
2. The invention can simultaneously and accurately predict the virulence factor, the drug resistance gene and the negative sample gene (neither the virulence factor nor the antibiotic resistance gene), can flexibly and accurately predict independently, solves the defects of high false negative rate, high sensitivity to a cut-off threshold value and capability of only identifying the conserved gene in the traditional optimal hit method, and obtains better prediction effect.
3. The invention has more precision and recall rate in the aspects of novel virulence factors and drug resistance genes, virulence factors and drug resistance genes in real metagenome data and pseudo virulence factors and drug resistance genes (gene segments) than the prior traditional prediction tool; the invention uses a calculation method comprising machine learning and deep learning neural networks, and compared with all the most advanced prediction tools, the result is competitive and the scientific performance is higher.
Drawings
Other features, objects and advantages of the present invention will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, with reference to the accompanying drawings.
FIG. 1 is a flow chart of a method for predicting the mixing of virulence factors and antibiotic resistance genes according to the invention;
FIG. 2 is a histogram comparing results of the hybrid prediction method of the present invention with other computational methods to predict both virulence factors and antibiotic resistance genes.
Detailed Description
The technical solutions of the present invention are described in detail below with reference to embodiments, and the detailed features and advantages of the present invention are described in detail in the embodiments, which are sufficient for any person skilled in the art to understand the technical contents of the present invention and implement the present invention, and the related objects and advantages of the present invention can be easily understood by those skilled in the art according to the description, the claims and the attached drawings disclosed in the present specification.
Referring to fig. 1, a method for predicting the mixture of virulence factors and antibiotic resistance genes in microbial data comprises the following steps:
s1, respectively obtaining known antibiotic resistance gene sequence data, virulence factor sequence data and negative sample gene sequence data (not belonging to antibiotic resistance genes and virulence factors);
s1 comprises the following specific steps:
s11, acquiring known antibiotic resistance gene sequence data from databases of ARDB, CARD and Uniprot;
s12, acquiring known virulence factor sequence data from databases of VFDB, PATRIC, victors and Unit;
s13, acquiring negative sample gene sequence data from the database Unit.
S2, respectively calculating multiple core gene characteristics by using gene sequence information, and respectively constructing a deep learning neural network architecture and a classical integrated learning architecture through the core gene characteristics;
s2 comprises the following specific steps:
s21, because the multiple core gene characteristics comprise similar characteristics based on comparison scores, characteristics based on gene evolution information, characteristics based on gene sequence information and simple characteristics based on a gene sequence of unique heat codes; therefore, the gene sequence information is used to calculate the similarity feature based on the alignment score, the simple feature based on the gene sequence of the unique hot code, the feature based on the gene evolution information, and the feature based on the gene sequence information, respectively.
For a similarity feature based on alignment scores consisting of the alignment scores of virulence factors and antibiotic resistance genes with known virulence factors and antibiotic resistance genes, which takes into account the similarity distribution of the sequences in the ARGs and VFs databases, not just the optimal hit rate, the alignment score is used as a similarity index because, unlike e-value, it takes into account the degree of uniformity between the sequences and is independent of the size of the database.
The step of calculating the similar features based on the comparison scores in the step S21 comprises the following specific steps:
selecting a faster DIAMOND program than BLAST, aligning the gene sequences in the training dataset with the remaining known 12724 ARGs and 30945 VFs for comparison under sensitive parameters;
the training data set has been de-duplicated using the CD-HIT program with the data set for comparison to avoid the possibility of tag leakage, and the alignment scores are normalized to the [0,1] interval to represent the similarity of the sequences over distance;
the bit score based similarity feature for each gene sequence in the training dataset is converted into a fixed feature vector of dimensions 12724+30945=43669, where each dimension is the alignment score output by the DIAMOND program between the full gene length sequence and each available ARG and VF in the alignment dataset.
The features based on gene evolution information consist of three specific features based on a position-specific scoring matrix (PSSM), including PSSM-component feature, RPM-PSSM feature, AADP-PSSM feature, wherein the PSSM-component feature eliminates variations due to protein sequence length by summing and averaging all rows of the original PSSM profile for each naturally occurring amino acid type, defined as follows:
Figure BDA0003723441950000071
Figure BDA0003723441950000072
wherein R is i Row i, r, representing the PSSM-composite feature matrix k Line k, p, representing the normalized PSSM k Denotes the k amino acid in the protein sequence, a i Represents the ith amino acid of the 20 standard amino acids.
The RPM-PSSM feature transforms the original PSSM by filtering negative values to 0 while leaving the positive values unchanged. The idea of this approach is derived from the residue probe approach, i.e., considering each amino acid corresponding to a particular column in the PSSM as a probe, and finally, the original PSSM is converted into a 400-dimensional feature vector using the definitions given below:
Figure BDA0003723441950000073
Figure BDA0003723441950000074
wherein M is i Line i, m, representing the characteristic matrix of RPM-PSSM k Denotes the kth line, p, of the PSSM k Denotes the kth amino acid in the protein sequence, a i Represents the ith amino acid of the 20 standard amino acids.
The AADP-PSSM feature extends the traditional AAC and DPC concepts to PSSM, and the AAC-PSSM is converted into a 20-dimensional feature vector of fixed length by averaging the columns of the original PSSM profile, which is defined as follows:
Figure BDA0003723441950000081
wherein x is j Line j of the feature matrix representing the substitutions AAC-PSSM, representing the average proportion of amino acid mutations during evolution, p i,j Representing the i rows and j columns of the original PSSM. Second, DPC-PSSM was converted to a fixed-length 400-dimensional feature vector to avoid information loss by X in the protein, defined as follows:
Figure BDA0003723441950000082
AADP-PSSM is converted into a fixed-length eigenvector of 20+400=420 dimensions by combining the two components.
The gene sequence information-based features include an amino acid composition feature (AAC), a dipeptide composition feature (DPC), a dipeptide deviation from expected mean feature (DDE), a pseudo-amino acid composition feature (PAAC) feature, and a quasi-sequence order feature (QSO).
Wherein, amino acid composition characteristic (AAC) represents the frequency of 20 natural amino acids (i.e., ACDEFGHIKLMNPQRSTVWY) in a protein sequence, and can be calculated as:
Figure BDA0003723441950000083
wherein N (a) represents the number of specific amino acids a, N represents the sequence length of the protein or peptide, and f (a) represents the finally generated 20-dimensional feature vector.
DPC characterization represents the frequency of a dipeptide in a protein or polypeptide sequence and can be calculated as:
Figure BDA0003723441950000084
wherein N is ab Denotes the number of given dipeptides ab, N denotes the length of the sequence of the protein or peptide, and D (a, b) denotes the resulting 400-dimensional eigenvector. Calculation of DPC characteristics reference is made to the previous description.
The DDE feature is a combination of three features: theoretical Mean (TM), dipeptide composition (DPC) and theoretical difference (TV), specifically, TM characteristics were calculated as follows:
Figure BDA0003723441950000085
wherein, C a And C b Code for amino acids a and b, respectively; c N Equal to 61, indicates the total number of possible codons not including three stop codons.
The TV features are calculated as follows:
Figure BDA0003723441950000086
wherein TM represents the characteristic of TM and the calculation is as described above, and N represents the length of the sequence of the protein or peptide.
The calculation method of the DDE characteristics is as follows:
Figure BDA0003723441950000091
where TM represents TM features and TV represents TV features, the calculations refer to the previous description.
S22, constructing a deep learning network architecture by using similar features based on comparison scores and simple features based on gene sequences of unique hot codes, and training a neural network classification model in an end-to-end mode;
s23, constructing a classical ensemble learning framework by using the characteristics based on the gene evolution information and the characteristics based on the gene sequence information, and training a classical machine learning classification model by using prior characteristic information.
The prior characteristic information in the S23 training classical machine learning classification model comprises a Random Forest (Random Forest) classification algorithm, an extreme Random tree (Extra Trees) classification algorithm, an Xgboost classification algorithm, a GradientBoosting classification algorithm and an Adaboost classification algorithm.
S3, taking the known antibiotic resistance gene data as a first type sample, the known virulence factor sequence information data as a second type sample, the known negative sample gene sequence information data as a third type sample, randomly extracting three types of data samples, and randomly dividing the whole training data set five times, wherein four parts are taken as a training data set and the rest is taken as a test data set;
and S4, acquiring a new training data set by using various classification methods, constructing a classification model for the new training data set based on the extreme random tree, and acquiring a performance evaluation index of the classification model.
S4 specifically comprises the following steps:
s41, performing a stacking algorithm by using a plurality of classification methods, and taking the prediction scores of different classification methods on the training data as a new training data set;
in order to obtain excellent prediction performance of virulence factors and antibiotic resistance genes, the classical machine learning method and the deep learning strength are integrated into a stacked algorithm,
s41 specifically includes the following steps:
s411, integrating a plurality of basic classification models through a meta-model;
s412, training the basic-level classification model by using the whole training data set, and training the meta-model by using the output of the basic-level classification model as the training characteristic;
s413, respectively training the basic-level classification models by utilizing a 5-fold cross validation method to solve the overfitting phenomenon in final prediction, wherein in a specific embodiment, a pseudo code shown in an algorithm of a stacking algorithm in the invention is shown in the following table 1
Table 1: stacking Algorithm the pseudo code shown in the Algorithm
Figure BDA0003723441950000101
And S42, constructing a classification model by using a new training data set based on the extreme random tree, scoring the model by using the test data set, repeating the five experiments, and taking the average result of the five experiments as the performance evaluation index of the model.
Example two
To better illustrate the effectiveness of the prediction method of the present invention, we have implemented a rigorous procedure, and cross-validation steps were used to make the validity of the present invention be unbiased evaluated, and table 2 lists the results of the present example on the mixed prediction of virulence factors and drug resistance genes under quintuplet cross-validation method:
TABLE 2 results of the present invention on simultaneous prediction of virulence factors and drug resistance genes under quintupling cross validation
Figure BDA0003723441950000102
In table 2: precision Recall Recall ratio F1-score F1 score, VFs: virulence factors, ARGs: drug resistance gene NSs: negative sample gene Micro-average: micro average
As can be seen from table 2, in the present embodiment, a higher evaluation score is obtained on the results of multiple cross experiments, and the results indicate that the present invention not only can simultaneously predict virulence factors and antibiotic resistance genes, but also has excellent performance in terms of accuracy and recall rate.
EXAMPLE III
In order to test the prediction capability of the present invention on unknown Virulence Factors (VFs), drug-resistant genes (ARGs) and negative sample genes (NSs), the present invention constructs an independent data set comprising 209 ARGs, 209 VFs and 209 NSs, it is noted that these unknown genes are completely independent of the genes in the training data set, by setting the identity threshold of CD-HIT to 100%, all identical or repeated sequences are removed, furthermore, we introduced the currently available VRprofile model (the latest calculation model) as a comparison method and the traditional "best HIT" method as a baseline (using Diamond sequence alignment tool) as a comparison method under three different parameters, table 3 lists the results of the present invention (HyperVR), VRprofile model and simultaneously predicting the unknown virulence factors and drug-resistant genes using Diamond sequence alignment tool under three different parameters in sequence:
table 3: the embodiment of the invention, the VRprofile model and the baseline comparison method simultaneously predict the unknown virulence factors and drug resistance genes
Figure BDA0003723441950000111
In the table: precision: precision rate, recall: recall, F1-score: f1 score, VFs: virulence factors, ARGs: drug resistance gene, NSs: negative sample gene, micro-average: micro average
From table 3, it can be seen that, in comparison between the example of the present invention (HyperVR) and the VRprofile model using a Diamond sequence alignment tool as a comparison method under three different parameters, the experimental results of the example of the present invention (HyperVR) all obtain the highest evaluation score, and have more excellent performance in terms of accuracy and recall rate than other baseline comparison methods.
FIG. 2 shows the histogram results of the simultaneous prediction of unknown virulence factors and drug resistance genes by the example of the invention (HyperVR), VRprofile model (the most recent calculation model) and baseline comparison method (including three different parameters), where in FIG. 2, bar a represents the F1 score and bar b represents the recall; column c represents the accuracy, diamond-81%, diamond-64%, diamond-21% respectively represent baseline as a comparison method using the Diamond sequence alignment tool at three different parameters; the height of the histogram represents the performance of the prediction performance of the method, and as can be seen from comparison of the histogram in fig. 2, the embodiment (HyperVR) of the invention has higher prediction performance compared with the latest calculation model (VRprofile) and baseline comparison method (including three different parameters), and the comprehensive performance is superior to other models.
Finally, it should be noted that while the present invention has been described with reference to the specific embodiments thereof, it should be understood by those skilled in the art that the above embodiments are only for illustrating the present invention and are not to be construed as limiting the present invention, and various equivalent changes and substitutions may be made therein without departing from the spirit of the present invention, and therefore, it is intended that all changes and modifications to the above embodiments within the spirit and scope of the present invention be covered by the appended claims.

Claims (10)

1. A hybrid prediction method of virulence factor and antibiotic resistance genes is characterized by comprising the following steps:
s1, respectively obtaining known antibiotic resistance gene sequence data, virulence factor sequence data and negative sample gene sequence data from a database;
s2, respectively calculating multiple core gene characteristics by using gene sequence information, and respectively constructing a deep learning neural network architecture and a classical integrated learning architecture through the core gene characteristics;
s3, taking the three types of sequence data in the S1 as samples, randomly extracting the samples to be used as a data collection, and randomly dividing the data collection five times, wherein four parts in each division are training data sets, and the rest part is a testing data set;
s4, acquiring a new training data set by using various classification methods; and constructing a classification model for the new training data set based on the extreme random tree, and obtaining the performance evaluation index of the classification model.
2. The method for predicting the mixture of virulence factors and antibiotic resistance genes according to claim 1, wherein the step S1 comprises the following steps:
s11, acquiring known antibiotic resistance gene sequence data from databases of ARDB, CARD and Uniprot;
s12, acquiring known virulence factor sequence data from databases of VFDB, PATRIC, victors and Unit prot;
s13, acquiring negative sample gene sequence data from the database Unit.
3. The method for predicting the mixture of virulence factors and antibiotic resistance genes according to claim 1, wherein the step S2 comprises the following steps:
s21, utilizing the gene sequence information to respectively calculate similar characteristics based on comparison scores, simple characteristics based on the gene sequence of the unique hot code, characteristics based on the gene evolution information and characteristics based on the gene sequence information;
s22, constructing a deep learning network architecture by using similar features based on comparison scores and simple features based on gene sequences of unique hot codes, and training a neural network classification model in an end-to-end mode;
s23, constructing a classical ensemble learning framework by using the characteristics based on the gene evolution information and the characteristics based on the gene sequence information, and training a classical machine learning classification model by using prior characteristic information.
4. The method of claim 3, wherein the step of calculating the similarity based on the alignment scores in S21 comprises the following steps:
the DIAMOND program was selected and the gene sequences in the training dataset were aligned with the remaining known 12724 ARGs and 30945 VFs for comparison under sensitive parameters;
the training data set has been de-duplicated with the data set for comparison using the CD-HIT program, and the comparison score is normalized to the [0,1] interval;
the similarity feature based on bit score of each gene sequence in the training data set is converted into a fixed feature vector with dimensions of 12724+30945= 43669.
5. The method of claim 3, wherein the characteristics based on the genetic evolution information in S21 are composed of three specific characteristics based on the specific location scoring matrix, including PSSM-component characteristics, RPM-PSSM characteristics and AADP-PSSM characteristics;
wherein the PSSM-composition characteristics eliminate variations in protein sequence length by summing and averaging all rows of the original PSSM profile for each naturally occurring amino acid type, as defined below:
Figure FDA0003723441940000021
Figure FDA0003723441940000022
wherein R is i Row i, r, representing the PSSM-composite feature matrix k Line k, p, representing normalized PSSM k Denotes the kth amino acid, alpha, in the protein sequence i Represents the ith amino acid of the 20 standard amino acids;
the RPM-PSSM feature transforms the original PSSM by filtering negative values to 0 while leaving positive values unchanged, the idea for the RPM-PSSM feature is from the residue probe approach, i.e., considering each amino acid corresponding to a particular column in the PSSM as a probe, the original PSSM is transformed into a 400-dimensional feature vector using the definitions below:
Figure FDA0003723441940000023
Figure FDA0003723441940000024
wherein M is i Line i, m, representing the characteristic matrix of RPM-PSSM k Line k representing PSSM,p k Denotes the kth amino acid, alpha, in the protein sequence i Represents the ith amino acid of the 20 standard amino acids;
the AADP-PSSM feature extends the traditional AAC and DPC concepts to PSSM, and the AAC-PSSM is converted into a 20-dimensional feature vector of fixed length by averaging the columns of the original PSSM profile, which is defined as follows:
Figure FDA0003723441940000025
wherein x is j Line j representing the surrogate AAC-PSSM signature matrix, representing the average proportion of amino acid mutations during evolution, p i,j Entities representing i rows and j columns in the original PSSM;
DPC-PSSM was converted to a fixed length 400-dimensional feature vector to avoid X-induced information loss in proteins, defined as follows:
Figure FDA0003723441940000031
AADP-PSSM is converted into a fixed-length feature vector of 20+400=420 dimensions by combining the two components.
6. The method according to claim 3, wherein the characteristics based on the gene sequence information in S21 include amino acid composition characteristics, dipeptide composition characteristics, and dipeptide deviation characteristics from an expected average value;
wherein, the amino acid composition characteristic represents the frequency of 20 natural amino acids in the protein sequence, and the calculation formula is as follows:
Figure FDA0003723441940000032
wherein N (a) represents the number of specific amino acids a, N represents the sequence length of the protein or peptide, and f (a) represents the finally generated 20-dimensional feature vector;
the dipeptide composition characteristic represents the frequency of the dipeptide in a protein or polypeptide sequence, and the calculation formula is as follows:
Figure FDA0003723441940000033
wherein N is ab Denotes the number of given dipeptides ab, N denotes the length of the sequence of the protein or peptide, and D (a, b) denotes the resulting 400-dimensional eigenvector.
7. The method of claim 6, wherein the deviation of the dipeptide from the expected average is characterized by a combination of three characteristics: theoretical mean TM, dipeptide composition DPC and theoretical difference TV;
the formula for calculating the TM feature is as follows:
Figure FDA0003723441940000034
wherein, C a And C b The codon numbers encoding amino acids a and b, respectively. C N Equal to 61, indicates the total number of possible codons not including three stop codons.
The calculation formula of the TV characteristic is as follows:
Figure FDA0003723441940000035
wherein TM represents TM characteristic, TV represents TV characteristic, and N represents sequence length of protein or peptide.
The calculation formula of DDE characteristics is as follows:
Figure FDA0003723441940000041
wherein DPC stands for DPC feature, TM stands for TM feature, and TV stands for TV feature.
8. The hybrid prediction method of virulence factors and antibiotic resistance genes according to claim 3, wherein the priori feature information in S23 training a classical machine learning classification model comprises a random forest classification algorithm, an extreme random tree classification algorithm, an Xgboost classification algorithm, a GradientBoosting classification algorithm, and an Adaboost classification algorithm.
9. The method for predicting the mixture of virulence factors and antibiotic resistance genes according to claim 1, wherein the step S4 comprises the following steps:
s41, performing a stacking algorithm by using a plurality of classification methods, and taking the prediction scores of different classification methods on the training data as a new training data set;
and S42, constructing a classification model by using a new training data set based on the extreme random tree, scoring the model by using a test data set, repeating the experiment for five times, and taking the average result of the experiment for five times as the performance evaluation index of the model.
10. The method for predicting the mixture of virulence factors and antibiotic resistance genes according to claim 9, wherein the S41 comprises the following steps:
s411, integrating a plurality of basic classification models through a meta-model;
s412, training the basic-level classification model by using the whole training data set, and training the meta-model by using the output of the basic-level classification model as the training characteristic;
and S413, respectively training the basic-level classification models by using a 5-time cross validation method.
CN202210781902.8A 2022-06-30 2022-06-30 Hybrid prediction method of virulence factor and antibiotic resistance gene Pending CN115171792A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210781902.8A CN115171792A (en) 2022-06-30 2022-06-30 Hybrid prediction method of virulence factor and antibiotic resistance gene

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210781902.8A CN115171792A (en) 2022-06-30 2022-06-30 Hybrid prediction method of virulence factor and antibiotic resistance gene

Publications (1)

Publication Number Publication Date
CN115171792A true CN115171792A (en) 2022-10-11

Family

ID=83490457

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210781902.8A Pending CN115171792A (en) 2022-06-30 2022-06-30 Hybrid prediction method of virulence factor and antibiotic resistance gene

Country Status (1)

Country Link
CN (1) CN115171792A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116541785A (en) * 2023-07-05 2023-08-04 北京建工环境修复股份有限公司 Toxicity prediction method and system based on deep integration machine learning model

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116541785A (en) * 2023-07-05 2023-08-04 北京建工环境修复股份有限公司 Toxicity prediction method and system based on deep integration machine learning model
CN116541785B (en) * 2023-07-05 2023-09-12 北京建工环境修复股份有限公司 Toxicity prediction method and system based on deep integration machine learning model

Similar Documents

Publication Publication Date Title
Krause et al. Finding novel genes in bacterial communities isolated from the environment
WO2019041333A1 (en) Method, apparatus, device and storage medium for predicting protein binding sites
Blakeley-Ruiz et al. Considerations for constructing a protein sequence database for metaproteomics
CN115083521B (en) Method and system for identifying tumor cell group in single cell transcriptome sequencing data
CN115171792A (en) Hybrid prediction method of virulence factor and antibiotic resistance gene
Babu et al. A comparative study of gene selection methods for cancer classification using microarray data
CN109949866A (en) Detection method, device, computer equipment and the storage medium of pathogen operational group
CN111161795A (en) Intestinal microorganism sequencing data processing method and device, storage medium and processor
AlKindy et al. Gene similarity-based approaches for determining core-genes of chloroplasts
CN115881296B (en) Thyroid papillary carcinoma (PTC) risk auxiliary layering system
ES2456240T3 (en) Method and computer system to evaluate classification annotations assigned to DNA sequences
CN113838528B (en) Single-cell horizontal coupling visualization method based on single-cell immune repertoire data
EP4305191A1 (en) Systems and methods for identifying microbial biosynthetic genetic clusters
GB2523445A (en) Metagenomic analysis of samples
Costa et al. A step-by-step protocol for classifying AOX proteins in flowering plants
CN114242158B (en) Method, device, storage medium and equipment for detecting ctDNA single nucleotide variation site
Strauch et al. A two-step clustering for 3-D gene expression data reveals the main features of the Arabidopsis stress response
Ohyanagi et al. Plant Omics: Advances in Big Data Biology
Zhou et al. Accurate reconstruction of the pan-and core-genomes of bacteria with PEPPA
Engvik Applications of k-mer count tables in community ecology
Marić et al. Approaches to metagenomic classification and assembly
Wang et al. FrameRate: learning the coding potential of unassembled metagenomic reads
Kromer-Edwards Predicting Antibiotic Resistance Using Machine Learning
Sturgill Comparative genome analysis of three Brucella spp. and a data model for automated multiple genome comparison
Simpson Jr Investigating Disease Mechanisms and Drug Response Differences in Transcriptomics Sequencing Data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination