CN115662503A - Method for predicting bacterial phenotypic characteristics based on bacterial genome data of machine learning - Google Patents
Method for predicting bacterial phenotypic characteristics based on bacterial genome data of machine learning Download PDFInfo
- Publication number
- CN115662503A CN115662503A CN202211172959.4A CN202211172959A CN115662503A CN 115662503 A CN115662503 A CN 115662503A CN 202211172959 A CN202211172959 A CN 202211172959A CN 115662503 A CN115662503 A CN 115662503A
- Authority
- CN
- China
- Prior art keywords
- bacterial
- data
- bacteria
- predicting
- prediction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000001580 bacterial effect Effects 0.000 title claims abstract description 51
- 238000000034 method Methods 0.000 title claims abstract description 23
- 238000010801 machine learning Methods 0.000 title claims abstract description 16
- 241000894006 Bacteria Species 0.000 claims abstract description 31
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 15
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 13
- 108020001580 protein domains Proteins 0.000 claims abstract description 12
- 238000012549 training Methods 0.000 claims abstract description 12
- 238000004458 analytical method Methods 0.000 claims abstract description 11
- 102000004169 proteins and genes Human genes 0.000 claims abstract description 11
- 108010077805 Bacterial Proteins Proteins 0.000 claims abstract description 6
- 239000011159 matrix material Substances 0.000 claims abstract description 6
- 238000007637 random forest analysis Methods 0.000 claims description 11
- 108091026890 Coding region Proteins 0.000 claims description 2
- 238000003066 decision tree Methods 0.000 claims description 2
- 238000012706 support-vector machine Methods 0.000 claims description 2
- 238000011160 research Methods 0.000 abstract description 5
- 238000012360 testing method Methods 0.000 description 12
- 244000005700 microbiome Species 0.000 description 9
- 238000005457 optimization Methods 0.000 description 6
- 238000013461 design Methods 0.000 description 5
- 241001148470 aerobic bacillus Species 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000012163 sequencing technique Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 241001148471 unidentified anaerobic bacterium Species 0.000 description 2
- 241000203069 Archaea Species 0.000 description 1
- 241000195493 Cryptophyta Species 0.000 description 1
- 241000233866 Fungi Species 0.000 description 1
- 241000700605 Viruses Species 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 210000004027 cell Anatomy 0.000 description 1
- 238000012258 culturing Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000012165 high-throughput sequencing Methods 0.000 description 1
- 210000003705 ribosome Anatomy 0.000 description 1
- 241000894007 species Species 0.000 description 1
Images
Landscapes
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention discloses a method for predicting bacterial phenotypic characteristics based on bacterial genome data of machine learning, which comprises the following steps: (1) Performing protein domain analysis on the bacterial genome data; (2) Reconstructing and combining the bacterial protein structural domains obtained by analysis, counting frequency and converting the frequency into a matrix to obtain a protein structural domain characteristic data set reconstructed by bacteria; (3) Dividing the integrated bacteria data in the step (2) into a training set and a prediction set according to whether the integrated bacteria data are of known characteristic types; (4) And performing model training on the training set data by using a machine learning algorithm to select an optimal prediction model, and performing feature prediction on the prediction set data by using the optimal prediction model. The method can be used for carrying out batch prediction on the characteristics of bacteria which are difficult to separate and culture and bacterial genomes assembled by utilizing metagenomes, so that a large amount of bacteria characteristic information is obtained, and a large amount of experimental research cost and time are saved.
Description
Technical Field
The invention relates to the technical field of biological high-throughput data analysis, in particular to a method for predicting bacterial phenotypic characteristics based on bacterial genome data of machine learning.
Background
Microorganisms, including bacteria, archaea, algae, fungi, and protozoa, and viruses, play an important role in maintaining good human and animal and plant health, but some are also direct causes of many diseases. The culture and study of microorganisms using biological experiments are indispensable for understanding the roles of microorganisms in our surrounding world, and are still essential even in the bioinformatics era.
Currently, when researchers research microorganisms by using culturable methods, only a few microorganisms can be cultured in pure form, because too many conditions for culturing microorganisms are unknown or there are dependency relationships among microorganisms, which seriously hinders the research on the life activity rules of microorganisms and the development and utilization of microorganism resources. In addition, the Metagenome sequencing data, especially the second and third-generation Metagenome sequencing data, can obtain some complete Metagenome assembled genomes (MAG-s) by using the Metagenome binning technology, and a large batch of bacterial reference genomes can be obtained by using the recently published bacterial single cell sequencing technology. With the increasing number of bacterial genomes obtained by high throughput sequencing technologies, the aerobic phenotypic characteristics of bacteria possessed by many new species represented by new genomes poorly identified by ribosomal sequences are difficult to obtain. Although there are a few reports of predicting bacterial characteristics by using a naive Bayes algorithm in the past, for example, predicting the bacterial aerobic degree by using a bacterial genome, the prediction accuracy is respectively 88% of anaerobic bacteria, 87% of aerobic bacteria and 35% of facultative bacteria, and the accuracy needs to be improved, so that the practical application is difficult.
Disclosure of Invention
In order to solve the problems, the invention provides a method for predicting bacterial phenotypic characteristics based on bacterial genome data of machine learning through technical improvement.
The invention realizes the purpose through the following technical scheme:
the invention provides a method for predicting bacterial phenotypic characteristics based on bacterial genome data of machine learning, which comprises the following steps:
(1) Performing protein domain analysis on the bacterial genome data;
(2) Reconstructing and combining the bacterial protein structural domains obtained by analysis, counting frequency and converting the frequency into a matrix to obtain a bacterial reconstructed protein structural domain data set;
(3) Dividing the protein structure domain feature data set reconstructed by the obtained bacteria into a training set and a prediction set according to whether the protein structure domain feature data set is a known feature type;
(4) And performing model training on training set data by using a machine learning algorithm to select an optimal prediction model, and performing feature prediction on prediction set data by using the optimal prediction model.
As a further optimization scheme of the invention, in the step (1), the bacterial genome data is derived from a bacterial genome database.
As a further optimization scheme of the present invention, in step (1), the open source software pfam _ scan is used to perform protein domain prediction analysis on the bacterial genome data, and default parameters are used: pl-fasta protein fa-dir/data 3/Pfam-a. Hmm > protein Pfam.
As a further preferred embodiment of the present invention, in the step (2), the reconstructed protein domain composition frequency matrix is constructed for each bacterium by reconstructing the protein domain composition for each gene and analyzing the frequency of the reconstructed protein domain in each bacterium based on the obtained characteristics of the protein domain of the bacterium.
As a further optimization of the present invention, in step (3), the characteristic types include, but are not limited to, aerobic type and optimal growth temperature.
As a further optimization scheme of the invention, in the step (4), the machine learning algorithm includes, but is not limited to, a random forest, a support vector machine, a decision tree, a gradient boosting tree, naive Bayes and a conditional inference tree.
As a further optimization scheme of the invention, in the step (4), a random forest algorithm is selected by a machine learning algorithm.
The invention has the beneficial effects that: the method can accurately predict the unknown characteristics of the bacteria in batches, not only can save a large amount of experimental research cost and time, but also can obtain a large amount of phenotypic characteristics of bacteria with bacterial genomes and difficult culture.
Drawings
FIG. 1 is a schematic diagram of a technical scheme provided by the present invention;
FIG. 2 is a graph showing the results of 1571 known aerobic bacteria species predicted by the best predictive model provided in example 1 of the present invention;
FIG. 3 is a graph showing the results of predicting 1542 bacteria of unknown aerobic types using the optimal prediction model provided in example 1 of the present invention;
FIG. 4 is a result of predicting the optimal growth temperature of bacteria by using a random forest algorithm according to embodiment 2 of the present invention;
fig. 5 is a result of evaluating the accuracy of the prediction method of the present invention using the absolute value of the error, which is provided in embodiment 2 of the present invention.
Detailed Description
The present application will now be described in further detail with reference to the drawings, it should be noted that the following detailed description is given for illustrative purposes only and is not to be construed as limiting the scope of the present application, as those skilled in the art will be able to make numerous insubstantial modifications and adaptations to the present application based on the above disclosure.
Example 1
As shown in fig. 1, the present embodiment provides a method for predicting bacterial aerobic type using bacterial genome based on machine learning, comprising the steps of:
(1) Downloading large-scale bacterial genome data from a bacterial genome database and performing protein domain analysis on the bacterial genome data: i.e. using the open source software pfam _ scan, using the default parameters: p fam _ scan.p-fama protein.fa-d ir/data3/Pfam/Pfam-A.hmm > protein.
(2) Integrating bacterial data: and (3) reconstructing and combining the bacterial protein structural domains obtained by analysis according to each gene coding sequence, counting the frequency, constructing the reconstructed protein structural domains of the bacteria to form a frequency matrix, and obtaining a reconstructed data set pfam.txt containing a plurality of bacterial protein structural domains.
(3) Look up bacterial phenotype data: the standardized bacterium information database Bac Dive is used for inquiring the aerobic types and the optimal growth temperatures of the bacteria one by one or directly collecting and collating data published by the existing research, the bacteria with known aerobic types and optimal growth temperatures are divided into a training set, and the bacteria with unknown aerobic types and optimal growth temperatures are divided into a prediction set.
(4) Txt, dividing the data of known phenotype types into a training set (group 1), a testing set (group 2), and adjusting the model proportion (1:1-3:1) to achieve the optimal prediction model.
Taking a random forest model loaded in the R language as an example, the test set prediction result is stored in RF _ prediction _ bank.txt:
design=read.table("design.txt",header=T,row.names=1)
otu_table=read.table("pfam.txt",header=T,row.names=1)
design_sub=subset(design,Group%in%c("group1"))
design_sub$Type=as.factor(design_sub$Type)
idx=rownames(design_sub)%in%colnames(otu_table)
design_sub=design_sub[idx,]
otu_sub=otu_table[,rownames(design_sub)]
library(randomForest)
set.seed(315)
rf=randomForest(t(otu_sub),design_sub$Type,importance=TRUE,proximity=T,ntree=1000)
print(rf)
set.seed(315)
result=rfcv(t(otu_sub),design_sub$Type,cv.fold=5)
result$error.cv
imp=as.data.frame(rf$importance)
imp=imp[order(imp[,1],decreasing=T),]
write.table(imp,file="importance_class.txt",quote=F,sep='\t',row.names=T,col.names=T)
design_test=subset(design,Group%in%c("group2"))
design_test$Type=as.factor(design_test$Type)
idx=rownames(design_test)%in%colnames(otu_table)
design_test=design_test[idx,]
otu_sub=otu_table[,rownames(design_test)]
otutab_t=as.data.frame(t(otu_sub))
otutab_t$Type=design[rownames(otutab_t),]$Type
set.seed(315)
otutab.pred=predict(rf,t(otu_sub))
pre_tab=table(observed=otutab_t[,"Type"],predicted=otutab.pred)
predict=data.frame(Type=otutab_t[,"Type"],predicted=otutab.pred)
write.table("SampleID\t",file=paste("RF_prediction_binary.txt",sep=""),ap-pend=F,quote=F,eol="",row.names=F,col.names=F)
write.table(predict,file="RF_prediction_binary.txt",append=T,quote=F,row.names=T,col.names=T,sep="\t")
taking 3113 bacteria data sets as an example, the aerobic type data set is 1571, the test set prediction result of the optimal model is shown in fig. 2, and the optimal random forest algorithm predicts the recalls of aerobic bacteria, anaerobic bacteria and facultative bacteria as follows: 88.87%, 96.43%, 88.03%, the Kappa coefficient of the random forest algorithm is 0.87, the prediction accuracy is very high, and then the best model is used to predict 1542 the bacteria of unknown aerobic type, the result is shown in fig. 3.
Example 2
This example provides a method for predicting the optimal growth temperature of bacteria using bacterial genomes based on machine learning, and the prediction implementation of the specific prediction method is in accordance with the procedure given in example 1.
In this embodiment, the constructed machine learning model includes 378 training sets and 377 test sets of bacteria samples, and the method of predicting the optimal growth temperature of bacteria by using the random forest algorithm is obtained most accurately through the optimization of various algorithms, and the result is shown in fig. 4;
in the evaluation according to the absolute values of errors of 15 ℃, 10 ℃ and 5 ℃, the accuracy of prediction of the test set reaches 100%, 97% and 78%, and it can be known that the prediction method provided by the application can more accurately predict the optimal growth temperature of bacteria, and the result is shown in fig. 5.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention.
Claims (7)
1. A method for predicting a bacterial phenotypic characteristic based on machine-learned bacterial genome data, comprising the steps of:
(1) Performing protein domain analysis on the bacterial genome data;
(2) Reconstructing and combining the bacterial protein structural domains obtained by analysis, counting frequency and converting the frequency into a matrix to obtain a protein structural domain characteristic data set reconstructed by bacteria;
(3) Dividing the protein domain feature data set reconstructed by the bacteria obtained in the step (2) into a training set and a prediction set according to whether the protein domain feature data set is of a known feature type;
(4) And performing model training on the training set data by using a machine learning algorithm to select an optimal prediction model, and performing feature prediction on the prediction set data by using the optimal prediction model.
2. The method for predicting a bacterial phenotypic characteristic based on machine-learned bacterial genome data of claim 1, wherein: in the step (1), the bacterial genome data is derived from a bacterial genome database.
3. The method for predicting a bacterial phenotypic characteristic based on machine-learned bacterial genome data of claim 1, wherein: in step (1), the bacterial genome data was subjected to protein domain analysis using pfam _ scan.
4. The method for predicting a bacterial phenotypic characteristic based on machine-learned bacterial genome data of claim 1, wherein: in the step (2), the bacterial protein structural domains obtained by analysis are reconstructed and combined according to each coding sequence, the frequency is counted, and the reconstructed protein structural domains of the bacteria are constructed to form a frequency matrix.
5. The method for predicting a bacterial phenotypic characteristic based on machine-learned bacterial genome data of claim 1, wherein: in step (3), the characteristic types include, but are not limited to, aerobic types and optimal growth temperatures.
6. The method for predicting a bacterial phenotypic characteristic based on machine-learned bacterial genome data of claim 1, wherein: in step (4), the machine learning algorithm includes, but is not limited to, a random forest, a support vector machine, a decision tree, a gradient boosting tree, naive Bayes, and a conditional inference tree.
7. The method for predicting bacterial phenotypic characteristics according to claim 1, wherein: and (4) selecting a random forest algorithm by using a machine learning algorithm.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2022104651858 | 2022-04-29 | ||
CN202210465185 | 2022-04-29 |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115662503A true CN115662503A (en) | 2023-01-31 |
Family
ID=84986490
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211172959.4A Pending CN115662503A (en) | 2022-04-29 | 2022-09-26 | Method for predicting bacterial phenotypic characteristics based on bacterial genome data of machine learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115662503A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116721695A (en) * | 2023-03-07 | 2023-09-08 | 安徽农业大学 | Identification method, device, equipment and medium of candidate gene for regulating bacterial shape |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3879537A1 (en) * | 2020-03-12 | 2021-09-15 | bioMérieux | Molecular technology for predicting a phenotypic nature of a bacterium from its genome |
CN114067912A (en) * | 2021-11-23 | 2022-02-18 | 天津金匙医学科技有限公司 | Method for screening important characteristic genes related to drug-resistant phenotype of bacteria based on machine learning |
CN114388062A (en) * | 2021-12-17 | 2022-04-22 | 予果生物科技(北京)有限公司 | Method, equipment and application for predicting antibiotic resistance phenotype based on machine learning |
-
2022
- 2022-09-26 CN CN202211172959.4A patent/CN115662503A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3879537A1 (en) * | 2020-03-12 | 2021-09-15 | bioMérieux | Molecular technology for predicting a phenotypic nature of a bacterium from its genome |
WO2021180771A1 (en) * | 2020-03-12 | 2021-09-16 | bioMérieux | Molecular technology for predicting a phenotypic trait of a bacterium from its genome |
CN114067912A (en) * | 2021-11-23 | 2022-02-18 | 天津金匙医学科技有限公司 | Method for screening important characteristic genes related to drug-resistant phenotype of bacteria based on machine learning |
CN114388062A (en) * | 2021-12-17 | 2022-04-22 | 予果生物科技(北京)有限公司 | Method, equipment and application for predicting antibiotic resistance phenotype based on machine learning |
Non-Patent Citations (2)
Title |
---|
SHRIKANT HARNE ET AL.: "MreB5 is a determinant of rod-to-helical transition in the cell-wall-less bacterium sprioplasma", 《CURRENT BIOLOGY》, vol. 30, no. 23, 7 December 2020 (2020-12-07), pages 4753 - 4762 * |
韩国民: "将现代信息技术融入微生物学综合实验的教学探讨", 《现代农业科技》, 31 August 2018 (2018-08-31), pages 276 - 277 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116721695A (en) * | 2023-03-07 | 2023-09-08 | 安徽农业大学 | Identification method, device, equipment and medium of candidate gene for regulating bacterial shape |
CN116721695B (en) * | 2023-03-07 | 2024-03-08 | 安徽农业大学 | Identification method, device, equipment and medium of candidate gene for regulating bacterial shape |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Dohm et al. | The quantitative genetics of maximal and basal rates of oxygen consumption in mice | |
CN113450882B (en) | Artificial intelligence-based basic culture medium formula development method and system | |
JP2023544067A (en) | Basal medium development method and system | |
Venkataram et al. | Mutualism-enhancing mutations dominate early adaptation in a two-species microbial community | |
Lehtinen et al. | Horizontal gene transfer rate is not the primary determinant of observed antibiotic resistance frequencies in Streptococcus pneumoniae | |
CN115662503A (en) | Method for predicting bacterial phenotypic characteristics based on bacterial genome data of machine learning | |
Kolter et al. | Bacteria grow swiftly and live thriftily | |
CN101175847A (en) | Methods for improving strains based on in silico analysis | |
KR100727053B1 (en) | Method of improvement of organisms using profiling the flux sum of metabolites | |
CN109557148A (en) | A kind of system and method for quick detection microbiologic population | |
WO2022063341A1 (en) | Basal culture medium development method, basal culture medium formulation and development, and system thereof | |
Wu et al. | The influence of kinship and dominance hierarchy on grooming partner choice in free-ranging Macaca mulatta brevicaudus | |
CN112342284B (en) | Method for analyzing microbial community functional gene transcription and translation activity | |
CN110767261B (en) | Method for automatically constructing high-precision genome scale metabolic network model | |
Lin et al. | The Flora Compositions of Nitrogen‐Fixing Bacteria and the Differential Expression of nifH Gene in Pennisetum giganteum zx lin Roots | |
Collins et al. | Diverse strategies link growth rate and competitive ability in phytoplankton responses to changes in CO2 levels | |
Wang et al. | Inferring Eupolypods Divergence Time Using Bayesian Tip-Dating | |
CN115101118A (en) | Method for predicting serum-free medium component concentration based on machine learning | |
Adl et al. | Timing of life cycle morphogenesis in synchronous samples of Sterkiella histriomuscorum. II. The sexual pathway | |
Ogata | The Growing Liberality Observed in Primary Animal and Plant Cultures is Common to the Social Amoeba | |
CN114093418B (en) | Detection method for evaluating soil activity | |
CN114611386A (en) | Culture medium mixing proportion optimization method, device, equipment and medium | |
Ren et al. | Identification of active pathways of Chlorella protothecoides by elementary mode analysis integrated with fluxomic data | |
Boxberger et al. | Draft genome and description of Chryseobacterium manosquense strain Marseille-Q2069T sp. nov., a new bacterium isolated from human healthy skin | |
Breitling | A completely resolved phylogenetic tree of British spiders (Arachnida: Araneae) |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |