CN115662503A - Method for predicting bacterial phenotypic characteristics based on bacterial genome data of machine learning - Google Patents

Method for predicting bacterial phenotypic characteristics based on bacterial genome data of machine learning Download PDF

Info

Publication number
CN115662503A
CN115662503A CN202211172959.4A CN202211172959A CN115662503A CN 115662503 A CN115662503 A CN 115662503A CN 202211172959 A CN202211172959 A CN 202211172959A CN 115662503 A CN115662503 A CN 115662503A
Authority
CN
China
Prior art keywords
bacterial
data
bacteria
predicting
prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211172959.4A
Other languages
Chinese (zh)
Inventor
韩国民
刘海达
万思琪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Agricultural University AHAU
Original Assignee
Anhui Agricultural University AHAU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Agricultural University AHAU filed Critical Anhui Agricultural University AHAU
Publication of CN115662503A publication Critical patent/CN115662503A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a method for predicting bacterial phenotypic characteristics based on bacterial genome data of machine learning, which comprises the following steps: (1) Performing protein domain analysis on the bacterial genome data; (2) Reconstructing and combining the bacterial protein structural domains obtained by analysis, counting frequency and converting the frequency into a matrix to obtain a protein structural domain characteristic data set reconstructed by bacteria; (3) Dividing the integrated bacteria data in the step (2) into a training set and a prediction set according to whether the integrated bacteria data are of known characteristic types; (4) And performing model training on the training set data by using a machine learning algorithm to select an optimal prediction model, and performing feature prediction on the prediction set data by using the optimal prediction model. The method can be used for carrying out batch prediction on the characteristics of bacteria which are difficult to separate and culture and bacterial genomes assembled by utilizing metagenomes, so that a large amount of bacteria characteristic information is obtained, and a large amount of experimental research cost and time are saved.

Description

Method for predicting bacterial phenotypic characteristics based on bacterial genome data of machine learning
Technical Field
The invention relates to the technical field of biological high-throughput data analysis, in particular to a method for predicting bacterial phenotypic characteristics based on bacterial genome data of machine learning.
Background
Microorganisms, including bacteria, archaea, algae, fungi, and protozoa, and viruses, play an important role in maintaining good human and animal and plant health, but some are also direct causes of many diseases. The culture and study of microorganisms using biological experiments are indispensable for understanding the roles of microorganisms in our surrounding world, and are still essential even in the bioinformatics era.
Currently, when researchers research microorganisms by using culturable methods, only a few microorganisms can be cultured in pure form, because too many conditions for culturing microorganisms are unknown or there are dependency relationships among microorganisms, which seriously hinders the research on the life activity rules of microorganisms and the development and utilization of microorganism resources. In addition, the Metagenome sequencing data, especially the second and third-generation Metagenome sequencing data, can obtain some complete Metagenome assembled genomes (MAG-s) by using the Metagenome binning technology, and a large batch of bacterial reference genomes can be obtained by using the recently published bacterial single cell sequencing technology. With the increasing number of bacterial genomes obtained by high throughput sequencing technologies, the aerobic phenotypic characteristics of bacteria possessed by many new species represented by new genomes poorly identified by ribosomal sequences are difficult to obtain. Although there are a few reports of predicting bacterial characteristics by using a naive Bayes algorithm in the past, for example, predicting the bacterial aerobic degree by using a bacterial genome, the prediction accuracy is respectively 88% of anaerobic bacteria, 87% of aerobic bacteria and 35% of facultative bacteria, and the accuracy needs to be improved, so that the practical application is difficult.
Disclosure of Invention
In order to solve the problems, the invention provides a method for predicting bacterial phenotypic characteristics based on bacterial genome data of machine learning through technical improvement.
The invention realizes the purpose through the following technical scheme:
the invention provides a method for predicting bacterial phenotypic characteristics based on bacterial genome data of machine learning, which comprises the following steps:
(1) Performing protein domain analysis on the bacterial genome data;
(2) Reconstructing and combining the bacterial protein structural domains obtained by analysis, counting frequency and converting the frequency into a matrix to obtain a bacterial reconstructed protein structural domain data set;
(3) Dividing the protein structure domain feature data set reconstructed by the obtained bacteria into a training set and a prediction set according to whether the protein structure domain feature data set is a known feature type;
(4) And performing model training on training set data by using a machine learning algorithm to select an optimal prediction model, and performing feature prediction on prediction set data by using the optimal prediction model.
As a further optimization scheme of the invention, in the step (1), the bacterial genome data is derived from a bacterial genome database.
As a further optimization scheme of the present invention, in step (1), the open source software pfam _ scan is used to perform protein domain prediction analysis on the bacterial genome data, and default parameters are used: pl-fasta protein fa-dir/data 3/Pfam-a. Hmm > protein Pfam.
As a further preferred embodiment of the present invention, in the step (2), the reconstructed protein domain composition frequency matrix is constructed for each bacterium by reconstructing the protein domain composition for each gene and analyzing the frequency of the reconstructed protein domain in each bacterium based on the obtained characteristics of the protein domain of the bacterium.
As a further optimization of the present invention, in step (3), the characteristic types include, but are not limited to, aerobic type and optimal growth temperature.
As a further optimization scheme of the invention, in the step (4), the machine learning algorithm includes, but is not limited to, a random forest, a support vector machine, a decision tree, a gradient boosting tree, naive Bayes and a conditional inference tree.
As a further optimization scheme of the invention, in the step (4), a random forest algorithm is selected by a machine learning algorithm.
The invention has the beneficial effects that: the method can accurately predict the unknown characteristics of the bacteria in batches, not only can save a large amount of experimental research cost and time, but also can obtain a large amount of phenotypic characteristics of bacteria with bacterial genomes and difficult culture.
Drawings
FIG. 1 is a schematic diagram of a technical scheme provided by the present invention;
FIG. 2 is a graph showing the results of 1571 known aerobic bacteria species predicted by the best predictive model provided in example 1 of the present invention;
FIG. 3 is a graph showing the results of predicting 1542 bacteria of unknown aerobic types using the optimal prediction model provided in example 1 of the present invention;
FIG. 4 is a result of predicting the optimal growth temperature of bacteria by using a random forest algorithm according to embodiment 2 of the present invention;
fig. 5 is a result of evaluating the accuracy of the prediction method of the present invention using the absolute value of the error, which is provided in embodiment 2 of the present invention.
Detailed Description
The present application will now be described in further detail with reference to the drawings, it should be noted that the following detailed description is given for illustrative purposes only and is not to be construed as limiting the scope of the present application, as those skilled in the art will be able to make numerous insubstantial modifications and adaptations to the present application based on the above disclosure.
Example 1
As shown in fig. 1, the present embodiment provides a method for predicting bacterial aerobic type using bacterial genome based on machine learning, comprising the steps of:
(1) Downloading large-scale bacterial genome data from a bacterial genome database and performing protein domain analysis on the bacterial genome data: i.e. using the open source software pfam _ scan, using the default parameters: p fam _ scan.p-fama protein.fa-d ir/data3/Pfam/Pfam-A.hmm > protein.
(2) Integrating bacterial data: and (3) reconstructing and combining the bacterial protein structural domains obtained by analysis according to each gene coding sequence, counting the frequency, constructing the reconstructed protein structural domains of the bacteria to form a frequency matrix, and obtaining a reconstructed data set pfam.txt containing a plurality of bacterial protein structural domains.
(3) Look up bacterial phenotype data: the standardized bacterium information database Bac Dive is used for inquiring the aerobic types and the optimal growth temperatures of the bacteria one by one or directly collecting and collating data published by the existing research, the bacteria with known aerobic types and optimal growth temperatures are divided into a training set, and the bacteria with unknown aerobic types and optimal growth temperatures are divided into a prediction set.
(4) Txt, dividing the data of known phenotype types into a training set (group 1), a testing set (group 2), and adjusting the model proportion (1:1-3:1) to achieve the optimal prediction model.
Taking a random forest model loaded in the R language as an example, the test set prediction result is stored in RF _ prediction _ bank.txt:
design=read.table("design.txt",header=T,row.names=1)
otu_table=read.table("pfam.txt",header=T,row.names=1)
design_sub=subset(design,Group%in%c("group1"))
design_sub$Type=as.factor(design_sub$Type)
idx=rownames(design_sub)%in%colnames(otu_table)
design_sub=design_sub[idx,]
otu_sub=otu_table[,rownames(design_sub)]
library(randomForest)
set.seed(315)
rf=randomForest(t(otu_sub),design_sub$Type,importance=TRUE,proximity=T,ntree=1000)
print(rf)
set.seed(315)
result=rfcv(t(otu_sub),design_sub$Type,cv.fold=5)
result$error.cv
imp=as.data.frame(rf$importance)
imp=imp[order(imp[,1],decreasing=T),]
write.table(imp,file="importance_class.txt",quote=F,sep='\t',row.names=T,col.names=T)
design_test=subset(design,Group%in%c("group2"))
design_test$Type=as.factor(design_test$Type)
idx=rownames(design_test)%in%colnames(otu_table)
design_test=design_test[idx,]
otu_sub=otu_table[,rownames(design_test)]
otutab_t=as.data.frame(t(otu_sub))
otutab_t$Type=design[rownames(otutab_t),]$Type
set.seed(315)
otutab.pred=predict(rf,t(otu_sub))
pre_tab=table(observed=otutab_t[,"Type"],predicted=otutab.pred)
predict=data.frame(Type=otutab_t[,"Type"],predicted=otutab.pred)
write.table("SampleID\t",file=paste("RF_prediction_binary.txt",sep=""),ap-pend=F,quote=F,eol="",row.names=F,col.names=F)
write.table(predict,file="RF_prediction_binary.txt",append=T,quote=F,row.names=T,col.names=T,sep="\t")
taking 3113 bacteria data sets as an example, the aerobic type data set is 1571, the test set prediction result of the optimal model is shown in fig. 2, and the optimal random forest algorithm predicts the recalls of aerobic bacteria, anaerobic bacteria and facultative bacteria as follows: 88.87%, 96.43%, 88.03%, the Kappa coefficient of the random forest algorithm is 0.87, the prediction accuracy is very high, and then the best model is used to predict 1542 the bacteria of unknown aerobic type, the result is shown in fig. 3.
Example 2
This example provides a method for predicting the optimal growth temperature of bacteria using bacterial genomes based on machine learning, and the prediction implementation of the specific prediction method is in accordance with the procedure given in example 1.
In this embodiment, the constructed machine learning model includes 378 training sets and 377 test sets of bacteria samples, and the method of predicting the optimal growth temperature of bacteria by using the random forest algorithm is obtained most accurately through the optimization of various algorithms, and the result is shown in fig. 4;
in the evaluation according to the absolute values of errors of 15 ℃, 10 ℃ and 5 ℃, the accuracy of prediction of the test set reaches 100%, 97% and 78%, and it can be known that the prediction method provided by the application can more accurately predict the optimal growth temperature of bacteria, and the result is shown in fig. 5.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention.

Claims (7)

1. A method for predicting a bacterial phenotypic characteristic based on machine-learned bacterial genome data, comprising the steps of:
(1) Performing protein domain analysis on the bacterial genome data;
(2) Reconstructing and combining the bacterial protein structural domains obtained by analysis, counting frequency and converting the frequency into a matrix to obtain a protein structural domain characteristic data set reconstructed by bacteria;
(3) Dividing the protein domain feature data set reconstructed by the bacteria obtained in the step (2) into a training set and a prediction set according to whether the protein domain feature data set is of a known feature type;
(4) And performing model training on the training set data by using a machine learning algorithm to select an optimal prediction model, and performing feature prediction on the prediction set data by using the optimal prediction model.
2. The method for predicting a bacterial phenotypic characteristic based on machine-learned bacterial genome data of claim 1, wherein: in the step (1), the bacterial genome data is derived from a bacterial genome database.
3. The method for predicting a bacterial phenotypic characteristic based on machine-learned bacterial genome data of claim 1, wherein: in step (1), the bacterial genome data was subjected to protein domain analysis using pfam _ scan.
4. The method for predicting a bacterial phenotypic characteristic based on machine-learned bacterial genome data of claim 1, wherein: in the step (2), the bacterial protein structural domains obtained by analysis are reconstructed and combined according to each coding sequence, the frequency is counted, and the reconstructed protein structural domains of the bacteria are constructed to form a frequency matrix.
5. The method for predicting a bacterial phenotypic characteristic based on machine-learned bacterial genome data of claim 1, wherein: in step (3), the characteristic types include, but are not limited to, aerobic types and optimal growth temperatures.
6. The method for predicting a bacterial phenotypic characteristic based on machine-learned bacterial genome data of claim 1, wherein: in step (4), the machine learning algorithm includes, but is not limited to, a random forest, a support vector machine, a decision tree, a gradient boosting tree, naive Bayes, and a conditional inference tree.
7. The method for predicting bacterial phenotypic characteristics according to claim 1, wherein: and (4) selecting a random forest algorithm by using a machine learning algorithm.
CN202211172959.4A 2022-04-29 2022-09-26 Method for predicting bacterial phenotypic characteristics based on bacterial genome data of machine learning Pending CN115662503A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2022104651858 2022-04-29
CN202210465185 2022-04-29

Publications (1)

Publication Number Publication Date
CN115662503A true CN115662503A (en) 2023-01-31

Family

ID=84986490

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211172959.4A Pending CN115662503A (en) 2022-04-29 2022-09-26 Method for predicting bacterial phenotypic characteristics based on bacterial genome data of machine learning

Country Status (1)

Country Link
CN (1) CN115662503A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116721695A (en) * 2023-03-07 2023-09-08 安徽农业大学 Identification method, device, equipment and medium of candidate gene for regulating bacterial shape

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3879537A1 (en) * 2020-03-12 2021-09-15 bioMérieux Molecular technology for predicting a phenotypic nature of a bacterium from its genome
CN114067912A (en) * 2021-11-23 2022-02-18 天津金匙医学科技有限公司 Method for screening important characteristic genes related to drug-resistant phenotype of bacteria based on machine learning
CN114388062A (en) * 2021-12-17 2022-04-22 予果生物科技(北京)有限公司 Method, equipment and application for predicting antibiotic resistance phenotype based on machine learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3879537A1 (en) * 2020-03-12 2021-09-15 bioMérieux Molecular technology for predicting a phenotypic nature of a bacterium from its genome
WO2021180771A1 (en) * 2020-03-12 2021-09-16 bioMérieux Molecular technology for predicting a phenotypic trait of a bacterium from its genome
CN114067912A (en) * 2021-11-23 2022-02-18 天津金匙医学科技有限公司 Method for screening important characteristic genes related to drug-resistant phenotype of bacteria based on machine learning
CN114388062A (en) * 2021-12-17 2022-04-22 予果生物科技(北京)有限公司 Method, equipment and application for predicting antibiotic resistance phenotype based on machine learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SHRIKANT HARNE ET AL.: "MreB5 is a determinant of rod-to-helical transition in the cell-wall-less bacterium sprioplasma", 《CURRENT BIOLOGY》, vol. 30, no. 23, 7 December 2020 (2020-12-07), pages 4753 - 4762 *
韩国民: "将现代信息技术融入微生物学综合实验的教学探讨", 《现代农业科技》, 31 August 2018 (2018-08-31), pages 276 - 277 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116721695A (en) * 2023-03-07 2023-09-08 安徽农业大学 Identification method, device, equipment and medium of candidate gene for regulating bacterial shape
CN116721695B (en) * 2023-03-07 2024-03-08 安徽农业大学 Identification method, device, equipment and medium of candidate gene for regulating bacterial shape

Similar Documents

Publication Publication Date Title
Dohm et al. The quantitative genetics of maximal and basal rates of oxygen consumption in mice
CN113450882B (en) Artificial intelligence-based basic culture medium formula development method and system
JP2023544067A (en) Basal medium development method and system
Venkataram et al. Mutualism-enhancing mutations dominate early adaptation in a two-species microbial community
Lehtinen et al. Horizontal gene transfer rate is not the primary determinant of observed antibiotic resistance frequencies in Streptococcus pneumoniae
CN115662503A (en) Method for predicting bacterial phenotypic characteristics based on bacterial genome data of machine learning
Kolter et al. Bacteria grow swiftly and live thriftily
CN101175847A (en) Methods for improving strains based on in silico analysis
KR100727053B1 (en) Method of improvement of organisms using profiling the flux sum of metabolites
CN109557148A (en) A kind of system and method for quick detection microbiologic population
WO2022063341A1 (en) Basal culture medium development method, basal culture medium formulation and development, and system thereof
Wu et al. The influence of kinship and dominance hierarchy on grooming partner choice in free-ranging Macaca mulatta brevicaudus
CN112342284B (en) Method for analyzing microbial community functional gene transcription and translation activity
CN110767261B (en) Method for automatically constructing high-precision genome scale metabolic network model
Lin et al. The Flora Compositions of Nitrogen‐Fixing Bacteria and the Differential Expression of nifH Gene in Pennisetum giganteum zx lin Roots
Collins et al. Diverse strategies link growth rate and competitive ability in phytoplankton responses to changes in CO2 levels
Wang et al. Inferring Eupolypods Divergence Time Using Bayesian Tip-Dating
CN115101118A (en) Method for predicting serum-free medium component concentration based on machine learning
Adl et al. Timing of life cycle morphogenesis in synchronous samples of Sterkiella histriomuscorum. II. The sexual pathway
Ogata The Growing Liberality Observed in Primary Animal and Plant Cultures is Common to the Social Amoeba
CN114093418B (en) Detection method for evaluating soil activity
CN114611386A (en) Culture medium mixing proportion optimization method, device, equipment and medium
Ren et al. Identification of active pathways of Chlorella protothecoides by elementary mode analysis integrated with fluxomic data
Boxberger et al. Draft genome and description of Chryseobacterium manosquense strain Marseille-Q2069T sp. nov., a new bacterium isolated from human healthy skin
Breitling A completely resolved phylogenetic tree of British spiders (Arachnida: Araneae)

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination