CN114520031A

CN114520031A - Method for predicting permeability of compound placental membrane based on machine learning

Info

Publication number: CN114520031A
Application number: CN202210079167.6A
Authority: CN
Inventors: 庄树林; 高雨晨; 崔世璇; 张家晨; 苟艺源; 赵启明
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2022-01-24
Filing date: 2022-01-24
Publication date: 2022-05-20
Anticipated expiration: 2042-01-24
Also published as: CN114520031B

Abstract

The invention discloses a method for predicting the permeability of a compound placental membrane based on machine learning, which comprises the following steps: (1) establishing a compound placental membrane permeability judgment standard; (2) acquiring a compound to establish a BPBData data set, obtaining sample data and a sample label, and preprocessing the sample data; (3) constructing a prediction model based on a machine learning algorithm, and training the prediction model under the supervision of a sample label by utilizing the preprocessed sample data so as to optimize parameters of the prediction model; (4) and predicting placental membrane permeability of the test compound. The method of the invention utilizes double parameters F/M (fetus-maternal blood concentration ratio in a compound body) and CI (clearance index) to establish a judgment standard of the permeability of the compound placental membrane; and then a prediction model of the molecular structure characteristics of the compound and the placental membrane permeability is constructed, so that the high-flux, low-time, low-cost and high-precision prediction of the compound placental membrane permeability is realized.

Description

Method for predicting permeability of compound placental membrane based on machine learning

Technical Field

The invention belongs to the technical field of compound attribute prediction, and particularly relates to a method for predicting the permeability of a compound placental membrane based on machine learning.

Background

The placenta is an important organ for intrauterine development and pregnancy maintenance, and is a structure for exchanging substances between a mother and a fetus. The placental barrier (BPB) plays an important role in the growth and development of the fetus by performing multiple functions. Since placenta is one of the most species-specific organs, in vitro models are considered to be more suitable than in vivo assays for assessing the metastatic potential of chemical substances in the human placental barrier.

However, in current studies, in vitro and ex vivo methods do not directly predict in vivo results, making assessment of the permeability of the compound to placental membranes difficult, and in vivo studies to assess the risk of transfer of chemicals from mother to fetus through placental membranes should not be performed. Under such circumstances, there is an urgent need for an evaluation method without human body test, which can achieve effective control of operability and cost while obtaining effective information on the permeability of the compound placental membrane.

Machine learning has wide application in the fields of natural language understanding, non-monotonic reasoning, machine vision, pattern recognition and the like, and can realize high-efficiency utilization of data information by deeply analyzing complex and diverse data based on the machine learning.

Chinese patent publication No. CN101339180A discloses a method for predicting the explosion characteristics of organic compounds based on a support vector machine, which includes: the parameterization of molecular structure information is realized by taking a molecular group of an organic compound as a structure descriptor for describing the molecular structure characteristics; utilizing a support vector machine to respectively simulate the internal quantitative relation between each explosion characteristic and the structure descriptor thereof, and establishing a corresponding support vector machine prediction model based on molecular groups; and inputting the molecular group of the organic compound to be predicted as an input parameter into the obtained prediction model to obtain the related explosion characteristic value. The invention can accurately and quickly predict the explosion characteristics of the organic compound according to the molecular structure of the organic compound.

Chinese patent publication No. CN113255770A discloses a method for training a compound attribute prediction model, which includes: acquiring space structure information formed by atoms and chemical bonds of a first sample compound; taking the first sample compound as an input sample and the corresponding spatial structure information as an output sample, and training to obtain a spatial structure prediction model; taking a second sample compound as an input sample and corresponding attribute information as an output sample, and training to obtain a compound attribute prediction model on the basis of the spatial structure prediction model; and predicting the attribute of the compound to be detected by using the compound attribute prediction model.

Disclosure of Invention

The invention provides a method for predicting the permeability of a compound placental membrane based on machine learning, which can realize batch prediction of the compound in a short time, is low in cost, short in time consumption and high in accuracy, overcomes the current situation that the evaluation of the permeability experiment of the current compound placental membrane is difficult, and fills up the technical blank in the aspect of building a model for predicting the permeability of the placental membrane.

The technical scheme is as follows:

a method for predicting the permeability of a compound placental membrane based on machine learning, comprising the steps of:

(1) establishing a compound placental membrane permeability judgment standard;

(2) collecting a compound to establish a BPBData data set, cleaning the data set, evaluating whether a sample in the data set has placental membrane permeability according to the standard established in the step (1), taking an evaluation result as a sample label, calculating a molecular fingerprint after deriving a SMILES expression of the sample, extracting a molecular descriptor of the sample as sample data, and preprocessing the sample data;

(3) constructing a prediction model based on a machine learning algorithm, and training the prediction model under the supervision of a sample label by utilizing the preprocessed sample data so as to optimize parameters of the prediction model;

(4) and (3) calculating the molecular fingerprint after deriving the SMILES expression of the compound to be detected, extracting the molecular descriptor as data to be detected, inputting the data to be detected into a parameter optimization prediction model, and predicting the placental membrane permeability of the compound to be detected.

In the prior art, the research on the placental membrane permeability of a compound is less, and the data acquisition is very difficult, so that the method utilizes double parameters F/M (fetal-maternal blood concentration ratio in a compound body) and CI (clearance index) to establish a compound placental membrane permeability judgment standard; classifying the samples according to whether the samples can penetrate through the placental membranes, and establishing a prediction model of the molecular structure characteristics of the compounds and the permeability of the placental membranes based on machine learning to realize batch in-vitro prediction of the permeability of the compounds of the placental membranes; the classification operation can reduce the calculation complexity and improve the calculation stability; in addition, the method does not need in-vivo experiments or establishment of organoid models, and has high prediction accuracy.

In the step (1), double parameters of F/M and CI are adopted to judge the permeability of the placental membrane of the compound, wherein F/M is the fetal-maternal blood concentration ratio in the compound body, and CI is a clearance index;

F/M-concentration of Compound in fetal blood/concentration of Compound in maternal blood

CI ═ Compound placenta penetration/antipyrine placenta penetration

Wherein the priority of the parameter F/M is greater than the priority of the parameter CI;

when the parameter F/M is less than or equal to 0.15, the compound does not have the permeability of the placental membrane, and when the parameter F/M is more than or equal to 0.3, the compound has the permeability of the placental membrane;

if the parameter F/M is not available: when the parameter CI is greater than 0.80, the compound is shown to have placental membrane permeability; when the parameter CI is less than or equal to 0.80, the compound does not have the permeability of the placental membrane.

In step (2), the compounds are collected from literature experimental data or a PubChem compound database, and the like.

Considering the generalization capability of the prediction model, various types of compounds are selected when the compounds are collected, so that the trained model has wider application range.

In the step (2), the step of cleaning the data set is as follows: filling blank values into the BPBData data set, removing inorganic substances, salts and neutral molecules, removing zero values and zero variance values and removing high correlation values so as to avoid influencing a calculation result and generating an overfitting phenomenon;

for samples with more than 1 parameter F/M: if the number of the parameters F/M is 2, taking a weighted average value; if the number of the parameters F/M is more than 2, selecting the parameter F/M with the highest occurrence frequency;

for samples with more than 1 parameter CI: if the number of the parameters CI is 2, taking a weighted average value; if the number of the parameters CI is more than 2, the parameter CI with the highest frequency of occurrence is selected.

The molecular fingerprint and the molecular descriptor can be calculated by utilizing software such as Chemopy, MoDred, RDkit and the like.

The preprocessing mode comprises normalization and normalization, wherein the normalization is to process data according to columns of a characteristic matrix and convert characteristic values of a sample into the same dimension; normalization is the processing of data according to the rows of the feature matrix, mapping the data to a specified range.

In the step (3), the preprocessed sample data is divided into a training set and a testing set, the prediction model is trained by the training set, the goodness of the prediction model is evaluated by the testing set, and the parameters of the prediction model are optimized.

In the step (3), the machine learning algorithm is selected from a random forest algorithm, a logistic regression algorithm, a naive Bayes algorithm, a support vector machine algorithm or a neural network algorithm.

Preferably, the machine learning algorithm is a neural network algorithm, and the neural network algorithm comprises an input layer, a hidden layer and an output layer; further preferably, the hidden layer is 1 layer, and the number of neuron nodes is 29.

Further preferably, the transfer function of the hidden layer is a logistic activation function, and the adam algorithm is used as a weight optimization path.

The invention also provides application of the prediction method of the compound placental membrane permeability based on machine learning in prediction of the compound placental membrane permeability.

Compared with the prior art, the invention has the beneficial effects that:

(1) the prediction method overcomes the current situation that the evaluation of the permeability experiment of the current compound placental membrane is difficult; the method can be used for predicting the placenta permeability of various compounds, such as medicines, quasi-medicines and pollutants, has wide application range, can realize batch prediction of the compounds in a short time, has low cost, short time consumption and high accuracy, and fills up the technical blank in the aspect of establishing a placental membrane permeability prediction model.

(2) The prediction accuracy of the parameter optimized prediction model can reach 0.833, the accuracy is 0.893, the recall rate is 0.847, the F1_ score is 0.870, the performance of the parameter optimized prediction model is excellent, and the prediction efficiency and accuracy are high.

(3) The invention classifies samples according to whether the compounds have the placental membrane permeability by establishing a compound placental membrane permeability judgment standard, then establishes a compound molecular structure characteristic and a prediction model of the placental membrane permeability based on a neural network, and can realize the prediction of the placental membrane transfer property only by using a SMILES expression when predicting the placental membrane permeability of the compounds.

Drawings

Fig. 1 is a flow chart of a method for predicting the permeability of a compound placental membrane based on machine learning according to the present invention.

FIG. 2 is a visualization of the classification capability of the prediction model of the present invention.

FIG. 3 is a graph of the number of hidden layer neuron nodes as a function of the accuracy of the prediction model.

Detailed Description

The invention is further elucidated with reference to the figures and the examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention.

The flow chart of the prediction method of the compound placental membrane permeability based on machine learning in the invention is shown in figure 1, and comprises the following four steps:

(1) establishing a compound placental membrane permeability judgment standard;

determining the permeability of the compound placental membrane by adopting two parameters, namely F/M (fetal-maternal blood concentration ratio in a compound body) and CI (clearance index);

Compound placenta permeability/Antipyrine (Antipyrine) placenta permeability

In the criterion of the permeability of the placental membrane, the priority of the parameter F/M is greater than the priority of the parameter CI;

when the parameter F/M is less than or equal to 0.15, the compound does not have the permeability of the placental membrane and is expressed as NC; when the parameter F/M is more than or equal to 0.3, the compound has the placental membrane permeability and is represented as C;

if the parameter F/M is not available, when the parameter CI is >0.80, the compound is indicated to have placental membrane permeability, indicated as C; when the parameter CI is less than or equal to 0.80, the compound does not have the permeability of the placental membrane and is expressed as NC.

compounds are collected from literature experimental data or PubChem compound database, etc.

The data set cleaning method comprises the following steps: filling blank values into a BPBData data set, removing inorganic substances, salts and neutral molecules, removing zero values and zero variance values and removing high correlation values;

After data set washing and placental membrane permeability evaluation, the BPBData data set included a total of 248 sample compounds, including 200C compounds and 48 NC compounds.

Deriving a SMILES expression and a linear molecular structure of the sample, extracting specific structural features of the sample by using RDkit software, taking the linear molecular structure of the sample as input, taking an RDkit molecular descriptor as output, wherein each column of data corresponds to one molecular descriptor, and finally obtaining 197 columns of molecular descriptors, namely a feature matrix of 248 rows and 197 columns.

And the feature matrix is preprocessed, so that the model performance loss caused by overlarge feature value difference is avoided. The preprocessing mode comprises normalization and normalization, wherein the normalization is to process data according to columns of a characteristic matrix and convert characteristic values of a sample into the same dimension; normalization is the processing of data according to the rows of the feature matrix, mapping the data to a specified range.

(3) And constructing a prediction model based on a machine learning algorithm, and training the prediction model under the supervision of a sample label by utilizing the preprocessed sample data so as to optimize parameters of the prediction model.

A prediction model is constructed by selecting a neural network, wherein the neural network comprises an input layer, a 1-layer hidden layer and an output layer.

Dividing the preprocessed sample data into a training set and a testing set, wherein the proportion of the training set to the testing set is preferably 8:2, the transfer function of the hidden layer is a logistic activation function, and adam' is selected through weight optimization; and training the prediction model by using a training set, evaluating the goodness of the prediction model by using a test set, and optimizing parameters of the prediction model.

The model goodness evaluation index comprises: accuracy, precision, recall, and F1_ score. The accuracy rate refers to the proportion of all correctly predicted sample data in the total sample data; the accuracy measures the probability that all samples predicted to be positive are true positive, and the accuracy is opposite to the false alarm rate, namely the higher the accuracy is, the less the false alarm rate is; f1_ score is a harmonic mean of accuracy and recall, and can evaluate the model more comprehensively.

In the training process, the output value of each layer is used as the node value of the next layer to continue to be calculated until the final predicted value is output, and learning and training of the model are carried out by continuously iteratively updating the weight by using an adam optimization algorithm to obtain the optimal weight.

The classification ability visualization graph of the prediction model of the present invention is shown in fig. 2, and the classification ability of the prediction model of the present invention is excellent.

And performing parameter tuning aiming at the maximum iteration number, the hidden layer neuron node number and the alpha value, wherein traversing and parameter tuning are performed on the hidden layer neuron node number. The functional relationship between the number of hidden layer neuron nodes and the accuracy of the prediction model is shown in fig. 3, and when the number of hidden layer neuron nodes is 29, the model accuracy reaches a peak value of 83.3%.

The prediction model after parameter optimization has excellent performance, the prediction accuracy of the prediction model after parameter optimization can reach 0.833, the accuracy is 0.893, the recall rate is 0.847, and the F1_ score is 0.870.

The embodiments described above are intended to illustrate the technical solutions of the present invention in detail, and it should be understood that the above-mentioned embodiments are only specific embodiments of the present invention, and are not intended to limit the present invention, and any modification, supplement or similar substitution made within the scope of the principles of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for predicting the permeability of a compound placental membrane based on machine learning, comprising the steps of:

(1) establishing a compound placental membrane permeability judgment standard;

2. The method for predicting the placental membrane permeability of a compound based on machine learning according to claim 1, wherein in step (1), the compound placental membrane permeability is determined using two parameters, F/M and CI, wherein F/M is the fetal-maternal blood concentration ratio of the compound in vivo, and CI is the clearance index;

CI ═ Compound placental permeability/antipyrine placental permeability

3. The method of predicting placental membrane permeability based on machine learning of claim 1, wherein the compound is collected from literature experimental data or compound databases.

4. The method of predicting machine-learning-based compound placental membrane permeability of claim 1, wherein the step of washing the data set comprises: filling blank values into a BPBData data set, removing inorganic substances, salts and neutral molecules, removing zero values and zero variance values and removing high correlation values;

5. The method according to claim 1, wherein in step (3), the preprocessed sample data is divided into a training set and a test set, the prediction model is trained by the training set, and the prediction model is evaluated by the test set to optimize parameters of the prediction model.

6. The method of predicting machine-learning-based compound placental membrane permeability of claim 1, wherein the machine learning algorithm is selected from a random forest algorithm, a logistic regression algorithm, a naive bayes algorithm, a support vector machine algorithm, or a neural network algorithm.

7. The method of predicting placental membrane permeability based on machine learning of claim 6, wherein the machine learning algorithm is a neural network algorithm comprising an input layer, a hidden layer and an output layer; the hidden layer is 1 layer, and the number of the neuron nodes is 29.

8. The machine learning-based compound placental membrane permeability prediction method of claim 7, wherein the transfer function of the hidden layer is a logistic activation function and adam's algorithm is used as a weight optimization path.

9. Use of a machine learning based prediction method of compound placental membrane permeability according to any one of claims 1-8 for predicting compound placental membrane permeability.