CN111798935A

CN111798935A - Universal compound structure-property correlation prediction method based on neural network

Info

Publication number: CN111798935A
Application number: CN201910280668.9A
Authority: CN
Inventors: 王晓华; 杨民民
Original assignee: Pharmablock Sciences (nanjing) Inc
Current assignee: Pharmablock Sciences (nanjing) Inc
Priority date: 2019-04-09
Filing date: 2019-04-09
Publication date: 2020-10-20

Abstract

The invention discloses a universal compound structure-property correlation prediction method based on a neural network, which comprises the following steps: step 1, transforming a molecular descriptor into a characteristic vector form to construct a data set; step 2, dividing the data set into a training set and a testing set, sending the training set into a fully-connected neural network for training, and determining parameters of a fully-convolutional model; and 3, transforming the molecular descriptors to be predicted into a characteristic vector form, and predicting according to the full convolution model. The prediction method has higher accuracy in the prediction of the solubility.

Description

Universal compound structure-property correlation prediction method based on neural network

Technical Field

The invention relates to a construction method of a prediction model, in particular to a universal compound structure-property correlation prediction method based on a neural network.

Background

Solubility is an essential property of compounds, particularly small molecule compounds. In general, for different compounds, different compounds will have different solubilities under the same conditions in the same solution due to the structure and spatial arrangement of the compounds themselves. The determination of the solubility plays an important role in chemical processes, process preparation, chemical substance migration in medicines and environments and the like in the chemical industry.

However, since the variety of compounds is large in reality and different solutions require different storage and measurement conditions, it is impractical to measure the solubility of all compounds by practical methods, and it is urgent to establish a universal solubility measurement method which is accurate, reliable and fast based on existing data.

Such methods may be collectively referred to as quantitative structure-property correlation prediction (hereinafter, abbreviated as QSPR). QSPR is the latest solubility calculation and prediction method at present, that is, a fitting model is established according to the quantitative relationship between the calculated molecular structure parameter (molecular descriptor) of the compound and a specific property (such as solubility) for prediction, and the QSPR research is generally divided into three steps:

(1) calculating a molecular descriptor;

(2) establishing a prediction model;

(3) and (5) analyzing the prediction accuracy.

Disclosure of Invention

The invention aims to provide a universal compound structure-property correlation prediction method based on a neural network, which has higher accuracy in the prediction of solubility.

In order to achieve the above purpose, the solution of the invention is:

a universal compound structure-property correlation prediction method based on a neural network comprises the following steps:

step 1, transforming a molecular descriptor into a characteristic vector form to construct a data set;

step 2, dividing the data set into a training set and a testing set, sending the training set into a fully-connected neural network for training, and determining parameters of a fully-convolutional model;

and 3, transforming the molecular descriptors to be predicted into a characteristic vector form, and predicting according to the full convolution model.

The specific content of the step 1 is as follows: the molecular descriptor is divided into 2 parts, which are respectively corresponding to the molecular fingerprint calculated by the molecular structural formula and the general descriptor, and the 2 parts are connected to form a characteristic vector.

In step 2, the architecture of the fully-connected neural network is as follows: the first layer is an input layer, then a plurality of convolution layers are arranged, finally, a full convolution network with the convolution kernel size of [1,1] is used as a classification network, and the mean value of each layer is used for representing the category represented by the layer.

In the step 2, a back propagation algorithm or a gradient descent algorithm is adopted to train the data set.

In the step 2, the training set and the test set are divided in the following manner: 90% of the data set was taken as the training set and 10% of the data set was taken as the test set.

After the scheme is adopted, the descriptor is universal in design, calculation and combination, the descriptor can be considered to have strong universality, and various description methods can be universal and combined. The prediction model has learning and prediction capabilities for solubility of indefinite length due to the adoption of the convolutional neural network based on deep learning, so that the model provided by the invention has great universality.

The invention has the following characteristics:

(1) the length of the most common molecular structure expression formula SMILE at present can be unlimited, and the method has universality;

(2) descriptor data can be freely added, and only the same descriptor is added to the same data set, so that the method has universality;

(3) the descriptor characteristics are automatically extracted by the convolutional neural network, and meanwhile, the descriptor characteristics are trained by combining the labels, so that the method has simplicity;

(4) the convolutional neural network is ingenious in design, can achieve perfect accuracy rate in a very short training period, and has high practicability;

(5) the method is originally applied to QSPR, and the accuracy rate reaches the advanced level of the world.

Drawings

FIG. 1 is a schematic diagram of the structure of building molecular descriptors;

FIG. 2 is a schematic diagram of a fully connected layer classifier;

FIG. 3 is a diagram of a similar molecule structure;

FIG. 4 is a schematic diagram of a full convolution neural network architecture;

FIG. 5 is a schematic diagram of a full convolution neural network structure with parameters.

Detailed Description

1. Molecular descriptors

Molecular descriptors are the basis for QSPR, which refers to the nature and measure of molecules that can be represented numerically in one or more aspects, as understood and analyzed by computers. The molecular descriptors can be direct numerical representations of the physicochemical properties of the molecules, or can be calculated from a variety of data indices according to a particular algorithm. The former includes physical and chemical indexes of molecular compound such as boiling point and melting point, and the latter relates more to the outer energy of molecules, the outer electronic charge distribution between bonds, and the like.

The calculation methods of the molecular descriptors are various, and almost six thousand of various physicochemical parameters covering the characteristic characters and the structural characteristics of the compounds can be calculated at the present stage of different software and software packages.

RDkit is a free source of chemical informatics and machine learning software, and provides APIs of C + + and Python, wherein the API carries a specific molecular descriptor calculation method, and the calculation result converts a molecular structural formula into a vector group with 279 characteristic representations.

As shown in fig. 1, the molecular descriptor in the model is divided into 2 parts, which are respectively the molecular fingerprint calculated corresponding to the molecular structural formula and the general descriptor, and then the two parts are connected through a connection operation to form a feature vector for representing the feature molecule.

2. Full convolution neural network model

2.1 fully-connected neural networks

A general neural network is generally configured by basic modules such as convolutional layers, active layers, and full link layers. The convolutional layer is responsible for feature extraction, and the convolutional kernel is an important component of the convolutional layer and essentially performs feature extraction on signals of all levels. The convolutions cooperate to form convolutional layers, which are connected to the previous convolutional layer and the next convolutional layer or the fully-connected layer by neurons on the kernel. After the feature of the previous layer is convoluted by a learnable convolution kernel, the corresponding feature graph is output through an activation function, and the corresponding feature graph is combined into the values of a plurality of feature graphs in convolution.

In the formula (1), the reaction mixture is,

the output result of the last convolutional layer convolution kernel, which passes through the convolution kernel in the current convolutional layer

Convolution is carried out, product calculation is carried out, and then offset is carried out

And adding to obtain the final product. The F function is generally called an activation function and is constructed using a tanh or relu function. The structure thus formed is shown in fig. 2.

The full-connection layer is used for classifying data, the special effect of the data extracted from the previous layer is subjected to one-dimensional transformation to form a one-dimensional vector, the one-dimensional vector is connected with the full-connection kernel number determined according to the experience of a designer, and the result is calculated in a matrix calculation mode. The final result is the final classification calculation using softmax or sigmoid function as the activation function. And outputting the final classification according to the result probability.

In addition, in order to train the neural network, a downsampling layer and a regularization layer are generally added between convolution layers to process an activation result, so that the fitting result is improved, overfitting is reduced, and the training speed is increased.

2.2 full convolution neural network architecture

The classification function of the traditional neural network is completed by a final full-connection layer, namely a function of mapping a feature result extracted by a convolutional layer to a specific mark space. This has the advantage of facilitating the calculation by using a softmax or sigmoid function after a matrix calculation. However, this results in a large amount of redundant parameters, which is very poor for data reuse. Another significant drawback of the fully-connected layer is that when the input data is reconstructed into one-dimensional vectors, the data structure between the vectors is lost. In addition, for the set fully-connected layer, because the intrinsic calculation method is matrix calculation, for the vectors at the input end, the input must use the vectors with the same dimension so as to ensure a uniform calculation.

The conventional convolutional neural network obtains the judgment of the rotation invariance of the image recognition through the pooling effect of the space. However, when applied to the field of chemistry, the properties of the molecules and the changes in the positions of the structures are quite different, and as shown in FIG. 3, the nature of their determination is quite different even if it is slightly different.

The authors in this document have originally used a new fully convolutional based neural network to classify molecular description feature vectors. Starting from convolutional layer feature extraction, obtaining a feature map with the highest corresponding degree, fitting the features of the corresponding layer in a global pooling manner, and calculating the corresponding strongest features as the corresponding classification results, as shown in fig. 4 and 5, the specific architecture is as follows: the first layer is an input layer which is converted into molecular fingerprints; followed by several convolutional layers; and finally, using a full convolution network with a convolution kernel size of [1,1] as a classification network, and using the average value of each layer to represent the category represented by the layer.

2.3 training of full convolution neural networks

The training of convolutional neural networks is actually to find a set of optimal solutions in a data space that is assumed to exist, so that the value of the calculated objective function (loss function) is minimized. In theory, the data space is infinite, and the combination of solutions is infinite, so that it is impossible to artificially set a set of optimal solutions.

Common neural network training methods are mainly a Back Propagation (BP) algorithm and a Gradient Descent (GD) algorithm. The same is true of the full convolution neural network herein. According to input data, after forward propagation, calculating errors between the input data and actual values through a loss function, propagating the errors backwards layer by layer, calculating partial derivatives of the errors to each convolution kernel value, and updating weights and deviations according to the partial derivatives.

Where, conv is the convolution operation,

in order to be the parameters of the convolution kernel,

the result is output for the last convolution kernel, where the convolution kernel value after the current convolution kernel is rotated by 180 degrees is used as the weight multiplier.

2.4, discussion of some details of full convolution neural network

The neural network used herein is a full convolution neural network, and currently, researches find that there are 3 main factors affecting the convolution neural network: the number of convolutional layers, the number of convolutional kernels, and the organization of the neural network. In practical application, Facebook's Resnet successfully superimposes the input layer and the residual error together, and successfully solves the problem that gradient propagation disappears in the process of increasing the number of convolution layers. Google's incorporation and subsequent versions design a network with a good local topology, i.e., perform multiple convolution operations or pooling operations on the input image in parallel, and stitch all output results into a very deep feature map, increasing the number of convolution kernels greatly without increasing the parameter values greatly.

The fully convolutional neural network proposed herein successfully changes the organization structure of the conventional neural network, replacing the final fully-connected layer for classification with convolutional layers, which also conforms to the "fully convolutional" neural network proposed herein.

3. QSPR model training based on full convolution neural network

3.1 data Structure and data conversion

SMILES (Simplified molecular input specification) is a specification for explicitly describing a molecular structure using ASCII character strings. SMILES was developed by Arthur Weininger and David Weininger in the late 80's of the 20 th century and was modified and expanded by others, particularly by the Sun's Chemical Information Systems Inc. (Daylight Chemical Information Systems Inc.).

TABLE 1

Table 1 is a diagram of the structure of SMILES and the corresponding compound molecules displayed by the software, and it can be seen that different SMILES correspond to different compound molecule structures, and the SMILES can obtain the corresponding descriptors through the corresponding software calculation (herein Rdkit), as shown in table 2.

TABLE 2

For a common single SMILES, the generated molecular descriptor is a [1,200] array vector with dimensions, which is used to replace the molecular representation and is also treated as input data for the model. The solubility is specifically defined as that solubility itself presents a continuous numerical sequence, and therefore is artificially classified using one-hot coding, and is classified into 10 classes for the sake of simplicity herein.

3.2 specific design and parameters of the full convolution model

In order to solve the problem that the characteristics are not obvious in the training process, the network bridges multiple dense connections, in the process of characteristic extraction of the convolutional layers, direct connection is established between any two layers, and the input of each layer is the union of the outputs of all the layers. And all the feature information extracted by the layer is also transmitted as communication information to the next layer until the final global convolutional layer. The global convolutional layer is used for performing spatial mapping on the extracted features and mapping the most significant features corresponding to the extracted features on different spatial levels so as to determine the corresponding categories. Thus, after sufficient extraction and mapping, the final pair is made to have the corresponding input value fall within a specific target interval.

3.3, full convolution model QSPR experimental results and analysis

3.3.1 Experimental data set

There are 3 data sets used herein, in order:

1) abraham octanol solubility dataset

2) Delaney water solubility dataset

3) Tox21 toxicity data set

Abraham and Delaney have 283 and 1144 records, respectively, where the structural formula SMILES is used and the specific values of the solubility are calculated log.

Tox21 was derived from the Tox21 program of national institute of health, chemical genomics (NCGC) of Lockville, Maryland, USA, where 12 groups of data were selected (nr-ahr, nr-ar, nr-ar-lbd, nr-aromatase, nr-er, nr-er-lbd, nr-ppar-gamma, sr-are, sr-atad5, sr-hse, sr-mmp, sr-p 53). Each set of data was approximately 8000 records.

All data sets were divided into training sets and test sets, accounting for 90% of the training sets and 10% of the test sets, respectively.

3.3.2 Experimental results and analysis

The full convolutional neural network model was implemented using a Tensorflow library as the basic framework. The main hardware used herein is two NVIDIA 1070 graphics cards as image processors, the batch size (batch size) is set to 50, the number of training times is unlimited, the learning rate is 0.0001, and the adopted optimizer model is a stochastic gradient descent optimizer (gradientdescreenoptimizer).

In each iteration of the model training, the parameters of all convolutional layers are involved in the calculation and are updated, and all the parameters are parameters of the convolutional filter. And the model simultaneously calculates the accuracy of the training set and the test set, and stops the model training when the accuracy on the training set is more than 0.9999. The results of the verification are shown in the following table:

TABLE 3

	SVM	Logistic regression	Full convolution model
				Abraham	0.38	0.17	0.79+
tox21	0.76	0.21	0.9999+
				Delaney	0.51	0.15	0.92+

As can be seen from table 3, the full convolution model proposed herein achieves very good accuracy on each data set, and especially on Tox21 data set, the average of 12 verification results can substantially reach 100% accuracy, which is very necessary for the verification of toxicity. The results for the Abraham and Delaney datasets are less than ideal for Tox21, most likely because the data volume of the dataset is too small to cover all the required training points.

4. Concluding sentence

Experiments were performed on different sets of molecular activity data using a fully convolutional neural network. Experiments show that: compared with the traditional machine learning tool, the full convolution neural network can obtain the best accuracy rate on small data, and the training speed is not obviously reduced.

The above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the protection scope of the present invention.

Claims

1. A universal compound structure-property correlation prediction method based on a neural network is characterized by comprising the following steps:

2. The prediction method of claim 1, wherein: the specific content of the step 1 is as follows: the molecular descriptor is divided into 2 parts, which are respectively corresponding to the molecular fingerprint calculated by the molecular structural formula and the general descriptor, and the 2 parts are connected to form a characteristic vector.

3. The prediction method of claim 1, wherein: in step 2, the architecture of the fully-connected neural network is as follows: the first layer is an input layer, then a plurality of convolution layers are arranged, finally, a full convolution network with the convolution kernel size of [1,1] is used as a classification network, and the mean value of each layer is used for representing the category represented by the layer.

4. The prediction method of claim 1, wherein: in the step 2, a back propagation algorithm or a gradient descent algorithm is adopted to train the data set.

5. The prediction method of claim 1, wherein: in step 2, the training set and the test set are divided in the following manner: 90% of the data set was taken as the training set and 10% of the data set was taken as the test set.