CN111986735B

CN111986735B - Calculation method for predicting atomic multipole distance in RNA by ARDGPR model

Info

Publication number: CN111986735B
Application number: CN202010837717.7A
Authority: CN
Inventors: 袁永娜; 刘振宇
Original assignee: Lanzhou University
Current assignee: Lanzhou University
Priority date: 2020-08-19
Filing date: 2020-08-19
Publication date: 2023-05-26
Anticipated expiration: 2040-08-19
Also published as: CN111986735A

Abstract

The invention relates to a calculation method for predicting an atomic high-order multipole distance in RNA based on an ARDGPR model, which comprises the following steps: optimizing the structures of all RNA molecule small fragments through quantum mechanical calculation software Gaussian09, and calculating the high-order multipolar distance of atoms in the molecules through AIMALL software integration; for each small molecular fragment, selecting the coordinate positions of all atoms in part of the small molecular fragments and training an ARDGPR model by using high-order multipole distances of the atoms; and verifying the prediction result of the ARDGPR model by taking the remaining small molecule fragment conformation as a test set. According to the invention, the ARDGPR model prediction is used for replacing quantum mechanical calculation, and on the basis of molecular mechanical simulation based on force field information, the physical and chemical parameter information such as energy, high-order multipolar distance of atoms and the like can be rapidly given for different constellations. Meanwhile, the high-order multipole distance of atoms is predicted through the trained ARDGPR model, the time is short, the cost is low, the prediction precision is high, the method is simple and convenient, a large amount of manpower, material resources and financial resources can be saved, and corresponding basic tools and quick ways are provided for improving the simulation precision of the RNA molecular force field.

Description

Calculation method for predicting atomic multipole distance in RNA by ARDGPR model

Technical Field

The invention belongs to the field of combination of quantum mechanics and molecular mechanics calculation, and particularly relates to a calculation method for predicting an atomic high-order multipole distance in RNA based on an ARDGPR model.

Background

RNA analysis is a very important topic in modern analytical science, which is the basis for elucidating RNA functions and exploring the molecular mechanisms of diseases. The traditional experimental method has high cost for measuring the RNA structure, and can not observe and record the molecular state at each moment in the RNA biological process, so that enough RNA secondary structure information can not be obtained. With the development of interdisciplinary science, computer technology is increasingly applied to the fields of chemistry, biology and the like, so as to solve the problem that the traditional chemistry and biology experimental methods are difficult to obtain or even incapable of obtaining molecular information. Computational chemistry obtains molecular properties such as vibration frequency, interatomic interactions, energy, etc. through computer simulation to help researchers obtain more chemical information, thereby overcoming the shortcomings and drawbacks of conventional experiments.

Common computer simulation methods include quantum mechanical computation (Quantum Mechanics, QM) and molecular mechanical computation (Molecular Mechanics, MM), where quantum mechanics can accurately express the movement of electrons in a molecule, and the computation results are accurate but very time-consuming. The molecular mechanics calculation does not consider the movement of electrons in the molecule, and the calculation result has high calculation accuracy without quantum mechanics, but has high calculation speed, thereby being more suitable for simulating a biological macromolecule system. In molecular mechanics, the RNA secondary structure is mainly subjected to simulation research through a molecular force field. Important roles in the stable structure of RNA are intramolecular non-bond interaction energies, including interatomic electrostatic interaction energies and Van der Waals forces, which are the more dominant roles. At present, widely used molecular force fields, such as AMBER, CHARMM, OPLS and the like, calculate interatomic electrostatic action energy through atomic point charges, and cannot accurately describe actual distribution conditions and polarization properties of electron clouds in molecules, so that simulation results are inaccurate. In order to improve the accuracy of a molecular force field simulation result, so that the molecular force field can obtain more reliable structure prediction in the process of simulating RNA, the invention improves the RNA molecular force field by calculating the electrostatic interaction energy between atoms through the high-order multipole distances of the atoms. In order to solve the problems that the calculation of atomic high-order multipole distance by a quantum chemical method is very time-consuming, the calculation cost is high and the like, the invention provides a calculation method for predicting atomic high-order multipole distance in RNA molecules based on an ARDGPR model. The method uses a traditional ARDGPR machine learning model, and carries out model training through an atomic high-order multipole distance data set in an RNA molecule, so that the atomic high-order multipole distance can be rapidly and accurately predicted, the cost for calculating the atomic high-order multipole distance is greatly reduced, the problem of high cost for calculating the atomic high-order multipole distance by a quantum chemical method is solved, and the development of molecular force field improvement is accelerated.

Disclosure of Invention

The existing molecular force field is based on the atomic point charge to calculate the electrostatic action energy between atoms, and the simulation result is not accurate enough. And the high-order multipole distance of atoms is calculated by a quantum chemical method, so that the simulation accuracy of the RNA molecular force field is improved, and huge cost is required. The invention aims to provide a machine learning method based on an ARDGPR model, which predicts high-order multipole distances (including dipole distances, quadrupole distances, octapole distances, sixteen pole distances and the like) of atoms rapidly, efficiently, accurately and with low cost so as to improve the simulation accuracy of an RNA molecular force field.

The principle of the invention is that the atomic high-order multipolar distance in the RNA molecule is predicted by a machine learning method, thereby establishing a machine learning model which can be used for predicting the atomic high-order multipolar distance in the RNA molecule, and adopting the following technical scheme:

a computing method for predicting atomic high order multipole distances in RNA molecules based on an ARDGPR model, the method comprising the steps of:

RNA molecules with different structures and sizes are selected from a database, the RNA molecules are cut into different small molecule fragments, and molecular structure information of the small molecule fragments and position information of predicted target atoms in a small molecule fragment space are obtained through Discover Studio software; ALF coordinate information of the predicted target atoms is obtained through ALF coordinate conversion; obtaining a high-order multipole distance of a target atom to be predicted through quantum mechanical calculation;

selecting atomic coordinate positions of part of small molecular fragments and high-order multipole distances of atoms from the data set to train an ARDGPR model, and obtaining parameters of the ARDGPR model; and verifying the prediction result of the ARDGPR model by taking the residual small molecules as a test set;

the method for verifying the prediction result of the ARDGPR model by using the residual small molecules as a test set comprises the following steps:

selecting the rest small molecular fragments from the data set, and verifying the prediction result of the ARDGPR model by calculating the high-order multipole distance of each atom;

selecting the rest small molecular fragments from the data set, and verifying the prediction result of the ARDGPR model by calculating the interatomic electrostatic interaction;

the remaining small molecule fragments were selected from the dataset and validated against the ARDGPR model by the performance of four machine learning models of Bagging, RBFNN, GRNN and GPR on the test set trained using the same training set.

Preferably, the method for obtaining the high-order multipole distance of the target atom to be predicted by quantum mechanics calculation comprises the following steps: according to the molecular structure information of the obtained small molecular fragments, inputting saturated small fragment molecules into Gaussian09 software, calculating the theoretical level of B3LYP/apc-1, and finally inputting the result into AIMALL software to integrate and calculate the electron cloud distribution condition of atoms, thereby obtaining the high-order multipolar distance of target atoms.

Preferably, the method for selecting the atomic coordinate position of part of the small molecular fragments and the atomic high-order multipole distance training ARDGPR model comprises the following steps: converting the coordinate position of each atom in the small molecular fragment from a global coordinate system to an atomic local coordinate system through ALF, taking the high-order multipole distance of the atom in the small molecular fragment as the output of the ARDGPR model, and taking the ALF coordinate information of the small molecular fragment as the input of the ARDGPR model; by using a Sobol sequence sampling method of size 100, part of the data is randomly selected as training set data for training the ARDGPR model.

Preferably, the selecting the remaining small molecule fragments from the dataset, verifying the prediction result of the ARDGPR model by the high order multipole distance of each atom in the small molecule fragments comprises the steps of:

selecting the rest small molecular fragments, and integrating and calculating by AIMALL software to obtain the high-order multipole distance of atoms in each small molecular fragment, and marking the high-order multipole distance as a parameter A; the corresponding high-order multipole distance obtained through ARDGPR model prediction is recorded as a parameter B; taking the parameter A as a true value, taking the parameter B as a predicted value, and verifying a predicted result through analyzing an error result.

Preferably, the selecting the remaining small molecule fragments from the dataset to verify the prediction of the ARDGPR model by calculating the interatomic electrostatic interactions comprises the steps of:

selecting the rest small molecular fragments, and calculating the interatomic electrostatic interaction energy C through the true value A;

according to the predicted value B, the interatomic electrostatic interaction energy D is obtained through program calculation;

taking the interatomic electrostatic interaction energy C as a true value, taking the interatomic electrostatic interaction energy D as a predicted value, and verifying a predicted result through analyzing an error result.

Preferably, the selecting of the remaining small molecule fragments from the dataset, the comparison verification with the ARDGPR model by evaluating the performance of four machine learning models of Bagging, RBFNN, GRNN and GPR trained using the same training set on the test set comprises the steps of:

four machine learning models, bagging, RBFNN, GRNN and GPR, were trained using the same training dataset as the ARDGPR model was developed. Selecting the rest small molecular fragments as a test set, and marking the high-order multipolar distances of atoms in the rest small molecular fragments as parameters E according to the prediction of four machine learning models such as Bagging, RBFNN, GRNN, GPR and the like;

according to the predicted parameter E, the interatomic electrostatic interaction energy F is obtained through program calculation;

and analyzing error results of the comparison parameters B and E, analyzing error results of the comparison electrostatic interaction energy D and the electrostatic interaction energy F between atoms, and verifying the prediction result of the ARDGPR model.

Preferably, the small molecule fragment comprises: phosphoric acid molecules, pentose molecules, four base molecules, phosphoric acid-pentose molecules, four pentose-base molecules, nucleotide molecules, and two nucleotide molecules that are base-paired.

Preferably, the exemplary small molecule fragment is a pentose molecule.

Preferably, the high order multipole distance includes a point charge, a dipole distance, a quadrupole distance, an octapole distance, a sixteen pole distance, and the like.

Preferably, the physicochemical parameters include molecular energy and atomic point charge, dipole moment, quadrupole moment, octapole moment, sixteen pole moment, and the like.

The beneficial effects of the invention are as follows:

(1) According to the invention, the parameter information of the energy and atomic high-order multipolar distance unfolding attribute of each small molecular fragment of the RNA molecule is obtained through quantum mechanical calculation.

(2) The ARDGPR model is trained through the parameter information obtained through quantum mechanical calculation, so that the parameter information of atomic high-order multipolar distance expansion of the small molecular fragment can be obtained rapidly through the atomic coordinate positions in the RNA small molecular fragment.

(3) Compared with the existing molecular force field, the method considers the high-order multipole distance of atoms, and the calculated parameter precision is higher, so that the method is more practical in the molecular dynamics simulation process.

(4) The method can train the ARDGPR model by using fewer RNA molecular structures, and can ensure high prediction result precision. Meanwhile, the high-order multipole distance of atoms is predicted through the trained ARDGPR model, the time is short, the cost is low, the prediction precision is high, the method is low and simple, the cost is low, a large amount of manpower, material resources and financial resources can be saved, and corresponding basic tools and quick ways are provided for improving the simulation precision of the RNA molecular force field.

Drawings

FIG. 1 is a flow chart of a calculation method of the present invention;

FIG. 2 is a schematic diagram of five-carbon sugar molecular structure and atomic local coordinate system;

FIG. 3 is a scatter plot of O1 atomic data sets (point charges);

FIG. 4 is RBFNN training error;

FIG. 5 is a GRNN training error;

FIG. 6 is a Bagging training error;

FIG. 7 is a Bagging part leaf node training error;

FIG. 8 is a GPR residual plot;

FIG. 9 is an ARDGPR residual map;

FIG. 10 is a Bagging residual map;

FIG. 11 is a GRNN residual plot;

FIG. 12 is a RBFNN residual map;

FIG. 13 is a graph showing the absolute value of the error of the O1 atomic point charge predicted by five models versus the data ratio;

FIG. 14 is a plot of absolute value of prediction error versus data ratio for 25 components;

FIG. 15 is a graph showing absolute values of predicted errors versus data ratios for five polar distances of O1 atoms;

Detailed Description

The invention is further illustrated by the following examples.

The invention provides a calculation method for predicting an atomic high-order multipole distance in RNA based on an ARDGPR model. The atomic high-order multipole distance comprises atomic point charges, dipole distances, quadrupole distances, octapole distances, sixteen pole distances and the like.

As shown in fig. 1, the present invention includes the steps of:

firstly, selecting RNA molecules with different structures and sizes from a Protein Database (PDB), and cutting the selected RNA molecules into small molecule fragments, wherein the small molecule fragments comprise: phosphoric acid molecules, pentose molecules, four base molecules (a, C, G, and U), phosphoric acid-pentose molecules, four pentose-base molecules, nucleotide molecules, and nucleotide … nucleotides. And then optimizing the structures of all small molecular fragments by using quantum mechanical computing software Gaussian09, and computing the high-order multipole distances of atoms in the small molecular fragments by using AIMALL software integration, wherein the high-order multipole distances comprise point charges, dipole distances, quadrupole distances, octapole distances, sixteen pole distances and the like. Then, for each small molecular fragment, selecting the atomic coordinate positions and atomic high-order multipolar distance training ARDGPR model and four machine learning models such as Bagging, RBFNN, GRNN and GPR of part of the small molecular fragments; selecting the rest small molecular fragments as a test set to obtain an ARDGPR model, bagging, RBFNN, GRNN and GPR and other four machine learning models, and predicting to obtain atomic high-order multipolar distances; and the interatomic electrostatic interaction energy is obtained through program calculation. Finally, predicting the high-order multipolar distance of atoms in the rest small molecular fragments and the electrostatic interaction energy between atoms through the ARDGPR model to verify the prediction effect of the model, and comparing the prediction effects of four machine learning models such as Bagging, RBFNN, GRNN and GPR to evaluate the application effect of the ARDGPR model.

All RNA molecules were downloaded from the PDB database and linked to http:// www.rcsb.org/. The invention will be further illustrated by the cleavage of selected RNA molecules into small molecule fragments, taking the O1 atom of pentose (pentase) as an example. Meanwhile, in the stage of verifying the prediction effect, because the electrostatic interaction between atoms can be obtained through atom high-order multipole distance calculation, the prediction effect of the ARDGPR model is only verified from two angles, namely, the prediction effect of the ARDGPR model is verified through the high-order multipole distance of atoms in the rest small molecular fragments predicted by the ARDGPR model, and the prediction results of four machine learning models such as Bagging, RBFNN, GRNN and GPR are compared to evaluate the prediction effect of the ARDGPR model.

The method comprises the following specific steps:

(1) Obtaining the high-order multipole distance of the target atom to be predicted through quantum mechanical calculation

The first step of cleaving the RNA molecule to obtain a five-carbon sugar molecule fragment

We randomly selected 300 RNA structures from the PDB database. Then, using the extraction program written in the subject group, a pentose (pentase) fragment in the RNA molecule was extracted. Because the H atom volume is small, the position information of the H atom is difficult to accurately describe by the existing experimental technology, the molecular structure of some RNAs in the PDB database only has the coordinate information of C, O, P, N atoms, and the coordinate information of the H atom is not contained. In order to facilitate the calculation of atomic high-order multipolar distances and the reduction of five-carbon glycogen initial neutral structures, H atom positions of five-carbon glycogen structures need to be recovered by automatically saturating H atom functions in batches through Discover Studio software. Finally we obtain the needed five-carbon sugar molecular structure information.

/>

Wherein, atom _i Is the i-th atom in pentose, i=1, 2, …, n. X is x _i ,y _i ,z _i Is the position of the ith atom in space.

And secondly, ALF coordinate information of the predicted target atom O1 atom is obtained through ALF coordinate conversion.

According to the position information of the atoms in the space obtained in the first step, the target atoms for research are O1 atoms in five-carbon sugar, so that an ALF framework is constructed by taking the O1 atoms as origin of coordinates, as shown in fig. 2. O1 atom adjacent atoms are all C atoms, and because the C1 adjacent atoms are N1, the priority of the atomic number is high according to the Cahn-Ingold-Prelog rule, and when the atomic numbers are the same, the atomic numbers of the atoms connected with the O1 atom adjacent atoms are seen, and so on. O1 to C1 are therefore chosen as the x-axis, O1, C4 and C1 as the xy-plane and perpendicular to the xy-plane as the z-axis. The three atoms in the ALF center are the center atom O1, the atom C1 with the highest priority, and the atom C4 with the next highest priority, respectively, and their three positions are defined by R _AX 、R _AY And v is fully described. The remaining atoms in the molecule are "non-ALF" atoms, which are described using spherical polar coordinates centered on the O1 atom. For non-ALF atom i, these coordinates are given by γ, θ and

these three parameter tablesShown. In FIG. 2, taking C5 as an example, the distance between C5 and O1 is γ, the angle between C5 and the z-axis is θ, and the angle between the projection of C5 on the xy-plane and the x-axis is +.>

There are a total of three ALFs and (N-3) spherical polar coordinates in the ALF coordinate system, i.e. a total of 3+3 (N-3) =3n-6 molecular geometric descriptors. Five-carbon sugar molecules studied by us, after saturated H atoms, had a total of 27 atoms, i.e. 75 molecular geometry descriptors, as feature vectors for the dataset.

Where n is the n-th pentase molecular conformation. i is the i-th non-ALF atom in each molecular conformation, i=1, 2.

And thirdly, performing quantum mechanical calculation by using Gaussian09 software and AIMALL software to obtain the multi-pole distances such as the point charge, the dipole moment, the quadrupole moment, the octapole moment, the sixteen pole moment and the like of the O1 atoms.

According to the pentase molecular structure information obtained in the first step, inputting saturated five-carbon sugar molecules into Gaussian09 software, and calculating to realize the method at the theoretical level of B3 LYP/apc-1. And finally, inputting the result into AIMALL software to integrate and calculate the electron cloud distribution condition of the atoms, and obtaining the point charge, dipole moment, quadrupole moment, octapole moment, sixteen pole moment and other multi-pole moment of the target atom O1.

Dat _ch ＝[ch _n ]

Where Datch is the point charge data for all the pendose molecular conformations and n is the nth pendose molecular conformation.

Dat _di ＝[di _1n di _2n di _3n ]

Wherein Dat _di Is dipole moment data of all the pendose molecular conformations, the dipole moment is three components, and n is the nth pendose molecular conformation.

Dat _qu ＝[qu _1n qu _2n qu _3n qu _4n qu _5n ]

Wherein Dat _qu Is quadrupole distance data of all the pendose molecular conformations, the quadrupole distance shares five components, and n is the nth pendose molecular conformation.

Dat _oct ＝[oct _1n oct _2n oct _3n oct _4n oct _5n oct _6n oct _7n ]

Wherein Dat _oct Is the octapole distance data of all the pendose molecular conformations, the octapole distance is the total seven components, and n is the nth pendose molecular conformation.

Dat _hex ＝[hex _1n hex _2n hex _an hex _4n hex _5n hex _6n hex _7n hex _8n hex _9n ]

Wherein Dat _hex Is sixteen polar distance data of all the pendose molecular conformations, sixteen polar distances are nine components, and n is the nth pendose molecular conformation.

(2) Training ARDGPR, bagging, RBFNN, GRNN and GPR five machine learning models

First step, determining the evaluation index of the prediction model

(a) The residual is the difference between the true value and the predicted value.

Wherein y is _i Representing the i-th true value of the value,

then it is the corresponding predicted value.

(b) Average absolute error MAE (MeanAbsolute Error).

Where m is the number of predicted samples, y _i Representing the i-th true value of the value,

then it is the corresponding predicted value.

(c) Mean square error MSE (Mean Squared Error)

then it is the corresponding predicted value.

(d) Root mean square error RMSE (Root Mean Squared Error)

then it is the corresponding predicted value.

(e) Determining the coefficient R ²

then it is the corresponding predicted value.

Selection of training set of experimental data in the second step

The five-carbon molecular structure obtained finally has 4553, and atomic ALF coordinate information of the molecular structure is used as input of a data set, and atomic high-order multipole distance is used as a prediction target.

Where n is the n-th pentase molecular conformation, and is 4553 in total.

We first use Dat _ch For predicting targets, five machine learning models such as ARDGPR, bagging, RBFNN, GRNN and GPR are evaluated and screened, and then the ARDGPR model is applied to predicting the high-order multipole distance of O1 atoms.

We used Sobol sequence samples of size 100 to divide the experimental dataset, fig. 3 is an O1 atomic dataset scatter plot (point charge), from top to bottom for the original dataset, training set, and test set, respectively. The original data set of the human body comprises 4553 five-carbon sugar molecular structures, 3000 molecular structures are sampled through a Sobol sequence to serve as a training set, and the rest 1553 molecular structures serve as a test set. In fig. 3, the ordinate indicates the atomic point charge value, i.e., M00, and the abscissa indicates the number of elements. As can be seen from the approximate ordinate range of the figure, the distribution among the training set, the test set and the original data set is similar, and the data are concentrated in the interval of 0.35 to 0.50 on the ordinate.

Third step training ARDGPR, bagging, RBFNN, GRNN and GPR five machine learning models

(a) RBFNN model training

We used a 10 fold cross validation method with 3000 training data and 75 dimensions of data. The MSE target uses a default value of 0, the maximum number of neurons is set to 500, the initial value is 0, and the number of neurons added each time is a default value of 25. It is important to choose the RBFNN expansion coefficient SPREAD reasonably, and the magnitude of its value determines the range over which the neuron responds to the region covered by the input vector. The method is used for obtaining the optimal SPREAD value by using an enumeration method, the initial value is 20, the maximum value is 500, the step length is 5, the optimal SPREAD parameter value is searched for in an iteration mode, and the training error result is shown in fig. 4.

Fig. 4 is composed of two sub-graphs a and B, the abscissa is the SPREAD coefficient value of the SPREAD, and the ordinate is the network training error MSE corresponding to the SPREAD value. The first plot a shows the value of SPREAD starting at 20 and stopping at step 5 to 500, and it can be seen from the plot that when the value of SPREAD is less than 55, the MSE is greater than 0.1 and the magnitude of the change is large, since the value of SPREAD is too small to allow the neuron to respond to the region covered by the input vector. At values of SPREAD greater than 100, the MSE value tends to stabilize, which indicates that the neuron is responsive to the region covered by the input vector. In order to make the data more visually displayed, we re-plot the data with a value of SPREAD greater than 130 on graph B, where it can be seen that the value of SPREAD has stabilized from 270. The value of the SPREAD with the smallest MSE is then chosen from among for constructing the model, which has been marked in fig. 4, when spread=470.

(b) GRNN model training

The training data is 3000, the dimension of the data features is 75 dimensions, and the parameters needed to be searched by GRNN are only the parameter of the smoothing factor SPREAD. We also use enumeration to obtain the optimal value of SPREAD, with an initial value of 0.001 and a maximum value of 1, and find the optimal value of SPREAD in a 0.001 cycle step.

The GRNN training error result is shown in fig. 5, and is composed of two sub-graphs A and B, wherein the horizontal coordinate is the smooth factor SPREAD value, and the vertical coordinate is the network training error MSE corresponding to the SPREAD value. Graph a is a loop iteration of the value of SPREAD from 0.001 to 1 in steps of 0.001, from which we can see that the value of SPREAD takes a minimum value in the interval 0 to 0.1 of MSE, and then the MSE gradually rises until it becomes smooth, because as the value of SPREAD increases, the smoother the approach of the network to the sample data increases accordingly until it becomes smooth. So the network is very capable of approximating samples when the value of SPREAD is 0 to 0.1. In order to make the data more visually displayed, the data with the SPREAD value of 0.04 to 0.08 is redrawn in FIG. B. As can be seen in fig. 5, the change in MSE is relatively gradual at low, and the minimum MSE is achieved when spread=0.059.

(c) Bagging model training

The learner of the Bagging algorithm model constructed by the method adopts a decision tree method, and can use the extrapackage samples to assist pruning or use the extrapackage samples to estimate the posterior probability of each node in the decision tree to assist the processing of zero training sample nodes. Parameters to be adjusted by the Bagging model include the number of learners and the number of leaf nodes of the decision tree. The optimal parameter combination is circularly found out by adopting a grid searching method. The number of leaf nodes is initially set to 1, the maximum value is set to 50, and the step size is 1. The learner number is initially set to 10, the maximum value is set to 300, and the step size is 5.

The Bagging training error is shown in fig. 6, the abscissa is the number of leaf nodes, the ordinate is the number of learners, the color of each square represents the training error of the model under the corresponding number of leaf nodes and the number of learners, and the color represents the MSE value from small to large from deep to light. As can be seen from the figure, the color distribution is uniform, indicating that the training error increases stepwise as the number of leaves increases. The dark squares are mainly concentrated between the number of leaf nodes 5 and 7, indicating that there is a minimum MSE value between the number of leaf nodes 5 and 7. To find the optimal leaf node and learner parameter combination, the data for leaf nodes 5 through 7 are redrawn in FIG. 6.

The training errors for leaf nodes 5 through 7 are shown in fig. 7, with the abscissa being the number of learnt and the ordinate being the MSE value of the training model. It can be seen that the number of leaf nodes is 5, the number of learners is 160, the MSE value is minimum, and the model obtains the optimal parameter combination.

(d) GPR model training

The GPR model is specified by its mean function (mean function) m (x) and covariance function (covariance function) k (x, x'), which we set to zero. k (x, x') is also a kernel function of GPR, we choose an exponential kernel.

x is the input set, x' is the new input, σ _f Sigma is the regression function variance or called maximum covariance _n For noise variance, δ (x-x') is the Kronecker function. Here, an isotropic kernel, so l is the width of the kernel, the range over which the control function responds to the input vector.

In the process of the GPR, the super parameter to be determined is θ= (l, σ) _f ，σ _n ). Knowing that p (θ|x, y) is maximum based on posterior probabilityAt the time, θ= (l, σ) _f ，σ _n ) With a maximum value. And obtaining the minimum value of the negative log likelihood function according to the Bayes theory to obtain the optimal super-parameter.

(e) ARDGPR model training

The hyper-parameters to be solved for the GPR model are θ= (l, σ) _f ，σ _n ) And solving the super-parameters of the GPR model by a method of minimizing the negative log likelihood function. The automatic correlation determination framework is added into the GPR model to obtain the ARDGPR model, and only one correlation coefficient parameter eta is integrated for the characteristics of each input variable _i . The set of parameters η is then applied _i Adding to the hyper-parameter set θ= (l, σ) _f ，σ _n ) Obtaining a new super-parameter set theta= (eta) ₁ …η _i ，σ _f ，σ _n ). And finally, when searching for the optimal super-parameters through a maximum likelihood method, the correlation degree between the characteristics of different input variables and the target can be deduced from the data.

The method for calculating the super parameters of the GPR model and the ARDGPR model is the same as the method for converting the maximum value of the calculated log likelihood function into the minimum value of the calculated negative log likelihood function, so that the minimum value of the calculated GPR model and the ARDGPR model is calculated by establishing the negative log likelihood function of the conditional probability of a training sample.

(3) Verification of ARDGPR model

First, predicting effects of five machine learning models, such as ARDGPR, bagging, RBFNN, GRNN and GPR, are analyzed.

The residual diagram of the GPR model is shown in FIG. 8, the ordinate is the residual value, and the abscissa is the true value of the predicted target. As can be seen from fig. 8, the residual values are approximately randomly distributed in the horizontal banded regions, without any obvious trend, indicating that the GPR model fits well to the test set data.

The residual diagram of the ARDGPR model is shown in fig. 9, the ordinate is the residual value, and the abscissa is the true value of the predicted target. As can be seen from fig. 9, the residual values are approximately randomly distributed in the horizontal banded regions, with no obvious trend, indicating that the ARDGPR model fits well to the test set data. Compared to the residual map of the GPR model, the residual values are generally smaller, with only the individual residual values being larger. And compared with the absolute error maximum value of the ARDGPR model, the rest prediction errors of the ARDGPR model are generally smaller.

The Bagging model residual diagram is shown in fig. 10, the ordinate is the residual value, and the abscissa is the true value of the predicted target. As can be seen from fig. 10, the difference between the distribution of residual values of the arggpr model and the GPR model is smaller, the residual values are also approximately randomly distributed in the horizontal banded region, and no obvious trend exists, which indicates that the fitting degree of the Bagging model, the arggpr model and the GPR model to the test set data is not quite different, and all the characteristics are good. By comparing residual images of the ARDGPR model and the GPR model (note the difference between the ordinate ranges of the three images), the residual values of the Bagging model are smaller than those of the GPR model as a whole, but the difference is not obvious compared with the ARDGPR model, so that the fitting effect of the Bagging model is superior to that of the GPR model and is close to that of the ARDGPR model.

The GRNN model residual map is shown in fig. 11. As can be seen from fig. 11, the distribution of the residual values shows a clear trend compared to the Bagging model, the ARDGPR model and the GPR model, which indicates that the GRNN model has a poor fitting effect on the test set data and cannot be basically used for predicting the data set.

The RBFNN model residual map is shown in fig. 12. As can be seen from FIG. 12, the residual value distribution situation is the same as that of the GRNN model, and shows a remarkable trend, which indicates that the RBFNN model has poor fitting effect on the data of the test set, and can not be basically used for predicting the data set.

The second step compares the predictive effects of five machine learning models, ARDGPR, bagging, RBFNN, GRNN and GPR.

The results of predicting O1 atomic point charges by five machine learning models such as RBFNN, GRNN, bgaaing, GPR and ARDGPR are shown in Table 1 below, which shows the absolute error maximum, absolute error minimum, MAE, MSE, RMSE and R of the predicted results ² . As can be seen from the values of MSE and RMSE, the ARDGPR model and the Bagging model are not much different, the effect is good, the RBFNN model and the GRNN model are not much different, the effect is worst, and the GPR model is in the middle position. This scenario is at R ² The above-mentioned performance is more obvious and,RBFNN and GRNN R ² The values approach 0, while R of the ARDGPR model and the Bagging model ² The value approaches 1 and the gpr model is in the neutral position. The method has the advantages that the ARDGPR model has the best fitting degree to the test set, the Bagging model is inferior, the RBFNN model and the GRNN model have the worst fitting degree to the test set, and the RBFNN model and the GRNN model can not be used for predicting the target value basically. Meanwhile, compared with the other four models, the ARDGPR model has smaller overall prediction error as seen from the maximum value and the minimum value of the absolute value of the error and MAE.

TABLE 1 regression model prediction results

FIG. 13 is a graph showing the absolute value of the error standard of the predicted O1 atomic point charges of five models and the data proportion comparison curve reaching the corresponding standard. It can be seen from the graph that the absolute value of 70% of the data errors predicted by the ARDGPR model is less than 0.0013, the absolute value of 70% of the data errors predicted by the bagging model is less than 0.0112, the absolute value of 70% of the data errors predicted by the GPR model is less than 0.0251, the absolute value of 70% of the data errors predicted by the GRNN model is less than 0.0375, and the absolute value of 70% of the data errors predicted by the RBFNN model is less than 0.0381. The value corresponding to the X axis at 100% of the ordinate indicates that all the data prediction errors are smaller than the value, and it can be seen from the graph that the Bagging model is the smallest, then the RBFNN model, and in combination with Table 3-1, it is not difficult to find that the phenomenon occurs at 100% of the ordinate of the curve because the absolute error maximum value of the Bagging model and the RBFNN model is smaller than the other three models. As can be seen from the whole curve, the absolute value of the prediction error of the ARDGPR model is smaller than that of the other four models, which indicates that the prediction effect of the ARDGPR model is better than that of the other four models.

Through analysis of the experimental results, the Bagging model and the ARDGPR model are good in performance on a test set, wherein the ARDGPR model is most excellent in performance, the RMSE value is improved by 52.97% compared with the GPR model, and the RMSE value is improved by 10.57% compared with the Bagging model. Because the Bagging model is huge in time consumption for iteratively searching parameters, if the Bagging model is applied to the prediction of the rest 24 components, the optimal parameters of 24 models need to be searched by the same method to ensure the accuracy of the models, the consumed calculation time is huge, and the time consumed by directly calculating the parameters by software is far exceeded. In contrast, because the input data is unchanged, the sum of the time taken to build the ARDGPR model on the remaining 24 components is much less than the tuning time of a single Bagging model.

Thirdly, verifying the prediction effect of the ARDGPR model through the existing atomic high-order multipole distance

FIG. 14 shows the ARDGPR model predictive O1 atom 25 composition M1 to M25 absolute value standard and the data ratio comparison curve reaching the corresponding standard. For ease of observation, instead of drawing 25 components on one graph, six sub-graphs are drawn in sequence, with the abscissa ranges of all sub-graphs being uniform. From the M1-M4 graphs, the result obtained by predicting the first four components by the ARDGPR model can be known, and the absolute value of 80% of the data errors is smaller than 0.008. The absolute value of the data error of 80% on the M5 to M8 components is less than 0.003, the absolute value of the data error of 80% on the M9 to M12 components is less than 0.01, the absolute value of the data error of 80% on the M13 to M16 components is less than 0.0108, the absolute value of the data error of 80% on the M17 to M20 components is less than 0.014, and the absolute value of the data error of 80% on the M21 to M25 components is less than 0.0105. It is shown that the ARDGPR model has an absolute value of 80% of data errors of less than 0.014 in predicting 25 components. And all curves are basically overlapped, which shows that the ARDGPR model has stable performance, the prediction error of all components on the test set does not greatly fluctuate, and the prediction effect is good.

Finally, calculating the values of point charge, dipole moment, quadrupole moment, octapole moment and sixteen pole moment according to the obtained 25 composition ingredient prediction results through a formula, calculating the absolute value of the error, and then carrying out ascending arrangement to obtain a table 2. The method comprises the steps of testing a total of 1553 five-carbon sugar molecular structures, predicting 25 components of a target atom O1 by an ARDGPR model, calculating point charges, dipole moment, quadrupole moment, octapole moment, sixteen pole moment and the like of the O1 atom according to obtained results, and finally calculating errors of the point charges, dipole moment, quadrupole moment, octapole moment, sixteen pole moment and the like and the true value of the O1 atom high-order multipole moment. Because there are 1553 pieces of data, only the first five pieces of data, the middle five pieces of data, and the last five pieces of data are listed in table 2, and the average error absolute values of the point charges, the dipole moment, the quadrupole moment, the octapole moment, the hexadecimal pole moment, and the like thereof are also listed.

TABLE 2 absolute value of atomic higher order multipole distance error

/>

FIG. 15 shows the absolute value of error criteria for five pole pitches of O1 atoms and a data proportion comparison curve for achieving the corresponding criteria. It can be seen from the graph that 80% of the point charge error values are less than 0.0030, 80% of the dipole moment error values are less than 0.0070, 80% of the quadrupole moment error values are less than 0.0029, 80% of the octapole moment error values are less than 0.0100, and 80% of the hexadecimal moment error values are less than 0.0130. The absolute value of the errors of the five polar distances are relatively close in overall, the prediction effect is good, the overall performance of the ARDGPR model is relatively stable, no large deviation occurs, and the model meets the requirements of the model built by the high-order multipole distances of the predicted atoms.

Claims

1. The calculation method for predicting atomic high-order multipole distance in RNA molecular force field based on ARDGPR model is characterized by comprising the following steps:

RNA molecules with different structures and sizes are selected from a protein database, the RNA molecules are cut into different small molecule fragments, and molecular structure information of the small molecule fragments and position information of each atom in a small molecule fragment space are obtained through a discover studio software; converting the coordinate position of each atom in the small molecular fragment from a global coordinate system to an atomic local coordinate system through ALF coordinate conversion; optimizing the structures of different small molecule fragments by using quantum mechanical computing software GAUSSIAN09, and then calculating by using AIMALL software to obtain the high-order multipolar distance of the target atoms to be predicted;

selecting the atomic coordinate position of part of the small molecular fragments and training an ARDGPR model by using the atomic high-order multipole distance to obtain the atomic high-order multipole distance of the ARDGPR model; and verifying the prediction result of the ARDGPR model by taking the residual small molecules as a test set;

selecting the rest small molecular fragments from the data set, and verifying the prediction result of the ARDGPR model through the high-order multipole distance of each atom in the small molecular fragments;

2. The method for predicting the high-order multipole distance of an atom in RNA based on an ARDGPR model according to claim 1, wherein the method for optimizing the structures of different small molecule fragments by using a quantum mechanical computing software GAUSSIAN09 and then obtaining the high-order multipole distance of a target atom to be predicted by using an AIMALL software is as follows:

according to the molecular structure information of the obtained small molecular fragments, inputting saturated small fragment molecules into Gaussian09 software, calculating the theoretical level of B3LYP/apc-1, and finally inputting the result into AIMALL software to integrate and calculate the electron cloud distribution condition of atoms, thereby obtaining the high-order multipolar distance of target atoms.

3. The method for computing the atomic higher order multipole distance in predicting RNA based on the ARDGPR model according to claim 1, wherein the method for selecting the atomic coordinate positions of the partial small molecule fragments and the atomic higher order multipole distance training the ARDGPR model is as follows:

converting the coordinate position of each atom in the small molecular fragment from a global coordinate system to an atomic local coordinate system through ALF, taking the high-order multipole distance of the atom in the small molecular fragment as the output of the ARDGPR model, and taking the ALF coordinate information of the small molecular fragment as the input of the ARDGPR model; by using a Sobol sequence sampling method of size 100, part of the data is randomly selected as training set data for training the ARDGPR model.

4. The method for computing the high order multipole distances of atoms in the predicted RNA based on the ARDGPR model according to claim 1, wherein the selecting the remaining small molecule fragments from the dataset, verifying the prediction result of the ARDGPR model by the high order multipole distances of each atom in the small molecule fragments comprises the steps of:

selecting the rest small molecular fragments, and obtaining the high-order multipole distance of atoms in each small molecular fragment through AIMALL software integral calculation, and marking the high-order multipole distance as a parameter A: the corresponding high-order multipole distance obtained through ARDGPR model prediction is recorded as a parameter B; taking the parameter A as a true value, taking the parameter B as a predicted value, and verifying a predicted result through analyzing an error result.

5. The method for computing the high order multipole distances of atoms in the predicted RNA based on the ARDGPR model according to claim 1, wherein the selecting the remaining small molecule fragments from the dataset, the predicting result of the ARDGPR model can be verified by computing the interatomic electrostatic interactions, comprises the steps of:

selecting the rest small molecular fragments, and calculating the interatomic electrostatic interaction energy C by using quantum mechanical calculation software GAUSSIAN09 through a true value A;

according to a parameter B predicted by the ARDGPR model, obtaining an interatomic electrostatic interaction energy D through program calculation;

6. The method of claim 1, wherein the selecting the remaining small molecule fragments from the dataset, comparing the ARDGPR model to the argpr model by evaluating the performance of four machine learning models Bagging, RBFNN, GRNN and GPR trained using the same training set on the test set comprises the steps of:

training Bagging, RBFNN, GRNN and GPR four machine learning models using the same training dataset as the construction of the ARDGPR model;

selecting the rest small molecular fragments as a test set, and marking the high-order multipole distance of atoms in the rest small molecular fragments as a parameter E according to the prediction of Bagging, RBFNN, GRNN and GPR four machine learning models;

7. The method for computing the high order multipole distance of an atom in predicted RNA based on the ARDGPR model according to any one of claims 1 to 6, wherein the small molecule fragment comprises: phosphoric acid molecules, pentose molecules, four base molecules, phosphoric acid-pentose molecules, four pentose one base molecules, nucleotide molecules and two nucleotide molecules that are base-paired.

8. The method of computing an atomic higher order multipole moment in RNA based on the ARDGPR model according to any of claims 1-6, wherein the higher order multipole moment comprises a point charge, dipole moment, quadrupole moment, octapole moment and sixteen pole moment.