CN107742061A

CN107742061A - A kind of prediction of protein-protein interaction mthods, systems and devices

Info

Publication number: CN107742061A
Application number: CN201710848068.9A
Authority: CN
Inventors: 邹小勇; 王洋; 李占潮; 戴宗
Original assignee: National Sun Yat Sen University
Current assignee: Sun Yat Sen University; National Sun Yat Sen University
Priority date: 2017-09-19
Filing date: 2017-09-19
Publication date: 2018-02-27
Anticipated expiration: 2037-09-19
Also published as: CN107742061B

Abstract

The invention discloses a kind of prediction of protein-protein interaction method, system and device.This method obtains the multi-dimensional image data corresponding to the protein of required prediction, and the multi-dimensional image data of acquisition is input in convolutional neural networks and handled, so as to export prediction result.The system includes the data module for being used to obtain the multi-dimensional image data corresponding to the protein of required prediction, and is handled for the multi-dimensional image data of acquisition to be input in convolutional neural networks, so as to export the processing module of prediction result.The device includes being used for the memory for providing memory space and the processor for performing prediction of protein-protein interaction method program.Protein sequence data is converted to multi-dimensional image data and handled with convolutional neural networks by the present invention, there is provided the method for convolutional neural networks parameter and extension input data dimension suitable for protein data feature, it is possible to provide more accurate prediction.Present invention could apply to prediction of protein-protein interaction field.

Description

A kind of prediction of protein-protein interaction mthods, systems and devices

Technical field

The present invention relates to big data treatment technology, especially a kind of prediction of protein-protein interaction mthods, systems and devices.

Background technology

Protein interaction is the basis of vital movement, and intracellular physiological change is by protein regulation come real It is existing.But various functions and phenomenon are realized by single protein in vital movement, but pass through protein-protein, egg Interaction between white matter and nucleic acid and protein and other micromolecular compounds realizes, therefore studies between protein Interactively to research life science have important meaning.The scientific research institution in the whole world is analyzed the interaction relationship of protein Extensive research is expanded, from the theory analysis of early stage, is developed and caused experimental method means to all kinds of technologies, so far Substantial amounts of experimental data is have accumulated.How from the biological data of these enormous amounts, relevant information is excavated, in biological information Field category hot research direction.In the sequence signature based on protein and structural Quality Research, existing method is to data Call format is strict, and data prediction needs substantial amounts of manual annotation, is unfavorable for carrying out large-scale data analysis and unknown data Potentiality is predicted.

The content of the invention

In order to solve the above-mentioned technical problem, the first object of the present invention is to provide a kind of prediction of protein-protein interaction side Method, the second object of the present invention are to provide a kind of prediction of protein-protein interaction system, and the third object of the present invention is to provide one Kind prediction of protein-protein interaction device.

The first technical scheme for being taken of the present invention is：

A kind of prediction of protein-protein interaction method, comprises the following steps：

Multi-dimensional image data needed for obtaining corresponding to the protein of prediction；

The multi-dimensional image data of acquisition is input in convolutional neural networks and handled, so as to export prediction result.

Further, handled in the described multi-dimensional image data of acquisition is input in convolutional neural networks, so as to Before the step for exporting prediction result, it is additionally provided with and establishes convolutional neural networks model step, it is described to establish convolutional neural networks Model step specifically includes：

The multi-dimensional image data and interaction data value in protein interaction database corresponding to protein are obtained, Input data positive sample is built with gained multi-dimensional image data, with gained interaction data value structure output data positive sample；

The multi-dimensional image data and interaction data value corresponding to the outer protein of protein interaction database are obtained, Input data negative sample is built with gained multi-dimensional image data, with gained interaction data value structure output data negative sample；

Input data positive sample and input data negative sample are selected so as to build training input data set respectively and test defeated Enter data set；Output data positive sample and output data negative sample are selected so as to build training output data set respectively and test defeated Go out data set；

With training input data set and training output data set training convolutional neural networks, with test input data set with And test output data set convolutional neural networks

The convolutional neural networks that the convolutional neural networks obtained after training and test are terminated are established as needed for.

Further, the acquisition methods of the multi-dimensional image data include following any：Separately encoded data conversion Data transfer device, the data transfer device of Central Symmetry coding and the data conversion side of adjacent encoder that method, dislocation encode Method.

Further, formula used in the separately encoded data transfer device is：

F₁=pix (i_k1).reshape(length,width,dimension)

F₂=pix (1800+i_k2).reshape(length,width,dimension)

Further, formula used in the data transfer device of the dislocation coding is：

F₁=pix (i_k1*2).reshape(length,width,dimension)

F₂=pix (i_k2*2+1).reshape(length,width,dimension)。

Further, formula used in the data transfer device of the Central Symmetry coding is：

F₁=pix (1799-i_k1).reshape(length,width,dimension)

F₂=pix (1800+i_k2).reshape(length,width,dimension)。

Further, formula used in the data transfer device of the adjacent encoder is：

F₁=pix (1799-length+i_k1).reshape(length,width,dimension)

F₂=pix (1800+i_k2).reshape(length,width,dimension)。

The meaning respectively measured in above-mentioned formula is respectively：F₁The multi-dimensional image data obtained after being changed for the first protein, F₂For The multi-dimensional image data obtained after the conversion of two protein, pix () are that the sequence site of protein is converted into image pixel site Function, i_k1For the sequence site of the first protein, i_k2For the sequence site of the second protein, reshape () is by one-dimensional sequence Column data is converted to the function of matrix data, and length is total line number of the multi-dimensional image data obtained after changing, and width is to turn Total columns of the multi-dimensional image data obtained after changing, dimension are data channel number.

Further, the convolutional neural networks include the first convolutional layer, the second convolutional layer, the 3rd convolutional layer, Volume Four Lamination, the first pond layer, the second pond layer, the first full articulamentum and the second full articulamentum, the output end of first convolutional layer Pass sequentially through the second convolutional layer, the first pond layer, the 3rd convolutional layer, Volume Four lamination, the second pond layer, the first full articulamentum with The input connection of second full articulamentum.

The second technical scheme for being taken of the present invention is：

A kind of prediction of protein-protein interaction system, including：

Data module, for obtaining the multi-dimensional image data corresponding to the protein of required prediction；

Processing module, handled for the multi-dimensional image data of acquisition to be input in convolutional neural networks, so as to defeated Go out prediction result.

The 3rd technical scheme taken of the present invention is：

A kind of prediction of protein-protein interaction device, including：

Memory, for storing at least one program,

Processor, for loading at least one program and performing following steps：

The present invention the first beneficial effect be：The linear data of protein sequence is converted to multidimensional image number by the present invention According to then multi-dimensional image data is input in convolutional neural networks and handled, overcome by i.e. compound matrice data of higher-dimension Data prediction and the limitation for supporting information deficiency, it is good to protein data compatibility, have a wide range of application.The invention also discloses The training method of convolutional neural networks and the acquisition methods of multi-dimensional image data, multi-dimensional image data can include protein incessantly Sequence data in itself, as needed can also change the data such as path, avtive spot, binding site and kinetic parameter in the lump For multi-dimensional image data so that be input to that the data that convolutional neural networks are handled are more rich, increase analysis accuracy and The scope of analysis.The invention also discloses the structure of convolutional neural networks, improves convolutional neural networks model treatment albumen prime number According to performance.

The present invention the second beneficial effect be：The multi-dimensional map corresponding to the protein of required prediction is obtained using data module As data, the multi-dimensional image data of acquisition is input in convolutional neural networks using processing module and handled, so as to export Prediction result, present protein interaction prediction method is set quickly to realize.

The present invention the 3rd beneficial effect be：The device formed using memory and processor performs present protein phase Interaction Forecasting Methodology, present protein interaction prediction method is set quickly to realize.

Brief description of the drawings

Fig. 1 is the flow chart of present protein interaction prediction method；

Fig. 2 is the structure chart of present protein Interaction Predicting system；

Fig. 3 is the structure chart of present protein Interaction Predicting device；

Fig. 4 is the particular flow sheet for establishing convolutional neural networks step；

Fig. 5 is separately encoded conversion effect figure；

Fig. 6 is the conversion effect figure of dislocation coding；

The conversion effect figure of asymmetric encoding centered on Fig. 7；

Fig. 8 is the conversion effect figure of adjacent encoder；

Fig. 9 is impact effect figure of the convolution check figure to training loss function value；

Figure 10 is impact effect figure of the convolution check figure to test loss function value；

Figure 11 is impact effect figure of the convolution check figure to training accuracy；

Figure 12 is impact effect figure of the convolution check figure to test accuracy rate；

Figure 13 is impact effect figure of the repetitive exercise degree to model performance；

Figure 14 is the accuracy rate design sketch of convolution model predicting function intensity.

Embodiment

A kind of prediction of protein-protein interaction method disclosed by the invention, as shown in figure 1, comprising the following steps：

The multi-dimensional image data of acquisition is input in convolutional neural networks and handled, so as to export prediction result.It is defeated Protein interaction data prediction value of the prediction result gone out between the two proteins of required prediction.

The invention also discloses a kind of prediction of protein-protein interaction system, it includes：For obtaining the egg of required prediction The data module of multi-dimensional image data corresponding to white matter, and for the multi-dimensional image data of acquisition to be input into convolutional Neural Handled in network, so as to export the processing module of prediction result.Its structure is as shown in Figure 2.The system can realize this hair Bright prediction of protein-protein interaction method.

The invention also discloses a kind of prediction of protein-protein interaction device, it includes：For data storage and program Memory, it stores at least one program, and the processor of the program stored for performing memory, performed by processor Program comprise the following steps：

The structure of the present apparatus is as shown in Figure 3.The present apparatus can realize present protein interaction prediction method.

In order to illustrate more clearly of a kind of prediction of protein-protein interaction method of the present invention, with reference to specific embodiment, It is described further.Following all embodiments, a kind of available prediction of protein-protein interaction system of the invention and a kind of albumen Matter Interaction Predicting device is realized.

Preferred embodiment is further used as, is input in convolutional neural networks by the multi-dimensional image data of acquisition Row processing, the step for so as to export prediction result before, be additionally provided with and establish convolutional neural networks step, as shown in figure 4, described Convolutional neural networks step is established to specifically include：

With training input data set and output data set training convolutional neural networks are trained, with test input data set and survey Try output data set test convolutional neural networks；

Preferred embodiment is further used as, establishes protein interaction number used in convolutional neural networks step It is hippie databases according to storehouse, 2.0 versions of hippie databases can reach more excellent effect.Below with hippie databases It is described further exemplified by 2.0 versions to establishing convolutional neural networks step.

The version of hippie databases 2.0 includes 287,357 protein interaction data, by 16,835 albumen textures Into.In order to build input data positive sample and output data positive sample, data input is advised according in convolutional network calculating process The needs of mould, screened first in 16,835 protein, protein sequence is shorter in length than 200 or more than 1800 The protein labeling of position is invalid and deleted, and finally remains 16,212 protein for participating in interaction；Simultaneously 287, Screened in 357 protein interaction data, delete the protein interaction relationship for including marked invalid, finally protect 246,726 interaction data are stayed.By multi-dimensional image data corresponding to protein in 246,726 data obtained above For building input data positive sample, corresponding interaction data value is used for building output data positive sample.

Preferred embodiment is further used as, data structure negative sample is obtained outside protein interaction database, Data outside protein interaction database refer to the data do not included by protein interaction database.Preferably, lead to Cross following methods structure input data negative sample and output data negative sample：The input data negative sample of 95% quantity is not from Multidimensional data image corresponding to protein in the protein interaction relationship included by hippie databases, remaining 5% quantity Input data negative sample derive from input data positive sample data basis at random change protein sequence in 1/3 amino acid Form corresponding multidimensional data image.Output data negative sample is the interaction data corresponding to input data negative sample Value.

Then, each selection is a part of from all input data positive samples and input data negative sample, for building instruction Practice input data set, it is preferable that the input data positive sample selected is identical with input data negative sample quantity.From all inputs Each selection is a part of in data positive sample and input data negative sample, for building test input data set, it is preferable that taken out The input data positive sample of choosing is identical with input data negative sample quantity.

Then, each selection is a part of from all output data positive samples and output data negative sample, for building instruction Practice output data set, it is preferable that the output data positive sample selected is identical with output data negative sample quantity.From all outputs Each selection is a part of in data positive sample and output data negative sample, for building test output data set, it is preferable that taken out The output data positive sample of choosing is identical with output data negative sample quantity.

Training input data set obtained above and training output data, respectively as input known to convolutional neural networks And output, for training convolutional neural networks.

Test input data set obtained above and test output data, respectively as input known to convolutional neural networks And output, for testing convolutional neural networks, so as to verify the quality of convolutional neural networks.

Preferred embodiment is further used as, because prediction is being trained, tests and be actually used in convolutional neural networks All need to obtain multi-dimensional image data corresponding to protein during protein interaction, therefore the invention provides by protein sequence The method that data are converted to corresponding multi-dimensional image data.The data format of protein sequence is the view data of one [n, 1], Be inherently identical with [k, the h] view data being input in convolutional neural networks, convolutional neural networks training, test with And multi-dimensional image data used during processing, its acquisition methods include following any：Separately encoded data transfer device, mistake Data transfer device, the data transfer device of Central Symmetry coding or the data transfer device of adjacent encoder of position coding.

Preferred embodiment is further used as, formula used in separately encoded data transfer device is：

F₁=pix (i_k1).reshape(length,width,dimension)

F₂=pix (1800+i_k2).reshape(length,width,dimension)

Preferred embodiment is further used as, the formula used in the data transfer device of coding that misplaces is：

F₁=pix (i_k1*2).reshape(length,width,dimension)

F₂=pix (i_k2*2+1).reshape(length,width,dimension)。

Preferred embodiment is further used as, formula used in the data transfer device of Central Symmetry coding is：

F₁=pix (1799-i_k1).reshape(length,width,dimension)

F₂=pix (1800+i_k2).reshape(length,width,dimension)。

Preferred embodiment is further used as, formula used in the data transfer device of adjacent encoder is：

F₁=pix (1799-length+i_k1).reshape(length,width,dimension)

F₂=pix (1800+i_k2).reshape(length,width,dimension)。

The meaning respectively measured in above-mentioned each formula is respectively：F₁The multi-dimensional image data obtained after being changed for the first protein, F₂ The multi-dimensional image data obtained after being changed for the second protein, pix () are that the sequence site of protein is converted into image pixel The function in site, i_k1For the sequence site of the first protein, i_k2For the sequence site of the second protein, reshape () is by one Dimension sequence data is converted to the function of matrix data, and length is total line number of the multi-dimensional image data obtained after conversion, width For total columns of the multi-dimensional image data obtained after conversion, dimension is data channel number.

For example, in training convolutional neural networks, it is known that protein p₁With protein p₂And its corresponding interaction data Value, then, any of above-mentioned coding method can be used, by protein p₁Corresponding to the first albumen in the above method Matter, by protein p₂Corresponding to the second protein in the above method, so as to obtain protein p₁With protein p₂Corresponding multidimensional View data, for training convolutional neural networks.

For example, predict protein p using convolutional neural networks₃With protein p₄Interaction when, then, can use Any of above-mentioned coding method, by protein p₃Corresponding to the first protein in the above method, by protein p₄Correspond to The second protein in the above method, so as to obtain protein p₃With protein p₄Corresponding multi-dimensional image data, is input to convolution It is predicted in neutral net.

Preferred embodiment is further used as, the data channel number of the acquisition methods of the multi-dimensional image data is 3, it includes the first data channel, the second data channel and triple data path, and the first data channel stores the coding of amino acid, Second data channel stores the component ratio of amino acid, and triple data path stores the bigeminy group # of amino acid.

Such as：For one section of protein sequence, using following rule, three-dimensional is carried out to each amino acid in sequence and reflected Conversion is penetrated, wherein three data channel are respectively：

The coding of first data channel storage amino acid.Conversion method：The amino acid imparting one 20 of 20 kinds of general's whole~ 252 values, are incremental value with 12, amino acid are encoded by english abbreviation order, such as alanine (A) is 20, half Guang ammonia Sour (C) is 32, and aspartic acid (D) is 44, is 252 to tyrosine (Y)；

Second data channel stores each amino acid composition ratio.Conversion method：For certain protein sequence, statistics 20 Component ratio of the kind amino acid in the sequence.Such as the length of P85B_HUMAN protein sequences is 780, serine (S) exists Occupy 27 in 780 total amino acid contents, corresponding data is that (27 ÷ 780 × 255) is equal to 8.82, is 8 after rounding；

The numbering of the bigeminy group of 3rd data channel storage amino acid.Bigeminy group be by an amino acid (i-th bit) and The combination of next amino acids (i+1 position) composition, the bigeminy groups of last amino acids use itself cover, the data of bigeminy group by Primary sequential encoding (20 kinds of amino acid are encoded to 1~20 by abbreviation letter) is multiplied by deputy sequential encoding except upper 400 Present normalization coefficient 255 again to obtain, such as the bigeminy group numerical value of methionine (M) and serine (S) is：11×14÷400 × 255=98.Because convolutional neural networks require the normalized computation rule of input data, protein data is multiplied by by this method 255 and the method that rounds meet the requirement calculated.The coding result of three data channel is as shown in table 1.

The coding result of 1 three data channel of table

Preferably, the line number of image and columns are set to 60, now, the expression and conversion effect of four kinds of methods It is as follows：

Separately encoded data transfer device, expression formula is as follows, and conversion effect is as shown in Figure 5.

F₁=pix (i_k1).reshape(60,60,3)

F₂=pix (1800+i_k2).reshape(60,60,3)

Misplace the data transfer device encoded, and expression formula is as follows, and conversion effect is as shown in Figure 6.

F₁=pix (i_k1*2).reshape(60,60,3)

F₂=pix (i_k2*2+1).reshape(60,60,3)

The data transfer device of Central Symmetry coding, expression formula is as follows, and conversion effect is as shown in Figure 7.

F₁=pix (1799-i_k1).reshape(60,60,3)

F₂=pix (1800+i_k2).reshape(60,60,3)

The data transfer device of adjacent encoder, expression formula is as follows, and conversion effect is as shown in Figure 8.

F₁=pix (1799-length+i_k1).reshape(60,60,3)

F₂=pix (1800+i_k2).reshape(60,60,3)

Further it is used as preferred embodiment, it is contemplated that the characteristic of convolutional neural networks local sensing, in order to avoid Data it is adjacent it is too near caused by data contamination phenomenon, more excellent effect can be obtained using the 3rd kind of Central Symmetry coding method, it is former Because being：(1) edge of picture can be grouped as background colour, the obscurity boundary for preventing BORDER PROCESSING operation from making；(2) except two Outside first 60 of the protein sequence of bar interaction, the sequence of other parts is all mutually faced with the different fragments in same sequence Closely, additional features caused by different sequences will not be caused to find, increase the reliability of model；(3) the seamless phase of two sequences is realized Even, the high-level characteristic of overall protein interaction is excavated, improves the discrimination performance of model.

Preferred embodiment is further used as, above-mentioned each data channel can also store other data of protein, and And data channel can be expanded according to the needs actually calculated.For example, in first data channel storage of above-described embodiment The coding of amino acid, second data channel store each amino acid composition ratio, the 3rd data channel stores the two of amino acid On the basis of the numbering of connection group, increase by the four, the 5th data channel, path and avtive spot are stored in the four, the 5th numbers respectively According in passage, can continue to increase by the six, the 7th data channel, binding site, kinetic parameter are stored in the 6th respectively, the In seven data channel.Increase is input to the data dimension of convolutional neural networks, so as to increase the model of the accuracy of analysis and analysis Enclose.

Preferred embodiment is further used as, the training dataset and test data set used due to this method will include More than 560,000 datas, than tens times more of the training number (20,000 to 50,000) of traditional convolutional neural networks, according to upper The parameter setting of embodiment is stated, the length of every sample data is 3,600/dimension (60 × 60 data matrixes), altogether 3-dimensional.By It is huge in data volume, it is contemplated that calculating it is ageing, this method using 4 layers of convolutional layer sequential structure.Preferably, convolutional Neural Network includes the first convolutional layer, the second convolutional layer, the 3rd convolutional layer, Volume Four lamination, the first pond layer, the second pond layer, the One full articulamentum and the second full articulamentum, the output end of the first convolutional layer pass sequentially through the second convolutional layer, the first pond layer, the 3rd Convolutional layer, Volume Four lamination, the second pond layer, the first full articulamentum are connected with the input of the second full articulamentum.That is, Second convolutional layer receives the output result of the first convolutional layer, and first pond layer receives the output knot of the second convolutional layer Fruit, the 3rd convolutional layer receive the output result of the first pond layer, and the Volume Four lamination receives the output of the 3rd convolutional layer As a result, second pond layer receives the output result of Volume Four lamination, and the first full articulamentum receives the second pond layer Output result, the second full articulamentum receive the output result of the first full articulamentum.

Preferred embodiment is further used as, pond layer carries out down-sampling operation.Step parameter is arranged to 2, therefore often Secondary convolution operation all makes sequence data matrix size subtract 2.First convolutional layer and the second convolutional layer difference 60 convolution of structure setting Core, core scale are arranged to (3,3), and the first convolutional layer has 1680 parameters, and the second convolutional layer has the parameter of 32,460, the first volume The activation primitive of lamination and the second convolutional layer is for relu, gives up value and takes 0.5, and now, the first convolutional layer receives and is input to volume 60 × 60 matrix of product neutral net, i.e. multi-dimensional image data, the matrix of output 58 × 58；Second convolutional layer receive this 58 × 58 matrix, the matrix of output 56 × 56；After the 1st convolutional layer and the 2nd convolutional layer, believe there is provided the first pond layer for spatial domain Number apply maximum pond, pond scale is arranged to (2,2), to set protein data in both vertically as well as horizontally down-sampling The factor is 2, and to ensure data after first round convolution and pondization operation, former data scale reduces half, and data_format is set For channels_last, corresponding TensorFlow data structure, the matrix of the first pond layer output 26 × 26；3rd convolution Layer and Volume Four lamination difference 60 convolution kernels of structure setting, core scale are arranged to (3,3), and the 3rd convolutional layer has 64920 ginsengs Number, Volume Four lamination have a parameter of 129720, and the activation primitive of the 3rd convolutional layer and Volume Four lamination is for relu the 3rd Convolutional layer receives this 26 × 26 matrix, the matrix of output 22 × 22；Volume Four lamination receives this 22 × 22 matrix, output 18 × 18 matrix；Second pond layer receives this 18 × 18 matrix, the matrix of output 9 × 9.After the output one-dimensional of 120 dimensions, First full articulamentum is set to receive this output result, the first full articulamentum has 8,786,432 parameters, gives up value and take 0.25, Activation primitive is relu.The vector that first full articulamentum output length is 512, finally sets the second full articulamentum to receive upper one The vector of the full articulamentum of layer, to calculate last result.

Convolution check figure is more, and the protein data feature that can be extracted can increase therewith, but takes and exponentially rise.Cause This, it is necessary to reduce time loss on enough feature bases are extracted.By four groups of experiments, 12 wheel repetitive exercise processes are have studied In, influence of the different convolution check figures to model performance, experimental data is as shown in table 2.

Influence of the different convolution check figures of table 2 to model performance

From Table 2, it can be seen that when the 1st, 2 convolutional layer check figures increase to the 600, the 3rd, 4 convolutional layer check figures from 200 from 100 When increasing to 1200, the 1st wheel training accuracy rate increases to 79.00% from 70.88%；The final accuracy rate trained by 12 wheels, Increase to 88.47% from 80.32%, accuracy rate is effectively improved, wherein the model of repetitive exercise is taken turns in four groups of experiments 12 Can be as shown in Fig. 9-Figure 12.

From fig. 9, it can be seen that by setting more convolution kernels, loss function slope of a curve constantly increases, and can have Reduce to effect the loss function value of training set；Figure 10 represents the loss function value of test set also with the increase of repetitive exercise number And reduce；Figure 11 represents that increase convolution check figure can be effectively increased the accuracy rate of training set；Figure 12 represents the accuracy rate of test set, with The increase of convolution check figure, accuracy rate can fluctuate up and down as frequency of training increases, similar with Figure 10 situations.Its main cause It is：Caused by Adam optimizers employ and randomly select the optimisation strategy of descent direction, it is contemplated that the rejection rate used is 0.5 (every time can give up 50% model parameter in training, to prevent over-fitting), the phenomenon of this fluctuation is normal.Figure Curve a, b, c, d in 9- Figure 12 represent the effect of first, second, third, fourth group of experiment respectively.

When from more convolution kernels, model parameter can increase therewith, while time-consuming can also increase considerably.Count four groups The model of different scales, its parameter are shown in Table 3.

The accuracy rate of the contrast experiment of table 3 and corresponding model parameter

Layers 1 and 2 convolutional layer check figure sets 100,200,400 and 600 convolution respectively in four groups of experiments in table 3 Core；3rd layer and the 4th layer of convolutional layer set 200,400,800 and 1200 convolution kernels respectively.It is by statistics as can be seen that average Ratio of the model parameter quantity in four groups of experiments corresponding to each convolution kernel is 1：1.25：1.81：2.40, it is more by setting The optimization method of model parameter (total parameter increases to 3,3430000 from 2,320,000) makes the accuracy rate of model be lifted from 80.31% To 89.93%, but averagely time-consuming ratio is 1：2.73：9.25：20.1.Accuracy rate is contrasted it can be found that the standard of the 3rd group of experiment True rate differ smaller (2.17%) with the 4th group of accuracy rate, but the 4th group of experiment averagely take be the 3rd group 2.17 Times.Therefore, preferably embodiment is to set 400 convolution kernels, the 3rd, 4 layer of convolutional layer difference respectively in the 1st, level 2 volume lamination 800 convolution kernels are set.

Preferred embodiment is further used as, the initial method of the convolutional neural networks is initial to be uniformly distributed Change.Initial method is the initialization weighting function that each layer of model is set.It is used in 20,000 protein interaction data Training pattern, 4,000 interaction data are used for test performance, and the ratio of positive negative sample is 1:1, the bar that repetitive exercise 5 is taken turns Under part, initialization is uniformly distributed, full 0 initializes, complete 1 initializes, fixed value initializes, normal distribution initializes, random uniform The different initialization such as distribution initialization, the initialization of truncation Gaussian Profile, random orthogonal matrix initialisation and unit matrix initialization Influence of the method to model performance is studied, as a result as shown in table 4.

Influence of the 4 different initial methods of table to model performance

As seen from the results in Table 4, the accuracy rate (79.50%) and fixed value initial method of initial method are uniformly distributed Accuracy rate (80.97%) is close, but its loss function value (0.3952) has substantially than fixed value initial method (0.4431) Reduction.Therefore, the loss function and accuracy rate of test set are considered, it is preferably to initialize to be uniformly distributed initial method Method.

Preferred embodiment is further used as, the activation primitive of convolutional neural networks is LeakyReLu functions. LeakyReLu functions are improved relu functions, and it is defined as：

F (x)=α * x work as x<0, or

F (x)=x works as x>=0

When activation primitive un-activation, LeakyReLu still suffers from non-zero output, so as to obtain a small gradient, avoids Relu activation primitives may cause gradient to lack because of partial sequence, it is ensured that model stability.

Preferred embodiment is further used as, the optimizer of convolutional neural networks is Adam algorithms.It is 60 in training set, 000 protein interaction data, test set 10,000, positive and negative sample proportion are 1:1, common iteration 5 is taken turns, corresponding convolution Core be arranged to convolutional layer (100 convolution kernels)-convolutional layer (100 convolution kernels)-pond layer-convolutional layer (200 convolution kernels)- Convolutional layer (200 convolution kernels)-pond layer, the size of convolution kernel are under conditions of 4 × 4 and 5 × 5, respectively to RMSprop respectively Algorithm, Adam algorithms, stochastic gradient descent algorithm, Adagrad algorithms, Adadelta algorithms, Adamax algorithms and Nadam algorithms Device performance evaluation is optimized Deng seven algorithms.Wherein：The parameter of Adam algorithms is：Initial learning rate lr is arranged to 0.001, Beta_1 and beta_2 parameters are respectively set to 0.9 and 0.999, while prevent from removing 0 mistake there is provided epsilon parameters.Experiment As a result it is as shown in table 5.

Influence of the Different Optimization device of table 5 to model performance

From the result of table 5 it is known that the test accuracy rate (81.30%) of Adam algorithms and Nadam algorithms (82.05%) It is close, but loss function value (0.3338) is smaller than Nadam algorithm (0.3587).Because penalty values are the important of measurement model performance Index, therefore in the case of accuracy rate difference is less, the smaller Adam algorithms of loss function value are preferably optimizers.

Preferred embodiment is further used as, the use of positive and negative sample is 80,000 data composing training collection, is used 30,000 datas form test set, and convolutional neural networks, which use, is uniformly distributed initial method, LeakyReLu activation primitives, Adam optimizers, the 1st, level 2 volume lamination 400 convolution kernels are set, the 3rd, 4 layer of convolutional layer sets 800 convolution kernels, training altogether 150 wheels, using following index evaluation model performance：Accuracy rate (Acc), the accuracy of the data sample of test on model；It is quick Sensitivity (Sen), also known as positive prediction rate (TPR) or recall rate, recall ratio of the sample being computed correctly compared with total sample；Specifically Property (Spe), also known as true negative rate (TNR), count the ratio for the negative sample being correctly validated；Accuracy (Pre), in total sample just Really calculate the ratio of sample；False positive rate (FPR), statistics are calculated as that mistake is classified as just (false positive) and actual negative sample is total Several ratios；Likelihood ratio (DOR), reflect the index of model authenticity, by positive likelihood ratio (LR+) and negative likelihood (LR-) Ratio obtain, the index reflects the prediction case of test data comprehensively, and highly stable；Geneva coefficient correlation (Mcc), reaction Predictive ability and performance of the model under given sample proportion, the computational methods of these indexs are as follows：

Wherein, TP is the quantity for being correctly classified as positive sample in positive sample；TN be correctly be classified as in negative sample it is negative The quantity of sample；FN is the quantity for being classified as negative sample in positive sample by mistake；FP is to be classified as positive sample by mistake in negative sample This quantity.

Experimental result is as shown in figure 13, curve a, b, c, d in Figure 13 represent respectively train accuracy rate, test accuracy rate, Train loss function value and test loss function value.

Training accuracy rate as curve a can be seen that convolution model in Figure 13 is in steady growth trend, by 150 trainings in rotation After white silk, rate of accuracy reached remains in that the trend (curve terminal slope 0.0048) increased to 97.22%；Model Training loss function value gradually reduces, and is restrained after 129 wheel iteration, as shown in the curve c in Figure 13, the training process of model is steady It is fixed and effective.In test set, the accuracy rate by 150 wheel training is 91.45%, as shown in the curve b in Figure 13, illustrates mould Type is not in over-fitting, reliable results.But the loss function value of test set present by a relatively large margin (- 0.095~+ 0.114) fluctuation, as shown in the curve d in Figure 13, this is due to the descent direction that Adam methods randomly select and model house Abandon value to be arranged to caused by 0.5, belong to normal situation in convolutional neural networks.

According to the experimental result of 30,000 test datas, the accuracy rate (Acc) of this model is 89.93%, susceptibility (Sen) it is 89.26%, specific (Spe) is 90.43%, and accuracy (Pre) is 89.80%, and geneva coefficient correlation (Mcc) is 0.7968.There are 13,591 correctly to be classified in 15,000 positive samples, positive prediction rate is 90.61%, has Isosorbide-5-Nitrae 09 wrong Classification, false positive rate 9.57% by mistake；There are 13,312 correctly to be distributed in 15,000 negative samples, negative predictive rate is 88.75%, there are 1,688 to be classified by mistake, false negative rate 11.04%；Aligning the likelihood ratio that negative sample is classified is 92.02, wherein positive likelihood ratio is 8.05, negative likelihood 0.11；Mistake rate of failing to report is 9.57%, and error occurrences are 11.25%.

Compared with conventional method, the present invention has transfer learning and can train more data to increase the excellent of model accuracy rate Gesture, the result compared with other method are as shown in table 6.For example, Hishigaki et al. (https://doi.org/ 10.1016/j.sbi.2004.05.003) it is respectively to the accuracy rate of the protein interaction of saccharomycete：Subcellular location (72.7%), cytosis (63.6%) and physiological function (52.7%)；DNdisorder methods and its improved method (BMC Bioinformatics, 2013,14 (1), 1~10) accuracy rate is 74%；In addition, with using convolutional neural networks processing egg White matter related data is compared, and the present invention still has preferable performance.Wang etc. uses convolutional neural networks prediction group protein level AUCpreD methods (Bioinformatics, 2016,32 (17), i672~i679), on CASP9 and CASP10 data sets Achieve 76% accuracy；Yeeleng etc. carries out the Interaction Predicting of protein-ligand using convolutional neural networks (Bioinformatics btx264, DOI：https://doi.org/10.1093/bioinformatics/btx264), obtain Obtained 77.8% and 73.5% accuracy rate.Compared with document above method, the accuracy rate of the inventive method 89.93% has More obvious advantage.

Result of this method of table 6 compared with other method

* represent that the numerical value is not reported in bibliography

Using present protein interaction prediction method to 2000 in hippie Unrecorded protein phase interactions Analysis is predicted with relation.Based on constructed model, interaction result is expressed in the form of loss function value, most Small preceding 10 Thermodynamic parameters relation is as shown in table 7.

The result of 10 before the loss function value minimum of table 7

For example, a-protein L1A1_HUMAN and MERL_HUMAN are predicted as most possibly interacting.Due to all In 246,726 protein interactions, AL1A1_HUMAN albumen take part in 2,556 interactions, thus AL1A1_ altogether The training parameter of HUMAN albumen is more comprehensive and accurate.Protein MERL_HUMAN also has the higher frequency of occurrences, 246, Occur altogether in 726 positive samples 1,349 times, and have in positive sample 215 protein respectively at AL1A1_HUMAN albumen and MERL_HUMAN albumen has interaction relationship, so it is reliable to have the Relationship Prediction of interaction to the two protein 's.

In addition to the potential Interaction Predicting of protein, present protein interaction prediction method can be additionally used in albumen The prediction of matter bond strength.Interaction strength is represented by the score values of hippie data sets between protein, between 0.00~ 1.00.Score values are encoded to the vector of one 8 by the inventive method, and first place is " 0 "：Represent that two protein will not occur Interaction, first place represent that two protein interact for " 1 ", latter 7 101 numbers for being then encoded to 000~100 Value.By above coding method, two classification problems of prediction of protein-protein interaction are converted into eight classification to action intensity Problem, experimental result are as shown in figure 14.

As seen from Figure 14, different degrees of mistake between the inventive method predicted intensity value and actual strength value be present Difference.For predicted value value_predictWith actual value value_reality, its difference | value_predict-value_reality| quite In calculation error.When the error tolerance of error is controlled in initial value ± 0.02, the accuracy rate of model is only 0.24%；When fault-tolerant model System is contained at initial value ± 0.03, accuracy rate can be lifted to 27.75%；When error tolerance is controlled in initial value ± 0.12, accurately Rate has reached 57.96%；And error tolerance is controlled at initial value ± 0.26, accuracy rate can reach 99.97%.Research shows, This method can be judged and be predicted to the action intensity between protein in certain error range.Therefore, Ke Yitong The methods of crossing increase convolution kernel and model parameter, it is accurate that further lifting convolutional neural networks are predicted protein action intensity Rate.

Above is the preferable implementation to the present invention is illustrated, but the implementation is not limited to the invention Example, those skilled in the art can also make a variety of equivalent variations on the premise of without prejudice to spirit of the invention or replace Change, these equivalent deformations or replacement are all contained in the application claim limited range.

Claims

A kind of 1. prediction of protein-protein interaction method, it is characterised in that comprise the following steps：

Multi-dimensional image data needed for obtaining corresponding to the protein of prediction；

The multi-dimensional image data of acquisition is input in convolutional neural networks and handled, so as to export prediction result.
2. a kind of prediction of protein-protein interaction method according to claim 1, it is characterised in that described by the more of acquisition Dimensional data image is input in convolutional neural networks and handled, the step for so as to export prediction result before, be additionally provided with and build Vertical convolutional neural networks step, the convolutional neural networks step of establishing specifically include：

The multi-dimensional image data and interaction data value in protein interaction database corresponding to protein are obtained, uses institute Multi-dimensional image data structure input data positive sample is obtained, with gained interaction data value structure output data positive sample；

The multi-dimensional image data and interaction data value corresponding to the outer protein of protein interaction database are obtained, uses institute Multi-dimensional image data structure input data negative sample is obtained, with gained interaction data value structure output data negative sample；

Input data positive sample and input data negative sample are selected so as to build training input data set and test input number respectively According to collection；Output data positive sample and output data negative sample are selected so as to build training output data set and test output number respectively According to collection；

With training input data set and output data set training convolutional neural networks are trained, with test input data set and survey Try output data set convolutional neural networks；

The convolutional neural networks that the convolutional neural networks obtained after training and test are terminated are established as needed for.
A kind of 3. prediction of protein-protein interaction method according to claim 1 or 2, it is characterised in that the multi-dimensional map As the acquisition methods of data are including following any：Separately encoded data transfer device, dislocation coding data transfer device, The data transfer device of Central Symmetry coding and the data transfer device of adjacent encoder.
4. a kind of prediction of protein-protein interaction method according to claim 3, it is characterised in that described separately encoded Formula is used in data transfer device：

F₁=pix (i_k1).reshape(length,width,dimension)

F₂=pix (1800+i_k2).reshape(length,width,dimension)

In formula, F₁The multi-dimensional image data obtained after being changed for the first protein, F₂The multidimensional obtained after being changed for the second protein View data, pix () are that the sequence site of protein is converted to the function in image pixel site, i_k1For the first protein Sequence site, i_k2For the sequence site of the second protein, reshape () is that one-dimensional sequence data are converted into matrix data Function, length are total line number of the multi-dimensional image data obtained after changing, and width is the multi-dimensional image data obtained after changing Total columns, dimension is data channel number.
5. a kind of prediction of protein-protein interaction method according to claim 3, it is characterised in that the dislocation coding Formula is used in data transfer device：

F₁=pix (i_k1*2).reshape(length,width,dimension)

F₂=pix (i_k2*2+1).reshape(length,width,dimension)。
6. a kind of prediction of protein-protein interaction method according to claim 3, it is characterised in that the Central Symmetry is compiled Code data transfer device used in formula be：

F₁=pix (1799-i_k1).reshape(length,width,dimension)

F₂=pix (1800+i_k2).reshape(length,width,dimension)。
7. a kind of prediction of protein-protein interaction method according to claim 3, it is characterised in that the adjacent encoder Formula is used in data transfer device：

F₁=pix (1799-length+i_k1).reshape(length,width,dimension)

F₂=pix (1800+i_k2).reshape(length,width,dimension)。
A kind of 8. prediction of protein-protein interaction method according to claim 3, it is characterised in that the convolutional Neural net Network includes the first convolutional layer, the second convolutional layer, the 3rd convolutional layer, Volume Four lamination, the first pond layer, the second pond layer, first Full articulamentum and the second full articulamentum, the output end of first convolutional layer pass sequentially through the second convolutional layer, the first pond layer, the Three convolutional layers, Volume Four lamination, the second pond layer, the first full articulamentum are connected with the input of the second full articulamentum.
A kind of 9. prediction of protein-protein interaction system, it is characterised in that including：

Data module, for obtaining the multi-dimensional image data corresponding to the protein of required prediction；

Processing module, handled for the multi-dimensional image data of acquisition to be input in convolutional neural networks, it is pre- so as to export Survey result.
A kind of 10. prediction of protein-protein interaction device, it is characterised in that including：

Memory, for storing at least one program,

Processor, for loading at least one program and performing following steps：

Multi-dimensional image data needed for obtaining corresponding to the protein of prediction；

The multi-dimensional image data of acquisition is input in convolutional neural networks and handled, so as to export prediction result.