CN108171010A

CN108171010A - Protein complex detection method and device based on semi-supervised internet startup disk model

Info

Publication number: CN108171010A
Application number: CN201711250342.9A
Authority: CN
Inventors: 朱佳; 黄昌勤
Original assignee: Guangzhou Van Ping Electronic Technology Co Ltd; South China Normal University
Current assignee: Guangdong SUCHUANG Data Technology Co.,Ltd.
Priority date: 2017-12-01
Filing date: 2017-12-01
Publication date: 2018-06-15
Anticipated expiration: 2037-12-01
Also published as: CN108171010B

Abstract

The invention discloses protein complex detection methods and device based on semi-supervised internet startup disk model, the method includes obtaining the adjacency matrix of the protein interaction Internet, embedded processing is carried out to adjacency matrix, so as to obtain dimensionality reduction matrix, dimensionality reduction matrix is handled using clustering algorithm, so as to obtain protein complex testing result, described device includes for storing at least one program storage and for loading at least one program to perform the processor of the protein complex detection method based on semi-supervised internet startup disk model.The present invention improves the effect of clustering processing by carrying out dimension conversion, then clustering algorithm is transferred to handle to the corresponding adjacency matrix of the protein interaction Internet.Protein complex detection method and device the present invention is based on semi-supervised internet startup disk model are widely used in protein complex identification technology field.

Description

Protein complex detection method and device based on semi-supervised internet startup disk model

Technical field

The present invention relates to protein complex identification technology fields, are based especially on the albumen of semi-supervised internet startup disk model The compound body detecting method of matter and device.

Background technology

Protein complex is that protein interaction (Protein-protein interaction, PPI) is formed Complicated graph structure, plays vital role in biochemical process and pharmaceutical technology.Therefore, PPI friendships are correctly identified Protein complex in mutual network, it is extremely useful for biomedical sector.But, with the tremendous growth of PPI data, again In addition the 'bottleneck' restrictions of experimental method, only a small amount of protein complex are identified by testing.

To overcome the technology restriction of experimental method in protein complex detection, it is used computational methods.PPI is interacted Network can regard a undirected unweighted graph as, wherein, protein is vertex, their interaction is side.Each albumen Matter complex is made of two or more protein for showing as intensive connected subgraph to be based on, which means that can utilize The figure that clustering method is formed finds them.

Recently, internet startup disk is studied extensively by people, and confirms that it can further improve many figure clustering methods Performance.The low-dimensional on vertex represents in network vector learning network, to capture and preserve the network structure.But, it is most of existing The feature on each vertex in some network vector method heavy dependence networks, this causes them not to be suitable for the PPI Internets. It is related to each vertex without any metadata other than protein name is referred to as in the PPI Internets.In other words, existing net Network vector approach can not capture PPI alternating network structures completely, because can be used for calculating the estimation of its single order without enough data Estimate with second order.

Invention content

In order to solve the above-mentioned technical problem, the first object of the present invention is to provide based on semi-supervised internet startup disk model Protein complex detection method, second is designed to provide the protein complex detection based on semi-supervised internet startup disk model Device.

The first technical solution for being taken of the present invention is：

Based on the protein complex detection method of semi-supervised internet startup disk model, include the following steps：

Obtain the adjacency matrix of the protein interaction Internet；

Embedded processing is carried out to adjacency matrix, so as to obtain dimensionality reduction matrix；

Dimensionality reduction matrix is handled using clustering algorithm, so as to obtain protein complex testing result.

Further, it is described that embedded processing is carried out to adjacency matrix, it is specific to wrap the step for so as to obtain dimensionality reduction matrix It includes：

The single order estimation between all any two points in the protein interaction Internet is calculated, so as to obtain protein The partial structurtes information of the interaction Internet；

The second order estimation between all any two points in the protein interaction Internet is calculated, so as to obtain protein The overall structure information of the interaction Internet；

Partial structurtes information and overall structure information are saved in adjacency matrix, so as to obtain dimensionality reduction matrix.

Further, the single order estimation calculated in the protein interaction Internet between all any two points, It the step for so as to obtain the partial structurtes information of the protein interaction Internet, specifically includes：

The preferred adjoint point on each vertex in the protein interaction Internet is selected using adjoint point selection algorithm Collection；

Respectively according to the preferred adjoint point collection on each vertex, characteristic information is assigned for each vertex, so as to establish feature Information matrix；

According to characteristic information matrix, calculate the single order in the protein interaction Internet between all any two points and estimate Meter；

Using the single order estimation between any two points all in the protein interaction Internet as the egg of required acquisition The partial structurtes information of the white matter interaction Internet.

Further, the second order estimation calculated in the protein interaction Internet between all any two points, It the step for so as to obtain the overall structure information of the protein interaction Internet, specifically includes：

It will abut against and handled in matrix and characteristic information Input matrix to figure convolutional neural networks, it is mutual so as to export protein Act on the second order estimation between all any two points in the Internet；

Using the second order estimation between any two points all in the protein interaction Internet as the egg of required acquisition The overall structure information of the white matter interaction Internet.

Further, described each top selected using adjoint point selection algorithm in the protein interaction Internet It the step for preferred adjoint point collection of point, specifically includes：

The protein interaction Internet is handled using Deepwalk algorithms, so as to obtain each vertex Deepwalk vectors；

A vertex in the selected protein interaction Internet is as object vertex；

According to the Deepwalk of object vertex and all adjoint points of object vertex vectors, computing object vertex is every with it respectively The Euclidean distance of one adjoint point；

Computing object vertex and the arithmetic average of the Euclidean distance of each of which adjoint point；

The set that the adjoint point that all Euclidean distances with object vertex are more than arithmetic average is formed is as object The preferred adjoint point collection on vertex；

A vertex in the execution selected protein interaction Internet is returned to as this step of object vertex Suddenly, until the preferred adjoint point collection on each vertex in the protein interaction Internet is selected.

Further, the second order estimation calculated in the protein interaction Internet between all any two points, It is described excellent equipped with Optimization Steps after the step for so as to obtain the overall structure information of the protein interaction Internet Change step to include：

According to the single order estimation between any two points all in the protein interaction Internet and second order estimation, calculate Scheme Laplce's regular terms loss function；

Dynamic adjustment characteristic information order of matrix number, until figure Laplce's regular terms loss function minimizes；

Will according to corresponding single order estimation during figure Laplce's regular terms loss function minimum and second order estimation respectively as The partial structurtes information of the protein interaction Internet of required acquisition and overall structure information.

Further, the figure Laplce regular terms loss function, calculation formula are as follows：

L=L_first+λL_second

In formula, L is schemes Laplce's regular terms loss function, L_firstThe loss monitored for single order estimation, L_secondIt is two The monitored loss of rank estimation, λ L_firstAnd L_secondBetween balance factor.

Further, the monitored loss of the single order estimation, calculation formula are as follows：

In formula, v_iAnd v_jIt is the opposite vertexes connected in the protein interaction Internet by a line, y_iIt is by v_i's The matrix that Deepwalk vectors are established, y_jIt is by v_jDeepwalk vectors establish matrix；

The monitored loss of the second order estimation, calculation formula are as follows：

In formula, L₀For the convolutional layer number of plies of figure convolutional neural networks, H⁽⁰⁾=N × D,

Dynamic adjustment α and β so that Z is equal to 0 or to the maximum extent close to 0 in following equations group：

In formula,For the minus deviation variable of first object,For the overgauge variable of first object,For the second target Minus deviation variable,Overgauge variable for the second target；X is characterized information matrix, and D is the columns of X, and P is the singular value of X Most high percentage, α is a matrix, and the columns of α is equal to the maximum value that D can use, and β is equal to the minimum value that D can use.

The second technical solution for being taken of the present invention is：Protein complex detection based on semi-supervised internet startup disk model Device, including：

Memory, for storing at least one program；

Processor, it is embedding based on semi-supervised network described in the first technical solution to perform for loading at least one program Enter the protein complex detection method of model.

The beneficial effects of the invention are as follows：It is mutual to protein by the compound body detecting method of present protein and device The effect Internet carries out embedded, dimension conversion processing, can improve existing clustering algorithm to protein interaction Interactive Network Network carries out efficiency during cluster calculation process, optimizes Clustering Effect so that protein complex testing result is more accurate.Meanwhile The present invention can be that each vertex of the protein interaction Internet assigns feature, can capture protein interaction interaction The partial structurtes of network can capture its overall structure again, therefore present invention does not require each tops of the protein interaction Internet Point itself has feature, and overcoming directly to hand over protein interaction of each vertex there is no feature using clustering algorithm The technological deficiency that mutual network is handled.The present invention is stable, and every prediction result evaluation index is superior to other protein Compound body detecting method.

Description of the drawings

Fig. 1 is the flow chart of the compound body detecting method of present protein；

Fig. 2 is the particular flow sheet of step S2；

Fig. 3 is the particular flow sheet of step S21；

Fig. 4 is the particular flow sheet of step S211；

Fig. 5 is the comparison result of Krogan data sets；

Fig. 6 is the comparison result of Dip data sets；

Fig. 7 is the comparison result of Biogrid data sets；

Fig. 8 is the structure chart of present protein complex detection device.

Specific embodiment

Embodiment 1

Protein complex detection method disclosed by the invention based on semi-supervised internet startup disk model, as shown in Figure 1, packet Include following steps：

S1. the adjacency matrix of the protein interaction Internet is obtained；

S2. embedded processing is carried out to adjacency matrix, so as to obtain dimensionality reduction matrix；

S3. dimensionality reduction matrix is handled using clustering algorithm, so as to obtain protein complex testing result.

The existing detection method to protein complex, be by the protein interaction Internet be expressed as one it is undirected Scheme G=(V, E), protein is the vertex V in figure, and interaction is the side E in figure, and protein interaction Interactive Network The side of network does not have weight.The protein interaction Internet can be concentrated from available datas such as Krogan, Dip and Biogrid It obtains.By graph theory it is found that a protein interaction Internet corresponds to an adjacency matrix, COACH or K-means is utilized Clustering algorithms is waited to handle adjacency matrix, protein complex testing result can be obtained, that is, export which result shows A little protein belong to an a kind of namely complex.The present invention is based on the protein complex inspections of semi-supervised internet startup disk model Survey method is by carrying out adjacency matrix embedded processing, so as to obtain the dimensionality reduction square for being passed through dimension by adjacency matrix and being transformed Battle array, then protein complex detection is carried out to dimensionality reduction matrix with well known clustering algorithm, the fortune of clustering algorithm can be improved Line efficiency.Since the present invention utilizes the corresponding Internet of protein interaction, i.e., figure progress protein mathematically is compound Physical examination is surveyed, therefore unless stated otherwise, not to protein interaction, PPI, the protein interaction Internet in embodiment And the concepts such as corresponding figure of the protein interaction Internet distinguish.

Preferred embodiment is further used as, it is described that embedded processing is carried out to adjacency matrix, so as to obtain dimensionality reduction matrix The step for, i.e. step S2, as shown in Fig. 2, specifically including：

S21. the single order estimation between all any two points in the protein interaction Internet is calculated, so as to obtain egg The partial structurtes information of the white matter interaction Internet；

S22. the second order estimation between all any two points in the protein interaction Internet is calculated, so as to obtain egg The overall structure information of the white matter interaction Internet；

S23. partial structurtes information and overall structure information are saved in adjacency matrix, so as to obtain dimensionality reduction matrix.

Wherein, the pairwise similarity between single order estimation (First-order proximity) description vertex.For albumen Any pair of vertex v in the matter interaction Internet_iAnd v_jFor, if v_iAnd v_jBetween have a line, then v_iAnd v_jBetween There is positive single order to estimate.Conversely, v_iAnd v_jBetween single order be estimated as 0.Single order estimation reflects the protein interaction Internet Partial structurtes.

Pairwise similarity between second order estimation (Second-order proximity) description vertex neighbour structure.It is assumed that N_iAnd N_jRepresent v_iAnd v_jAdjacent opposite vertexes, then second order estimation by N_iAnd N_jSimilitude determine.If two vertex share perhaps Mostly public neighbour, then the second order estimation between two vertex can be very high.It is similar that second order estimation has proven to one opposite vertexes of definition Property good measure standard, even if they and it is boundless be connected, therefore it can greatly enrich the relationship on vertex.Second order estimation reflects egg The overall structure of the white matter interaction Internet.

Single order estimates the concept with second order estimation, is proposed in LINE models earliest.If u is in figure G=(V, E) One vertex, then u and the single order estimation on other all vertex in figure G=(V, E) are represented by N_u={ s_u,1,s_u,2,… s_u,|V|, wherein s_i,jThe weight on the side in figure G=(V, E) between vertex i and vertex j is represented, if between vertex i and vertex j There is no side connection, then s_i,j=0, if connected between vertex i and vertex j by side, and it is not weighted graph to scheme G=(V, E), that S_i,j=1, if figure G=(V, E) is weighted graph, then s_i,j>0.Similarly vertex v and other all vertex in figure G=(V, E) Single order estimation be represented by N_v={ s_v,1,s_v,2,…s_v,|V|}.According to this algorithm, all tops in figure G=(V, E) can be calculated Single order between point and other vertex is estimated.And second order is estimated, it, then can be by calculating N by taking vertex v and vertex u as an example_uWith N_v Between similitude obtain.It can be seen that calculate single order estimation and second order estimation, it is desirable that the weight on each side in figure is first obtained, But the characteristics of PPI, is between vertex other than protein title difference, without other features for differentiation, that is, often A vertex lacks for for each entitled feature in side.

Since the present invention is using the corresponding Internet progress protein complex detection of protein interaction, that is, have in mind In protein interaction Internet entirety, therefore unless stated otherwise, protein interaction is not interacted in embodiment Single order estimation, the single order estimation of single order estimation, the protein interaction Internet in network between all any two points are made It distinguishes, also the second order estimation not between all any two points in the protein interaction Internet, protein interaction Second order estimation, the second order estimation of the Internet are distinguished.

After single order estimation and second order estimation is obtained, you can single order estimation and second order estimation are combined with adjacency matrix, Single order is exactly estimated that corresponding partial structurtes information and second order estimate that corresponding overall structure information is saved in adjacency matrix, So as to obtain dimensionality reduction matrix.Due to being combined and belonging to the prior art single order estimation and second order estimation with adjacency matrix, herein It does not repeat.

Because each vertex in the protein interaction Internet is other than corresponding protein title without other Feature, therefore in order to calculate the estimation of the single order of the protein interaction Internet, i.e., in the protein interaction Internet Single order estimation between all any two vertex, needs to assign one group of feature for each vertex.In view of protein complex Definition, the important adjoint point on each vertex can be set as its feature, because these adjoint points have higher probability to be answered as protein Zoarium is combined.So-called important adjoint point refers to screen in all adjoint points on a vertex by certain algorithm Part adjoint point.

Preferred embodiment is further used as, it is described to calculate all any two points in the protein interaction Internet Between single order estimation, the step for so as to obtain the partial structurtes information of the protein interaction Internet, i.e. step S21, as shown in figure 3, specifically including：

S211. respectively according to the preferred adjoint point collection on each vertex；

S212. according to the corresponding preferred adjoint point collection in each vertex, characteristic information is assigned for each vertex, so as to Establish characteristic information matrix；

S213. it according to characteristic information matrix, calculates in the protein interaction Internet between all any two points Single order is estimated；

Each vertex in the protein interaction Internet has preferred adjoint point collection, but be not excluded for certain vertex Preferred adjoint point collection may be empty set.For a vertex in the protein interaction Internet, preferred adjoint point collection is The set of qualified adjoint point screened from its all adjoint point.Using preferred adjoint point collection spy is assigned to corresponding vertex Reference ceases.If vertex v_iCorresponding preferred adjoint point collection includes vertex x, y and z, then " x, y and z " three vertex are exactly vertex v_i The feature being endowed.After each vertex is endowed feature by such method, just there are the basis for calculating side right weight, Ran Houyong To calculate single order estimation.

Since each vertex has the characteristic information being endowed, protein interaction interaction can be obtained The characteristic information matrix (Feature matrix) of network, it is the matrix of N × D rank, and wherein N is protein interaction The vertex sum of the Internet, D are the feature quantity on each vertex.Because the preferred adjoint point collection of each vertex correspondence differs Sample, that is, the feature on each vertex are different, therefore the feature quantity on each vertex is also different.

For example, in the protein interaction Internet for having N number of vertex at one, a vertex may corresponding spy The maximum value of quantity is levied as N, therefore the maximum order of the corresponding characteristic information matrix of this protein interaction Internet For N × N ranks.If the feature quantity of a vertex correspondence be less than N, then this vertex in characteristic information matrix it is corresponding that This deficiency of a line N is arranged, and N row can be supplied with filling algorithm, and preferred method is to be supplied N row to make the element of its rightmost It is zero.And during the use of characteristic information matrix, it is sometimes desirable to reduce its scale, that is, keep its line number constant, reduce Its columns, at this time can be considered as D one variable, and the maximum value of D can be set to feature in the protein interaction Internet The feature quantity on the vertex of quantity maximum, can also directly be set to N, and the minimum value of D can be set to protein interaction interaction The feature quantity on the vertex of feature quantity minimum in network.For example, when the maximum value of D is set to N, the characteristic information square of N × D ranks Battle array can be reduced to N × (D-1) rank, N × (D-2) rank etc., it is preferable that be by its rightmost during by characteristic information matrix reduction Row are left out, and only retain leftmost row.

According to characteristic information matrix, can calculate in the protein interaction Internet between all any two points Single order is estimated.It, can be preferably by cosine similarity there are many ways to calculating single order estimation according to characteristic information matrix Computational methods since this belongs to the prior art, do not repeat here.

Preferred embodiment is further used as, it is described to calculate all any two points in the protein interaction Internet Between second order estimation, the step for so as to obtain the overall structure information of the protein interaction Internet, specifically include：

Second order is estimated to represent the similarity degree of an opposite vertexes neighbour structure.Thus, second order estimation is modeled, first has to mould Typeization each pushes up neighborhood of a point.For the figure G=(V, E) containing n vertex, adjacency matrix M is corresponded to, it includes n row squares Battle array, i.e. m₁,m₂,…m_n.For row matrixAnd if only if v_iAnd v_jThere is m when being connected by a line_i,j>0。

m_iVertex v is described_iNeighbour structure, and M provides the information of each vertex neighbour structure.So it can be based on automatic Encoder design goes out GCN, to preserve the estimation of the second order of G.

Figure convolutional neural networks (Graph Convolutional Network, GCN) based on autocoder can answer With hidden variable, the interpretable hidden expression of undirected non-weight map can be learnt, this is to be very suitable for protein interaction friendship Mutual network.Using each vertex feature as GCN a part of input data, then, by l convolutional layers coding it Afterwards, the statement learnt by original graph can just be obtained.For decoded portion, internal product decoder can be simply used. The protein interaction Internet is a undirected nonweighted figure G=(V, E), there is N=| V | a vertex.By the neighbour of G The characteristic information matrix X of domain matrix A and N × D rank is as input.Using random hidden variable Z_i, the output of N × F ranks can be obtained Matrix Z.Here, F is the quantity for exporting feature, and D is the feature quantity on each vertex.It just can be obtained from the output result of GCN The second order estimation for the protein interaction Internet to be obtained, i.e., it is all arbitrary in the protein interaction Internet The second order estimation on two vertex.Since the method that second order estimation is obtained from the output result of GCN belongs to the prior art, this In do not repeat.

Since each vertex is characterized in what the adjoint point based on selection generated, in other words, the feature quantity on each vertex It is different.So initial values of the N as D is set, when establishing characteristic information matrix X, if these no features of the vertex, Correlation values are then set as 0.Then, each network layer can be written as following nonlinear function in figure convolutional neural networks：

H^(l+1)=f (H^l, A),

Wherein H⁽⁰⁾=X, H^(l)=Z,

Transmission rule is as follows：

f(H^(l), A) and=relu (AH^(l)W^(l)),

Wherein W is the weight matrix of I network layers, and relu is activation primitive, it is noted that is only enumerated with the A persons of multiplication all All features of adjoint point, but do not include the vertex in itself.It is therefore desirable to a unit matrix I is added on A.Then, transmission rule Then become：

Wherein It isDiagonal Vertex Degree matrix, if L=3, that is it is meant that figure convolutional neural networks have three A convolutional layer rebuilds the structure of A to obtain Z.It is assumed that determine the feature of each layer of reservation preceding layer half in network, then three It is obtained after layer

It is further used as preferred embodiment, the adjoint point selection algorithm, i.e. step S211, as shown in figure 4, specifically For：

S2111. the protein interaction Internet is handled using Deepwalk algorithms, so as to obtain each The Deepwalk vectors on vertex；

S2112. a vertex in the protein interaction Internet is selected as object vertex；

S2113. according to the Deepwalk of object vertex and all adjoint points of object vertex vectors, difference computing object vertex With the Euclidean distance of each of which adjoint point；

S2114., all Euclidean distances with object vertex are more than to the collection cooperation of the adjoint point composition of arithmetic average Preferred adjoint point collection for object vertex；

S2115. a vertex in the execution selected protein interaction Internet is returned to as object vertex The step for, until the preferred adjoint point collection on each vertex in the protein interaction Internet is selected.

DeepWalk is a kind of method for learning the hidden expression of node, this method is in a vector row space to node Social relationships encoded, be language model and unsupervised learning from word sequence to figure on one extension.This method will The sequence for blocking migration is learnt as sentence.This method have it is expansible, can parallelization the characteristics of, can be used for do network Classification and outlier detection.DeepWalk methods are successfully verified in social networks and map analysis.It passes through model Change a succession of short and random migration, continuous vector space is encoded with low-dimensional, so as to learn potentially to state.

The protein interaction Internet is handled by Deepwalk, gained handling result causes protein phase Each vertex corresponds to the vector of one 64 dimension in the interaction Internet, according to any two vertex corresponding 64 Dimensional vector can calculate the Euclidean distance on the two vertex.In the present patent application, each vertex is calculated by Deepwalk 64 dimensional vectors obtained after method processing are referred to as the Deepwalk vectors of this vertex correspondence.Selected protein interaction Interactive Network A vertex in network, referred to as object vertex, the Euclidean distance of object vertex and its all adjoint point is calculated respectively Come, then seek the arithmetic average of all these Euclidean distances, i.e., by the Euclid of object vertex and its all adjoint point away from From the sum of divided by its adjoint point sum.Then, by the Euclidean distance and arithmetic average of object vertex and each of which adjoint point It is compared, the adjoint point of arithmetic average is more than for Euclidean distance, then is included into preferred adjoint point collection, otherwise excludes preferred Except adjoint point collection.By this method, the certain vertex that can be directed to the protein interaction Internet filters out it Qualified adjoint point forms preferred adjoint point collection.

The above method is recycled, i.e., selects for an object vertex in step S2114 and sets up its preferred adjoint point collection Afterwards, return to step S2112, the vertex that another is selected not yet to set up preferred adjoint point collection in the protein interaction Internet It as new object vertex, is continued to execute since step S2112, until vertex all in the protein interaction Internet Its qualified adjoint point is all filtered out by this method forms corresponding preferred adjoint point collection.There is corresponding preferred adjoint point Collection can carry out the operations such as feature imparting by above-mentioned published method.

According to above-mentioned this adjoint point selection algorithm, the meaning of characteristic information matrix is just definitely：It is arranged with N rows D, N For the vertex sum of the protein interaction Internet, D is the feature quantity on each vertex.After Deepwalk algorithms, Each vertex has corresponded to the vector of one 64 dimension, and therefore, each element in characteristic information matrix is substantially one 64 dimensional vectors.

Preferred embodiment is further used as, it is described to calculate all any two points in the protein interaction Internet Between second order estimation, the step for so as to obtain the overall structure information of the protein interaction Internet after, be equipped with Optimization Steps, the Optimization Steps include：

Due to setting initial values of the N as D when establishing characteristic information matrix, characteristic information order of matrix number differs Surely it is most rational, the single order estimation of the protein interaction Internet according to obtained by characteristic information matrix and second order estimation Also it is not necessarily optimal, the dimensionality reduction matrix handled for clustering algorithm for finally obtain is not optimal by this.In order to Optimal dimensionality reduction matrix is acquired, dynamically adjusts characteristic information order of matrix number, the single order of the protein interaction Internet Estimation and second order estimation will also change, and the figure Laplce regular terms that gained is calculated by single order estimation and second order estimation loses When function obtains minimum value, show the estimation of corresponding single order and second order estimation be combined as it is optimal, should with this optimal one Rank estimate and second order estimation combination respectively as required acquisition the protein interaction Internet partial structurtes information and Overall structure information further goes to acquire dimensionality reduction matrix.

It is further used as preferred embodiment, the figure Laplce regular terms loss function, calculation formula is as follows It is shown：L=L_first+λL_second

In formula, L is schemes Laplce's regular terms loss function, L_firstThe loss monitored for single order estimation, L_secondIt is two The monitored loss of rank estimation, λ L_firstAnd L_secondBetween balance factor, λ is a parameter, can be in algorithm actual motion When select its value.

Preferred embodiment is further used as, the single order estimates monitored loss, and calculation formula is as follows：

In formula, v_iAnd v_jIt is the opposite vertexes connected in the protein interaction Internet by a line, y_iIt is by v_i's The matrix that Deepwalk vectors are established, y_jIt is by v_jDeepwalk vectors establish matrix.Preferably, y_iIt is by v_i's The matrix that Deepwalk vectors are established, specifically, with v_iAnd v_iThe corresponding Deepwalk vectors conduct of all preferred adjoint points Element, structure matrix y_i.Matrix y_jConstruction method similarly.Because the adjoint point number on each vertex may be different, that is, Say y_iAnd y_jExponent number may be different, smaller matrix is filled using neutral element, it is ensured that two matrix sizes are identical, with It is calculated.It is so-called that smaller matrix is filled using neutral element, it specifically can it is preferable to use following this fill methods：Such as y_i Exponent number compares y_jIt is small, then to be just filled into y with neutral element_iIn become a new matrix so that new order of matrix number and y_jEqually, and And y_iIn the upper left corner of new matrix.

In formula, L₀For the convolutional layer number of plies of figure convolutional neural networks, H⁽⁰⁾=N × D,Here it is similary The method that ground is filled with neutral element so that H^(l+1)And H^(l)Exponent number it is identical.

In aforementioned manners, when obtaining minimum value for figure Laplce's regular terms loss function L the estimation of corresponding single order and Second order estimation combination is optimal.

In formula,For the minus deviation variable of first object,For the overgauge variable of first object,For the second target Minus deviation variable,Overgauge variable for the second target；X is characterized information matrix, and D is the columns of X, and P is the singular value of X Most high percentage, α is a matrix, and the columns of α is equal to the maximum value that D can use, and β is equal to the minimum value that D can use；

By according to Z be equal to 0 or to the maximum extent close to 0 when corresponding characteristic information matrix calculate single order estimation and Second order estimates the partial structurtes information of the protein interaction Internet and overall structure information respectively as required acquisition.

The above method is another implementation method of Optimization Steps.Mathematically, by the way that figure Laplce regular terms is asked to damage The dimensionality reduction problem of the problem of function minimum is to realize optimization actually matrix is lost, it, can be with as preferred embodiment Using traditional singular value decomposition method (SVD) come into the dimensionality reduction of row matrix.According to the theorem of SVD, the feature for having N × D ranks is believed Matrix X is ceased, U × S × V* can be written as again, here, U is the orthogonal matrix of characteristic information matrix X, and the size of U is N × N ranks；S It is the diagonal matrix of characteristic information matrix X, the size of S is N × D ranks；V* is the associate matrix of U, and the size of V* is D × D Rank.S can also be referred to as the singular value of X.If the minimum value of some most high percentage P of the singular value is set as 0, then, It can obtain the approximate matrix of X, i.e. X '.Finally, the value of D is to reduce, but, since it is desired that the reconstruct for minimizing X → X ' misses Difference, it is necessary to maximize the value of 1-P.After having carried out multiplication calculation with SVD, X'=(1-P) X, X is a N × D matrix, institute The problem of figure Laplce's regular terms loss function minimum value is to realize optimization can will be asked to be converted to goal programming and asked Topic, as shown in below equation group：

Dynamic adjustment α, refers to that α is initially preferably taken as the matrix of N × N, that is, characteristic information matrix is in itself, adjusts α, that is, gradually α depression of orders are such as deleted the row of rightmost one as the matrix of N × (N-1), then substitute into equation group again and fall into a trap It calculates；It deletes matrix of the row of rightmost one as N × (N-2) again in next step, then substitutes into calculating, etc. in equation group again.

In this equation group, positive and negative deviation variable is placed in status of equal importance, which means that becoming for each deviation Amount, weight is 1.Obviously, when Z is equal to 0, Pareto optimal solution can be obtained.But in some cases, Z cannot be accurately Equal to 0, Z required at this time is the value as close possible to 0 in its value range.So by constantly updating α and β, until looking for To can make Z close or equal to 0 α and β combine, the characteristic information matrix corresponding to the combination of this α and β be it is optimal, by The single order estimation and second order estimation that optimal characteristic information matrix is calculated can make dimensionality reduction matrix optimal, to optimize cluster Effect.

Embodiment 2

In the present embodiment, based on three groups of PPI data sets, will illustrate in embodiment 1 based on semi-supervised internet startup disk mould The protein complex detection method of type, is tested with reference to existing clustering method, by its experimental result and existing cluster The experimental result routinely applied of method is compared with state-of-the-art method, to show the performance of 1 the method for embodiment.Experiment exists It is run on desktop computer, is configured to i7CPU double-cores 4.00GHZ, 16GB memory, 1070 video cards of GTX.Three group data sets it is entire Calculating process can be completed in one day.Further, since PPI data clusters are usually disposable process in real world, The improvement of run time and the analysis of time complexity need not be paid close attention under study for action because clustering result quality be only it is prior.

Use the PPI data sets of three groups of newest saccharomyces cerevisiaes, i.e. Krogan data sets, Dip data sets and Biogrid numbers According to collection.Krogan data sets and Dip data sets are the operations for assessing several clustering algorithms.As shown in table 1, Krogan numbers There are similar average degree and density according to collection and Dip data sets, and Biogrid data sets compare with them, have higher average Degree and density.Because PPI data can use non-directed graph G=(V, E) to represent that then average degree can be calculated asDensity can calculate ForThe characteristic of three kinds of PPI data sets is as shown in table 1.

PPI data have higher rate of false alarm, it is estimated that about 50% or so.The noise jamming of data is from the PPI data Detect the clustering method of protein complex.Then, using CYC2008 as with reference to data set.CYC2008 provides saccharomyces cerevisiae Aspect passes through the catalogue of 408 kinds of protein complexes manually proofreaded, 90% more than another prevalence data collection MIPS.

Table 1

Data set	Vertex	Side	Average degree	Density
					Krogan	5364	61289	22.85	0.0043
Dip	4972	17836	7.17	0.0014
					Biogrid	6242	255510	81.87	0.013

Using neighbour's affinity score from the point of view of certain algorithm detect protein complex whether with the albumen in CYC2008 Matter composite bulk phase is matched.Then, accuracy rate, recall rate and F values then with it are calculated, to assess the performance of the algorithm.Neighbour is affine Power scoring NA (p, b) is defined as follows：

Here, P=(Vp, Ep) is the protein complex of prediction, and B=(Vb, Eb) is the protein complex of reference.In It is that accuracy rate precision can calculate as follows：

Wherein,

Recall rate recall calculates as follows：

Wherein,

F values F-measure is the harmonic-mean of accuracy rate and recall rate, is calculated as follows：

ω is a threshold value, and it is compound with reference to a certain protein in data set to represent whether protein complex is confirmed to be Body.According to experiment, set neighbour's affinity scoring threshold value as 0.25, this so that model performance and other algorithms are different.

In addition, also using three indexs, i.e. score (Frac), maximum matching rate (MMR) and geometric accuracy (Acc), to spend Measure the quality of protein complex cluster.Frac is the index for estimating score pair between two protein complexes, has and is more than 0.25 overlap integral θ, Frac (θ) calculates as follows：

Here, A and B is two protein complexes.

The geometry that Acc is other two kinds measurements --- cluster sensitivity (Sn) and cluster positive predictive value (PPV) --- is put down Mean.Sn and PPV calculates as follows：

Here, n is the protein number with reference to protein complex, and m is the protein number for clustering protein complex, Element t_ijRepresent the protein number found in two complexs.Because S_nIt can be by adding each egg in same complex White matter and increase, and PPV can also be maximized by adding each protein in its own complex, thus can with this two Kind measures the geometrical mean to calculate Sn and PPV：

MMR represents that the protein complex of two groups of aggregations is bigraph (bipartite graph), wherein two groups of nodes represent reference composite body respectively With prediction complex, it is coupled reference composite body and predicts that the side of complex is weighted by overlap integral.Two protein complexes it Between overlap integral equationIt calculates.The value of MMR is the total of the specific subset on the side for possessing weight limit Weight divided by the number with reference to protein complex.

Root is it was found that so far, COACH is that the PPI Internets most stablize most representative clustering algorithm.Made with it Clustering method for assessment models.With two kinds of state-of-the-art network vector model DeepWalk and SDNE come comparison model Performance.As for the robustness of assessment models, then two distinct types of traditional clustering algorithm K-means and DBSCAN is selected to carry out Compare.About COACH, three key parameters of the algorithm, i.e. density, affinity and the degree of approach are set, respectively 0.7,0.2 and 0.5, it empirically analyzes, these parameters are enough to complete stablizing for all-network vector algorithm and calculate.And for K-means and DBSCAN, using only its default settings.

Because SDNE is also required to single order estimation, but due to it is designed for social networks, three kinds of versions have been used This SDNE, i.e., each SDNE-NA of the vertex without any feature, each vertex use SDNE-ALL of all adjoint points as feature And each vertex is using SDNE-SN of the selected adjoint point as feature.SDNE-SN is using the adjoint point choosing disclosed in embodiment 1 Algorithm progress adjoint point is selected to select.

The test result of Krogan data sets, Dip data sets and Biogrid data sets is shown in Fig. 5, Fig. 6, Fig. 7 respectively.

In terms of result, for the test of the accuracy rate of all three data sets, recall rate and F values, model is superior to other Model.Especially for highdensity Biogrid data sets, the F values that model is completed are at least higher than deputy model 90%.For Dip data sets, the F values that model is completed are highest 0.528, are about higher by than the algorithm of COACH is used only 20%, 9.5% also is higher by than occupying second COACH+SDNE-SN algorithms, 17% is higher by than COACH+DeepWalk algorithm.Class As result can be equally focused to find out in Krogan data.These are the results show that model is more suitable for use than other models exists With on highdensity complex network.

It moreover has been found that for all three data sets, SDNE-SN is better than SDNE-NA and SDNE-ALL.Because SDNE-SN It is to be estimated based on the adjoint point selection algorithm disclosed in embodiment 1 to calculate single order, as a result demonstrates the effective of model from side Property.

As for K-means and DBSCAN clustering algorithms, the two performing poor in testing.With which kind of network vector Algorithm is used together, and experimental result is not fine, which means that both algorithms are not suitable for the PPI Internets.

Compare the clustering result quality of each model below.According to the test result of previous section, only three kinds of selection is representational Model is compared, i.e. COACH, COACH+DeepWalk and COACH+SDNE-SN.Table 2 shows different model inspections Protein complex number.From table, it is found that for all three data sets, model can be arrived than other model inspections More protein complexes.There is this quantity basic, the quality for improving cluster is just more easy.

Table 2

Data set	COACH+ the method for the present invention	COACH	COACH+Deepwalk	COACH+DNE-SN
					Krogan	610	570	570	580
Dip	808	748	750	840
					Biogrid	3470	3158	3160	3267

Table 3, table 4, table 5 show that the clustering result quality for Krogan, Dip and Biogrid data set compares respectively.From table 3 It can be seen that model can complete better clustering result quality, for MMR and Frac two, than the COACH+ for occupying second SDNE-SN is about high by 38%, and Acc mono- is then about high by 25%.The situation of Dip data sets is also substantially similar.

As for Biogrid data sets, due to the high density of the network, the clustering result quality of all models reduces.But, mould Type is still better than other.For example, model Acc values reach 0.69, the COACH+SDNE-SN than occupying second is about high by 25%.

Table 3

	COACH+ the method for the present invention	COACH	COACH+Deepwalk	COACH+DNE-SN
					Frac	0.61	0.35	0.4	0.44
Acc	0.68	0.46	0.48	0.54
					MMR	0.5	0.19	0.25	0.36

Table 4

	COACH+ the method for the present invention	COACH	COACH+Deepwalk	COACH+DNE-SN
					Frac	0.81	0.61	0.62	0.64
Acc	0.68	0.58	0.6	0.63
					MMR	0.75	0.36	0.4	0.48

Table 5

	COACH+ the method for the present invention	COACH	COACH+Deepwalk	COACH+DNE-SN
					Frac	0.35	0.14	0.2	0.24
Acc	0.69	0.39	0.4	0.45
					MMR	0.28	0.05	0.14	0.22

Compare other network vector methods, devise a kind of algorithm for selecting crucial adjoint point as each apex feature, To calculate the estimation of its single order.In addition, devise three layers of GCN of one kind, the structure of the deep learning PPI Internets, to preserve secondly Rank is estimated.

The extensive experiment carried out for the various PPI Internets shows that model is stable, and indices are better than other State-of-the-art model.In the future, plan is using Recognition with Recurrent Neural Network, by data conformity to PPI Interactive Network from Biomedical literature Network, to be further improved the quality of protein complex detection.

Embodiment 3

The present invention is based on the protein complex detection device of semi-supervised internet startup disk model, as shown in figure 8, it includes：

Memory, for storing at least one program；

Processor is based on semi-supervised internet startup disk for loading at least one program to perform described in Examples 1 and 2 The protein complex detection method of model.

It is that the preferable of the present invention is implemented to be illustrated, but be not limited to the invention the implementation above Example, those skilled in the art can also make various equivalent variations under the premise of without prejudice to spirit of the invention or replace It changes, these equivalent deformations or replacement are all contained in the application claim limited range.

Claims

1. the protein complex detection method based on semi-supervised internet startup disk model, which is characterized in that include the following steps：

Obtain the adjacency matrix of the protein interaction Internet；

2. the protein complex detection method according to claim 1 based on semi-supervised internet startup disk model, feature It is, it is described that embedded processing is carried out to adjacency matrix, the step for so as to obtain dimensionality reduction matrix, specifically include：

The single order estimation between all any two points in the protein interaction Internet is calculated, it is mutual so as to obtain protein Act on the partial structurtes information of the Internet；

The second order estimation between all any two points in the protein interaction Internet is calculated, it is mutual so as to obtain protein Act on the overall structure information of the Internet；

3. the protein complex detection method according to claim 2 based on semi-supervised internet startup disk model, feature It is, the single order estimation calculated in the protein interaction Internet between all any two points, so as to obtain albumen It the step for partial structurtes information of the matter interaction Internet, specifically includes：

The preferred adjoint point collection on each vertex in the protein interaction Internet is selected using adjoint point selection algorithm；

Respectively according to the preferred adjoint point collection on each vertex, characteristic information is assigned for each vertex, so as to establish characteristic information Matrix；

According to characteristic information matrix, the single order estimation between all any two points in the protein interaction Internet is calculated；

Using the single order estimation between any two points all in the protein interaction Internet as the protein of required acquisition The partial structurtes information of the interaction Internet.

4. the protein complex detection method according to claim 3 based on semi-supervised internet startup disk model, feature It is, the second order estimation calculated in the protein interaction Internet between all any two points, so as to obtain albumen It the step for overall structure information of the matter interaction Internet, specifically includes：

It will abut against and handled in matrix and characteristic information Input matrix to figure convolutional neural networks, so as to export protein interaction Second order estimation in the Internet between all any two points；

Using the second order estimation between any two points all in the protein interaction Internet as the protein of required acquisition The overall structure information of the interaction Internet.

5. the protein complex detection method according to claim 3 or 4 based on semi-supervised internet startup disk model, special Sign is, the preferred neighbour on each vertex selected using adjoint point selection algorithm in the protein interaction Internet The step for point set, specifically includes：

A vertex in the selected protein interaction Internet is as object vertex；

According to the Deepwalk of object vertex and all adjoint points of object vertex vectors, computing object vertex and each of which are distinguished The Euclidean distance of adjoint point；

The set that the adjoint point that all Euclidean distances with object vertex are more than arithmetic average is formed is as object vertex Preferred adjoint point collection；

A step for vertex in the execution selected protein interaction Internet is as object vertex is returned to, directly Until the preferred adjoint point collection for selecting each vertex in the protein interaction Internet.

6. the protein complex detection method according to claim 4 based on semi-supervised internet startup disk model, feature It is, the second order estimation calculated in the protein interaction Internet between all any two points, so as to obtain albumen After the step for overall structure information of the matter interaction Internet, equipped with Optimization Steps, the Optimization Steps include：

According to the single order estimation between any two points all in the protein interaction Internet and second order estimation, Tula is calculated This regular terms loss function of pula；

Corresponding single order estimation and second order estimation are respectively as required when will be according to figure Laplce's regular terms loss function minimum The partial structurtes information of the protein interaction Internet of acquisition and overall structure information.

7. the protein complex detection method according to claim 6 based on semi-supervised internet startup disk model, feature It is, the figure Laplce regular terms loss function, calculation formula is as follows：

L=L_first+λL_second

In formula, L is schemes Laplce's regular terms loss function, L_firstThe loss monitored for single order estimation, L_secondEstimate for second order The monitored loss of meter, λ L_firstAnd L_secondBetween balance factor.

8. the protein complex detection method according to claim 7 based on semi-supervised internet startup disk model, feature It is, the monitored loss of the single order estimation, calculation formula is as follows：

9. the protein complex detection method according to claim 4 based on semi-supervised internet startup disk model, feature It is, the second order estimation calculated in the protein interaction Internet between all any two points, so as to obtain albumen After the step for overall structure information of the matter interaction Internet, equipped with Optimization Steps, the Optimization Steps include：

In formula,For the minus deviation variable of first object,For the overgauge variable of first object,Negative bias for the second target Poor variable,Overgauge variable for the second target；X is characterized information matrix, and D is the columns of X, and P is the highest of the singular value of X Percentage, Z are that will abut against the output handled in matrix and characteristic information Input matrix to figure convolutional neural networks as a result, α is one Matrix, and the columns of α is equal to the maximum value that D can use, β is equal to the minimum value that D can use；

Will according to Z be equal to 0 or to the maximum extent close to 0 when corresponding characteristic information matrix and calculate single order estimation and two Rank estimates the partial structurtes information of the protein interaction Internet and overall structure information respectively as required acquisition.

10. the protein complex detection device based on semi-supervised internet startup disk model, which is characterized in that it includes：

Memory, for storing at least one program；

Processor is required described in any one of 1-9 with perform claim based on semi-supervised network for loading at least one program The protein complex detection method of incorporation model.