CN108171010B

CN108171010B - Protein complex detection method and device based on semi-supervised network embedded model

Info

Publication number: CN108171010B
Application number: CN201711250342.9A
Authority: CN
Inventors: 朱佳; 黄昌勤
Original assignee: Guangzhou Fanping Electronic Technology Co ltd; South China Normal University
Current assignee: Guangdong SUCHUANG Data Technology Co.,Ltd.
Priority date: 2017-12-01
Filing date: 2017-12-01
Publication date: 2021-09-14
Anticipated expiration: 2037-12-01
Also published as: CN108171010A

Abstract

The invention discloses a protein complex detection method and a device based on a semi-supervised network embedded model, wherein the method comprises the steps of obtaining an adjacent matrix of a protein interaction network, embedding the adjacent matrix to obtain a dimensionality reduction matrix, processing the dimensionality reduction matrix by using a clustering algorithm to obtain a protein complex detection result and the like, and the device comprises a processor for storing at least one program memory and loading the at least one program to execute the protein complex detection method based on the semi-supervised network embedded model. According to the invention, the adjacent matrix corresponding to the protein interaction network is subjected to dimension conversion and then processed by the clustering algorithm, so that the clustering effect is improved. The protein complex detection method and device based on the semi-supervised network embedded model are widely applied to the technical field of protein complex identification.

Description

Protein complex detection method and device based on semi-supervised network embedded model

Technical Field

The invention relates to the technical field of protein complex recognition, in particular to a protein complex detection method and a protein complex detection device based on a semi-supervised network embedded model.

Background

Protein complexes are complex graphic structures formed by Protein-Protein interactions (PPIs), and play a crucial role in biochemical processes and pharmaceutical processes. Therefore, correctly identifying protein complexes in PPI interaction networks is extremely useful for the biomedical field. However, with the tremendous growth of PPI data, coupled with the bottleneck constraints of the experimental approach, only a small number of protein complexes are identified experimentally.

To overcome the technical limitations of experimental methods in protein complex detection, computational methods have been used. The PPI interaction network can be seen as a undirected, non-weighted graph, where proteins are vertices and their interactions are edges. Each protein complex is composed of two or more proteins that appear as densely connected subgraphs, meaning that they can be found using a graph formed based on a clustering method.

Recently, network embedding has been extensively studied and demonstrated to further improve the performance of many graph clustering methods. The network vector learns the low-dimensional representation of the vertices in the network to capture and save the network structure. However, most existing network vector methods rely heavily on the characteristics of each vertex in the network, which makes them unsuitable for PPI interactive networks. In PPI interaction networks, no metadata is associated with each vertex, except for the protein name. In other words, the conventional network vector method cannot fully capture the PPI interactive network structure because there is not enough data to calculate its first and second order estimates.

Disclosure of Invention

In order to solve the above-described problems, a first object of the present invention is to provide a method for detecting a protein complex based on a semi-supervised network-embedded model, and a second object of the present invention is to provide a device for detecting a protein complex based on a semi-supervised network-embedded model.

The first technical scheme adopted by the invention is as follows:

the protein complex detection method based on the semi-supervised network embedding model comprises the following steps:

obtaining a adjacency matrix of a protein interaction network;

embedding the adjacent matrix to obtain a dimension reduction matrix;

and processing the dimensionality reduction matrix by using a clustering algorithm to obtain a protein complex detection result.

Further, the step of performing embedding processing on the adjacency matrix to obtain a dimension reduction matrix specifically includes:

calculating first-order estimation between any two points in the protein interaction network so as to obtain local structure information of the protein interaction network;

calculating second-order estimation between any two points in the protein interaction network so as to obtain the overall structure information of the protein interaction network;

and storing the local structure information and the overall structure information into an adjacency matrix, thereby obtaining a dimension reduction matrix.

Further, the step of calculating a first-order estimate between any two points in the protein interaction network to obtain local structural information of the protein interaction network specifically includes:

selecting a preferred neighbor set of each vertex in the protein interaction network by using a neighbor selection algorithm;

respectively endowing each vertex with characteristic information according to the preferred neighbor point set of each vertex, thereby establishing a characteristic information matrix;

calculating first-order estimation between any two points in the protein interaction network according to the characteristic information matrix;

and taking the first-order estimation between any two points in the protein interaction network as the local structural information of the protein interaction network required to be acquired.

Further, the step of calculating a second-order estimate between any two points in the protein interaction network to obtain overall structural information of the protein interaction network specifically includes:

inputting the adjacency matrix and the characteristic information matrix into a graph convolution neural network for processing, thereby outputting second-order estimation between any two points in the protein interaction network;

and taking the second-order estimation between any two points in the protein interaction network as the overall structural information of the protein interaction network required to be acquired.

Further, the step of selecting a preferred neighbor set for each vertex in the protein interaction network by using a neighbor selection algorithm specifically includes:

processing the protein interaction network by using a Deepwalk algorithm so as to obtain a Deepwalk vector of each vertex;

selecting one vertex in the protein interaction network as an object vertex;

respectively calculating the Euclidean distance between the object vertex and each adjacent point according to the Deepwalk vectors of the object vertex and all adjacent points of the object vertex;

calculating the arithmetic mean of the Euclidean distances between the vertex of the object and each adjacent point of the vertex of the object;

taking a set consisting of all adjacent points of which the Euclidean distance from the vertex of the object is greater than the arithmetic mean as a preferred adjacent point set of the vertex of the object;

returning to the step of executing one vertex in the selected protein interaction network as an object vertex until a preferred set of neighbors for each vertex in the protein interaction network is selected.

Further, after the step of calculating a second order estimate between any two points in the protein interaction network to obtain the overall structural information of the protein interaction network, an optimization step is provided, wherein the optimization step comprises:

calculating a graph Laplace regularization term loss function according to first-order estimation and second-order estimation between any two points in the protein interaction network;

dynamically adjusting the order number of the characteristic information matrix until the Laplace regularization term loss function of the graph is minimized;

and respectively taking the first-order estimation and the second-order estimation which correspond to the minimum graph Laplace regularization term loss function as the local structure information and the overall structure information of the protein interaction network which need to be obtained.

Further, the calculation formula of the graph laplacian regularization term loss function is as follows:

L＝L_first+λL_second

wherein L is the graph Laplace regularization term loss function, L_firstTo estimate the monitored loss at first order, L_secondFor second order estimation of the monitored loss, λ is L_firstAnd L_secondA balance factor between.

Further, the first order estimates the monitored loss, and the calculation formula is as follows:

in the formula, v_iAnd v_jIs a pair of vertices, y, connected by an edge in a protein interaction network_iIs composed of v_iThe Deepwalk vector of (a), y_jIs composed of v_jThe Deepwalk vector of (1);

the second-order estimation monitored loss is calculated according to the following formula:

in the formula, L₀Convolutional layer for graph convolutional neural networkNumber of layers H⁽⁰⁾＝N×D，

dynamically adjusting α and β such that Z is equal to 0 or maximally close to 0 in the following system of equations:

in the formula (I), the compound is shown in the specification,

is a negative offset variable of the first target,

is a positive deviation variable for the first target,

is a negative offset variable of the second target,

a positive deviation variable for a second target; x is a characteristic information matrix, D is the number of columns of X, P is the highest percentage of singular values of X, α is a matrix, and the number of columns of α is equal to the maximum value that D can take, and β is equal to the minimum value that D can take.

The second technical scheme adopted by the invention is as follows: a semi-supervised network embedding model based protein complex detection apparatus, comprising:

a memory for storing at least one program;

a processor for loading the at least one program to execute the protein complex detection method based on the semi-supervised network embedding model according to the first technical aspect.

The invention has the beneficial effects that: by the protein complex detection method and the protein complex detection device, the protein interaction network is embedded and subjected to dimension conversion, so that the efficiency of the conventional clustering algorithm in clustering operation on the protein interaction network can be improved, the clustering effect is optimized, and the protein complex detection result is more accurate. Meanwhile, the invention can endow the characteristics to each vertex of the protein interaction network, and can capture the local structure of the protein interaction network and the overall structure of the protein interaction network, so that the invention does not require that each vertex of the protein interaction network has characteristics, and overcomes the technical defect that the clustering algorithm can not be directly used for processing the protein interaction network without characteristics of each vertex. The invention has stable operation, and each prediction result evaluation index is superior to other protein complex detection methods.

Drawings

FIG. 1 is a flowchart of a method for detecting a protein complex according to the present invention;

fig. 2 is a detailed flowchart of step S2;

fig. 3 is a detailed flowchart of step S21;

fig. 4 is a detailed flowchart of step S211;

FIG. 5 is a comparison of Krogan datasets;

FIG. 6 shows the comparison of Dip datasets;

FIG. 7 is a comparison of the Biogrid data sets;

FIG. 8 is a schematic diagram of a protein complex detection apparatus according to the present invention.

Detailed Description

Example 1

The invention discloses a protein complex detection method based on a semi-supervised network embedding model, which comprises the following steps as shown in figure 1:

s1, acquiring an adjacency matrix of a protein interaction network;

s2, embedding the adjacent matrix to obtain a dimension reduction matrix;

and S3, processing the dimensionality reduction matrix by using a clustering algorithm to obtain a protein complex detection result.

In the conventional method for detecting a protein complex, a protein interaction network is represented as an undirected graph G ═ V, E, a protein is a vertex V in the graph, and its interaction is an edge E in the graph, and the edge of the protein interaction network has no weight. Protein interaction networks can be obtained from existing datasets such as Krogan, Dip, and Biogrid. As can be seen from the graph theory, a protein interaction network corresponds to an adjacency matrix, and the adjacency matrix is processed by using a clustering algorithm such as COACH or K-means, so that a protein complex detection result can be obtained, namely, an output result shows which proteins belong to one class, namely, a complex. The protein complex detection method based on the semi-supervised network embedding model obtains the dimensionality reduction matrix obtained by carrying out dimensionality conversion on the adjacency matrix by embedding the adjacency matrix, and carries out protein complex detection on the dimensionality reduction matrix by using a known method clustering algorithm, so that the operation efficiency of the clustering algorithm can be improved. Since the present invention utilizes the interaction network corresponding to protein interaction, i.e., mathematical graph, for protein complex detection, the concepts of protein interaction, PPI, protein interaction network, and graph corresponding to protein interaction network are not distinguished in the examples unless otherwise specified.

Further as a preferred embodiment, the step of performing embedding processing on the adjacency matrix to obtain the dimension reduction matrix, that is, step S2, as shown in fig. 2, specifically includes:

s21, calculating first-order estimation between any two points in the protein interaction network to obtain local structure information of the protein interaction network;

s22, calculating second-order estimation between any two points in the protein interaction network to obtain the overall structure information of the protein interaction network;

s23, storing the local structure information and the overall structure information into an adjacent matrix, thereby obtaining a dimension reduction matrix.

Wherein the First order estimate (First-order proximity) describes pairwise similarities between vertices. For any pair of vertices v in the protein interaction network_iAnd v_jIn other words, if v_iAnd v_jThere is an edge in between, then v_iAnd v_jWith a positive first order estimate in between. On the contrary, v_iAnd v_jThe first order between estimates is 0. The first order estimate reflects the local structure of the protein interaction network.

Second-order estimation (Second-order proximity) describes pairwise similarities between vertex neighborhood structures. Suppose N_iAnd N_jRepresents v_iAnd v_jAdjacent pair of vertices, the second order estimate is then formed by N_iAnd N_jThe similarity of (c) is determined. If two vertices share many common neighbors, the second order estimate between the two vertices will be high. Second order estimation has proven to be a good metric for defining the similarity of a pair of vertices, even though they are not edge-connected, so it can greatly enrich the relationship of vertices. The second order estimate reflects the overall structure of the protein interaction network.

The concept of first and second order estimates was first proposed in the LINE model. Assuming that u is a vertex in (V, E), u and the first order estimate of all other vertices in (V, E) may be expressed as N_u＝{s_u,1,s_u,2,…s_u,|V|In which s is_i,jRepresents the weight of the edge between vertex i and vertex j in (V, E), and if there is no edge connection between vertex i and vertex j, s _i,j0, if the edges connect between vertex i and vertex j, and graph G is not a weighted graph, (V, E), then s_i,jIf graph G is a weighted graph, then s is 1_i,j>0. In the same way, the first-order estimation of vertex V and all other vertices in graph G ═ V, E can be expressed as N_v＝{s_v,1,s_v,2,…s_v,|V|}. From this algorithm, a first order estimate between all vertices and other vertices in graph G ═ (V, E) can be calculated. The second order estimate, using vertex v and vertex u as examples, can be calculated by calculating N_uAnd N_vThe similarity between them is obtained. It can be seen that calculating the first and second order estimates requires first obtaining the weights of the edges in the graph, but PPIs are characterized by the fact that there is no feature to distinguish between vertices other than the difference in protein names, i.e., each vertex lacks features to weight the edges.

Because the invention utilizes the interaction network corresponding to the protein interaction to detect the protein complex, namely, the invention focuses on the whole protein interaction network, the embodiment does not distinguish the first-order estimation, the first-order estimation and the first-order estimation of the protein interaction network between any two points in the protein interaction network, and also does not distinguish the second-order estimation, the second-order estimation and the second-order estimation of the protein interaction network between any two points in the protein interaction network unless particularly stated.

After the first-order estimation and the second-order estimation are obtained, the first-order estimation and the second-order estimation can be combined with the adjacent matrix, namely, the local structure information corresponding to the first-order estimation and the overall structure information corresponding to the second-order estimation are stored in the adjacent matrix, so that the dimension reduction matrix is obtained. Since it is prior art to combine the first and second order estimates with the adjacency matrix, it is not described here in detail.

Because each vertex in the protein interaction network has no other features than the corresponding protein name, in order to compute a first order estimate of the protein interaction network, i.e., between any two vertices in the protein interaction network, it is necessary to assign a set of features to each vertex. In view of the definition of the protein complex, important neighbors of each vertex can be set as features because these neighbors have a higher probability to be combined together as a protein complex. The important neighbors are parts of neighbors screened from all neighbors of a vertex through a certain algorithm.

Further as a preferred embodiment, the step of calculating a first-order estimate between any two points in the protein interaction network to obtain the local structure information of the protein interaction network, i.e. step S21, as shown in fig. 3, specifically includes:

s211, respectively selecting a preferred neighbor point set according to each vertex;

s212, according to the optimal neighbor point set corresponding to each vertex, giving characteristic information to each vertex, and accordingly establishing a characteristic information matrix;

s213, calculating first-order estimation between any two points in the protein interaction network according to the characteristic information matrix;

Each vertex in the protein interaction network has a preferred set of neighbors, although it is not excluded that the preferred set of neighbors for certain vertices may be an empty set. For a vertex in a protein interaction network, its preferred set of neighbors is the set of eligible neighbors screened from all its neighbors. Feature information is assigned to the corresponding vertices using the preferred set of neighbors. If the vertex v is_iThe corresponding preferred neighborhood set includes vertices x, y and z, then the "x, y and z" three vertices are vertices v_iThe characteristic to be imparted. Each vertex is characterized in this way, and the basis for calculating the edge weights is then used to calculate a first order estimate.

Since each vertex has the assigned Feature information, a Feature information matrix (Feature matrix) of the protein interaction network is obtained, which is an N × D order matrix, where N is the total number of vertices of the protein interaction network and D is the number of features per vertex. Because the preferred set of neighbors for each vertex is different, i.e., the features of each vertex are different, the number of features per vertex is also different.

For example, in a protein interaction network with N vertices, the maximum number of features that a vertex may correspond to is N, and thus the maximum order of the feature information matrix corresponding to the protein interaction network is N × N. If the feature number corresponding to a vertex is less than N, the row corresponding to the vertex in the feature information matrix is less than N columns, and the N columns can be complemented by a padding algorithm, preferably by complementing the N columns so that the rightmost elements are all zero. In the using process of the feature information matrix, sometimes it is necessary to reduce the scale, that is, keep the number of rows unchanged and reduce the number of columns, at this time, D may be considered as a variable, the maximum value of D may be determined as the feature number of the vertex with the largest feature number in the protein interaction network, or may be directly determined as N, and the minimum value of D may be determined as the feature number of the vertex with the smallest feature number in the protein interaction network. For example, when the maximum value of D is N, the N × D order characteristic information matrix may be reduced to Nx (D-1) order, Nx (D-2) order, or the like, and preferably, when the order of the characteristic information matrix is reduced, the rightmost column is deleted and only the leftmost column is retained.

According to the characteristic information matrix, a first-order estimation between any two points in the protein interaction network can be calculated. There are various methods for calculating the first order estimate based on the feature information matrix, and a cosine similarity calculation method may be preferably used, which is not described herein since it belongs to the prior art.

Further as a preferred embodiment, the step of calculating a second-order estimate between any two points in the protein interaction network to obtain the overall structural information of the protein interaction network specifically includes:

Second orderThe degree of similarity representing a pair of vertex neighborhood structures is estimated. Thus, to model the second order estimate, the neighborhood of each vertex is modeled first. For a graph G with n vertices, (V, E), it corresponds to an adjacent matrix M, which contains n row matrices, i.e. M₁,m₂,…m_n. For row matrix

If and only if v_iAnd v_jWhen connected by one edge, has m_i,j>0。

m_iDescribing the vertex v_iAnd M provides information of the neighborhood structure of each vertex. Therefore, the GCN can be designed based on an auto-encoder to hold a second order estimate of G.

Auto-encoder based Graph Convolutional neural networks (GCN) can apply hidden variables, can learn the assertable hidden representation of undirected non-weighted graphs, which is well suited for protein interaction networks. Using the features of each vertex as part of the input data to the GCN, the representation learned from the original graph can then be obtained after encoding of the convolutional layer. For the decoding part, an inner product decoder may simply be used. A protein interaction network is a undirected, unweighted graph G ═ (V, E), which has N ═ V | vertices. Taking a neighborhood matrix A of G and a characteristic information matrix X of NxD order as input. Using random hidden variables Z_iAn output matrix Z of order N × F can be obtained. Here, F is the number of output features, and D is the number of features per vertex. The second order estimate of the protein interaction network to be obtained, i.e. the second order estimate of all arbitrary two vertices in the protein interaction network, can be obtained from the output of the GCN. Since the method of obtaining the second-order estimate from the GCN output result belongs to the prior art, it is not described herein.

Since the features of each vertex are generated based on the selected neighbors, in other words, the number of features per vertex is different. Therefore, setting N to the initial value of D, when building the feature information matrix X, if the vertex has no such features, the correlation value is set to 0. Thus, each network layer in a graph-convolved neural network can be written as a non-linear function as follows:

H^(l+1)＝f(H^l,A)，

wherein H⁽⁰⁾＝X，H^(l)＝Z，

The transmission rules are as follows:

f(H^(l),A)＝relu(AH^(l)W^(l))，

where W is the weight matrix of the I-network layer and relu is the activation function, note that the multiplication with A encompasses all the features of all neighbors but not the vertex itself. Therefore, an identity matrix I needs to be added to a. Then, the transmission rule becomes:

wherein

Is that

Let L be 3, that means that the graph convolution neural network has three convolution layers to reconstruct the structure of a to obtain Z. Given that each layer in the network is determined to retain half of the characteristics of the first layer, it is obtained after three layers

Further as a preferred embodiment, the neighboring point selection algorithm, that is, step S211, as shown in fig. 4, specifically includes:

s2111, processing the protein interaction network by using a Deepwalk algorithm to obtain a Deepwalk vector of each vertex;

s2112, selecting a vertex in the protein interaction network as an object vertex;

s2113, respectively calculating Euclidean distances between the object vertex and each adjacent point according to the Deepwalk vectors of the object vertex and all adjacent points of the object vertex;

s2114, taking a set formed by all adjacent points of which the Euclidean distance from the vertex of the object is greater than the arithmetic mean as an optimal adjacent point set of the vertex of the object;

s2115, returning to the step of executing one vertex in the selected protein interaction network as an object vertex until a preferred neighbor set of each vertex in the protein interaction network is selected.

Deepwalk is a method for learning node implicit expression, which encodes the social relationship of nodes in a continuous vector space and is an extension of a language model and unsupervised learning from a word sequence to a graph. The method learns the sequence of truncated walks as sentences. The method has the characteristics of expandability and parallelization, and can be used for network classification and abnormal point detection. The deep walk approach has been successfully validated in social networking and graph analysis. It learns the underlying representation by modeling a series of short, random walks that encode a continuous vector space in a low dimension.

The protein interaction network is processed through Deepwalk, each vertex in the protein interaction network corresponds to a 64-dimensional vector according to an obtained processing result, and the Euclidean distance between any two vertices can be calculated according to the 64-dimensional vectors corresponding to the two vertices. In the application of the invention, a 64-dimensional vector obtained after each vertex is processed by a Deepwalk algorithm is called as a Deepwalk vector corresponding to the vertex. Selecting a vertex in the protein interaction network, called an object vertex, calculating the Euclidean distances between the object vertex and all the adjacent points of the object vertex respectively, and solving the arithmetic mean of all the Euclidean distances, namely dividing the sum of the Euclidean distances between the object vertex and all the adjacent points of the object vertex by the total number of the adjacent points. Then, the Euclidean distance between the vertex of the object and each adjacent point thereof is compared with the arithmetic mean, and for the adjacent points with Euclidean distance larger than the arithmetic mean, the adjacent points are classified into the preferred adjacent point set, otherwise, the adjacent points are excluded from the preferred adjacent point set. In this way, a preferred set of neighbors can be formed by screening eligible neighbors for a particular vertex of the protein interaction network.

After the above method is repeatedly used, i.e. after a preferred neighbor set is selected and constructed for one object vertex in step S2114, the process returns to step S2112, another vertex which has not been constructed into the preferred neighbor set is selected as a new object vertex in the protein interaction network, and the process continues from step S2112 until all vertices in the protein interaction network are screened out by this method that their eligible neighbors constitute the corresponding preferred neighbor set. With the corresponding preferred set of neighbors, operations such as feature assignment can be performed by the methods disclosed above.

According to the above neighbor selection algorithm, the significance of the feature information matrix is more definite: it has N rows and D columns, where N is the total number of vertices of the protein interaction network and D is the number of features per vertex. After the Deepwalk algorithm, each vertex corresponds to a 64-dimensional vector, and therefore, each element in the feature information matrix is a 64-dimensional vector in essence.

Further as a preferred embodiment, the step of calculating a second order estimate between any two points in the protein interaction network to obtain the overall structural information of the protein interaction network is followed by an optimization step, the optimization step comprising:

Since N is set as the initial value of D when the characteristic information matrix is established, the order number of the characteristic information matrix is not necessarily the most reasonable, and the first-order estimation and the second-order estimation of the protein interaction network obtained according to the characteristic information matrix are also not necessarily the optimal, which makes the finally obtained dimension reduction matrix for the clustering algorithm processing not the optimal. In order to obtain an optimal dimension reduction matrix, the order number of the characteristic information matrix is dynamically adjusted, the first-order estimation and the second-order estimation of the protein interaction network are changed, when the minimum value is obtained by a graph Laplacian regularization term loss function obtained by the first-order estimation and the second-order estimation, the corresponding first-order estimation and second-order estimation combination is optimal, and the optimal first-order estimation and second-order estimation combination is respectively used as local structure information and overall structure information of the protein interaction network to be obtained, so that the dimension reduction matrix is further obtained.

Further, as a preferred embodiment, the calculation formula of the graph laplacian regularization term loss function is as follows: l ═ L_first+λL_second

Wherein L is the graph Laplace regularization term loss function, L_firstTo estimate the monitored loss at first order, L_secondFor second order estimation of the monitored loss, λ is L_firstAnd L_secondThe balance factor between the two, lambda is a parameter, and the value of the parameter can be selected when the algorithm is actually operated.

Further as a preferred embodiment, the first order estimates the monitored loss, and the calculation formula is as follows:

in the formula, v_iAnd v_jIs a pair of vertices connected by an edge in a protein interaction networkPoint, y_iIs composed of v_iThe Deepwalk vector of (a), y_jIs composed of v_jThe depwalk vector of (a). Preferably, y_iIs composed of v_iThe Deepwalk vector of (1), which is specifically, in v_iAnd v_iThe Deepwalk vectors corresponding to all the preferred neighbor points are used as elements to construct a matrix y_i. Matrix y_jThe same applies to the construction method of (1). Since the number of neighbors per vertex may be different, that is y_iAnd y_jMay be different, and zero elements are used to fill the smaller matrix, ensuring that both matrices are the same size for the calculations. In the case of filling a smaller matrix with zero elements, the following filling method may be preferably used: such as y_iRatio of orders y_jSmall, then fill to y with zero elements_iBecome a new matrix, so that the order and y of the new matrix_jSame, and y_iAt the upper left corner of the new matrix.

in the formula, L₀Number of convolutional layers for graph convolutional neural network, H⁽⁰⁾＝N×D，

Here likewise the method of filling with zero elements, such that H^(l+1)And H^(l)Are the same.

With the above method, the corresponding first order estimate and second order estimate combination is optimal when taking the minimum for the graph laplacian regularization term loss function L.

in the formula (I), the compound is shown in the specification,

is a negative offset variable of the first target,

is a positive deviation variable for the first target,

is a negative offset variable of the second target,

a positive deviation variable for a second target; x is a characteristic information matrix, D is the column number of X, P is the highest percentage of singular values of X, alpha is a matrix, the column number of alpha is equal to the maximum value of the optimization of D, and beta is equal to the minimum value of the optimization of D;

and respectively taking the first-order estimation and the second-order estimation which are calculated according to the corresponding characteristic information matrix when the Z is equal to 0 or is close to 0 to the maximum extent as the local structure information and the overall structure information of the protein interaction network required to be obtained.

The above method is another implementation method of the optimization step. Mathematically, the problem of achieving optimization by finding the minimum of the graph laplacian regularization term loss function is actually a problem of dimensionality reduction of the matrix, which can be done using a conventional Singular Value Decomposition (SVD) method as a preferred embodiment. According to the SVD theorem, an N × D order characteristic information matrix X can be rewritten into U × S × V, wherein U is an orthogonal matrix of the characteristic information matrix X, and the size of U is N × N order; s is a diagonal matrix of the characteristic information matrix X, and the size of S is NxD order; v is a conjugate transpose matrix of U, and the size of V is D × D order. S may also be referred to as the singular value of X. If the minimum value of some highest percentage P of the singular values is set to 0, then an approximation matrix of X, i.e., X', can be obtained. Finally, the value of D is reduced, however, because the reconstruction error of X → X' needs to be minimized, the value of 1-P must be maximized. Since after the multiplication operation is performed by SVD, X ═ (1-P) X, where X is an N × D matrix, the problem of finding the minimum of the laplacian regularization term loss function to achieve optimization can be converted into a target planning problem, as shown in the following equation set:

the dynamic adjustment of alpha means that alpha is initially preferably selected as an N × N matrix, namely the characteristic information matrix itself, and is adjusted, namely, the alpha is reduced step by step, for example, the rightmost column is deleted to be an N × (N-1) matrix, and then the matrix is substituted into an equation set for calculation; the rightmost column is deleted again to become a matrix of N (N-2), and then the matrix is substituted into the equation set for calculation, and the like.

In this system of equations, the positive and negative deviation variables are placed in equal importance, which means that the weight is 1 for each deviation variable. Obviously, when Z is equal to 0, a pareto optimal solution can be obtained. In some cases, however, Z cannot be exactly equal to 0, and in such cases Z is required to be as close to 0 as possible within its range. Therefore, by continuously updating α and β until a combination of α and β is found which can make Z close to or equal to 0, the feature information matrix corresponding to this combination of α and β is optimal, and the first-order estimation and the second-order estimation calculated from the optimal feature information matrix can optimize the dimension reduction matrix to optimize the clustering effect.

Example 2

In this example, based on three sets of PPI data sets, the semi-supervised network embedding model-based protein complex detection method described in example 1 was combined with the existing clustering method to perform an experiment, and the experimental results thereof were compared with those of the conventional application of the existing clustering method by the most advanced method to demonstrate the performance of the method described in example 1. The experiment runs on a desktop computer and is configured to be an i7CPU dual-core 4.00GHZ, 16GB memory and GTX 1070 video card. The whole operation process of the three sets of data sets can be completed in one day. Furthermore, since PPI data clustering is typically a one-time process in the real world, no attention needs to be paid to runtime improvement and time complexity analysis in the study, since the cluster quality is more important.

Three sets of the latest s.cerevisiae PPI datasets, namely the Krogan, Dip and Biogrid datasets, were used. The Krogan dataset and the Dip dataset were used to evaluate the operation of several clustering algorithms. As shown in table 1, the Krogan dataset and the Dip dataset have similar average and density, while the Biogrid dataset has higher average and density than them. Since PPI data can be represented by an undirected graph G ═ V, E, the degree of averaging can be calculated as

The density can be calculated as

The characteristics of the three PPI datasets are shown in table 1.

PPI data have a high false alarm rate, estimated to be around 50%. The noise of the data interferes with the clustering method for detecting protein complexes from PPI data. Then, CYC2008 is used as a reference data set. CYC2008 provides an artificially corrected total catalog of 408 protein complexes in Saccharomyces cerevisiae, 90% more MIPS than another popular data set.

TABLE 1

Data set	Vertex point	Edge	Degree of average	Density of
					Krogan	5364	61289	22.85	0.0043
Dip	4972	17836	7.17	0.0014
					Biogrid	6242	255510	81.87	0.013

A neighbor affinity score is used to see if the protein complex detected by an algorithm matches the protein complex in CYC 2008. It is then used to calculate accuracy, recall, and F-value to evaluate the performance of the algorithm. The neighbor affinity score NA (p, b) is defined as follows:

here, P ═ (Vp, Ep) is a predicted protein complex, and B ═ B (Vb, Eb) is a reference protein complex. Thus, accuracy precision can be calculated as follows:

wherein the content of the first and second substances,

the recall rate recall is calculated as follows:

wherein the content of the first and second substances,

the F-measure is the harmonic mean of accuracy and recall and is calculated as follows:

ω is a threshold value indicating whether or not the protein complex is confirmed as a certain protein complex in the reference data set. Experimentally, the neighbor affinity score threshold was set to 0.25, which differentiates model performance from other algorithms.

In addition, three metrics, namely score (Frac), Maximum Matching Ratio (MMR), and geometric accuracy (Acc), were also used to measure the quality of protein complex clustering. Frac is an indicator for measuring fractional pairs between two protein complexes, with an overlap integral θ greater than 0.25, and Frac (θ) is calculated as follows:

here, a and B are two protein complexes.

Acc is the geometric mean of the other two metrics, cluster sensitivity (Sn) and cluster Positive Predictive Value (PPV). Sn and PPV are calculated as follows:

here, n is the number of proteins in the reference protein complex, m is the number of proteins in the clustered protein complex, and the element t_ijRepresenting the number of proteins found in both complexes. Because of S_nIt can be increased by adding each protein in the same complex, while PPV can also be maximized by adding each protein in its own complex, so the geometric mean of Sn and PPV can be calculated using these two measures:

MMR indicates that the two sets of aggregated protein complexes are bipartite graphs, in which the two sets of nodes represent the reference complex and the predicted complex, respectively, and the edges joining the reference complex and the predicted complex are weighted by overlap integration. Equation for overlap integral between two protein complexes

And (4) calculating. The value of MMR is the total weight of the particular subset of edges that possess the greatest weight, divided by the number of reference protein complexes.

According to research, COACH is the most stable and representative clustering algorithm of PPI interactive network so far. The method is used as a cluster analysis method of an evaluation model. Two most advanced network vector models, Deepwalk and SDNE, were used to compare the performance of the models. As for the robustness of the evaluation model, two different types of traditional clustering algorithms K-means and DBSCAN are selected for comparison. With respect to COACH, three key parameters of the algorithm, namely density, affinity and proximity, are set to 0.7, 0.2 and 0.5, respectively, which are sufficient to perform stable calculations of all network vector algorithms as analyzed experimentally. Whereas for K-means and DBSCAN, only their default settings are used.

Because SDNE also requires first order estimation, three versions of SDNE are used, namely SDNE-NA with no features per vertex, SDNE-ALL with ALL neighbors as features per vertex, and SDNE-SN with selected neighbors as features per vertex, since it was originally designed for social networking. The SDNE-SN performs neighbor selection using the neighbor selection algorithm disclosed in embodiment 1.

The test results of the Krogan, Dip and Biogrid datasets are shown in fig. 5, 6 and 7, respectively.

From the results, the model was superior to the other models for all three data set tests of accuracy, recall, and F-number. In particular for Biogrid datasets with high density, the model achieved F values at least 90% higher than the second-order model. For the Dip dataset, the model completed F-value was the highest 0.528, approximately 20% higher than the COACH only algorithm, 9.5% higher than the second COACH + SDNE-SN algorithm, and 17% higher than the COACH + Deepwalk algorithm. Similar results can be found in the Krogan dataset as well. These results demonstrate that the model is more suitable for use on complex networks with high density than other models.

Furthermore, it was found that SDNE-SN outperformed SDNE-NA and SDNE-ALL for ALL three datasets. Since the SDNE-SN calculates a first order estimate based on the neighbor selection algorithm disclosed in example 1, the results demonstrate the effectiveness of the model from the side.

As for the K-means and DBSCAN clustering algorithms, the two algorithms perform poorly in the test. No matter which network vector algorithm is used together, the experimental results are not good, which means that the two algorithms are not suitable for PPI interactive networks.

The clustering quality of each model is compared below. Based on the test results in the previous section, only three representative models were selected for comparison, i.e., COACH + Deepwalk, and COACH + SDNE-SN. Table 2 shows the number of protein complexes detected with the different models. From the table, it can be seen that for all three data sets, the model detects more protein complexes than the other models. With this quantity basis, it is easier to improve the quality of clustering.

TABLE 2

Data set	COACH + method of the invention	COACH	COACH+Deepwalk	COACH+DNE-SN
					Krogan	610	570	570	580
Dip	808	748	750	840
					Biogrid	3470	3158	3160	3267

Table 3, table 4, table 5 show the cluster quality comparisons for the Krogan, Dip and Biogrid datasets, respectively. As can be seen from Table 3, the model is able to achieve better clustering quality, about 38% higher for both MMR and Frac than the second-ranked COACH + SDNE-SN, and about 25% higher for the Acc term. The situation for Dip datasets is also roughly similar.

As for the Biogrid dataset, the clustering quality of all models decreased due to the high density of the network. However, the model is still superior to the others. For example, the model Acc reaches 0.69, which is about 25% higher than the second COACH + SDNE-SN.

TABLE 3

	COACH + method of the invention	COACH	COACH+Deepwalk	COACH+DNE-SN
					Frac	0.61	0.35	0.4	0.44
Acc	0.68	0.46	0.48	0.54
					MMR	0.5	0.19	0.25	0.36

TABLE 4

	COACH + method of the invention	COACH	COACH+Deepwalk	COACH+DNE-SN
					Frac	0.81	0.61	0.62	0.64
Acc	0.68	0.58	0.6	0.63
					MMR	0.75	0.36	0.4	0.48

TABLE 5

	COACH + method of the invention	COACH	COACH+Deepwalk	COACH+DNE-SN
					Frac	0.35	0.14	0.2	0.24
Acc	0.69	0.39	0.4	0.45
					MMR	0.28	0.05	0.14	0.22

Compared with other network vector methods, an algorithm for selecting key neighbors as the characteristics of each vertex is designed to calculate first-order estimates thereof. In addition, a three-layer GCN is designed, and the structure of the PPI interactive network is deeply learned so as to store the second-order estimation of the PPI interactive network.

Extensive experiments aiming at various PPI interactive networks show that the model is stable, and various indexes are superior to other most advanced models. In the future, it is planned to use the recurrent neural networks to integrate data from the biomedical literature into PPI interaction networks to further improve the quality of protein complex detection.

Example 3

The invention discloses a protein complex detection device based on a semi-supervised network embedded model, which comprises the following components as shown in figure 8:

a memory for storing at least one program;

a processor for loading the at least one program to perform the semi-supervised network embedding model based protein complex detection method of embodiments 1 and 2.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. The protein complex detection method based on the semi-supervised network embedding model is characterized by comprising the following steps of:

obtaining a adjacency matrix of a protein interaction network;

embedding the adjacent matrix to obtain a dimension reduction matrix;

processing the dimensionality reduction matrix by using a clustering algorithm to obtain a protein complex detection result;

the step of performing embedding processing on the adjacency matrix to obtain the dimension reduction matrix specifically includes:

storing the local structure information and the overall structure information into an adjacent matrix, thereby obtaining a dimension reduction matrix;

the step of calculating a first-order estimate between any two points in the protein interaction network to obtain local structural information of the protein interaction network specifically comprises:

2. The method for detecting protein complexes based on semi-supervised network embedding model as claimed in claim 1, wherein the step of calculating the second-order estimation between any two points in the protein interaction network to obtain the overall structure information of the protein interaction network comprises:

3. The semi-supervised network embedding model-based protein complex detection method according to claim 1 or 2, wherein the step of selecting the preferred neighbor set of each vertex in the protein interaction network by using a neighbor selection algorithm specifically comprises:

selecting one vertex in the protein interaction network as an object vertex;

4. The semi-supervised network embedding model-based protein complex detection method according to claim 2, wherein the step of calculating a second-order estimate between any two points in the protein interaction network to obtain the overall structure information of the protein interaction network is followed by an optimization step, and the optimization step comprises:

5. The method for detecting protein complexes based on the semi-supervised network embedding model as recited in claim 4, wherein the graph Laplace regularization term loss function is calculated according to the following formula:

L＝L_first+λL_second

6. The method for detecting protein complexes based on semi-supervised network embedding model as claimed in claim 5, wherein the first order estimation is the monitored loss according to the following formula:

in the formula, L₀Is shown as a drawingNumber of convolutional layers of convolutional neural network, H⁽⁰⁾＝N×D，

7. The semi-supervised network embedding model-based protein complex detection method according to claim 2, wherein the step of calculating a second-order estimate between any two points in the protein interaction network to obtain the overall structure information of the protein interaction network is followed by an optimization step, and the optimization step comprises:

in the formula (I), the compound is shown in the specification,

is a negative offset variable of the first target,

is a positive deviation variable for the first target,

is a negative offset variable of the second target,

a positive deviation variable for a second target; x is a characteristic information matrix, D is the number of columns of X, P is the highest percentage of singular values of X, Z is an output result of inputting the adjacency matrix and the characteristic information matrix into the graph convolution neural network for processing, alpha is a matrix, the number of columns of alpha is equal to the maximum value of the preference of D, and beta is equal to the minimum value of the preference of D;

8. A protein complex detection device based on a semi-supervised network embedding model is characterized by comprising:

a memory for storing at least one program;

a processor for loading the at least one program to perform the semi-supervised network embedding model based protein complex detection method of any one of claims 1 to 7.