CN108171010B - Protein complex detection method and device based on semi-supervised network embedded model - Google Patents

Protein complex detection method and device based on semi-supervised network embedded model Download PDF

Info

Publication number
CN108171010B
CN108171010B CN201711250342.9A CN201711250342A CN108171010B CN 108171010 B CN108171010 B CN 108171010B CN 201711250342 A CN201711250342 A CN 201711250342A CN 108171010 B CN108171010 B CN 108171010B
Authority
CN
China
Prior art keywords
protein interaction
interaction network
vertex
network
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711250342.9A
Other languages
Chinese (zh)
Other versions
CN108171010A (en
Inventor
朱佳
黄昌勤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong SUCHUANG Data Technology Co.,Ltd.
Original Assignee
Guangzhou Fanping Electronic Technology Co ltd
South China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Fanping Electronic Technology Co ltd, South China Normal University filed Critical Guangzhou Fanping Electronic Technology Co ltd
Priority to CN201711250342.9A priority Critical patent/CN108171010B/en
Publication of CN108171010A publication Critical patent/CN108171010A/en
Application granted granted Critical
Publication of CN108171010B publication Critical patent/CN108171010B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Molecular Biology (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • Genetics & Genomics (AREA)
  • General Engineering & Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a protein complex detection method and a device based on a semi-supervised network embedded model, wherein the method comprises the steps of obtaining an adjacent matrix of a protein interaction network, embedding the adjacent matrix to obtain a dimensionality reduction matrix, processing the dimensionality reduction matrix by using a clustering algorithm to obtain a protein complex detection result and the like, and the device comprises a processor for storing at least one program memory and loading the at least one program to execute the protein complex detection method based on the semi-supervised network embedded model. According to the invention, the adjacent matrix corresponding to the protein interaction network is subjected to dimension conversion and then processed by the clustering algorithm, so that the clustering effect is improved. The protein complex detection method and device based on the semi-supervised network embedded model are widely applied to the technical field of protein complex identification.

Description

Protein complex detection method and device based on semi-supervised network embedded model
Technical Field
The invention relates to the technical field of protein complex recognition, in particular to a protein complex detection method and a protein complex detection device based on a semi-supervised network embedded model.
Background
Protein complexes are complex graphic structures formed by Protein-Protein interactions (PPIs), and play a crucial role in biochemical processes and pharmaceutical processes. Therefore, correctly identifying protein complexes in PPI interaction networks is extremely useful for the biomedical field. However, with the tremendous growth of PPI data, coupled with the bottleneck constraints of the experimental approach, only a small number of protein complexes are identified experimentally.
To overcome the technical limitations of experimental methods in protein complex detection, computational methods have been used. The PPI interaction network can be seen as a undirected, non-weighted graph, where proteins are vertices and their interactions are edges. Each protein complex is composed of two or more proteins that appear as densely connected subgraphs, meaning that they can be found using a graph formed based on a clustering method.
Recently, network embedding has been extensively studied and demonstrated to further improve the performance of many graph clustering methods. The network vector learns the low-dimensional representation of the vertices in the network to capture and save the network structure. However, most existing network vector methods rely heavily on the characteristics of each vertex in the network, which makes them unsuitable for PPI interactive networks. In PPI interaction networks, no metadata is associated with each vertex, except for the protein name. In other words, the conventional network vector method cannot fully capture the PPI interactive network structure because there is not enough data to calculate its first and second order estimates.
Disclosure of Invention
In order to solve the above-described problems, a first object of the present invention is to provide a method for detecting a protein complex based on a semi-supervised network-embedded model, and a second object of the present invention is to provide a device for detecting a protein complex based on a semi-supervised network-embedded model.
The first technical scheme adopted by the invention is as follows:
the protein complex detection method based on the semi-supervised network embedding model comprises the following steps:
obtaining a adjacency matrix of a protein interaction network;
embedding the adjacent matrix to obtain a dimension reduction matrix;
and processing the dimensionality reduction matrix by using a clustering algorithm to obtain a protein complex detection result.
Further, the step of performing embedding processing on the adjacency matrix to obtain a dimension reduction matrix specifically includes:
calculating first-order estimation between any two points in the protein interaction network so as to obtain local structure information of the protein interaction network;
calculating second-order estimation between any two points in the protein interaction network so as to obtain the overall structure information of the protein interaction network;
and storing the local structure information and the overall structure information into an adjacency matrix, thereby obtaining a dimension reduction matrix.
Further, the step of calculating a first-order estimate between any two points in the protein interaction network to obtain local structural information of the protein interaction network specifically includes:
selecting a preferred neighbor set of each vertex in the protein interaction network by using a neighbor selection algorithm;
respectively endowing each vertex with characteristic information according to the preferred neighbor point set of each vertex, thereby establishing a characteristic information matrix;
calculating first-order estimation between any two points in the protein interaction network according to the characteristic information matrix;
and taking the first-order estimation between any two points in the protein interaction network as the local structural information of the protein interaction network required to be acquired.
Further, the step of calculating a second-order estimate between any two points in the protein interaction network to obtain overall structural information of the protein interaction network specifically includes:
inputting the adjacency matrix and the characteristic information matrix into a graph convolution neural network for processing, thereby outputting second-order estimation between any two points in the protein interaction network;
and taking the second-order estimation between any two points in the protein interaction network as the overall structural information of the protein interaction network required to be acquired.
Further, the step of selecting a preferred neighbor set for each vertex in the protein interaction network by using a neighbor selection algorithm specifically includes:
processing the protein interaction network by using a Deepwalk algorithm so as to obtain a Deepwalk vector of each vertex;
selecting one vertex in the protein interaction network as an object vertex;
respectively calculating the Euclidean distance between the object vertex and each adjacent point according to the Deepwalk vectors of the object vertex and all adjacent points of the object vertex;
calculating the arithmetic mean of the Euclidean distances between the vertex of the object and each adjacent point of the vertex of the object;
taking a set consisting of all adjacent points of which the Euclidean distance from the vertex of the object is greater than the arithmetic mean as a preferred adjacent point set of the vertex of the object;
returning to the step of executing one vertex in the selected protein interaction network as an object vertex until a preferred set of neighbors for each vertex in the protein interaction network is selected.
Further, after the step of calculating a second order estimate between any two points in the protein interaction network to obtain the overall structural information of the protein interaction network, an optimization step is provided, wherein the optimization step comprises:
calculating a graph Laplace regularization term loss function according to first-order estimation and second-order estimation between any two points in the protein interaction network;
dynamically adjusting the order number of the characteristic information matrix until the Laplace regularization term loss function of the graph is minimized;
and respectively taking the first-order estimation and the second-order estimation which correspond to the minimum graph Laplace regularization term loss function as the local structure information and the overall structure information of the protein interaction network which need to be obtained.
Further, the calculation formula of the graph laplacian regularization term loss function is as follows:
L=Lfirst+λLsecond
wherein L is the graph Laplace regularization term loss function, LfirstTo estimate the monitored loss at first order, LsecondFor second order estimation of the monitored loss, λ is LfirstAnd LsecondA balance factor between.
Further, the first order estimates the monitored loss, and the calculation formula is as follows:
Figure BDA0001491602620000031
in the formula, viAnd vjIs a pair of vertices, y, connected by an edge in a protein interaction networkiIs composed of viThe Deepwalk vector of (a), yjIs composed of vjThe Deepwalk vector of (1);
the second-order estimation monitored loss is calculated according to the following formula:
Figure BDA0001491602620000032
in the formula, L0Convolutional layer for graph convolutional neural networkNumber of layers H(0)=N×D,
Figure BDA0001491602620000033
Further, after the step of calculating a second order estimate between any two points in the protein interaction network to obtain the overall structural information of the protein interaction network, an optimization step is provided, wherein the optimization step comprises:
dynamically adjusting α and β such that Z is equal to 0 or maximally close to 0 in the following system of equations:
Figure BDA0001491602620000034
in the formula (I), the compound is shown in the specification,
Figure BDA0001491602620000041
is a negative offset variable of the first target,
Figure BDA0001491602620000042
is a positive deviation variable for the first target,
Figure BDA0001491602620000043
is a negative offset variable of the second target,
Figure BDA0001491602620000044
a positive deviation variable for a second target; x is a characteristic information matrix, D is the number of columns of X, P is the highest percentage of singular values of X, α is a matrix, and the number of columns of α is equal to the maximum value that D can take, and β is equal to the minimum value that D can take.
The second technical scheme adopted by the invention is as follows: a semi-supervised network embedding model based protein complex detection apparatus, comprising:
a memory for storing at least one program;
a processor for loading the at least one program to execute the protein complex detection method based on the semi-supervised network embedding model according to the first technical aspect.
The invention has the beneficial effects that: by the protein complex detection method and the protein complex detection device, the protein interaction network is embedded and subjected to dimension conversion, so that the efficiency of the conventional clustering algorithm in clustering operation on the protein interaction network can be improved, the clustering effect is optimized, and the protein complex detection result is more accurate. Meanwhile, the invention can endow the characteristics to each vertex of the protein interaction network, and can capture the local structure of the protein interaction network and the overall structure of the protein interaction network, so that the invention does not require that each vertex of the protein interaction network has characteristics, and overcomes the technical defect that the clustering algorithm can not be directly used for processing the protein interaction network without characteristics of each vertex. The invention has stable operation, and each prediction result evaluation index is superior to other protein complex detection methods.
Drawings
FIG. 1 is a flowchart of a method for detecting a protein complex according to the present invention;
fig. 2 is a detailed flowchart of step S2;
fig. 3 is a detailed flowchart of step S21;
fig. 4 is a detailed flowchart of step S211;
FIG. 5 is a comparison of Krogan datasets;
FIG. 6 shows the comparison of Dip datasets;
FIG. 7 is a comparison of the Biogrid data sets;
FIG. 8 is a schematic diagram of a protein complex detection apparatus according to the present invention.
Detailed Description
Example 1
The invention discloses a protein complex detection method based on a semi-supervised network embedding model, which comprises the following steps as shown in figure 1:
s1, acquiring an adjacency matrix of a protein interaction network;
s2, embedding the adjacent matrix to obtain a dimension reduction matrix;
and S3, processing the dimensionality reduction matrix by using a clustering algorithm to obtain a protein complex detection result.
In the conventional method for detecting a protein complex, a protein interaction network is represented as an undirected graph G ═ V, E, a protein is a vertex V in the graph, and its interaction is an edge E in the graph, and the edge of the protein interaction network has no weight. Protein interaction networks can be obtained from existing datasets such as Krogan, Dip, and Biogrid. As can be seen from the graph theory, a protein interaction network corresponds to an adjacency matrix, and the adjacency matrix is processed by using a clustering algorithm such as COACH or K-means, so that a protein complex detection result can be obtained, namely, an output result shows which proteins belong to one class, namely, a complex. The protein complex detection method based on the semi-supervised network embedding model obtains the dimensionality reduction matrix obtained by carrying out dimensionality conversion on the adjacency matrix by embedding the adjacency matrix, and carries out protein complex detection on the dimensionality reduction matrix by using a known method clustering algorithm, so that the operation efficiency of the clustering algorithm can be improved. Since the present invention utilizes the interaction network corresponding to protein interaction, i.e., mathematical graph, for protein complex detection, the concepts of protein interaction, PPI, protein interaction network, and graph corresponding to protein interaction network are not distinguished in the examples unless otherwise specified.
Further as a preferred embodiment, the step of performing embedding processing on the adjacency matrix to obtain the dimension reduction matrix, that is, step S2, as shown in fig. 2, specifically includes:
s21, calculating first-order estimation between any two points in the protein interaction network to obtain local structure information of the protein interaction network;
s22, calculating second-order estimation between any two points in the protein interaction network to obtain the overall structure information of the protein interaction network;
s23, storing the local structure information and the overall structure information into an adjacent matrix, thereby obtaining a dimension reduction matrix.
Wherein the First order estimate (First-order proximity) describes pairwise similarities between vertices. For any pair of vertices v in the protein interaction networkiAnd vjIn other words, if viAnd vjThere is an edge in between, then viAnd vjWith a positive first order estimate in between. On the contrary, viAnd vjThe first order between estimates is 0. The first order estimate reflects the local structure of the protein interaction network.
Second-order estimation (Second-order proximity) describes pairwise similarities between vertex neighborhood structures. Suppose NiAnd NjRepresents viAnd vjAdjacent pair of vertices, the second order estimate is then formed by NiAnd NjThe similarity of (c) is determined. If two vertices share many common neighbors, the second order estimate between the two vertices will be high. Second order estimation has proven to be a good metric for defining the similarity of a pair of vertices, even though they are not edge-connected, so it can greatly enrich the relationship of vertices. The second order estimate reflects the overall structure of the protein interaction network.
The concept of first and second order estimates was first proposed in the LINE model. Assuming that u is a vertex in (V, E), u and the first order estimate of all other vertices in (V, E) may be expressed as Nu={su,1,su,2,…su,|V|In which s isi,jRepresents the weight of the edge between vertex i and vertex j in (V, E), and if there is no edge connection between vertex i and vertex j, s i,j0, if the edges connect between vertex i and vertex j, and graph G is not a weighted graph, (V, E), then si,jIf graph G is a weighted graph, then s is 1i,j>0. In the same way, the first-order estimation of vertex V and all other vertices in graph G ═ V, E can be expressed as Nv={sv,1,sv,2,…sv,|V|}. From this algorithm, a first order estimate between all vertices and other vertices in graph G ═ (V, E) can be calculated. The second order estimate, using vertex v and vertex u as examples, can be calculated by calculating NuAnd NvThe similarity between them is obtained. It can be seen that calculating the first and second order estimates requires first obtaining the weights of the edges in the graph, but PPIs are characterized by the fact that there is no feature to distinguish between vertices other than the difference in protein names, i.e., each vertex lacks features to weight the edges.
Because the invention utilizes the interaction network corresponding to the protein interaction to detect the protein complex, namely, the invention focuses on the whole protein interaction network, the embodiment does not distinguish the first-order estimation, the first-order estimation and the first-order estimation of the protein interaction network between any two points in the protein interaction network, and also does not distinguish the second-order estimation, the second-order estimation and the second-order estimation of the protein interaction network between any two points in the protein interaction network unless particularly stated.
After the first-order estimation and the second-order estimation are obtained, the first-order estimation and the second-order estimation can be combined with the adjacent matrix, namely, the local structure information corresponding to the first-order estimation and the overall structure information corresponding to the second-order estimation are stored in the adjacent matrix, so that the dimension reduction matrix is obtained. Since it is prior art to combine the first and second order estimates with the adjacency matrix, it is not described here in detail.
Because each vertex in the protein interaction network has no other features than the corresponding protein name, in order to compute a first order estimate of the protein interaction network, i.e., between any two vertices in the protein interaction network, it is necessary to assign a set of features to each vertex. In view of the definition of the protein complex, important neighbors of each vertex can be set as features because these neighbors have a higher probability to be combined together as a protein complex. The important neighbors are parts of neighbors screened from all neighbors of a vertex through a certain algorithm.
Further as a preferred embodiment, the step of calculating a first-order estimate between any two points in the protein interaction network to obtain the local structure information of the protein interaction network, i.e. step S21, as shown in fig. 3, specifically includes:
selecting a preferred neighbor set of each vertex in the protein interaction network by using a neighbor selection algorithm;
s211, respectively selecting a preferred neighbor point set according to each vertex;
s212, according to the optimal neighbor point set corresponding to each vertex, giving characteristic information to each vertex, and accordingly establishing a characteristic information matrix;
s213, calculating first-order estimation between any two points in the protein interaction network according to the characteristic information matrix;
and taking the first-order estimation between any two points in the protein interaction network as the local structural information of the protein interaction network required to be acquired.
Each vertex in the protein interaction network has a preferred set of neighbors, although it is not excluded that the preferred set of neighbors for certain vertices may be an empty set. For a vertex in a protein interaction network, its preferred set of neighbors is the set of eligible neighbors screened from all its neighbors. Feature information is assigned to the corresponding vertices using the preferred set of neighbors. If the vertex v isiThe corresponding preferred neighborhood set includes vertices x, y and z, then the "x, y and z" three vertices are vertices viThe characteristic to be imparted. Each vertex is characterized in this way, and the basis for calculating the edge weights is then used to calculate a first order estimate.
Since each vertex has the assigned Feature information, a Feature information matrix (Feature matrix) of the protein interaction network is obtained, which is an N × D order matrix, where N is the total number of vertices of the protein interaction network and D is the number of features per vertex. Because the preferred set of neighbors for each vertex is different, i.e., the features of each vertex are different, the number of features per vertex is also different.
For example, in a protein interaction network with N vertices, the maximum number of features that a vertex may correspond to is N, and thus the maximum order of the feature information matrix corresponding to the protein interaction network is N × N. If the feature number corresponding to a vertex is less than N, the row corresponding to the vertex in the feature information matrix is less than N columns, and the N columns can be complemented by a padding algorithm, preferably by complementing the N columns so that the rightmost elements are all zero. In the using process of the feature information matrix, sometimes it is necessary to reduce the scale, that is, keep the number of rows unchanged and reduce the number of columns, at this time, D may be considered as a variable, the maximum value of D may be determined as the feature number of the vertex with the largest feature number in the protein interaction network, or may be directly determined as N, and the minimum value of D may be determined as the feature number of the vertex with the smallest feature number in the protein interaction network. For example, when the maximum value of D is N, the N × D order characteristic information matrix may be reduced to Nx (D-1) order, Nx (D-2) order, or the like, and preferably, when the order of the characteristic information matrix is reduced, the rightmost column is deleted and only the leftmost column is retained.
According to the characteristic information matrix, a first-order estimation between any two points in the protein interaction network can be calculated. There are various methods for calculating the first order estimate based on the feature information matrix, and a cosine similarity calculation method may be preferably used, which is not described herein since it belongs to the prior art.
Further as a preferred embodiment, the step of calculating a second-order estimate between any two points in the protein interaction network to obtain the overall structural information of the protein interaction network specifically includes:
inputting the adjacency matrix and the characteristic information matrix into a graph convolution neural network for processing, thereby outputting second-order estimation between any two points in the protein interaction network;
and taking the second-order estimation between any two points in the protein interaction network as the overall structural information of the protein interaction network required to be acquired.
Second orderThe degree of similarity representing a pair of vertex neighborhood structures is estimated. Thus, to model the second order estimate, the neighborhood of each vertex is modeled first. For a graph G with n vertices, (V, E), it corresponds to an adjacent matrix M, which contains n row matrices, i.e. M1,m2,…mn. For row matrix
Figure BDA0001491602620000081
If and only if viAnd vjWhen connected by one edge, has mi,j>0。
miDescribing the vertex viAnd M provides information of the neighborhood structure of each vertex. Therefore, the GCN can be designed based on an auto-encoder to hold a second order estimate of G.
Auto-encoder based Graph Convolutional neural networks (GCN) can apply hidden variables, can learn the assertable hidden representation of undirected non-weighted graphs, which is well suited for protein interaction networks. Using the features of each vertex as part of the input data to the GCN, the representation learned from the original graph can then be obtained after encoding of the convolutional layer. For the decoding part, an inner product decoder may simply be used. A protein interaction network is a undirected, unweighted graph G ═ (V, E), which has N ═ V | vertices. Taking a neighborhood matrix A of G and a characteristic information matrix X of NxD order as input. Using random hidden variables ZiAn output matrix Z of order N × F can be obtained. Here, F is the number of output features, and D is the number of features per vertex. The second order estimate of the protein interaction network to be obtained, i.e. the second order estimate of all arbitrary two vertices in the protein interaction network, can be obtained from the output of the GCN. Since the method of obtaining the second-order estimate from the GCN output result belongs to the prior art, it is not described herein.
Since the features of each vertex are generated based on the selected neighbors, in other words, the number of features per vertex is different. Therefore, setting N to the initial value of D, when building the feature information matrix X, if the vertex has no such features, the correlation value is set to 0. Thus, each network layer in a graph-convolved neural network can be written as a non-linear function as follows:
H(l+1)=f(Hl,A),
wherein H(0)=X,H(l)=Z,
The transmission rules are as follows:
f(H(l),A)=relu(AH(l)W(l)),
where W is the weight matrix of the I-network layer and relu is the activation function, note that the multiplication with A encompasses all the features of all neighbors but not the vertex itself. Therefore, an identity matrix I needs to be added to a. Then, the transmission rule becomes:
Figure BDA0001491602620000091
wherein
Figure BDA0001491602620000092
Figure BDA0001491602620000093
Is that
Figure BDA0001491602620000094
Let L be 3, that means that the graph convolution neural network has three convolution layers to reconstruct the structure of a to obtain Z. Given that each layer in the network is determined to retain half of the characteristics of the first layer, it is obtained after three layers
Figure BDA0001491602620000095
Further as a preferred embodiment, the neighboring point selection algorithm, that is, step S211, as shown in fig. 4, specifically includes:
s2111, processing the protein interaction network by using a Deepwalk algorithm to obtain a Deepwalk vector of each vertex;
s2112, selecting a vertex in the protein interaction network as an object vertex;
s2113, respectively calculating Euclidean distances between the object vertex and each adjacent point according to the Deepwalk vectors of the object vertex and all adjacent points of the object vertex;
calculating the arithmetic mean of the Euclidean distances between the vertex of the object and each adjacent point of the vertex of the object;
s2114, taking a set formed by all adjacent points of which the Euclidean distance from the vertex of the object is greater than the arithmetic mean as an optimal adjacent point set of the vertex of the object;
s2115, returning to the step of executing one vertex in the selected protein interaction network as an object vertex until a preferred neighbor set of each vertex in the protein interaction network is selected.
Deepwalk is a method for learning node implicit expression, which encodes the social relationship of nodes in a continuous vector space and is an extension of a language model and unsupervised learning from a word sequence to a graph. The method learns the sequence of truncated walks as sentences. The method has the characteristics of expandability and parallelization, and can be used for network classification and abnormal point detection. The deep walk approach has been successfully validated in social networking and graph analysis. It learns the underlying representation by modeling a series of short, random walks that encode a continuous vector space in a low dimension.
The protein interaction network is processed through Deepwalk, each vertex in the protein interaction network corresponds to a 64-dimensional vector according to an obtained processing result, and the Euclidean distance between any two vertices can be calculated according to the 64-dimensional vectors corresponding to the two vertices. In the application of the invention, a 64-dimensional vector obtained after each vertex is processed by a Deepwalk algorithm is called as a Deepwalk vector corresponding to the vertex. Selecting a vertex in the protein interaction network, called an object vertex, calculating the Euclidean distances between the object vertex and all the adjacent points of the object vertex respectively, and solving the arithmetic mean of all the Euclidean distances, namely dividing the sum of the Euclidean distances between the object vertex and all the adjacent points of the object vertex by the total number of the adjacent points. Then, the Euclidean distance between the vertex of the object and each adjacent point thereof is compared with the arithmetic mean, and for the adjacent points with Euclidean distance larger than the arithmetic mean, the adjacent points are classified into the preferred adjacent point set, otherwise, the adjacent points are excluded from the preferred adjacent point set. In this way, a preferred set of neighbors can be formed by screening eligible neighbors for a particular vertex of the protein interaction network.
After the above method is repeatedly used, i.e. after a preferred neighbor set is selected and constructed for one object vertex in step S2114, the process returns to step S2112, another vertex which has not been constructed into the preferred neighbor set is selected as a new object vertex in the protein interaction network, and the process continues from step S2112 until all vertices in the protein interaction network are screened out by this method that their eligible neighbors constitute the corresponding preferred neighbor set. With the corresponding preferred set of neighbors, operations such as feature assignment can be performed by the methods disclosed above.
According to the above neighbor selection algorithm, the significance of the feature information matrix is more definite: it has N rows and D columns, where N is the total number of vertices of the protein interaction network and D is the number of features per vertex. After the Deepwalk algorithm, each vertex corresponds to a 64-dimensional vector, and therefore, each element in the feature information matrix is a 64-dimensional vector in essence.
Further as a preferred embodiment, the step of calculating a second order estimate between any two points in the protein interaction network to obtain the overall structural information of the protein interaction network is followed by an optimization step, the optimization step comprising:
calculating a graph Laplace regularization term loss function according to first-order estimation and second-order estimation between any two points in the protein interaction network;
dynamically adjusting the order number of the characteristic information matrix until the Laplace regularization term loss function of the graph is minimized;
and respectively taking the first-order estimation and the second-order estimation which correspond to the minimum graph Laplace regularization term loss function as the local structure information and the overall structure information of the protein interaction network which need to be obtained.
Since N is set as the initial value of D when the characteristic information matrix is established, the order number of the characteristic information matrix is not necessarily the most reasonable, and the first-order estimation and the second-order estimation of the protein interaction network obtained according to the characteristic information matrix are also not necessarily the optimal, which makes the finally obtained dimension reduction matrix for the clustering algorithm processing not the optimal. In order to obtain an optimal dimension reduction matrix, the order number of the characteristic information matrix is dynamically adjusted, the first-order estimation and the second-order estimation of the protein interaction network are changed, when the minimum value is obtained by a graph Laplacian regularization term loss function obtained by the first-order estimation and the second-order estimation, the corresponding first-order estimation and second-order estimation combination is optimal, and the optimal first-order estimation and second-order estimation combination is respectively used as local structure information and overall structure information of the protein interaction network to be obtained, so that the dimension reduction matrix is further obtained.
Further, as a preferred embodiment, the calculation formula of the graph laplacian regularization term loss function is as follows: l ═ Lfirst+λLsecond
Wherein L is the graph Laplace regularization term loss function, LfirstTo estimate the monitored loss at first order, LsecondFor second order estimation of the monitored loss, λ is LfirstAnd LsecondThe balance factor between the two, lambda is a parameter, and the value of the parameter can be selected when the algorithm is actually operated.
Further as a preferred embodiment, the first order estimates the monitored loss, and the calculation formula is as follows:
Figure BDA0001491602620000111
in the formula, viAnd vjIs a pair of vertices connected by an edge in a protein interaction networkPoint, yiIs composed of viThe Deepwalk vector of (a), yjIs composed of vjThe depwalk vector of (a). Preferably, yiIs composed of viThe Deepwalk vector of (1), which is specifically, in viAnd viThe Deepwalk vectors corresponding to all the preferred neighbor points are used as elements to construct a matrix yi. Matrix yjThe same applies to the construction method of (1). Since the number of neighbors per vertex may be different, that is yiAnd yjMay be different, and zero elements are used to fill the smaller matrix, ensuring that both matrices are the same size for the calculations. In the case of filling a smaller matrix with zero elements, the following filling method may be preferably used: such as yiRatio of orders yjSmall, then fill to y with zero elementsiBecome a new matrix, so that the order and y of the new matrixjSame, and yiAt the upper left corner of the new matrix.
The second-order estimation monitored loss is calculated according to the following formula:
Figure BDA0001491602620000112
in the formula, L0Number of convolutional layers for graph convolutional neural network, H(0)=N×D,
Figure BDA0001491602620000113
Here likewise the method of filling with zero elements, such that H(l+1)And H(l)Are the same.
With the above method, the corresponding first order estimate and second order estimate combination is optimal when taking the minimum for the graph laplacian regularization term loss function L.
Further as a preferred embodiment, the step of calculating a second order estimate between any two points in the protein interaction network to obtain the overall structural information of the protein interaction network is followed by an optimization step, the optimization step comprising:
dynamically adjusting α and β such that Z is equal to 0 or maximally close to 0 in the following system of equations:
Figure BDA0001491602620000121
in the formula (I), the compound is shown in the specification,
Figure BDA0001491602620000122
is a negative offset variable of the first target,
Figure BDA0001491602620000123
is a positive deviation variable for the first target,
Figure BDA0001491602620000124
is a negative offset variable of the second target,
Figure BDA0001491602620000125
a positive deviation variable for a second target; x is a characteristic information matrix, D is the column number of X, P is the highest percentage of singular values of X, alpha is a matrix, the column number of alpha is equal to the maximum value of the optimization of D, and beta is equal to the minimum value of the optimization of D;
and respectively taking the first-order estimation and the second-order estimation which are calculated according to the corresponding characteristic information matrix when the Z is equal to 0 or is close to 0 to the maximum extent as the local structure information and the overall structure information of the protein interaction network required to be obtained.
The above method is another implementation method of the optimization step. Mathematically, the problem of achieving optimization by finding the minimum of the graph laplacian regularization term loss function is actually a problem of dimensionality reduction of the matrix, which can be done using a conventional Singular Value Decomposition (SVD) method as a preferred embodiment. According to the SVD theorem, an N × D order characteristic information matrix X can be rewritten into U × S × V, wherein U is an orthogonal matrix of the characteristic information matrix X, and the size of U is N × N order; s is a diagonal matrix of the characteristic information matrix X, and the size of S is NxD order; v is a conjugate transpose matrix of U, and the size of V is D × D order. S may also be referred to as the singular value of X. If the minimum value of some highest percentage P of the singular values is set to 0, then an approximation matrix of X, i.e., X', can be obtained. Finally, the value of D is reduced, however, because the reconstruction error of X → X' needs to be minimized, the value of 1-P must be maximized. Since after the multiplication operation is performed by SVD, X ═ (1-P) X, where X is an N × D matrix, the problem of finding the minimum of the laplacian regularization term loss function to achieve optimization can be converted into a target planning problem, as shown in the following equation set:
Figure BDA0001491602620000126
the dynamic adjustment of alpha means that alpha is initially preferably selected as an N × N matrix, namely the characteristic information matrix itself, and is adjusted, namely, the alpha is reduced step by step, for example, the rightmost column is deleted to be an N × (N-1) matrix, and then the matrix is substituted into an equation set for calculation; the rightmost column is deleted again to become a matrix of N (N-2), and then the matrix is substituted into the equation set for calculation, and the like.
In this system of equations, the positive and negative deviation variables are placed in equal importance, which means that the weight is 1 for each deviation variable. Obviously, when Z is equal to 0, a pareto optimal solution can be obtained. In some cases, however, Z cannot be exactly equal to 0, and in such cases Z is required to be as close to 0 as possible within its range. Therefore, by continuously updating α and β until a combination of α and β is found which can make Z close to or equal to 0, the feature information matrix corresponding to this combination of α and β is optimal, and the first-order estimation and the second-order estimation calculated from the optimal feature information matrix can optimize the dimension reduction matrix to optimize the clustering effect.
Example 2
In this example, based on three sets of PPI data sets, the semi-supervised network embedding model-based protein complex detection method described in example 1 was combined with the existing clustering method to perform an experiment, and the experimental results thereof were compared with those of the conventional application of the existing clustering method by the most advanced method to demonstrate the performance of the method described in example 1. The experiment runs on a desktop computer and is configured to be an i7CPU dual-core 4.00GHZ, 16GB memory and GTX 1070 video card. The whole operation process of the three sets of data sets can be completed in one day. Furthermore, since PPI data clustering is typically a one-time process in the real world, no attention needs to be paid to runtime improvement and time complexity analysis in the study, since the cluster quality is more important.
Three sets of the latest s.cerevisiae PPI datasets, namely the Krogan, Dip and Biogrid datasets, were used. The Krogan dataset and the Dip dataset were used to evaluate the operation of several clustering algorithms. As shown in table 1, the Krogan dataset and the Dip dataset have similar average and density, while the Biogrid dataset has higher average and density than them. Since PPI data can be represented by an undirected graph G ═ V, E, the degree of averaging can be calculated as
Figure BDA0001491602620000131
The density can be calculated as
Figure BDA0001491602620000132
The characteristics of the three PPI datasets are shown in table 1.
PPI data have a high false alarm rate, estimated to be around 50%. The noise of the data interferes with the clustering method for detecting protein complexes from PPI data. Then, CYC2008 is used as a reference data set. CYC2008 provides an artificially corrected total catalog of 408 protein complexes in Saccharomyces cerevisiae, 90% more MIPS than another popular data set.
TABLE 1
Data set Vertex point Edge Degree of average Density of
Krogan 5364 61289 22.85 0.0043
Dip 4972 17836 7.17 0.0014
Biogrid 6242 255510 81.87 0.013
A neighbor affinity score is used to see if the protein complex detected by an algorithm matches the protein complex in CYC 2008. It is then used to calculate accuracy, recall, and F-value to evaluate the performance of the algorithm. The neighbor affinity score NA (p, b) is defined as follows:
Figure BDA0001491602620000133
here, P ═ (Vp, Ep) is a predicted protein complex, and B ═ B (Vb, Eb) is a reference protein complex. Thus, accuracy precision can be calculated as follows:
Figure BDA0001491602620000141
wherein the content of the first and second substances,
Figure BDA0001491602620000147
the recall rate recall is calculated as follows:
Figure BDA0001491602620000142
wherein the content of the first and second substances,
Figure BDA0001491602620000148
the F-measure is the harmonic mean of accuracy and recall and is calculated as follows:
Figure BDA0001491602620000143
ω is a threshold value indicating whether or not the protein complex is confirmed as a certain protein complex in the reference data set. Experimentally, the neighbor affinity score threshold was set to 0.25, which differentiates model performance from other algorithms.
In addition, three metrics, namely score (Frac), Maximum Matching Ratio (MMR), and geometric accuracy (Acc), were also used to measure the quality of protein complex clustering. Frac is an indicator for measuring fractional pairs between two protein complexes, with an overlap integral θ greater than 0.25, and Frac (θ) is calculated as follows:
Figure BDA0001491602620000144
here, a and B are two protein complexes.
Acc is the geometric mean of the other two metrics, cluster sensitivity (Sn) and cluster Positive Predictive Value (PPV). Sn and PPV are calculated as follows:
Figure BDA0001491602620000145
Figure BDA0001491602620000146
here, n is the number of proteins in the reference protein complex, m is the number of proteins in the clustered protein complex, and the element tijRepresenting the number of proteins found in both complexes. Because of SnIt can be increased by adding each protein in the same complex, while PPV can also be maximized by adding each protein in its own complex, so the geometric mean of Sn and PPV can be calculated using these two measures:
Figure BDA0001491602620000151
MMR indicates that the two sets of aggregated protein complexes are bipartite graphs, in which the two sets of nodes represent the reference complex and the predicted complex, respectively, and the edges joining the reference complex and the predicted complex are weighted by overlap integration. Equation for overlap integral between two protein complexes
Figure BDA0001491602620000152
And (4) calculating. The value of MMR is the total weight of the particular subset of edges that possess the greatest weight, divided by the number of reference protein complexes.
According to research, COACH is the most stable and representative clustering algorithm of PPI interactive network so far. The method is used as a cluster analysis method of an evaluation model. Two most advanced network vector models, Deepwalk and SDNE, were used to compare the performance of the models. As for the robustness of the evaluation model, two different types of traditional clustering algorithms K-means and DBSCAN are selected for comparison. With respect to COACH, three key parameters of the algorithm, namely density, affinity and proximity, are set to 0.7, 0.2 and 0.5, respectively, which are sufficient to perform stable calculations of all network vector algorithms as analyzed experimentally. Whereas for K-means and DBSCAN, only their default settings are used.
Because SDNE also requires first order estimation, three versions of SDNE are used, namely SDNE-NA with no features per vertex, SDNE-ALL with ALL neighbors as features per vertex, and SDNE-SN with selected neighbors as features per vertex, since it was originally designed for social networking. The SDNE-SN performs neighbor selection using the neighbor selection algorithm disclosed in embodiment 1.
The test results of the Krogan, Dip and Biogrid datasets are shown in fig. 5, 6 and 7, respectively.
From the results, the model was superior to the other models for all three data set tests of accuracy, recall, and F-number. In particular for Biogrid datasets with high density, the model achieved F values at least 90% higher than the second-order model. For the Dip dataset, the model completed F-value was the highest 0.528, approximately 20% higher than the COACH only algorithm, 9.5% higher than the second COACH + SDNE-SN algorithm, and 17% higher than the COACH + Deepwalk algorithm. Similar results can be found in the Krogan dataset as well. These results demonstrate that the model is more suitable for use on complex networks with high density than other models.
Furthermore, it was found that SDNE-SN outperformed SDNE-NA and SDNE-ALL for ALL three datasets. Since the SDNE-SN calculates a first order estimate based on the neighbor selection algorithm disclosed in example 1, the results demonstrate the effectiveness of the model from the side.
As for the K-means and DBSCAN clustering algorithms, the two algorithms perform poorly in the test. No matter which network vector algorithm is used together, the experimental results are not good, which means that the two algorithms are not suitable for PPI interactive networks.
The clustering quality of each model is compared below. Based on the test results in the previous section, only three representative models were selected for comparison, i.e., COACH + Deepwalk, and COACH + SDNE-SN. Table 2 shows the number of protein complexes detected with the different models. From the table, it can be seen that for all three data sets, the model detects more protein complexes than the other models. With this quantity basis, it is easier to improve the quality of clustering.
TABLE 2
Data set COACH + method of the invention COACH COACH+Deepwalk COACH+DNE-SN
Krogan 610 570 570 580
Dip 808 748 750 840
Biogrid 3470 3158 3160 3267
Table 3, table 4, table 5 show the cluster quality comparisons for the Krogan, Dip and Biogrid datasets, respectively. As can be seen from Table 3, the model is able to achieve better clustering quality, about 38% higher for both MMR and Frac than the second-ranked COACH + SDNE-SN, and about 25% higher for the Acc term. The situation for Dip datasets is also roughly similar.
As for the Biogrid dataset, the clustering quality of all models decreased due to the high density of the network. However, the model is still superior to the others. For example, the model Acc reaches 0.69, which is about 25% higher than the second COACH + SDNE-SN.
TABLE 3
COACH + method of the invention COACH COACH+Deepwalk COACH+DNE-SN
Frac 0.61 0.35 0.4 0.44
Acc 0.68 0.46 0.48 0.54
MMR 0.5 0.19 0.25 0.36
TABLE 4
COACH + method of the invention COACH COACH+Deepwalk COACH+DNE-SN
Frac 0.81 0.61 0.62 0.64
Acc 0.68 0.58 0.6 0.63
MMR 0.75 0.36 0.4 0.48
TABLE 5
COACH + method of the invention COACH COACH+Deepwalk COACH+DNE-SN
Frac 0.35 0.14 0.2 0.24
Acc 0.69 0.39 0.4 0.45
MMR 0.28 0.05 0.14 0.22
Compared with other network vector methods, an algorithm for selecting key neighbors as the characteristics of each vertex is designed to calculate first-order estimates thereof. In addition, a three-layer GCN is designed, and the structure of the PPI interactive network is deeply learned so as to store the second-order estimation of the PPI interactive network.
Extensive experiments aiming at various PPI interactive networks show that the model is stable, and various indexes are superior to other most advanced models. In the future, it is planned to use the recurrent neural networks to integrate data from the biomedical literature into PPI interaction networks to further improve the quality of protein complex detection.
Example 3
The invention discloses a protein complex detection device based on a semi-supervised network embedded model, which comprises the following components as shown in figure 8:
a memory for storing at least one program;
a processor for loading the at least one program to perform the semi-supervised network embedding model based protein complex detection method of embodiments 1 and 2.
While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (8)

1. The protein complex detection method based on the semi-supervised network embedding model is characterized by comprising the following steps of:
obtaining a adjacency matrix of a protein interaction network;
embedding the adjacent matrix to obtain a dimension reduction matrix;
processing the dimensionality reduction matrix by using a clustering algorithm to obtain a protein complex detection result;
the step of performing embedding processing on the adjacency matrix to obtain the dimension reduction matrix specifically includes:
calculating first-order estimation between any two points in the protein interaction network so as to obtain local structure information of the protein interaction network;
calculating second-order estimation between any two points in the protein interaction network so as to obtain the overall structure information of the protein interaction network;
storing the local structure information and the overall structure information into an adjacent matrix, thereby obtaining a dimension reduction matrix;
the step of calculating a first-order estimate between any two points in the protein interaction network to obtain local structural information of the protein interaction network specifically comprises:
selecting a preferred neighbor set of each vertex in the protein interaction network by using a neighbor selection algorithm;
respectively endowing each vertex with characteristic information according to the preferred neighbor point set of each vertex, thereby establishing a characteristic information matrix;
calculating first-order estimation between any two points in the protein interaction network according to the characteristic information matrix;
and taking the first-order estimation between any two points in the protein interaction network as the local structural information of the protein interaction network required to be acquired.
2. The method for detecting protein complexes based on semi-supervised network embedding model as claimed in claim 1, wherein the step of calculating the second-order estimation between any two points in the protein interaction network to obtain the overall structure information of the protein interaction network comprises:
inputting the adjacency matrix and the characteristic information matrix into a graph convolution neural network for processing, thereby outputting second-order estimation between any two points in the protein interaction network;
and taking the second-order estimation between any two points in the protein interaction network as the overall structural information of the protein interaction network required to be acquired.
3. The semi-supervised network embedding model-based protein complex detection method according to claim 1 or 2, wherein the step of selecting the preferred neighbor set of each vertex in the protein interaction network by using a neighbor selection algorithm specifically comprises:
processing the protein interaction network by using a Deepwalk algorithm so as to obtain a Deepwalk vector of each vertex;
selecting one vertex in the protein interaction network as an object vertex;
respectively calculating the Euclidean distance between the object vertex and each adjacent point according to the Deepwalk vectors of the object vertex and all adjacent points of the object vertex;
calculating the arithmetic mean of the Euclidean distances between the vertex of the object and each adjacent point of the vertex of the object;
taking a set consisting of all adjacent points of which the Euclidean distance from the vertex of the object is greater than the arithmetic mean as a preferred adjacent point set of the vertex of the object;
returning to the step of executing one vertex in the selected protein interaction network as an object vertex until a preferred set of neighbors for each vertex in the protein interaction network is selected.
4. The semi-supervised network embedding model-based protein complex detection method according to claim 2, wherein the step of calculating a second-order estimate between any two points in the protein interaction network to obtain the overall structure information of the protein interaction network is followed by an optimization step, and the optimization step comprises:
calculating a graph Laplace regularization term loss function according to first-order estimation and second-order estimation between any two points in the protein interaction network;
dynamically adjusting the order number of the characteristic information matrix until the Laplace regularization term loss function of the graph is minimized;
and respectively taking the first-order estimation and the second-order estimation which correspond to the minimum graph Laplace regularization term loss function as the local structure information and the overall structure information of the protein interaction network which need to be obtained.
5. The method for detecting protein complexes based on the semi-supervised network embedding model as recited in claim 4, wherein the graph Laplace regularization term loss function is calculated according to the following formula:
L=Lfirst+λLsecond
wherein L is the graph Laplace regularization term loss function, LfirstTo estimate the monitored loss at first order, LsecondFor second order estimation of the monitored loss, λ is LfirstAnd LsecondA balance factor between.
6. The method for detecting protein complexes based on semi-supervised network embedding model as claimed in claim 5, wherein the first order estimation is the monitored loss according to the following formula:
Figure FDA0003147016740000021
in the formula, viAnd vjIs a pair of vertices, y, connected by an edge in a protein interaction networkiIs composed of viThe Deepwalk vector of (a), yjIs composed of vjThe Deepwalk vector of (1);
the second-order estimation monitored loss is calculated according to the following formula:
Figure FDA0003147016740000031
in the formula, L0Is shown as a drawingNumber of convolutional layers of convolutional neural network, H(0)=N×D,
Figure FDA0003147016740000032
7. The semi-supervised network embedding model-based protein complex detection method according to claim 2, wherein the step of calculating a second-order estimate between any two points in the protein interaction network to obtain the overall structure information of the protein interaction network is followed by an optimization step, and the optimization step comprises:
dynamically adjusting α and β such that Z is equal to 0 or maximally close to 0 in the following system of equations:
Figure FDA0003147016740000033
in the formula (I), the compound is shown in the specification,
Figure FDA0003147016740000034
is a negative offset variable of the first target,
Figure FDA0003147016740000035
is a positive deviation variable for the first target,
Figure FDA0003147016740000036
is a negative offset variable of the second target,
Figure FDA0003147016740000037
a positive deviation variable for a second target; x is a characteristic information matrix, D is the number of columns of X, P is the highest percentage of singular values of X, Z is an output result of inputting the adjacency matrix and the characteristic information matrix into the graph convolution neural network for processing, alpha is a matrix, the number of columns of alpha is equal to the maximum value of the preference of D, and beta is equal to the minimum value of the preference of D;
and respectively taking the first-order estimation and the second-order estimation which are calculated according to the corresponding characteristic information matrix when the Z is equal to 0 or is close to 0 to the maximum extent as the local structure information and the overall structure information of the protein interaction network required to be obtained.
8. A protein complex detection device based on a semi-supervised network embedding model is characterized by comprising:
a memory for storing at least one program;
a processor for loading the at least one program to perform the semi-supervised network embedding model based protein complex detection method of any one of claims 1 to 7.
CN201711250342.9A 2017-12-01 2017-12-01 Protein complex detection method and device based on semi-supervised network embedded model Active CN108171010B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711250342.9A CN108171010B (en) 2017-12-01 2017-12-01 Protein complex detection method and device based on semi-supervised network embedded model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711250342.9A CN108171010B (en) 2017-12-01 2017-12-01 Protein complex detection method and device based on semi-supervised network embedded model

Publications (2)

Publication Number Publication Date
CN108171010A CN108171010A (en) 2018-06-15
CN108171010B true CN108171010B (en) 2021-09-14

Family

ID=62525063

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711250342.9A Active CN108171010B (en) 2017-12-01 2017-12-01 Protein complex detection method and device based on semi-supervised network embedded model

Country Status (1)

Country Link
CN (1) CN108171010B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108932402A (en) * 2018-06-27 2018-12-04 华中师范大学 A kind of protein complex recognizing method
CN110796133B (en) * 2018-08-01 2024-05-24 北京京东尚科信息技术有限公司 Text region identification method and device
CN109389151B (en) * 2018-08-30 2022-01-18 华南师范大学 Knowledge graph processing method and device based on semi-supervised embedded representation model
CN110942805A (en) * 2019-12-11 2020-03-31 云南大学 Insulator element prediction system based on semi-supervised deep learning
CN111860768B (en) * 2020-06-16 2023-06-09 中山大学 Method for enhancing point-edge interaction of graph neural network
CN112071362B (en) * 2020-08-03 2024-04-09 西安理工大学 Method for detecting protein complex fusing global and local topological structures

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013049398A2 (en) * 2011-09-28 2013-04-04 H. Lee Moffitt Cancer Center & Research Institute, Inc. Protein-protein interaction as biomarkers
CN103235900A (en) * 2013-03-28 2013-08-07 中山大学 Weight assembly clustering method for excavating protein complex
CN105138866A (en) * 2015-08-12 2015-12-09 广东顺德中山大学卡内基梅隆大学国际联合研究院 Method for identifying protein functions based on protein-protein interaction network and network topological structure features
CN105930686A (en) * 2016-07-05 2016-09-07 四川大学 Secondary protein structureprediction method based on deep neural network
CN106021988A (en) * 2016-05-26 2016-10-12 河南城建学院 Recognition method of protein complexes

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003000849A2 (en) * 2001-06-21 2003-01-03 Bioinformatics Dna Codes, Llc Methods for representing sequence-dependent contextual information present in polymer sequences and uses thereof

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013049398A2 (en) * 2011-09-28 2013-04-04 H. Lee Moffitt Cancer Center & Research Institute, Inc. Protein-protein interaction as biomarkers
CN103235900A (en) * 2013-03-28 2013-08-07 中山大学 Weight assembly clustering method for excavating protein complex
CN105138866A (en) * 2015-08-12 2015-12-09 广东顺德中山大学卡内基梅隆大学国际联合研究院 Method for identifying protein functions based on protein-protein interaction network and network topological structure features
CN106021988A (en) * 2016-05-26 2016-10-12 河南城建学院 Recognition method of protein complexes
CN105930686A (en) * 2016-07-05 2016-09-07 四川大学 Secondary protein structureprediction method based on deep neural network

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
LLE流行嵌入式降维算法;梦游--;《https://blog.csdn.net/zhouguangfei0717/article/details/78604980》;20171122;第1-10页 *
Protein-protein interaction network inference from multiple kernels with optimization based on random walk by linear programming;L. Huang, L. Liao and C. H. Wu;《2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)》;20151101;第201-207页 *
基于流形学习的蛋白质功能预测与优化;梁华东;《中国优秀硕士学位论文全文数据库 基础科学辑》;20170815;第A006-48页 *
梁华东.基于流形学习的蛋白质功能预测与优化.《中国优秀硕士学位论文全文数据库 基础科学辑》.2017,第A006-48页. *
网络表示学习(DeepWalk,LINE,node2vec,SDNE);u013527419;《https://www.itdaan.com/blog/2017/07/24/ce511d9d6c68917c8a1afabbd66c17ae.html》;20170724;第1-6页 *
针对蛋白质复合体检测的自学习图聚类(英文);朱佳,等;《控制理论与应用》;20170630;第776-782页 *

Also Published As

Publication number Publication date
CN108171010A (en) 2018-06-15

Similar Documents

Publication Publication Date Title
CN108171010B (en) Protein complex detection method and device based on semi-supervised network embedded model
CN109389151B (en) Knowledge graph processing method and device based on semi-supervised embedded representation model
CN113705772A (en) Model training method, device and equipment and readable storage medium
CN107070867B (en) Network flow abnormity rapid detection method based on multilayer locality sensitive hash table
WO2022105108A1 (en) Network data classification method, apparatus, and device, and readable storage medium
Mall et al. Representative subsets for big data learning using k-NN graphs
Guo et al. Machine learning based feature selection and knowledge reasoning for CBR system under big data
Ghanbari et al. Reconstruction of gene networks using prior knowledge
Zhu et al. Protein complexes detection based on semi-supervised network embedding model
Karrar The effect of using data pre-processing by imputations in handling missing values
CN115424660A (en) Method and device for predicting multi-source information relation by using prediction model
Wu et al. Broad fuzzy cognitive map systems for time series classification
CN115114484A (en) Abnormal event detection method and device, computer equipment and storage medium
EP4009239A1 (en) Method and apparatus with neural architecture search based on hardware performance
Krock et al. Modeling Massive Highly Multivariate Nonstationary Spatial Data with the Basis Graphical Lasso
CN117349494A (en) Graph classification method, system, medium and equipment for space graph convolution neural network
Tej et al. Determining optimal neural network architecture using regression methods
CN116467466A (en) Knowledge graph-based code recommendation method, device, equipment and medium
CN116383441A (en) Community detection method, device, computer equipment and storage medium
Xiao Using machine learning for exploratory data analysis and predictive models on large datasets
CN115544307A (en) Directed graph data feature extraction and expression method and system based on incidence matrix
Chatur et al. Effectiveness evaluation of regression models for predictive data-mining
US20240111807A1 (en) Embedding and Analyzing Multivariate Information in Graph Structures
Rodrêguez-Fdez et al. A genetic fuzzy system for large-scale regression
KR102556235B1 (en) Method and apparatus for content based image retrieval

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220323

Address after: 510000 5548, floor 5, No. 1023, Gaopu Road, Tianhe District, Guangzhou City, Guangdong Province

Patentee after: Guangdong SUCHUANG Data Technology Co.,Ltd.

Address before: 510631 School of computer science, South China Normal University, 55 Zhongshan Avenue West, Tianhe District, Guangzhou City, Guangdong Province

Patentee before: SOUTH CHINA NORMAL University

Patentee before: Guangzhou Fanping Electronic Technology Co., Ltd