CN113837252B - Clustering processing method and device - Google Patents
Clustering processing method and device Download PDFInfo
- Publication number
- CN113837252B CN113837252B CN202111072320.4A CN202111072320A CN113837252B CN 113837252 B CN113837252 B CN 113837252B CN 202111072320 A CN202111072320 A CN 202111072320A CN 113837252 B CN113837252 B CN 113837252B
- Authority
- CN
- China
- Prior art keywords
- nodes
- label
- order
- node
- propagation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000003672 processing method Methods 0.000 title claims description 6
- 239000011159 matrix material Substances 0.000 claims abstract description 147
- 238000000034 method Methods 0.000 claims abstract description 47
- 230000001902 propagating effect Effects 0.000 claims abstract description 11
- 238000010586 diagram Methods 0.000 claims description 40
- 235000008694 Humulus lupulus Nutrition 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 6
- 230000006399 behavior Effects 0.000 claims description 5
- 238000004422 calculation algorithm Methods 0.000 description 14
- 230000008569 process Effects 0.000 description 9
- 238000004458 analytical method Methods 0.000 description 3
- 238000004590 computer program Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 230000014509 gene expression Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000007621 cluster analysis Methods 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the specification provides a clustering method and device. According to the method of this embodiment, a network graph constructed using relevant data of an analyzed subject is first acquired, the network graph including nodes including the analyzed subject and edges, the weights of the edges indicating correlations between the nodes; then, a first-order tag propagation probability matrix is obtained by utilizing the network graph; obtaining an N-order tag propagation probability matrix based on the first-order tag propagation probability matrix, wherein N is a positive integer more than 2, and the N-order tag propagation probability matrix comprises the probability of propagating tags among nodes in an N-order mode; then, using the N-order label propagation probability matrix to carry out label propagation among nodes until a preset convergence condition is met; and finally, determining the clustering result of the analyzed subject by utilizing the label propagation result.
Description
Technical Field
One or more embodiments of the present disclosure relate to the field of artificial intelligence, and in particular, to a clustering method and apparatus.
Background
In various application scenarios, a clustering method is often used to cluster analysis objects, and then the clustering result is further analyzed. For example, consumer data is clustered, and consumer patterns or habits of various consumer groups are analyzed to provide targeted services thereto. As another example, user network behavior is analyzed and clustered to perform risk identification. However, with the advent of the big data age, the volume of data gradually increases, and there is a need to provide a clustering method with simple algorithm and fast convergence speed.
Disclosure of Invention
In view of this, one or more embodiments of the present specification describe a clustering method that is simple in algorithm and fast in convergence speed.
According to a first aspect, there is provided a cluster processing method, including:
acquiring a network graph constructed by using related data of an analyzed subject, wherein the network graph comprises nodes and edges, the nodes comprise the analyzed subject, and the weights of the edges indicate the correlation between the nodes;
obtaining a first-order tag propagation probability matrix by using the network graph;
obtaining an N-order tag propagation probability matrix based on the first-order tag propagation probability matrix, wherein N is a positive integer more than 2, and the N-order tag propagation probability matrix comprises the probability of propagating tags among nodes in an N-order mode;
performing label propagation among nodes by using the N-order label propagation probability matrix until a preset convergence condition is met;
and determining the clustering result of the analyzed subject by using the label propagation result.
In one embodiment, the node further comprises other objects in relation to the subject being analyzed.
In another embodiment, deriving a first-order tag propagation probability matrix using the network map includes:
determining first-order label propagation probability among nodes by using the weights of the edges between the nodes and the first-order neighbor nodes in the network graph and the maximum value and the minimum value of the weights in all the edges;
the first-order tag propagation probability matrix is obtained from the first-order tag propagation probabilities between the nodes.
In one embodiment, the determining the first-order label propagation probability between the nodes by using the weights of the edges between the nodes and the first-order neighbor nodes in the network graph and the maximum value and the minimum value of the weights in all the edges includes:
if the node i and the node j have no edges in the network diagram, determining that the first-order label propagation probability between the node i and the node j is 0;
if the node i and the node j have edges in the network diagram, determining a first-order label propagation probability p between the node i and the node j ij The method comprises the following steps:wherein the w ij Representing the weight of the edge between node i and node j, said w max And w min Representing the maximum and minimum values of the weights in all sides of the network graph, respectively.
In another embodiment, deriving the N-order tag propagation probability matrix based on the first-order tag propagation probability matrix comprises:
determining the label propagation probability of each path passing between nodes in a network diagram by using the first-order label propagation probability between the nodes, wherein the paths comprise all paths within N hops;
determining N-order label propagation probability among the nodes by using the label propagation probability of each path among the nodes and the maximum value of the label propagation probability of each path among all the nodes;
and obtaining the N-order tag propagation probability matrix according to the N-order tag propagation probabilities among the nodes.
In one embodiment, the N takes 2;
second order tag propagation probability q between node i and node j ij The method comprises the following steps:
wherein V is a node set in the network diagram, n is the number of nodes in the network diagram, alpha is a preset distance attenuation coefficient, and p is ij Is the first order tag propagation probability between node i and node j.
In another embodiment, using the N-th order tag propagation probability matrix, performing tag propagation between nodes includes:
constructing a label matrix by using labels of all nodes in the network graph;
in each round of propagation, updating the tag matrix by using the product of the N-order tag propagation probability matrix and the tag matrix;
the propagation is repeatedly performed until a preset convergence condition is reached.
In one embodiment, constructing a label matrix using labels of nodes in the network graph includes: constructing a label matrix by using first labels of all nodes in the network diagram;
determining a clustering result of the analyzed subject using the label propagation result includes: dividing nodes with the same first label into a class of clusters; and taking a second label of the analyzed main body with the highest proportion in the same cluster as the label of the cluster.
In another embodiment, constructing a label matrix using labels of nodes in the network graph includes: constructing a label matrix by using second labels and unlabeled labels of all nodes in the network diagram;
after updating the tag matrix in each round of propagation, further comprising resetting the updated tag matrix with a second tag corresponding to the node initially labeled with the second tag;
determining a clustering result of the analyzed subject using the label propagation result includes: nodes with the same second label are divided into a class of clusters, and the second label of the same class of clusters is used as the label of the class of clusters.
According to a second aspect, there is provided a cluster processing apparatus comprising:
a graph acquisition unit configured to acquire a network graph constructed using related data of an analyzed subject, the network graph including nodes including the analyzed subject and edges, weights of the edges indicating correlations between the nodes;
the first matrix acquisition unit is configured to obtain a first-order tag propagation probability matrix by utilizing the network graph;
a second matrix acquisition unit configured to obtain an N-order tag propagation probability matrix based on the first-order tag propagation probability matrix, where N is a positive integer greater than 2, the N-order tag propagation probability matrix including probabilities of propagating tags between nodes in an N-order or lower manner;
the label propagation unit is configured to propagate labels among nodes by using the N-order label propagation probability matrix until a preset convergence condition is met;
and a result determining unit configured to determine a clustering result of the analyzed subject using the result of the tag propagation.
In one embodiment, the node further comprises other objects in relation to the subject being analyzed.
In another embodiment, the first matrix obtaining unit is specifically configured to determine a first-order tag propagation probability between the nodes by using weights of edges between the nodes and first-order neighbor nodes in the network graph, and maximum values and minimum values of weights in all edges; the first-order tag propagation probability matrix is obtained from the first-order tag propagation probabilities between the nodes.
In one embodiment, the first matrix acquisition unit is specifically configured to:
if the node i and the node j have no edges in the network diagram, determining that the first-order label propagation probability between the node i and the node j is 0;
if the node i and the node j have edges in the network diagram, determining a first-order label propagation probability p between the node i and the node j ij The method comprises the following steps:wherein the w ij Representing the weight of the edge between node i and node j, said w max And w min Representing the maximum and minimum values of the weights in all sides of the network graph, respectively.
In another embodiment, the second matrix acquisition unit is specifically configured to:
determining the label propagation probability of each path passing between nodes in a network diagram by using the first-order label propagation probability between the nodes, wherein the paths comprise all paths within N hops;
determining N-order label propagation probability among the nodes by using the label propagation probability of each path among the nodes and the maximum value of the label propagation probability of each path among all the nodes;
and obtaining the N-order tag propagation probability matrix according to the N-order tag propagation probabilities among the nodes.
In one embodiment, the N takes 2;
second order tag propagation probability q between node i and node j ij The method comprises the following steps:
wherein V is a node set in the network diagram, n is the number of nodes in the network diagram, alpha is a preset distance attenuation coefficient, and p is ij Is the first order tag propagation probability between node i and node j.
In another embodiment, the tag propagation unit is specifically configured to: constructing a label matrix by using labels of all nodes in the network graph; in each round of propagation, updating the tag matrix by using the product of the N-order tag propagation probability matrix and the tag matrix; the propagation is repeatedly performed until a preset convergence condition is reached.
In one embodiment, the label propagation unit is specifically configured to construct a label matrix by using the first labels of the nodes in the network graph;
the result determining unit is specifically configured to divide nodes with the same first label into a class of clusters; and taking a second label of the analyzed main body with the highest proportion in the same cluster as the label of the cluster.
In another embodiment, the label propagation unit is specifically configured to construct a label matrix by using the second labels and the unlabeled labels of the nodes in the network graph;
the label propagation unit is further configured to reset the updated label matrix by using a second label corresponding to the node initially marked with the second label after the label matrix is updated in each round of propagation;
the result determining unit is specifically configured to divide the nodes with the same second label into a class of clusters, and take the second label of the same class of clusters as the label of the class of clusters.
According to a third aspect, there is provided a computing device comprising a memory having executable code stored therein and a processor which when executing the executable code implements the method of the first aspect.
According to the method and the device provided by the embodiment of the specification, the clustering is performed based on the label propagation mode, the calculation complexity is low, the label propagation algorithm is improved, the label propagation probability matrix with more than two steps is determined, and the label propagation is performed based on the label propagation probability matrix with more than two steps, so that the convergence rate is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 illustrates a flow diagram of a cluster processing method according to one embodiment;
FIG. 2 illustrates a node connection relationship diagram in accordance with one embodiment;
FIG. 3 illustrates a block diagram of a cluster processing device, according to one embodiment.
Detailed Description
The following describes the scheme provided in the present specification with reference to the drawings.
The label propagation algorithm (LPA, lable Propagation Algorithm) is a graph-based semi-supervised learning method which is currently available, and is gradually focused and widely applied to various classification scenes due to the fact that the algorithm is simple and easy to implement, and has short execution time, low complexity and good classification effect. The traditional label propagation algorithm considers labels of first-order relation nodes of nodes and connected edges in the label propagation process, converts the weights of the edges into probabilities in a certain mode, and decides whether to propagate the labels to neighbor nodes of the first-order nodes according to the probabilities. However, when the number of nodes is very large, the conventional label propagation algorithm has a very slow or even impossible convergence. In view of this, the present specification improves on the basis of a tag propagation algorithm to increase the convergence speed.
Specific implementations of the above concepts are described below.
FIG. 1 illustrates a flow diagram of a cluster processing method according to one embodiment. It is understood that the method may be performed by any apparatus, device, platform, cluster of devices having computing, processing capabilities. As shown in fig. 1, the method includes:
step 101, obtaining a network graph constructed by using related data of an analyzed subject, wherein the network graph comprises nodes and edges, the nodes comprise the analyzed subject, and the weight of the edges indicates the correlation between the nodes.
Step 103, obtaining a first-order tag propagation probability matrix by using the network diagram.
Step 105, obtaining an N-order tag propagation probability matrix based on the first-order tag propagation probability matrix, where N is a positive integer greater than 2, and the N-order tag propagation probability matrix includes probabilities of propagating tags between nodes in an N-order manner.
Step 107, performing label propagation between nodes by using an N-order label propagation probability matrix until a preset convergence condition is met; and determining the clustering result of the analyzed subject by using the label propagation result.
It can be seen that, in the method shown in fig. 1, clustering is performed based on the tag propagation mode, the computation complexity is low, the tag propagation algorithm is improved, the tag propagation probability matrix with more than second order is determined, and the tag propagation is performed based on the tag propagation probability matrix with more than second order, so that the convergence speed is improved.
In addition, the clustering method considers neighbor nodes with the order of more than 2 in the label propagation process, so that the local optimal solution is not easy to fall into.
The manner in which the individual steps shown in fig. 1 are performed is described below.
The above step 101, i.e. "acquiring a network map constructed using relevant data of the subject under analysis", will be described in detail first with reference to the embodiments.
The method comprises the steps of determining an analyzed subject according to an actual application scene, and constructing a network diagram by utilizing related data of the analyzed subject. For example, in a risk identification scenario for a user, where the subject being analyzed is the user, a network map may be constructed using the user's relevant data. For another example, in a scenario in which the commodities are clustered, the subject to be analyzed is the commodity, and the network map may be constructed using the related data of the commodity.
The network graph constructed in the present specification may be a homogeneous network graph, i.e. the nodes in the network are all of the same type. For example, nodes are all users, edges between the nodes indicate that there is an association between the users, and weights of the edges are used for indicating the correlation between the users, and the higher the weights are, the greater the correlation is. The correlation may be embodied by transaction behavior, address book relationships, etc. between users.
In addition, the network graph constructed in the present specification may be a heterogeneous network graph, that is, the types of nodes in the network are more than one or the types of edges are more than one. For example, the node includes a user, a user device, a network (e.g., wifi identification). If the user uses a certain user device, an edge exists between two corresponding nodes, and the weight of the edge can be determined according to the using times or the frequency. If the user uses a certain wifi, an edge exists between two corresponding nodes, and the weight of the edge can be determined according to the frequency or the frequency of use. If the users are associated with each other, edges exist between the two corresponding nodes, and the weight of the edges can be reflected according to transaction behaviors, address book relations and the like between the users.
Taking a heterogeneous network as an example, the node set of the analyzed subject is represented asWherein n is 1 The number of nodes that are the subject of analysis. At V 1 There are edges between nodes in (a) and (b) the set of edges is denoted +.>The corresponding set of weights is +.>Wherein m is 1 Is the number of edges between the subjects being analyzed. The set of nodes outside the subject under analysis is denoted +.>Wherein n is 2 Is the number of nodes outside the subject being analyzed. V (V) 2 The nodes in (a) have various relations with the nodes corresponding to the subject to be analyzed, which constitutes edges between two sets of nodes, the sets of edges being denoted +.>The corresponding set of weights is +.> In this case, G (V, E, W) forms a heterogeneous network map, where v= (V) 1 , 2 ),W=(W 1 ,W 2 ),E=(E 1 ,E 2 )。
The step 103, i.e. "obtaining a first-order tag propagation probability matrix using a network map" is described in detail below with reference to the embodiments.
The first-order tag propagation probability matrix describes the probability of tag propagation between any two nodes in the network graph, taking the G (V, E, W) heterogeneous network graph described above as an example, wherein the number of nodes is n 1 +n 2 Then the first-order tag propagation probability matrix is a (n 1 +n 2 )×(n 1 +n 2 ) Is a matrix of (a) in the matrix. Each value in the matrix, e.g. p ij The first-order label propagation probability between the node i and the node j is represented, namely, the probability that label propagation is carried out between the node i and the node j through only one edge.
The method can be realized by utilizing the weights of the edges between the nodes and the first-order neighbor nodes in the network graph and the maximum value and the minimum value of the weights in all the edges to determine the first-order label propagation probability between the nodes; and obtaining a first-order label propagation probability matrix according to the first-order label propagation probabilities among the nodes.
A preferred embodiment is provided herein, where node i and node j have no edges in the network graph (the edges referred to in the examples herein refer to edges directly connected between two nodes), that is, node i and node j are not first-order neighbors, and there is no edge directly connected between two nodes, then the first-order label propagation probability between node i and node j is determined to be 0.
If the node i and the node j have edges in the network diagram, determining a first-order label propagation probability p between the node i and the node j ij The method comprises the following steps:
wherein w is ij Representing the weight of an edge between node i and node j, w max And w min Representing the maximum and minimum values of the weights in all edges of the network graph, respectively. Taking the heterogeneous network diagram mentioned above as an example, then:
w max =maxw i ,i=1,2,...,m 1 +m 2 (2)
w min =minw i ,i=1,2,...,m 1 +m 2 (3)
it should be noted that the above formula (1) is only a preferred example, and modifications or substitutions based on the concept or the formula (1) are all within the scope of the disclosure.
The above step 105, i.e. "deriving an N-order tag propagation probability matrix based on a first-order tag propagation probability matrix", is described in detail below with reference to the embodiments.
In this specification, N is a positive integer of 2 or more, and the N-order tag propagation probability matrix includes probabilities of propagating tags between nodes in an N-order or lower manner. The so-called N-th order represents the propagation of the tag through the N-hop node. Each value in the N-order tag propagation probability matrix is an N-order tag propagation probability between nodes, preferably N is 2. For example, the second order tag propagation probability q between node i and node j ij Representing the total probability of propagating labels between node i and node j through one-hop nodes (i.e., one edge) and 2-hop nodes (i.e., through two consecutive edges)。
When the N-order label propagation probability matrix is obtained, the first-order label propagation probability among the nodes can be utilized to determine the label propagation probability of each path passing through among the nodes in the network diagram, wherein the paths comprise all paths within N hops. And then determining the N-order label propagation probability among the nodes by using the label propagation probability of each path among the nodes and the maximum value of the label propagation probability of each path among all the nodes. And finally obtaining an N-order label propagation probability matrix according to the N-order label propagation probabilities among the nodes.
Taking 2 as an example, namely a second-order tag propagation probability matrix, determining a second-order tag propagation probability q between a node i and a node j ij When the following formula can be adopted:
wherein V is a node set in the network graph. n is the number of nodes in the network graph, e.g. n in the heterogeneous network mentioned above 1 +n 2 . Alpha is a preset distance attenuation coefficient, and an empirical value or an experimental value can be adopted.
In the above formula (4), p ij The first-order label propagation probability between the node i and the node j is shown, namely, the probability of label propagation between the node i and the node j through the directly connected edge (namely, the path of the one-hop node).
The probability sum of label propagation between the node i and the node j through two continuous edges (namely paths of two-hop nodes) is reflected.
To facilitate understanding, an example is given. Assuming that the connection relationship of node 1 and node 2 in the relationship diagram is as shown in fig. 2, the second-order tag propagation probability q between node 1 and node 2 is determined 12 When it is necessary to determine the probability of propagation of a one-hop path between node 1 and node 2, i.e. p 12 . It is also necessary to determine node 1 andthe probability of propagation of the two-hop path between nodes 2 is in fig. 2 the propagation probability of the "node 1-node 4-node 2" path and the propagation probability of the "node 1-node 3-node 2" path.
Wherein the propagation probability of the path of the node 1-node 4-node 2 is p 14 p 42 The propagation probability of the node 1-node 3-node 2' path is p 13 p 32 。
Then the numerator of equation (4) is: alpha (p) 14 p 42 +p 13 p 32 )+(1-α)p 12 。
The label propagation probability of the two-hop path and the label propagation probability of the one-hop path among all nodes in the relation graph can be calculated in a similar way, and the maximum value is taken as the denominator of the formula (4).
As can be seen from this example, the second-order tag propagation probability matrix determined in the manner provided in the embodiments of the present disclosure incorporates not only a one-hop path between nodes, i.e., a direct relationship between nodes, but also a two-hop path between nodes, i.e., an indirect relationship between nodes.
The step 107, namely, performing label propagation between nodes by using the N-order label propagation probability matrix until a preset convergence condition is satisfied; the clustering result of the analyzed subject is determined by using the label propagation result to describe in detail.
In the step, a label matrix can be constructed by utilizing labels of all nodes in the network diagram; in each round of propagation, updating the tag matrix by using the product of the N-order tag propagation probability matrix and the tag matrix; the propagation is repeatedly performed until a preset convergence condition is reached.
In the above process, updating the tag matrix is actually the process of tag propagation by using the product of the N-order tag propagation probability matrix and the tag matrix. That is, each node propagates the labels to neighboring nodes below the N-th order with a corresponding probability in the label propagation probability matrix. This propagation approach significantly accelerates the convergence rate of the tag propagation.
As one of the possible ways, a label matrix may be constructed using the first labels of the nodes in the network map, which is assumed to be denoted as F. The first label employed therein may be a node identification, number, etc., typically a node specific label. Taking the second-order label propagation as an example, in each round of propagation, the label matrix F is updated with the product of the second-order label propagation probability matrix Q and the label matrix F, that is, QF is used as the updated F. In the propagation process, if the correlation degree between two nodes is higher, the tags are easier to propagate, and each node propagates the tags to the first-order neighbors and the second-order neighbors with a certain probability, so that the convergence speed of tag propagation is increased.
The terms "first" and "second" in the terms of "first label" and "second label" in this specification are not limited in terms of number, order, size, etc., and are merely used to distinguish between expressions of different expressions in terms of names.
And repeating the propagation process until a preset convergence condition is reached. Where the convergence condition may be, for example, that the tag stops propagating or that the cluster modularity no longer increases, etc.
Where label stall propagation is understood to mean that the labels of all nodes no longer change, or that the labels of nodes up to a certain proportion (e.g. 99.9%) no longer change.
Modularity (Modularity) is a relatively common method for measuring the structural strength of network communities, and all communities in the specification correspond to all types of clusters. The closer the value is to 1, the stronger the intensity of each cluster is, and the better the division quality is. The modularity may be measured by the number of edges between the resulting clusters and the number of edges within the cluster, e.g., the modularity M may be expressed as:
wherein e ii Representing the ratio of the number of edges between nodes in the class cluster i to the total number of edges. a, a i =∑ j e ij ,e ij Representing the ratio of the edge between class cluster i and class cluster j to the total edge number。
After algorithm convergence, nodes with the same first label are divided into a class of clusters. That is, the first labels of the nodes within each cluster class are the same and the first labels of the nodes of different clusters are different.
After clustering, since some nodes bear a second label, the second label is typically additionally labeled for the node. Taking the risk identification scene as an example, the risk identification scene can be a label of whether a part of users are marked with risk users or not, a risk category label marked with the part of users, and the like. In this scenario, the second label of the analyzed subject with the highest occupancy rate in the same cluster can be used as the label of the cluster. For example, if the proportion of users in a cluster that are marked as high risk is high, then the cluster may be marked as high risk, and the users in the cluster are considered to be high risk users.
As another implementation manner, label propagation and clustering can be directly performed by using the second labels marked on part of the nodes. Since the second label is usually pre-labeled for the node, the label type of the second label is known, and assuming that there are C kinds of second labels, there are L nodes labeled with the second label. The label matrix of the second labeled nodes may be represented as an lxc matrix, denoted YL. The ith row in YL represents a label indication vector of the ith node, and if the second label of the ith node is of the jth category, the jth element of the ith row in YL is 1, and the other elements of the ith row are 0. Similarly, U nodes not labeled with a second label are also denoted as a matrix of u×c, denoted as YU. The initial values in YU may be set randomly. Then for the network graph the label matrix F is actually equal to the combination of YL and YU.
At each round of propagation, the tag matrix F is first updated with the product of the N-order tag propagation probability matrix Q and the tag matrix F, that is, QF is used as the updated F. In the propagation process, if the correlation degree between two nodes is higher, the tags are easier to propagate, and each node propagates the tags to the first-order neighbors and the second-order neighbors … N-order neighbors with a certain probability, so that the convergence rate of tag propagation is increased.
Further, since the node labeled with the second label is already explicitly labeled and cannot be biased by the tape during the propagation, after QF is used as the updated F in each round of propagation, the updated label matrix F needs to be reset by using the second label corresponding to the node originally labeled with the second label. I.e. the updated label at the corresponding position in F is reset according to YL.
The above-described propagation process is repeated (each round including updating the tag matrix and resetting the tag matrix) until a preset convergence condition is reached. Where the convergence condition may be, for example, that the tag stops propagating or that the cluster modularity no longer increases, etc.
After algorithm convergence, nodes with the same second label are divided into a class of clusters. That is, the second labels of the nodes within each cluster class are the same and the second labels of the nodes of different clusters are different. And a second label of the same cluster is used as the label of the cluster. Nodes not marked with the second label are also propagated with the second label through the algorithm.
As can be seen from the manner described in the above embodiments, the tag propagation algorithm only involves linear computation, the computation complexity is low, and the convergence speed is improved by adopting the N-order tag propagation probability matrix, and the local optimal solution is not easily trapped. There is still good performance in handling network graphs of nodes and edges above the billion level.
The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
The foregoing is a detailed description of the methods provided by the present disclosure, and the apparatus provided by the present disclosure is described in detail below with reference to examples.
FIG. 3 illustrates a block diagram of a cluster processing apparatus according to one embodiment, as illustrated in FIG. 3, the apparatus 300 may include: a diagram acquisition unit 301, a first matrix acquisition unit 302, a second matrix acquisition unit 303, a tag propagation unit 304, and a result determination unit 305. Wherein the main functions of each constituent unit are as follows:
a graph acquisition unit 301 configured to acquire a network graph constructed using related data of an analyzed subject, the network graph including nodes including the analyzed subject and edges, the weights of the edges indicating correlations between the nodes.
The first matrix acquisition unit 302 is configured to obtain a first-order tag propagation probability matrix by using the network map.
The second matrix obtaining unit 303 is configured to obtain an N-order tag propagation probability matrix based on the first-order tag propagation probability matrix, where N is a positive integer greater than 2, and the N-order tag propagation probability matrix includes probabilities of propagating tags between nodes in N-order or lower ways.
The tag propagation unit 304 is configured to perform tag propagation between nodes by using the N-order tag propagation probability matrix until a preset convergence condition is satisfied.
The result determination unit 305 is configured to determine a clustering result of the analyzed subject using the result of the tag propagation.
The network map acquired by the map acquiring unit 301 may be a homogeneous network map, that is, nodes in the network are all of the same type. It is also possible to have a heterogeneous network graph, i.e. the type of nodes in the network is more than one or the type of edges is more than one. In the heterogeneous network graph, the nodes also include other objects that have a relationship with the subject being analyzed.
As one of the realizations, the first matrix obtaining unit 302 may be specifically configured to determine a first-order tag propagation probability between the nodes by using weights of edges between the nodes and first-order neighbor nodes thereof in the network graph, and maximum and minimum values of weights in all edges; and obtaining a first-order label propagation probability matrix according to the first-order label propagation probabilities among the nodes.
For example, if the node i and the node j have no edge in the network diagram, the first matrix acquisition unit 302 determines that the first-order tag propagation probability between the node i and the node j is 0.
If node i and node j have edges in the network graph, the first matrix acquisition unit 302 may determine a first-order tag propagation probability p between node i and node j ij The method comprises the following steps:wherein w is ij Representing the weight of an edge between node i and node j, w max And w min Representing the maximum and minimum values of the weights in all edges of the network graph, respectively.
As an achievable manner, the second matrix acquisition unit 303 may be specifically configured to:
determining the label propagation probability of each path passing between nodes in a network diagram by using the first-order label propagation probability between the nodes, wherein the paths comprise all paths within N hops;
determining N-order label propagation probability among the nodes by using the label propagation probability of each path among the nodes and the maximum value of the label propagation probability of each path among all the nodes;
and obtaining an N-order label propagation probability matrix according to the N-order label propagation probabilities among the nodes.
As a preferable case, N is 2. Second order tag propagation probability q between node i and node j ij The method comprises the following steps:
wherein V is a node set in the network diagram, n is the number of nodes in the network diagram, alpha is a preset distance attenuation coefficient, and p ij Is the first order tag propagation probability between node i and node j.
The tag propagation unit 304 may be specifically configured to: constructing a label matrix by using labels of all nodes in the network diagram; in each round of propagation, updating the tag matrix by using the product of the N-order tag propagation probability matrix and the tag matrix; the propagation is repeatedly performed until a preset convergence condition is reached.
As one of the realizations, the tag propagation unit 304 is specifically configured to construct a tag matrix with the first tags of the nodes in the network graph.
The result determining unit 305 is specifically configured to divide nodes having the same first label into a class of clusters; and taking a second label of the analyzed main body with the highest proportion in the same cluster as the label of the cluster.
Alternatively, the label propagation unit 304 is specifically configured to construct a label matrix using the second labels and the unlabeled labels labeled for each node in the network graph.
The label propagation unit 304 is further configured to reset the updated label matrix with the second label corresponding to the node initially labeled with the second label after updating the label matrix in each round of propagation.
The result determining unit 305 is specifically configured to divide the nodes having the same second label into a class of clusters, and take the second label of the same class of clusters as the label of the class of clusters.
According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 1.
According to an embodiment of yet another aspect, there is also provided a computing device including a memory having executable code stored therein and a processor that, when executing the executable code, implements the method described in connection with fig. 1.
With the development of time and technology, the meaning of computer readable storage media is more and more extensive, and the propagation path of the computer program is not limited to a tangible medium any more, and the computer program can also be directly downloaded from a network, etc. Any combination of one or more computer readable storage media may be employed. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The processors described above may include one or more single-core processors or multi-core processors. The processor may comprise any combination of general purpose processors or special purpose processors (e.g., image processors, application processor baseband processors, etc.).
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments in part.
Those skilled in the art will appreciate that in one or more of the examples described above, the functions described in the present invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The foregoing embodiments have been provided for the purpose of illustrating the general principles of the present invention in further detail, and are not to be construed as limiting the scope of the invention, but are merely intended to cover any modifications, equivalents, improvements, etc. based on the teachings of the invention.
Claims (9)
1. The clustering processing method comprises the following steps:
acquiring a network graph constructed by using related data of an analyzed subject, wherein the network graph comprises nodes and edges, the nodes comprise the analyzed subject, and the weights of the edges indicate the correlation between the nodes;
the network graph comprises isomorphic network graphs, nodes in the network are all of the same type, the nodes are all users, edges between the nodes indicate the existence of correlation between the users, the weight of the edges is used for indicating the correlation between the users, the higher the weight is, the larger the correlation is indicated, and the correlation is reflected through the transaction behavior and address book relation between the users;
obtaining a first-order tag propagation probability matrix by using the network graph;
obtaining an N-order tag propagation probability matrix based on the first-order tag propagation probability matrix, wherein N is a positive integer more than 2, and the N-order tag propagation probability matrix comprises the probability of propagating tags among nodes in an N-order mode;
performing label propagation among nodes by using the N-order label propagation probability matrix until a preset convergence condition is met;
determining a clustering result of the analyzed subject by using the label propagation result;
the obtaining the N-order tag propagation probability matrix based on the first-order tag propagation probability matrix comprises:
determining the label propagation probability of each path passing between nodes in a network diagram by using the first-order label propagation probability between the nodes, wherein the paths comprise all paths within N hops;
determining N-order label propagation probability among the nodes by using the label propagation probability of each path among the nodes and the maximum value of the label propagation probability of each path among all the nodes;
obtaining an N-order tag propagation probability matrix according to the N-order tag propagation probabilities among the nodes;
wherein, N is 2;
second order tag propagation probability q between node i and node j ij The method comprises the following steps:
wherein V is a node set in the network diagram, n is the number of nodes in the network diagram, alpha is a preset distance attenuation coefficient, and p is ij Is the first order tag propagation probability between node i and node j.
2. The method of claim 1, wherein the node further comprises other objects in a relationship with the subject being analyzed.
3. The method of claim 1, wherein deriving a first-order tag propagation probability matrix using the network map comprises:
determining first-order label propagation probability among nodes by using the weights of the edges between the nodes and the first-order neighbor nodes in the network graph and the maximum value and the minimum value of the weights in all the edges;
the first-order tag propagation probability matrix is obtained from the first-order tag propagation probabilities between the nodes.
4. The method of claim 3, wherein determining the first-order label propagation probability between nodes using weights of edges between nodes and first-order neighbor nodes thereof in the network graph, maximum values and minimum values of weights in all edges comprises:
if the node i and the node j have no edges in the network diagram, determining that the first-order label propagation probability between the node i and the node j is 0;
if the node i and the node j have edges in the network diagram, determining a first-order label propagation probability p between the node i and the node j ij The method comprises the following steps:wherein the w ij Representing the weight of the edge between node i and node j, said w max And w min Representing the maximum and minimum values of the weights in all sides of the network graph, respectively.
5. The method of claim 1, wherein using the N-th order tag propagation probability matrix for tag propagation between nodes comprises:
constructing a label matrix by using labels of all nodes in the network graph;
in each round of propagation, updating the tag matrix by using the product of the N-order tag propagation probability matrix and the tag matrix;
the propagation is repeatedly performed until a preset convergence condition is reached.
6. The method of claim 5, wherein constructing a label matrix using labels of nodes in the network graph comprises: constructing a label matrix by using first labels of all nodes in the network diagram;
determining a clustering result of the analyzed subject using the label propagation result includes: dividing nodes with the same first label into a class of clusters; and taking a second label of the analyzed main body with the highest proportion in the same cluster as the label of the cluster.
7. The method of claim 5, wherein constructing a label matrix using labels of nodes in the network graph comprises: constructing a label matrix by using second labels and unlabeled labels of all nodes in the network diagram;
after updating the tag matrix in each round of propagation, further comprising resetting the updated tag matrix with a second tag corresponding to the node initially labeled with the second tag;
determining a clustering result of the analyzed subject using the label propagation result includes: nodes with the same second label are divided into a class of clusters, and the second label of the same class of clusters is used as the label of the class of clusters.
8. A cluster processing apparatus comprising:
a graph acquisition unit configured to acquire a network graph constructed using related data of an analyzed subject, the network graph including nodes including the analyzed subject and edges, weights of the edges indicating correlations between the nodes; the network graph comprises isomorphic network graphs, nodes in the network are all of the same type, the nodes are all users, edges between the nodes indicate the existence of correlation between the users, the weight of the edges is used for indicating the correlation between the users, the higher the weight is, the larger the correlation is indicated, and the correlation is reflected through the transaction behavior and address book relation between the users;
the first matrix acquisition unit is configured to obtain a first-order tag propagation probability matrix by utilizing the network graph;
a second matrix acquisition unit configured to obtain an N-order tag propagation probability matrix based on the first-order tag propagation probability matrix, where N is a positive integer greater than 2, the N-order tag propagation probability matrix including probabilities of propagating tags between nodes in an N-order or lower manner;
the label propagation unit is configured to propagate labels among nodes by using the N-order label propagation probability matrix until a preset convergence condition is met;
a result determination unit configured to determine a clustering result of the analyzed subject using the result of the tag propagation;
wherein the second matrix acquisition unit is configured to:
determining the label propagation probability of each path passing between nodes in a network diagram by using the first-order label propagation probability between the nodes, wherein the paths comprise all paths within N hops;
determining N-order label propagation probability among the nodes by using the label propagation probability of each path among the nodes and the maximum value of the label propagation probability of each path among all the nodes;
obtaining an N-order label propagation probability matrix according to the N-order label propagation probabilities among the nodes;
wherein N is 2, and the second-order label propagation probability q between the node i and the node j ij The method comprises the following steps:
wherein V is a node set in the network diagram, n is the number of nodes in the network diagram, alpha is a preset distance attenuation coefficient, and p ij Is the first order tag propagation probability between node i and node j.
9. A computing device comprising a memory and a processor, wherein the memory has executable code stored therein, which when executed by the processor, implements the method of any of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111072320.4A CN113837252B (en) | 2021-09-14 | 2021-09-14 | Clustering processing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111072320.4A CN113837252B (en) | 2021-09-14 | 2021-09-14 | Clustering processing method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113837252A CN113837252A (en) | 2021-12-24 |
CN113837252B true CN113837252B (en) | 2024-03-26 |
Family
ID=78959192
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111072320.4A Active CN113837252B (en) | 2021-09-14 | 2021-09-14 | Clustering processing method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113837252B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116089722B (en) * | 2023-02-15 | 2023-11-21 | 北京欧拉认知智能科技有限公司 | Implementation method, device, computing equipment and storage medium based on graph yield label |
CN117911139B (en) * | 2024-01-18 | 2024-08-30 | 广州西米科技有限公司 | Financial risk control method and system based on user relationship network |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111612164A (en) * | 2020-02-18 | 2020-09-01 | 贵州工程应用技术学院 | Non-iterative big data semi-supervised learning method, system, storage medium and terminal |
CN112085615A (en) * | 2020-09-23 | 2020-12-15 | 支付宝(杭州)信息技术有限公司 | Method and device for training graph neural network |
CN112926990A (en) * | 2021-03-25 | 2021-06-08 | 支付宝(杭州)信息技术有限公司 | Method and device for fraud identification |
CN113157767A (en) * | 2021-03-24 | 2021-07-23 | 支付宝(杭州)信息技术有限公司 | Risk data monitoring method, device and equipment |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8990209B2 (en) * | 2012-09-06 | 2015-03-24 | International Business Machines Corporation | Distributed scalable clustering and community detection |
-
2021
- 2021-09-14 CN CN202111072320.4A patent/CN113837252B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111612164A (en) * | 2020-02-18 | 2020-09-01 | 贵州工程应用技术学院 | Non-iterative big data semi-supervised learning method, system, storage medium and terminal |
CN112085615A (en) * | 2020-09-23 | 2020-12-15 | 支付宝(杭州)信息技术有限公司 | Method and device for training graph neural network |
CN113157767A (en) * | 2021-03-24 | 2021-07-23 | 支付宝(杭州)信息技术有限公司 | Risk data monitoring method, device and equipment |
CN112926990A (en) * | 2021-03-25 | 2021-06-08 | 支付宝(杭州)信息技术有限公司 | Method and device for fraud identification |
Non-Patent Citations (1)
Title |
---|
基于改进的标签传播算法的网络聚类方法;桂春;黄旺星;;吉林大学学报(工学版)(05);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN113837252A (en) | 2021-12-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Razzano et al. | Image-based deep learning for classification of noise transients in gravitational wave detectors | |
CN110032606B (en) | Sample clustering method and device | |
CN113837252B (en) | Clustering processing method and device | |
CN112085172B (en) | Method and device for training graph neural network | |
CN108563710B (en) | Knowledge graph construction method and device and storage medium | |
CN108197652B (en) | Method and apparatus for generating information | |
CN111523640B (en) | Training method and device for neural network model | |
US9390065B2 (en) | Iterative estimation of system parameters using noise-like perturbations | |
CN109978060B (en) | Training method and device of natural language element extraction model | |
CN111931002B (en) | Matching method and related equipment | |
CN111382868A (en) | Neural network structure search method and neural network structure search device | |
CN112085615B (en) | Training method and device for graphic neural network | |
US10147103B2 (en) | System and method for a scalable recommender system using massively parallel processors | |
CN113852970B (en) | Multi-dimensional spectrum prediction method, system, device and medium based on graph neural network | |
CN111368973A (en) | Method and apparatus for training a hyper-network | |
CN110009486A (en) | A kind of method of fraud detection, system, equipment and computer readable storage medium | |
CN111611390B (en) | Data processing method and device | |
US20200133947A1 (en) | Systems and Methods for Quantum Global Optimization | |
US20210081800A1 (en) | Method, device and medium for diagnosing and optimizing data analysis system | |
US11341394B2 (en) | Diagnosis of neural network | |
CN116109907B (en) | Target detection method, target detection device, electronic equipment and storage medium | |
CN111460171A (en) | Target user identification method and device for server | |
CN116502700A (en) | Skin detection model training method, skin detection device and electronic equipment | |
CN113988175B (en) | Clustering processing method and device | |
CN115618065A (en) | Data processing method and related equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |