CN114880482A

CN114880482A - Graph embedding-based relation graph key personnel analysis method and system

Info

Publication number: CN114880482A
Application number: CN202210451803.3A
Authority: CN
Inventors: 张暐; 郭峰; 陈瀚平; 曹瑞雪; 陈栩琪
Original assignee: GRG Banking Equipment Co Ltd
Current assignee: GRG Banking Equipment Co Ltd
Priority date: 2022-04-26
Filing date: 2022-04-26
Publication date: 2022-08-09
Also published as: WO2023207013A1

Abstract

The invention discloses a relation graph key personnel analysis method and a system based on graph embedding, wherein the method comprises the following steps of constructing a character relation graph based on social media data; analyzing each node in the character relation graph by adopting a graph embedding algorithm to obtain an embedding vector of each node; generating key node seeds of the figure relation graph according to pre-related indexes; and analyzing the key node seeds by adopting a clustering algorithm according to the embedded vector of each node, and identifying key personnel nodes. The invention fully utilizes the topological property of the relational graph, has the learning performance, does not need to manually set parameter values or the calculation rule of the specified degree gain, and thus eliminates the adverse effect of unreasonable manual rule setting on the result; meanwhile, the whole graph is calculated, the isomorphism and heterogeneity of the nodes are integrated, and the obtained analysis result of key personnel is more accurate.

Description

Graph embedding-based relation graph key personnel analysis method and system

Technical Field

The invention relates to the technical field of knowledge graph analysis, in particular to a method and a system for analyzing key personnel of a relation graph based on graph embedding.

Background

The personnel relationship map is a knowledge map which is constructed by taking social, relatives and emotional relationships between personnel entities as cores. According to the six-degree separation theory, any two strangers can establish contact only by five friends at most in interpersonal interaction. To some extent, all people in the world can be linked through a personal relationship network. Because of the complexity of the real world, the number of people and relationship types involved in the construction of the relationship graph is increasing. In a plurality of sub-graphs of a relationship graph, only one character or a plurality of characters often play a main role, especially in public opinion analysis, administrative management, risk control and recommendation systems, the mining of key personnel plays a decisive role in business, and the method becomes an important technology for knowledge graph analysis and application.

The key people on the relation graph are mined, learning methods are few, and manual qualitative or simple static numerical calculation is also relied on. For example, chinese patent CN113032607A discloses a key personnel analysis method, which includes: acquiring a member relation map, acquiring member initialization weight, acquiring member interaction information, calculating and updating a member full value based on the interaction information and the initial full value, and if the sum of the weight differences of two adjacent times corresponding to each node person obtained after updating is less than a preset weight threshold, extracting the node person with the maximum updated weight as a target node person, wherein the scheme has the following defects: 1) the updating methods of the node information, the value of the interactive information and the node weight in the relational graph are all set by manual rules and have no learnability. 2) When nodes and relations are added and deleted and cross-domain service migration is carried out, corresponding service rules need to be given through manual intervention, and expansibility is not achieved. 3) The weight updating of the node personnel only comprises local structure information and personnel information, a global topological structure cannot be utilized, and high accuracy is not achieved. These problems make the analysis of the key personnel of the relationship graph unintelligent, and have serious application limitation.

For example, chinese patent CN 112269922 a discloses a method for discovering community public opinion key characters based on network representation learning, the method includes inputting a social network relationship diagram into a community structure and structure hole node discovery model to obtain a community partition set and structure hole nodes; inputting the social network relationship graph and the community partition set into a network embedding model containing social influence and a community structure to obtain the social influence of the nodes in the community network graph and a node network embedding expression vector; and performing visual analysis based on the structural hole nodes, the social influence and the network embedded expression vector to obtain public opinion key figures. "this solution still suffers from the following disadvantages: 1) and (3) direct modularity gain and indirect modularity gain of the relational graph until a target matrix of the network embedded vector is obtained through eigenvalue decomposition, wherein a vectorization method in the whole process is given by rules and still belongs to manual selection of features, rather than adaptive learning. The method greatly depends on the rule definition of the gain of the degree of the direct module and the indirect module, and if the rule definition cannot reflect the network structure, the method is greatly influenced, and the accuracy rate of finding key people is reduced.

Disclosure of Invention

In view of the above technical problems, the present invention aims to provide a method and a system for analyzing key personnel of a relational graph based on graph embedding, which solve the problems that the conventional method for mining key people has no expansibility, or the accuracy is low due to the fact that only local structural information and personnel information are included in the weight update of node personnel, or the accuracy is low due to the dependence on the rule of direct module and indirect module degree gain.

The invention adopts the following technical scheme:

a relation map key personnel analysis method based on graph embedding comprises the following steps:

constructing a figure relation graph based on social media data;

analyzing each node in the character relation graph by adopting a graph embedding algorithm to obtain an embedding vector of each node;

generating key node seeds of the figure relation graph according to pre-related indexes;

and analyzing the key node seeds by adopting a clustering algorithm according to the embedded vector of each node, and identifying key personnel nodes.

Optionally, the building of the person relationship graph based on social media data includes:

and mining character entities and relations from news data triggering the whole period of the public sentiment event to generate a character relation map.

Optionally, the mining of the character entities and relationships from news data triggering the whole period of the public sentiment event to generate the character relationship graph includes:

the method comprises the steps of filtering news reports and social dynamic data published in a specified public sentiment period through keywords in a network platform by using a crawler technology to obtain texts and social dynamic contents related to public sentiment events in the news reports in the public sentiment period and interactive relations among entities, and generating a corresponding character relation graph by using a text structuring technology.

Optionally, the analyzing each node in the person relationship graph by using a graph embedding algorithm to obtain an embedding vector of each node includes:

for each node, acquiring a neighboring node by adopting a random walk method to obtain a neighboring node set; and training a neighboring node set by using a skip-gram model, predicting the current node by using each neighboring node to enable the current node to have the maximum probability, and sequentially training each neighboring node in the neighboring node set to obtain the embedded vector of each node.

Optionally, the generating a key node seed for the node according to the pre-correlation index includes:

generating an adjacent matrix of the image according to preset relevant indexes, and performing characteristic decomposition on the adjacent matrix to obtain a characteristic value and a characteristic vector;

and acquiring a feature vector corresponding to the maximum feature value in the feature values of the nodes, wherein the centrality of the ith node is the ith element in the feature vector corresponding to the maximum feature value, and generating a key node seed according to the centrality of each node.

Optionally, the analyzing, according to the key node seeds, the embedded vectors of the nodes by using a clustering algorithm to identify key personnel nodes includes:

classifying each embedded vector by adopting a clustering algorithm according to the key node seeds to obtain a plurality of clustering categories;

calculate each cluster class c _i And taking the calculated clustering center as an updated clustering center, and taking the updated clustering center as a key personnel node.

Optionally, the classifying each embedded vector by using a clustering algorithm to obtain a plurality of clustering categories includes:

and taking the key node seeds as initial clustering centers, calculating the distance from each embedded vector to each initial clustering center, acquiring the initial clustering center with the shortest distance from each embedded vector, and classifying each node into the clustering category to which the initial clustering center with the shortest distance belongs.

A graph-embedding-based relationship graph key personnel analysis system comprises:

the map building unit is used for building a character relation map based on the social media data;

the graph analysis unit is used for analyzing each node in the character relation graph by adopting a graph embedding algorithm to obtain an embedding vector of each node;

the key node seed generating unit is used for generating key node seeds of the figure relation graph according to pre-related indexes;

and the identification unit is used for analyzing the key node seeds by adopting a clustering algorithm according to the embedded vectors of all the nodes and identifying key personnel nodes.

An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the graph-embedding-based relationship graph key personnel analysis method.

A computer storage medium having stored thereon a computer program which, when executed by a processor, implements the graph embedding-based relationship graph key personnel analysis method.

Compared with the prior art, the invention has the beneficial effects that:

the character relation graph is constructed based on social media data, each node in the character relation graph is analyzed by adopting a graph embedding algorithm, an embedding vector of each node is obtained, the topological property of the relation graph is fully utilized, meanwhile, the learning performance is achieved, network embedding expression and node vectorization are respectively determined by random walk control and a corresponding machine learning method, parameter values or calculation rules of regulation gain do not need to be set manually, and therefore the adverse effect of unreasonable manual rule setting on results is eliminated; meanwhile, the character relation graph is constructed based on social media data, only a network topological structure is relied on, when nodes and relations are added and deleted and cross-domain service migration is carried out, the network can be trained quickly, and extra knowledge injection is not needed; generating key node seeds of the figure relation graph according to pre-related indexes; and analyzing the key node seeds by adopting a clustering algorithm according to the embedded vectors of the nodes to identify key personnel nodes, and calculating the whole graph in the process of identifying the key personnel nodes, so that the isomorphism and heterogeneity of the nodes are integrated, and the obtained key personnel analysis result is more accurate.

Furthermore, a random walk method is adopted to obtain neighboring nodes to obtain a neighboring node set, each neighboring node is used for predicting the current node, the current probability of the current node is made to be maximum, each neighboring node in the neighboring node set is trained in sequence to obtain an embedded vector of each node, a graph embedding method based on random walk is adopted for analysis, parameter values or a calculation rule of a specified gain is not required to be set manually, and therefore the high accuracy of identifying key personnel nodes is further improved.

Drawings

Fig. 1 is a schematic flowchart of a relationship graph key person analysis method based on graph embedding according to an embodiment of the present invention;

fig. 2 is a schematic diagram of random walk sampling of a neighboring node according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a relationship graph key personnel analysis system based on graph embedding according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings and specific embodiments, and it should be noted that, in the premise of no conflict, the following described embodiments or technical features may be arbitrarily combined to form a new embodiment:

the first embodiment is as follows:

the following explains the terms of art in the present invention:

graph Embedding (also called Network Embedding) is a process for mapping Graph data (usually a high-dimensional dense matrix) into a low-micro dense vector, and can well solve the problem that the Graph data is difficult to be efficiently input into a machine learning algorithm.

An Adjacency Matrix (Adjacency Matrix) is a Matrix representing the Adjacency relationship between vertices, and the logical structure of the Adjacency Matrix is divided into two parts: v and E are set, where V is a vertex and E is an edge. Therefore, a one-dimensional array is used for storing all vertex data in the graph; the data of the relationships (edges or arcs) between vertices are stored in a two-dimensional array called a adjacency matrix.

Centrality (centricity) is a measure of the importance of a node in a network. Centrality may be defined for a single node or a group of nodes. The feature vector centrality is the centrality of the node in combination with the centrality of the neighbors of the node.

The embedded vector of the node is the vector representation of a vertex (vertex) in the network obtained through the connection relation in the network structure, and the vector representation is applied to tasks such as clustering, classification and the like as a basic feature.

Referring to fig. 1, fig. 1 shows a method for analyzing key personnel of a relationship graph based on graph embedding, which includes the following steps:

step S1, constructing a character relation graph based on the social media data;

specifically, the building of the person relationship graph based on social media data includes:

In the specific implementation, a crawler technology can be used for filtering news reports and social dynamic data published in a specified public sentiment period through keywords in a network platform to obtain texts and social dynamic contents related to the public sentiment events in the news reports in the public sentiment period and interactive relations among entities, and a text structuring technology is adopted to generate a corresponding character relation graph.

In specific implementation, in the process of constructing the figure relation graph, the figure relation graph can be constructed through a knowledge triple extraction technology, a knowledge graph generation technology which dynamically evolves along with time, a development relation mining technology, a transfer learning technology based on domain knowledge and the like.

Step S2, analyzing each node in the character relation graph by adopting a graph embedding algorithm to obtain an embedding vector of each node;

optionally, the step S2 includes:

for each node, acquiring a neighboring node by adopting a random walk method to obtain a neighboring node set; specifically, referring to fig. 2, fig. 2 is a schematic diagram illustrating a random walk sampling of a neighboring node according to an embodiment of the present invention; wherein, given a current vertex v, the probability of going to vertex x is:

wherein, pi _vx Expressing the unnormalized transition probability between the vertexes, namely the probability that the random walk passes through the node t to reach the node v and walks to the node x; z is a normalization constant;

specifically, to control the direction of random walk to express our preference, assume that the current random walk reaches node v through node t, and the probability pi of the walk to x at this time _vx The following formula is satisfied:

ω _vx ＝α _pq (t，x)·ω _vx ；ω _vx is the weight of the edge, p is the return parameter, q is the distance parameter, d _tx Is the shortest path distance; coefficient alpha _pq(t,x) The following formula is satisfied:

where if q >1, the random walk tends to access a node close to the previous node, and if q <1, the random walk tends to access a node far from the previous node.

In the implementation process, the invention is based on a vectorization method of random walk, is different from a non-vectorization method of updating the value of CN113032607A interaction information and the node weight of Chinese patent, is also different from a regular method of modularity gain of CN 112269922A of Chinese patent, and has learnability and adaptivity.

Then, a skip-gram model is used for training a neighboring node set, each neighboring node is used for predicting the current node, the current probability of the current node is made to be maximum, and each neighboring node in the neighboring node set is trained in sequence to obtain the embedded vector of each node.

In the implementation process, a character relation graph is constructed based on social media data, each node in the character relation graph is analyzed by adopting a graph embedding algorithm, for example, character entities and relations are mined from news data triggering a public sentiment event in a whole period, and the character relation graph is generated; the graph is analyzed by using a graph embedding machine learning method based on random walk to obtain a node vector, the whole graph is directly vectorized, the characteristic information is obtained more comprehensively, and the isomorphism and heterogeneity of nodes are integrated by calculating the whole graph, so that the obtained analysis result of key personnel is more accurate.

In specific implementation, in the technical process of predicting the current node by using the neighboring node, a derivation method of word2vec such as CBOW and a training optimization method based on negative sampling or a huffman tree can be further adopted to help realize the prediction of the current node.

Specifically, a neighboring node set of the current node is obtained and marked as N _S (u) training each neighboring node by using a skip-gram model, and predicting the current node by using the neighboring nodes so as to maximize the probability of the current node, wherein the maximum probability is

Then, each adjacent node is trained in sequence to obtain an embedded vector.

Step S3, generating key node seeds of the character relation graph according to the pre-correlation indexes;

optionally, the step S3 includes:

Specifically, the graph adjacency matrix a may be generated according to relevant indexes such as network density, reachability, clustering coefficient, and centrality measure, and the adjacency matrix is subjected to feature decomposition, that is, Ax is λ x, so as to obtain a feature value and a feature vector, and then after the feature value and the feature vector are obtained, the centrality of the ith node in the feature vector corresponding to the largest feature value is equal to the ith element in the feature vector.

During specific implementation, manual labeling, pre-training model labeling, remote unsupervised small sample labeling and other small sample labeling methods can be adopted, labeling is performed firstly, and the centrality can further include importance measurement indexes such as degree centrality, betweenness centrality, tight centrality and the like.

And step S4, analyzing the key node seeds by adopting a clustering algorithm according to the embedded vectors of the nodes, and identifying key personnel nodes.

Optionally, the step S4 specifically includes:

taking the key node seeds as initial clustering centers, wherein the initial clustering centers respectively have alpha ₁ 、α ₂ 、......α _k The initial cluster centers form an initial cluster center set alpha ═ alpha ₁ ，α ₂ ，......α _k ；

Classifying the embedded vectors by adopting a clustering algorithm to obtain a plurality of clustering categories; calculate each cluster class c _i And the calculated clustering center is used as a key personnel node.

In the implementation process, the vectorization method for image embedding is directly classified, does not depend on strong hypothesis, is different from the community structure and social influence hypothesis of CN 112269922A of Chinese patent, and has universality.

The step of classifying each embedded vector by using a clustering algorithm comprises the following steps:

computing each of the embedding vectors x _i The distance from each initial clustering center is obtained, and the initial clustering center alpha with the shortest distance from each embedded vector is obtained _i Classifying each node as an initial cluster center alpha having the shortest distance from the node _i Cluster category to which it belongs c _i Wherein i is more than or equal to 1 and less than or equal to k, and i and k are natural numbers;

specifically, the calculation method of the cluster center adopted by the calculation is as follows:

wherein, | c _i And | representing the number of nodes in the clustering category, and repeating iteration of the algorithm of the clustering center until a certain termination condition is reached, wherein the category where the key node seed node is located is used as the key node category.

In this embodiment, a machine learning method is used to analyze the quantitative nodes and identify key nodes, and specifically, the algorithm used to identify the key personnel nodes may be a supervised and semi-supervised machine learning classification algorithm.

Referring to fig. 3, fig. 3 shows a graph-embedding-based relationship graph key personnel analysis system according to the present invention, which includes:

Example three:

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application, and in the present application, an electronic device 100 for implementing a graph-embedded relationship graph key person analysis method according to the present invention according to the embodiment of the present application may be described by using the schematic diagram shown in fig. 4.

As shown in fig. 4, an electronic device 100 includes one or more processors 102, one or more memory devices 104, and the like, which are interconnected via a bus system and/or other type of connection mechanism (not shown). It should be noted that the components and structure of the electronic device 100 shown in fig. 4 are only exemplary and not limiting, and the electronic device may have some of the components shown in fig. 4 and may also have other components and structures not shown in fig. 4, as needed.

The processor 102 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 100 to perform desired functions.

The storage 104 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. On which one or more computer program instructions may be stored that may be executed by processor 102 to implement the functions of the embodiments of the application (as implemented by the processor) described below and/or other desired functions. Various applications and various data, such as various data used and/or generated by the applications, may also be stored in the computer-readable storage medium.

The invention also provides a computer storage medium on which a computer program is stored, in which the method of the invention, if implemented in the form of software functional units and sold or used as a stand-alone product, can be stored. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer storage medium and used by a processor to implement the steps of the embodiments of the method. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer storage medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer storage media may include content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer storage media that does not include electrical carrier signals and telecommunications signals as subject to legislation and patent practice.

Various other modifications and changes may be made by those skilled in the art based on the above-described technical solutions and concepts, and all such modifications and changes should fall within the scope of the claims of the present invention.

Claims

1. A relation map key personnel analysis method based on graph embedding is characterized by comprising the following steps:

constructing a figure relation graph based on social media data;

2. The graph embedding-based relationship graph key personnel analysis method according to claim 1, wherein the building of the person relationship graph based on social media data comprises:

3. The graph-based embedded relationship graph key personnel analysis method according to claim 2, wherein the mining of the character entities and relationships from news data triggering the whole period of the public sentiment event to generate the character relationship graph comprises:

4. The graph embedding-based relationship graph key personnel analysis method according to claim 1, wherein the analyzing each node in the person relationship graph by using a graph embedding algorithm to obtain an embedded vector of each node comprises:

5. The graph embedding-based relational graph key personnel analysis method according to claim 1, wherein the generating key node seeds for nodes according to pre-correlation indexes comprises:

6. The graph embedding-based relational graph key personnel analysis method according to claim 1, wherein the step of analyzing the embedding vector of each node by adopting a clustering algorithm according to the key node seeds to identify key personnel nodes comprises the following steps:

7. The graph embedding-based relationship atlas key personnel analysis method of claim 6, wherein the classifying each embedded vector by a clustering algorithm to obtain a plurality of cluster categories comprises:

8. A relational graph key personnel analysis system based on graph embedding is characterized by comprising:

9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the graph-based embedded relationship graph key personnel analysis method of any of claims 1-7.

10. A computer storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the graph-embedding based relationship graph key personnel analysis method of any one of claims 1-7.