WO2023207013A1

WO2023207013A1 - Graph embedding-based relational graph key personnel analysis method and system

Info

Publication number: WO2023207013A1
Application number: PCT/CN2022/129009
Authority: WO
Inventors: 张暐; 郭峰; 陈瀚平; 曹瑞雪; 陈栩琪
Original assignee: 广州广电运通金融电子股份有限公司
Priority date: 2022-04-26
Filing date: 2022-11-01
Publication date: 2023-11-02
Also published as: CN114880482A

Abstract

Disclosed in the present disclosure is a graph embedding-based relational graph key personnel analysis method and system. The method comprises the following steps: constructing a character relationship graph on the basis of social media data; analyzing each node in the character relationship graph by means of a graph embedding algorithm to obtain an embedding vector of each node; generating key node seeds of the character relationship graph according to a pre-correlation index; and according to the embedding vector of each node, analyzing the key node seeds by means of a clustering algorithm to identify a key personnel node. In the present disclosure, topological properties of the relationship graph are fully utilized, learnability is achieved, and there is no need to manually set a parameter value or specify a calculation rule for degree gain, so that an adverse effect of unreasonable setting of a manual rule is eliminated; meanwhile, the whole graph is calculated, and the isomorphism and heterogeneity of the nodes are integrated, so that the obtained analysis result of key personnel is more accurate.

Description

A graph embedding-based key personnel analysis method and system for relationship graphs

This disclosure requires the priority of the Chinese patent application submitted to the China Patent Office on April 26, 2022, with the application number 202210451803.3 and the invention title "A graph embedding-based relationship graph key personnel analysis method and system", and its entire content incorporated by reference into this disclosure.

Technical field

The present disclosure relates to the technical field of knowledge graph analysis, and specifically relates to a method and system for analyzing key personnel of a relationship graph based on graph embedding.

Background technique

The personnel relationship graph is a knowledge graph constructed with the core of "personnel" entities and the social, kinship, and emotional relationships between people. According to the "six degrees of separation theory", in interpersonal communication, any two strangers can establish a connection through at most five friends. To some extent, everyone in the world is connected through personal networks. Because of the complexity of the real world, more and more types of characters and relationships are involved in the construction process of the relationship map. In several sub-graphs of a relationship graph, there is often only one character or a few characters who play a major role. Especially in public opinion analysis, administrative management, risk control and recommendation systems, the identification of key personnel plays a decisive role in the business. , has become an important technology for knowledge graph analysis and application.

There are few learning methods for key person mining on relationship graphs, and they also rely on manual qualitative or simple static numerical calculations. For example, Chinese patent CN 113032607 A discloses a key personnel analysis method. The method includes: "obtaining the member relationship map, obtaining member initialization weights, obtaining member interaction information, calculating the member full value based on the interaction information and the initial full value and updating, After the update, the sum of the two adjacent weight differences corresponding to each node person is less than the preset weight threshold, then the node person with the largest weight after the update is extracted as the target node person." This solution has the following shortcomings: 1) The node information, interactive information values, and node weight update methods in the relationship graph are all set by manual rules and are not learnable. 2) When adding or deleting nodes and relationships, or performing cross-domain business migration, manual intervention is required to provide corresponding business rules, which is not scalable. 3) The weight update of node personnel only contains local structural information and personnel information, fails to take advantage of the global topological structure, and does not have high accuracy. These problems prevent the key personnel analysis of the relationship map from being intelligent and have serious application limitations.

For example, Chinese patent CN 112269922 A discloses a method for discovering key figures in community public opinion based on network representation learning. The method includes "entering the social network relationship graph into the community structure and structure hole node discovery model to obtain the community division set and structure hole nodes; The social network relationship diagram and community division set input the network embedding model containing social influence and community structure to obtain the social influence of the nodes in the community network diagram and the node network embedding representation vector; based on the structural hole node, social influence and network embedding representation Perform visual analysis of vectors to obtain key figures of public opinion." This solution still has the following shortcomings: 1) The direct modularity gain and indirect modularity gain of the relationship diagram, until the target matrix of the network embedding vector is obtained through eigenvalue decomposition, the vector in the whole process The method is given by rules and still belongs to artificial selection of features rather than adaptive learning. This method relies heavily on the rule definition of direct modules and indirect modularity gains. If the rule definition cannot reflect the network structure, the method will be greatly affected, reducing the accuracy of key person discovery.

Contents of the invention

(1) Technical problems to be solved

The technical problem to be solved by this disclosure is the low accuracy of traditional key person mining methods.

(2) Technical solutions

In view of the above technical problems, the purpose of this disclosure is to provide a graph-embedded relationship graph key personnel analysis method and system to solve the problem that traditional key personnel mining methods are not scalable, or the weight update of node personnel only includes Local structural information and personnel information lead to low accuracy, or rules that rely on direct module and indirect modularity gains lead to low accuracy.

This disclosure adopts the following technical solutions:

A graph embedding-based key personnel analysis method for relationship graphs, including the following steps:

Build a relationship graph based on social media data;

Using a graph embedding algorithm to analyze each node in the character relationship graph, obtain the embedding vector of each node;

Generate key node seeds of the character relationship graph based on pre-related indicators;

According to the embedding vector of each node, a clustering algorithm is used to analyze the key node seeds and identify key personnel nodes.

Optionally, building a person relationship graph based on social media data includes:

Mining character entities and relationships from news data that triggers the entire cycle of public opinion events, and generating a character relationship graph.

Optionally, mining character entities and relationships from news data that triggers the entire cycle of public opinion events and generating a character relationship graph includes:

Use crawler technology to filter news reports and social dynamic data published during the specified public opinion period through keywords on the network platform, and obtain the text and social dynamic content related to the public opinion event in the news reports during the public opinion period, as well as the interactive relationship between entities. Use text structuring technology to generate the corresponding character relationship map.

Optionally, the graph embedding algorithm is used to analyze each node in the character relationship graph to obtain the embedding vector of each node, including:

For each node, a random walk method is used to obtain neighboring nodes and a set of neighboring nodes is obtained; the skip-gram model is used to train the neighboring node set, and each neighboring node is used to predict the current node so that the probability of the current node being present is maximized. Each neighboring node in the neighboring node set is trained to obtain the embedding vector of each node.

Optionally, generating key node seeds based on pre-related indicators for nodes includes:

Generate a graph adjacency matrix according to the preset relevant indicators, perform eigendecomposition on the adjacency matrix, and obtain eigenvalues and eigenvectors;

Obtain the eigenvector corresponding to the largest eigenvalue among the eigenvalues of each node, where the centrality of the i-th node is the i-th element in the eigenvector corresponding to the largest eigenvalue, and generate key node seeds based on the centrality of each node.

Optionally, according to the key node seeds, a clustering algorithm is used to analyze the embedding vectors of each node and identify key personnel nodes, including:

According to the key node seeds, a clustering algorithm is used to classify each embedding vector to obtain several clustering categories;

Calculate the clustering center of each clustering category c _i , use the calculated clustering center as the updated clustering center, and use the updated thicker clustering center as the key personnel node.

Optionally, the clustering algorithm is used to classify each embedding vector to obtain several clustering categories, including:

Use the key node seeds as the initial clustering center, calculate the distance from each embedding vector to each initial clustering center, obtain the initial clustering center with the shortest distance from each embedding vector, and classify each node as the shortest distance from it. The clustering category to which the initial clustering center belongs.

A graph embedding-based key personnel analysis system for relationship graphs, including:

A graph construction unit is used to construct a character relationship graph based on social media data;

The graph analysis unit is used to analyze each node in the character relationship graph using a graph embedding algorithm to obtain the embedding vector of each node;

A key node seed generation unit, used to generate key node seeds of the character relationship map based on pre-related indicators;

An identification unit is configured to use a clustering algorithm to analyze the key node seeds according to the embedding vector of each node, and identify key personnel nodes.

An electronic device includes: at least one processor, and a memory communicatively connected to the at least one processor, wherein the memory stores instructions that can be executed by the at least one processor, and the instructions are At least one processor executes, so that the at least one processor can execute the graph embedding-based relationship graph key personnel analysis method.

A computer storage medium has a computer program stored thereon. When the computer program is executed by a processor, the computer program implements the graph embedding-based relationship graph key personnel analysis method.

(3) Beneficial effects

Compared with the existing technology, the beneficial effects of the present disclosure are:

This disclosure builds a character relationship graph based on social media data, uses a graph embedding algorithm to analyze each node in the character relationship graph, and obtains the embedding vector of each node, making full use of the topological properties of the relationship graph. At the same time, it has the ability to Learning, network embedding representation and node vectorization are determined by random walk control and corresponding machine learning methods respectively. There is no need to manually set parameter values or specify calculation rules for degree gain, thereby eliminating the impact of unreasonable artificial rule settings on the results. Adverse effects; at the same time, this disclosure builds a character relationship graph based on social media data and only relies on the network topology. When adding and deleting nodes and relationships, and performing cross-domain business migration, the network can be quickly trained without additional knowledge injection; based on the pre-correlation The indicator generates key node seeds of the person relationship graph; according to the embedding vector of each node, a clustering algorithm is used to analyze the key node seeds and identify key personnel nodes. In the process of identifying key personnel nodes, Calculating the entire graph takes into account the isomorphism and heterogeneity of nodes, making the key personnel analysis results more accurate.

Further, a random walk method is used to obtain neighboring nodes, and a neighboring node set is obtained. Each neighboring node is used to predict the current node, so that the probability of the current node appearing is maximized. Each neighboring node in the neighboring node set is trained in sequence, and we obtain The embedding vector of each node is analyzed using a graph embedding method based on random walks. There is no need to manually set parameter values or specify calculation rules for degree gain, thus further improving the high accuracy of identifying key personnel nodes.

Description of the drawings

Figure 1 is a schematic flow chart of a key personnel analysis method for a relationship graph based on graph embedding provided by an embodiment of the present disclosure;

Figure 2 is a schematic diagram of random walk sampling of neighbor nodes provided by an embodiment of the present disclosure;

Figure 3 is a schematic diagram of a graph embedding-based relationship graph key personnel analysis system provided by an embodiment of the present disclosure;

FIG. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed ways

Below, the present disclosure will be further described with reference to the accompanying drawings and specific implementation modes. It should be noted that, on the premise that there is no conflict, the various embodiments or technical features described below can be arbitrarily combined to form new embodiments. :

Example 1:

The following is an explanation of the professional terms used in this disclosure:

Graph Embedding (also called Network Embedding) is a process of mapping graph data (usually high-dimensional dense matrices) into low-dimensional dense vectors. It can well solve the problem of graph data being difficult to efficiently input into machine learning algorithms.

Adjacency Matrix is a matrix that represents the adjacent relationship between vertices. The logical structure of the adjacency matrix is divided into two parts: V and E sets, where V is the vertex and E is the edge. Therefore, a one-dimensional array is used to store all vertex data in the graph; a two-dimensional array is used to store data on the relationships (edges or arcs) between vertices. This two-dimensional array is called an adjacency matrix.

Centrality is used to measure the importance of a node in the network. Centrality can be defined for a single node or for a group of multiple nodes. Eigenvector centrality combines the centrality of a node's neighbors as the centrality of the node.

The embedding vector of a node refers to the vector representation of the vertex (vertex) in the network obtained through the connection relationship in the network structure, and is used as a basic feature to be applied to tasks such as clustering and classification.

Please refer to Figure 1, which illustrates a graph embedding-based relationship graph key personnel analysis method of the present disclosure, which includes the following steps:

Step S1: Construct a character relationship graph based on social media data;

Specifically, the construction of a character relationship graph based on social media data includes:

During specific implementation, crawler technology can be used to filter news reports and social dynamic data published during the specified public opinion period through keywords on the network platform, and obtain the text and social dynamic content related to the public opinion event in the news reports during the public opinion period. As well as the interactive relationships between entities, text structuring technology is used to generate corresponding character relationship maps.

In the specific implementation, in the process of constructing the character relationship graph, the character can also be constructed through knowledge triple extraction technology, knowledge graph generation technology that dynamically evolves over time, development relationship mining technology, and transfer learning technology based on domain knowledge. Relationship map.

Step S2: Use a graph embedding algorithm to analyze each node in the character relationship graph to obtain the embedding vector of each node;

Optionally, step S2 includes:

For each node, a random walk method is used to obtain neighboring nodes and obtain a set of neighboring nodes; specifically, please refer to Figure 2, which shows a random walk sampling of neighboring nodes provided by an embodiment of the present disclosure. Schematic diagram; where, given the current vertex v, the probability of going to vertex x is:

Among them, π _vx represents the unnormalized transition probability between vertices, which is the probability of a random walk passing through node t to node v and then walking to node x; Z is a normalized constant;

Specifically, in order to control the direction of the random walk and express our preferences, assuming that the current random walk passes through node t and reaches node v, the probability π _vx of walking to x at this time satisfies the following formula:

π _vx = α _pq (t, x)·ω _vx ; ω _vx is the weight of the edge, p is the return parameter, q is the away parameter, d _tx is the shortest path distance; the coefficient α _{pq (t, x)} satisfies the following formula:

Among them, if q>1, the random walk tends to visit nodes close to the previous node; if q<1, the random walk tends to visit nodes far away from the previous node.

In the above implementation process, the present disclosure is based on the random walk vectorization method, which is different from the non-vectorization method of updating the value of interactive information and node weights of the Chinese patent CN 113032607 A, and is also different from the Chinese patent CN 112269922 The rule method of modularity gain of A is learnable and adaptive.

Then, the skip-gram model is used to train the neighboring node set, and each neighboring node is used to predict the current node so that the probability of the current node appearing is maximized. Each neighboring node in the neighboring node set is trained in sequence to obtain the embedding vector of each node.

In the above implementation process, a character relationship graph is constructed based on social media data, and a graph embedding algorithm is used to analyze each node in the character relationship graph. For example, by mining character entities from news data that triggers the entire cycle of public opinion events. and relationships, generate a character relationship graph; use the graph embedding machine learning method based on random walks to analyze the graph, obtain node vectors, directly vectorize the entire graph, and obtain more comprehensive feature information. By calculating the entire graph, comprehensive It eliminates the isomorphism and heterogeneity of nodes, making the key personnel analysis results more accurate.

In specific implementation, in the technical process of using neighboring nodes to predict the current node, derivative methods of word2vec such as CBOW, and training optimization methods based on negative sampling or Huffman trees can also be used to help predict the current node.

Specifically, the set of neighboring nodes of the current node is obtained, recorded as N _s (u). First, the skip-gram model is used to train each neighboring node, and the neighboring nodes are used to predict the current node, so that the probability of the current node appearing is maximized. The maximum probability is

Then each neighboring node is trained in sequence to obtain the embedding vector.

Step S3: Generate key node seeds of the character relationship map according to pre-related indicators;

Optionally, step S3 includes:

Specifically, the graph adjacency matrix A can be generated based on relevant indicators such as network density, reachability, clustering coefficient and centrality measure, and the adjacency matrix can be characterized by eigendecomposition, that is, Ax = λx. After obtaining the eigenvalues and eigenvectors, the maximum feature In the feature vector corresponding to the value, the centrality of the i-th node is equal to the i-th element in the feature vector.

In specific implementation, manual annotation, pre-trained model annotation, remote unsupervised and other small sample annotation methods can be used. Annotation is performed first. The centrality can also include degree centrality, betweenness centrality and closeness centrality. and other importance metrics.

Step S4: According to the embedding vector of each node, use a clustering algorithm to analyze the key node seeds and identify key personnel nodes.

Optionally, step S4 specifically includes:

The key node seeds are used as initial clustering centers. The initial clustering centers are α ₁ , α ₂ , ... α _k respectively. The initial clustering centers constitute the initial clustering center set α = α ₁ ,α ₂ ,...α _k ;

A clustering algorithm is used to classify each embedding vector to obtain several clustering categories; the clustering center of each clustering category c _i is calculated, and the calculated clustering center is used as a key personnel node.

In the above implementation process, the present disclosure directly classifies the vectorization method of graph embedding without relying on strong assumptions. It is different from the community structure and social influence assumptions of the Chinese patent CN 112269922 A and has universal applicability.

The steps of using a clustering algorithm to classify each embedding vector include:

Calculate the distance between each embedded vector x _i and each initial cluster center, and obtain the initial cluster center α _i with the shortest distance from each embedded vector, and classify each node as belonging to the initial cluster center α _i with the shortest distance from it. The clustering category c _i , where 1≤i≤k, i and k are both natural numbers;

Specifically, the calculation method used to calculate the cluster center is:

Among them, |c _i | represents the number of nodes in the clustering category. The algorithm of the clustering center is iterated repeatedly until a certain termination condition is reached. Among them, the class where the key node seed node is located is regarded as the key node class.

In this embodiment, a machine learning method is used to analyze vectorized nodes and identify key nodes. Specifically, the algorithm used to identify key personnel nodes can use supervised and semi-supervised machine learning classification algorithms.

Please refer to Figure 3, which shows a graph embedding-based relationship graph key personnel analysis system of the present disclosure, including:

Embodiment three:

FIG. 4 is a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure. In the present disclosure, the schematic diagram shown in FIG. 4 can be used to describe the key to a relationship graph based on graph embedding for implementing the embodiment of the present disclosure. Electronic device 100 for people analysis method.

As shown in Figure 4, a schematic structural diagram of an electronic device, the electronic device 100 includes one or more processors 102 and one or more storage devices 104. These components are connected through a bus system and/or other forms of connection mechanisms (not shown). out) interconnection. It should be noted that the components and structure of the electronic device 100 shown in FIG. 4 are only exemplary and not restrictive. According to needs, the electronic device may have some components shown in FIG. 4 or may have components not shown in FIG. 4 other components and structures.

The processor 102 may be a central processing unit (CPU) or other form of processing unit with data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 100 to perform desired functions.

The storage device 104 may include one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random access memory (RAM) and/or cache memory (cache). The non-volatile memory may include, for example, read-only memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and may be executed by the processor 102 to implement the functions (implemented by the processor) described below in the embodiments of the present disclosure and/ or other desired functions. Various application programs and various data, such as various data used and/or generated by the application programs, may also be stored in the computer-readable storage medium.

The present disclosure also provides a computer storage medium with a computer program stored thereon. If the method of the present disclosure is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in the computer storage medium. Based on this understanding, the present disclosure can implement all or part of the processes in the methods of the above embodiments, and can also be completed by instructing relevant hardware through a computer program. The computer program can be stored in a computer storage medium, and the computer program can be stored in a computer storage medium. When executed by the processor, the steps of each of the above method embodiments can be implemented. Wherein, the computer program includes computer program code, which may be in the form of source code, object code, executable file or some intermediate form. The computer storage medium may include: any entity or device capable of carrying the computer program code, recording media, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), electrical carrier signals, telecommunications signals, and software distribution media, etc. It should be noted that the content contained in the computer storage medium can be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdiction. For example, in some jurisdictions, according to legislation and patent practice, the computer storage medium does not include Electrical carrier signals and telecommunications signals.

For those skilled in the art, various other corresponding changes and deformations can be made based on the technical solutions and concepts described above, and all of these changes and deformations should fall within the protection scope of the claims of the present disclosure.

Industrial applicability

The graph embedding-based key personnel analysis method of the relationship graph provided by this disclosure constructs a character relationship graph based on social media data, and uses a graph embedding algorithm to analyze each node in the character relationship graph to obtain the embedding vector of each node. It makes full use of the topological properties of the relationship graph, uses a clustering algorithm to analyze key node seeds according to the embedding vector of each node, identifies key personnel nodes, and calculates the entire graph, integrating the isomorphism and heterogeneity of nodes. This makes the key personnel analysis results more accurate and has strong industrial practicability.

Claims

A graph embedding-based key personnel analysis method for relationship graphs, which is characterized by including the following steps:

Build a relationship graph based on social media data;

Using a graph embedding algorithm to analyze each node in the character relationship graph, obtain the embedding vector of each node;

Generate key node seeds of the character relationship graph based on pre-related indicators;

According to the embedding vector of each node, a clustering algorithm is used to analyze the key node seeds and identify key personnel nodes.
The key personnel analysis method of the relationship graph based on graph embedding according to claim 1, characterized in that the construction of the person relationship graph based on social media data includes:

Mining character entities and relationships from news data that triggers the entire cycle of public opinion events, and generating a character relationship graph.
The key personnel analysis method of a relationship graph based on graph embedding according to claim 2, characterized in that, mining character entities and relationships from news data that triggers the entire cycle of public opinion events to generate a character relationship graph includes:

Use crawler technology to filter news reports and social dynamic data published during the specified public opinion period through keywords on the network platform, and obtain the text and social dynamic content related to the public opinion event in the news reports during the public opinion period, as well as the interactive relationship between entities. Use text structuring technology to generate the corresponding character relationship map.
The key personnel analysis method of the relationship graph based on graph embedding according to claim 1, characterized in that the graph embedding algorithm is used to analyze each node in the character relationship graph to obtain the embedding vector of each node, including :

For each node, a random walk method is used to obtain neighboring nodes and a set of neighboring nodes is obtained; the skip-gram model is used to train the neighboring node set, and each neighboring node is used to predict the current node so that the probability of the current node being present is maximized. Each neighboring node in the neighboring node set is trained to obtain the embedding vector of each node.
The key personnel analysis method of relationship graph based on graph embedding according to claim 1, characterized in that generating key node seeds based on pre-related indicators for nodes includes:

Generate a graph adjacency matrix according to the preset relevant indicators, perform eigendecomposition on the adjacency matrix, and obtain eigenvalues and eigenvectors;

Obtain the eigenvector corresponding to the largest eigenvalue among the eigenvalues of each node, where the centrality of the i-th node is the i-th element in the eigenvector corresponding to the largest eigenvalue, and generate key node seeds based on the centrality of each node.
The key personnel analysis method of a relationship graph based on graph embedding according to claim 1, characterized in that, according to the embedding vector of each node, a clustering algorithm is used to analyze the key node seeds and identify the key personnel. Nodes, including:

According to the key node seeds, a clustering algorithm is used to classify each embedding vector to obtain several clustering categories;

Calculate the clustering center of each clustering category c i , use the calculated clustering center as the updated clustering center, and use the updated clustering center as the key personnel node.
The key personnel analysis method of relationship graph based on graph embedding according to claim 6, characterized in that the clustering algorithm is used to classify each embedding vector to obtain several clustering categories, including:

Use the key node seeds as the initial clustering center, calculate the distance from each embedding vector to each initial clustering center, obtain the initial clustering center with the shortest distance from each embedding vector, and classify each node as the shortest distance from it. The clustering category to which the initial clustering center belongs.
A graph embedding-based relationship graph key personnel analysis system, which is characterized by including:

A graph construction unit is used to construct a character relationship graph based on social media data;

The graph analysis unit is used to analyze each node in the character relationship graph using a graph embedding algorithm to obtain the embedding vector of each node;

A key node seed generation unit, used to generate key node seeds of the character relationship map based on pre-related indicators;

An identification unit is configured to use a clustering algorithm to analyze the key node seeds according to the embedding vector of each node, and identify key personnel nodes.
An electronic device, characterized in that it includes: at least one processor, and a memory communicatively connected to the at least one processor, wherein the memory stores instructions that can be executed by the at least one processor, and the The instructions are executed by the at least one processor to enable the at least one processor to execute the graph embedding-based relationship graph key personnel analysis method described in any one of claims 1-7.
A computer storage medium with a computer program stored thereon, characterized in that, when executed by a processor, the computer program implements the graph embedding-based relationship graph key personnel analysis method described in any one of claims 1-7 .