CN116756391A

CN116756391A - Unbalanced graph node neural network classification method based on graph data enhancement

Info

Publication number: CN116756391A
Application number: CN202310683638.9A
Authority: CN
Inventors: 王燕玲; 黄震华; 刘波; 薛云
Original assignee: Guangdong Bowei Chuangyuan Technology Co ltd; South China Normal University
Current assignee: Guangdong Bowei Chuangyuan Technology Co ltd; South China Normal University
Priority date: 2023-06-09
Filing date: 2023-06-09
Publication date: 2023-09-15

Abstract

The invention discloses an unbalanced graph node neural network classification method based on graph data enhancement, which aims to solve the node imbalance problem existing in the current graph data classification, firstly, a double-layer graph convolutional network is used for learning node embedding, then a link is added for a synthetic node through a link generation model by generating a minority class sample, and structural information of the synthetic node is increased. The method provided by the invention has good robustness, obviously improved performance index and strong generalization capability, and can be widely applied to the field of unbalanced graph node classification.

Description

Unbalanced graph node neural network classification method based on graph data enhancement

Technical Field

The invention relates to the field of graph neural network classification, in particular to an unbalanced graph node neural network classification method based on graph data enhancement.

Background

In the field of machine learning, a common task is a classification task, namely, assigning a label to a sample, and in some problems, such as image recognition, natural language processing, the sample is modeled in the form of a table, an image, a time sequence, or the like; in the real world, however, many tasks involve data in irregular fields such as social network engineering, biological networks, cerebral nerve connections, etc. The data can be generally represented in the form of a graph, and classification of nodes in the graph is a very important application, for example, classification of documents in a citation network is helpful for researchers to search for documents, and classification of users in a social network is helpful for recommendation systems to implement more accurate recommendation.

The data involved in many classification tasks may be represented in the form of graphs, wherein vertices in the graphs represent samples, edges in the graphs represent relationships between vertices and edges may also be referred to as links, such graph-based representations of data may be collectively referred to as graph data, and the asymmetry of different portions of many graph data, and many data sets generated by graph-based systems naturally exhibit highly skewed sample class distributions, wherein the classes with a greater number of samples are referred to as majority classes, the classes with a lesser number of samples are referred to as minority classes, and graph data classification algorithms trained based on such graph data may identify minority class samples as majority classes, resulting in poor classification accuracy.

However, the graph data is non-European space data, in which the node number, node connection mode, number of neighbor nodes of each node, etc. are different, and in the past, many effective methods have been developed to implement classification tasks based on European space data such as data forms of voice sequence, image pixels, video frames, etc., because of the great difference between European space data and non-European space data, these methods are difficult to directly migrate to the graph data, and in addition, the class imbalance phenomenon of the graph data is ubiquitous, the class imbalance phenomenon is commonly occurring in various fields, in the real world, the number of samples of each class of the whole data set is unbalanced due to the difference of sample properties and sample occurrence frequencies, wherein the class with a large number of samples is called a majority class, the class with a small number of samples is called minority class, the number of samples of majority class is often far more than that of minority class, class imbalance phenomenon causes that traditional machine learning algorithms tend to identify minority class samples as majority class, in the machine learning field, this problem is called class imbalance problem, class imbalance problem also appears in graph data, because of asymmetry of different parts in graph data, many data sets generated by graph-based systems naturally show highly inclined distribution, for example, data set citeser is a reference network containing reference relations between papers, is typical graph data, in which 21.1% of papers belong to majority class, and minority class only accounts for 7.9%. Existing methods generally assume that the distribution of input classes is close to or perfectly balanced, and these methods provide a training set of class distribution balances to ensure that feature learning among multiple classes is balanced, thereby completely circumventing class imbalance problems. Aiming at the problem of class unbalance, how to consider most class and few class samples in the node classification task of the graph data is the key of the research and the realization of the class unbalance graph data node classification method.

Disclosure of Invention

Aiming at the technical defects in the background technology, the invention provides an unbalanced graph node neural network classification method based on graph data enhancement, which solves the technical problems and meets the actual requirements, and the specific technical scheme is as follows:

an unbalanced graph node neural network classification method based on graph data enhancement comprises the following steps:

step 1: inputting collected basic data to construct a form of graph data, and obtaining a first graph data vector;

step 2: inputting the first graph data vector into a convolution layer based on a spectrum domain to obtain an original node embedded vector in the first graph data vector;

step 3: embedding the original node into a vector input node synthesis model, generating a new synthesis node set, and calculating a new synthesis node characteristic vector;

step 4: combining the new synthesized node set with the original node set to obtain a data enhancement node set, inputting the data enhancement node set into a link generation model to obtain a link feature vector of the data enhancement node set; obtaining a second graph data vector according to the data enhancement node set feature vector and the data enhancement node set link feature vector;

step 5: inputting the second graph data vector into a graph neural network classifier to obtain a classification result, calculating classification loss according to the classification result, optimizing the graph neural network classifier according to the classification threshold, obtaining a graph neural network classifier of a data classification model, wherein the graph neural network classifier comprises at least one layer of graph neural network and one layer of multi-classification model, and inputting data to be classified into the graph neural network classifier to obtain prediction of the data to be classified;

as an improvement of the above scheme, the step 2 specifically includes:

step 2.1: inputting first graph data into a first graph convolution layer of the spectrum domain-based convolution layer to obtain a first graph convolution vector;

step 2.2: inputting the first graph convolution vector into an activation function to obtain an activation vector;

step 2.3: and inputting the activation vector into a second graph convolution layer to obtain an original node embedded vector of the first graph data.

As an improvement of the above scheme, the step 3 specifically includes:

step 3.1: inputting the original nodes of the first graph data into a sampled probability calculation layer to obtain the sampled probability of each original node in the first graph data;

step 3.2: the method comprises the steps of inputting an original node embedded vector and a sampled probability of the original node into a sampling layer to obtain a sampled node set;

step 3.3: and inputting the sampled nodes in the sampled node set into the synthesis layer to obtain a new synthesized node set.

As an improvement of the above solution, the specific process of obtaining the sampled probability of each original node in the step 3.1 is:

acquiring the category of each node in the first graph data;

obtaining a first node set and a second node set according to the category of each node in the first graph data and the number of nodes of each category in the first graph data, and obtaining the sampling probability of each node in the first node set according to a first preset rule;

obtaining sampling probability of each node in the second node set according to a second preset rule according to the number of nodes in the first graph data and the number of nodes of each category in the second node set;

and obtaining the sampled probability of each original node in the first graph data according to the sampled probability of each node in the first node set and the sampled probability of each node in the second node set.

As an improvement of the above scheme, the step 3.2 specifically includes:

obtaining a pre-sampling node set according to the original nodes and the sampled probability of each original node;

and obtaining a sampled node set according to the link number of each pre-sampling node in the pre-sampling nodes, wherein the sampled node set comprises a plurality of sampled nodes with preset numbers.

As an improvement of the above scheme, the step 3.3 specifically includes:

executing a first preset operation on each sampled node in the sampled node set to obtain a new synthesized node set, wherein the new synthesized node set comprises a plurality of new node sets consistent with the number of the sampled nodes, and the first preset operation is as follows: acquiring a neighbor node set of a sampled node;

and obtaining a new node set corresponding to the sampled node according to the distance between the sampled node and each node in the neighbor node set.

As an improvement of the above scheme, the step 4 specifically includes:

executing preset operation on each sampled node to obtain a link feature vector, wherein the link feature vector comprises a first link vector, a second link vector and a third link vector, and the preset operation specifically comprises:

adding links to each node in the new synthesized node set corresponding to the sampled node and the sampled node to obtain the first link vector;

adding links to each node in a neighbor node set of the sampled node and each node in a new synthesized node set corresponding to the sampled node to obtain a second link vector;

and in response to the existence of the intersection between the neighbor node set of the sampled node and the sampled node set, adding a link to each node in the new synthesized node set corresponding to the sampled node and each node in the new node set corresponding to each sampled node in the intersection to obtain the third link vector.

As an improvement of the above scheme, the step 5 specifically includes:

step 5.1: inputting the second graph data vector into at least one layer of graph neural network to obtain a second node embedded vector of the second graph data vector;

step 5.2: and inputting the second node embedded vector into the multi-classification model to obtain the classification result.

An electronic device comprising a memory and a processor, wherein: the memory is used for storing a computer program; the processor is used for executing the computer program.

A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements a data classification method as described above.

The invention has the beneficial effects that:

1. compared with the traditional unbalanced-like learning method, the method is more suitable for graph data with a non-Euclidean structure based on a graph neural network structure, the traditional unbalanced-like learning method ignores abundant structural information in the graph data, the graph data enhanced graph neural network in the invention very important view structural information, and the unbalanced-like node classification graph neural network model utilizes a graph reconstruction component to enable nodes learned by the model to be embedded with the structural information of the graph data, and establishes proper links for synthetic nodes with a link generation model in the graph data enhanced graph neural network model, so that the learning of a classifier is promoted.

2. The soft link strategy guides the model by optimizing KL divergence between the multi-element Gaussian distribution obeyed by the edge embedding of the synthesized node and the original node, so that the graph neural model learns the correct node embedding and generates the proper synthesized node embedding, and the structure information similar to the original graph data can be kept, thereby solving the problem of adjacent matrix dimension change caused by different node numbers.

Drawings

Fig. 1 is a flowchart of a method for classifying an unbalanced graph node neural network based on graph data enhancement according to the present invention.

Fig. 2 is a technical framework diagram of a classification method of an unbalanced graph node neural network based on graph data enhancement.

Fig. 3 is a schematic diagram of a node synthesis model related to a classification method of an unbalanced graph node neural network based on graph data enhancement.

Fig. 4 is a schematic diagram of a link generation model of an unbalanced graph node neural network classification method based on graph data enhancement according to the present invention.

Detailed Description

Embodiments of the present invention will be described below with reference to the accompanying drawings and examples, but the embodiments of the present invention are not limited to the following examples, and the present invention relates to the relevant essential parts in the art, and should be regarded as known and understood by those skilled in the art.

step 5: inputting the second graph data vector into a graph neural network classifier to obtain a classification result, calculating classification loss according to the classification result, optimizing the graph neural network classifier according to the classification threshold, and obtaining a graph neural network classifier of a data classification model, wherein the graph neural network classifier comprises at least one layer of graph neural network and one layer of multi-classification model, and inputting data to be classified into the graph neural network classifier to obtain a prediction of the data to be classified;

further, in the above scheme, the step 2 specifically includes:

Further, in the above scheme, the step 3 specifically includes:

Further, in the above scheme, the specific process of obtaining the sampled probability of each original node in the step 3.1 is:

acquiring the category of each node in the first graph data;

the nodes in the first graph data comprise a plurality of categories, the first node set is composed of the categories with the largest number of nodes, the second node set is composed of other categories except the categories with the largest number of nodes, when a certain node belongs to the category with the largest number of nodes, the categories are classified into the first node set, and when a certain node belongs to the category except the category with the largest number of nodes, the categories are classified into the second node set.

The first preset rule is as follows: when a certain node belongs to the category with the largest number of nodes (namely belongs to the first node set), the sampling probability of the node is 0;

the second preset rule is as follows: when a node does not belong to the category with the largest number of nodes (namely, belongs to the second node set), the sampling probability of the node is: by usingRepresenting the number of labeled nodes, C _Δ Representing the class of node, |C _Δ The I represents C _Δ Node number of class, ++>And |C _Δ The ratio of the values is the sampling probability of the node.

Further, in the above scheme, the step 3.2 specifically includes:

obtaining a sampled node set according to the link number of each pre-sampling node in the pre-sampling nodes, wherein the sampled node set comprises a plurality of sampled nodes with preset numbers;

specifically, the sampled node is according to |E _i Magnitude order of |selects |E _i Gamma with the greatest i ₁ Individual node, |E _i I indicates the number of links of a node, γ ₁ Is a parameter used to control the number of sampled nodes.

Further, in the above scheme, the step 3.3 specifically includes:

obtaining a new node set corresponding to the sampled node according to the distance between the sampled node and each node in the neighbor node set;

the method comprises the following steps: finding a set of sampled nodesEach node v of (a) _i Neighbor node set of the same class +.>The new synthetic node set->Each node v of (a) _j ，v _j Feature embedding vector o _j By node v _i Feature embedding vector o _i And node v _i Neighbor node v _k Feature embedding vector o _k Is obtained by solving the tie by the distance accumulation of (2).

Further, in the above scheme, the step 4 specifically includes:

Further, in the above scheme, the step 5 specifically includes:

In a specific embodiment of the invention:

a graph dataset with unbalanced node class distribution may be represented asWherein, the liquid crystal display device comprises a liquid crystal display device, is a +.>A set of individual nodes, and->Is a set of m= |e|edges and can be represented as an n x n adjacency matrix a if the node pair (v _i ，v _j ) Is present, then A _ij =1, otherwise a _ij =0, note that when i e [1, n]，A _ii Is set to 0, X is used to represent the characteristics of the original node, and furthermore,/is set to 0>There are multiple category labels, denoted as c= { C ₁ ，C ₂ ，...，C _K }，/>Divided into k= |c| groups, the nodes of each group (i.e. each class) have the same label +|>The class label information of the middle node is denoted Y.

In the node synthesis model, calculating the feature vector of a new node by sampling the node and picking out a neighbor node set with the same category as the selected node; the node synthesis model samples nodes according to sampling probability, the sampling process is divided into two steps, a batch of nodes are selected firstly, and the nodes are selectedSampled probability ρ of (1) _i The calculation is as follows:

wherein C is _max Representing the category with the highest number of nodes, Y _i Representing the nodes in the class with the highest number of nodes,representing the number of labeled nodes, C _Δ Representing node v _i Category, |C _Δ The I represents C _Δ The number of nodes of the class. In calculating ρ _i Thereafter, ρ is reduced _i Scaling sigma _i ρ _i Multiplying the sum of sampling probabilities of all nodes by 1, and according to the sampling probability ρ _i Sampling a batch of nodes, and selecting the nodes according to |E _i Magnitude order of |selects |E _i Gamma with the greatest i ₁ Individual node, |E _i I represents node v _i The number of edges, gamma ₁ For controlling the parameters of the number of the sampled nodes, the sampling process considers the influence of the node types and the number of the edges of the node types, so that the nodes with a few types and abundant structural information are easier to be usedSampling, the sampled node set is marked +.>

Based on sampled node setsA set of new nodes will be generated +.>For sampled node set->Each node v of (a) _i Will be according to the parameter gamma ₂ Generation of gamma ₂ V-base _i New synthesized nodes, node synthesis model generating synthesized node set +.>At the same time, the retention of structural information and inclusion in node characteristics are considered, and the sampled node set is required to be found first>Each node v of (a) _i Neighbor node set of the same class +.>Is defined as follows:

aiu is the value of the corresponding component of the ith row and the ith column of the adjacent matrix A; yu is node v _u Category labels of (c);

the meaning of Λ is "and", i.e. the formulas on both sides of Λ are to be established simultaneously.

For a composite node setEach node v of (a) _j ，v _j Feature embedding vector o _j The calculation method of (2) is as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,representing the same class of neighbor node sets +.>The number of nodes in delta _k The random number representing a value in (0, 1), the new node generated by the method is generated by interpolation method relative to the traditional node directly from the same class of the two classes, the characteristics of the neighbor nodes in the same class as the node are comprehensively considered in the process of generating the new node, and the node generated by the method is characterized by the node v _i Similarly, by the process of generating new nodes

The method of incorporating the neighbor node supplements structural information in the feature embedding of the new node, and finally generates gamma ₁ ×γ ₂ And synthesizing nodes.

For the link generation model, frequent links often exist between graph data samples, so when a node set is synthesizedAfter the generation, links of the synthetic nodes are needed to be continuously generated, and the information of the synthetic nodes is enriched.

Specifically, after the processing of the node synthesis model module, a synthesized node set can be obtainedIncorporating these synthetic nodes into the node set +.>Node set with increased data sample is obtained>First of all the node set is +_ with inner product decoder>Reconstructing a graph structure:

wherein, the liquid crystal display device comprises a liquid crystal display device,is the reconstructed adjacency matrix of the graph. Wherein o is _i ^T Is o is _i Is a transposed vector of (2); o (o) _i ^T ·o _j Represents o _i ^T And o _j The two vectors perform a dot product operation; and o _i And o _j Respectively node v _i And->Is used to determine the embedded vector of (a). />The closer the value of (1) is to 1, then the node v is described _i And v _j The greater the likelihood of a link between them. Otherwise, node v _i And v _j The less likely a link exists between them. It can be observed that because the synthetic node set +.>Is added to reconstruct the adjacency matrix +.>Will be larger than the dimension of the adjacency matrix a of the graph.

In the soft connection strategy, link distribution alignment loss is proposed to improve the more accurate prediction of the model on the links of the synthetic nodes, firstly, by using a full connection layer, the feature vectors of the links can be obtained from the embedded vector mapping of two nodes with link relations, and the process is expressed as follows:

wherein the method comprises the steps ofRepresenting join operations, the vector o is to be embedded _u And embedding vector o _v Splicing, FC represents a full connection layer, and mapping weight W is utilized _e The spliced vector->Mapping to node v _u And node v _v Linked embedded vectors.

Original node setIs expressed as:

representation->Is a d-dimensional vector; e, e _ij Representing links between nodes in the original set of nodes v +.>Representing link e _ij Is to synthesize the node set +.>Is expressed as:

e _km the links of the representation include links of the composite node to the original node, as well as links of the composite node to the composite node,representative Link e _km Is the embedded vector of (1), let e _ij And e _km All obey d-dimensional multivariate gaussian distribution, i.eAnd->Their probability density functions are expressed as follows:

wherein mu _ij 、Sigma (sigma) _ij 、/>Mean and covariance matrices, respectively, +.> But->The parameter can be from e _ij And e _km The approximation is calculated as:

equation (1) (2) can be expressed as the product of d independent gaussian distributions:

final KL-based divergence optimization to reduceAnd->The difference between:

where tr (A) is the trace of matrix A and its value is equal to the sum of the elements on A's main diagonal.

After node embedding of the synthesized nodes is obtained, the synthesized nodes and the original nodes form a new node set, and links of the synthesized nodes are determined through graph reconstruction, so that the problem of adjacent matrix dimension change caused by different numbers of the nodes is solved, and a soft link strategy guides a model by optimizing KL divergence between multi-element Gaussian distributions obeyed by edge embedding of the synthesized nodes and the original nodes.

Finally, let theRepresenting enhanced graph data node embedding by merging graph data +.>Is obtained by node embedding and composite node embedding. Will have label node set->And synthetic node set->Combining, use->Representing the augmented labeled node set, +.>Representing labeled node set->Is->Representing all nodes>Representation->Adjacent matrix of->Representing links between all nodes. There is now an enhancement map data +.>And a labeled node set +.>The number of samples of different classes in (a) becomes balanced. Based on enhancement map data->An unbiased GNN classifier can be trained. Specifically, the invention employs two layers of GCNs as classifiers, and uses softmax layers for node classification. Note that the graph neural network classifier is alternative, and the present invention selects GCN as the classifier for computational convenience, and if computational resources are sufficient, the classifier may select other suitable graph neural network models. The definition of this classifier is expressed as:

wherein H represents node embedding through two layers of GCN learning, W ₃ And W is ₄ Is a weight matrix, and P is a class probability distribution matrix of nodes. The classifier performs training learning by optimizing the classification loss, and the classification loss is defined as:

in this context,is->Category label information of the node in order toThe one-hot matrix is expressed, and K is the number of categories. />Is constructed by randomly sampling nodes from the entire graph data node set.

The invention is implemented on a Pytorch platform and optimized by an ADAM optimizer, and trained on a NVIDIA GeForce GTX 2080Ti GPU. The mapping dimensions xi and tau of the first layer and the second layer GCN convolution layers are 500 and 100 respectively, and the mapping dimensions of each layer of the classifier are 50 and |C|, and |C| represents the category number of the graph data. The initial learning rate is 0.002, the L2 regularized weight attenuation coefficient is 0.03, and dropout is 0.03. Furthermore, gamma in the Cora dataset ₁ And gamma is equal to ₂ Parameters were set to 500 and 2, respectively, at gamma in the CiteSeer dataset ₁ And gamma is equal to ₂ The parameters were set to 500 and 2, respectively, at γ in the PubMed dataset ₁ And gamma is equal to ₂ The parameters were set to 800 and 2, respectively. The maximum training round is set to 1000, the early stop threshold is 200, and the loss function adjustment parameter λ is set to 0.1.

In another embodiment of the present invention, the data processing method based on graph data enhancement may be applied to the field of legal data classification to solve the problem of uneven case type distribution.

Step 1: the collected basic data is input to be constructed into the form of graph data, and first graph data is obtained.

For example, for judge documents of a certain type of crime, the case elements and the judgment result of each judge document are constructed in the form of graph data.

Step 2: and inputting the first graph data into a convolution layer based on a spectrum domain to obtain a node embedded vector of the first graph data.

Step 3: the node embedding vector is input into a node synthesis model to obtain a synthesized node vector, and a synthesized node feature vector is calculated;

step 4: inputting the synthesized node vector into a link generation model to obtain a link feature vector; obtaining a second graph data vector according to the synthesized node feature vector and the link feature vector;

step 5: inputting the second graph feature vector into a graph neural network-based classification model to obtain a classification result, calculating classification loss according to the classification result, optimizing the graph neural network-based classification model according to the classification threshold value to obtain a data classification model, wherein the graph neural network-based classification model comprises at least one layer of graph neural network and one layer of multi-classification model;

step 6: inputting the data to be classified into the data classification model to obtain the prediction classification of the data to be classified.

For example: and under an application scene, inputting the case data to be classified into the data classification model to obtain the prediction classification of the case to be classified.

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims

1. A classification method of unbalanced graph node neural network based on graph data enhancement is characterized in that,

step 5: and inputting the second graph data vector into a graph neural network classifier to obtain a classification result, calculating classification loss according to the classification result, optimizing the graph neural network classifier according to the classification threshold, obtaining a graph neural network classifier of a data classification model, wherein the graph neural network classifier comprises at least one layer of graph neural network and one layer of multi-classification model, and inputting data to be classified into the graph neural network classifier to obtain prediction of the data to be classified.

2. The method for classifying the unbalanced graph node neural network based on the graph data enhancement according to claim 1, wherein the step 2 is specifically:

3. The method for classifying the unbalanced graph node neural network based on the graph data enhancement according to claim 1, wherein the step 3 is specifically:

4. The method for classifying the neural network of the unbalanced graph node based on the graph data enhancement according to claim 3, wherein the specific process of obtaining the sampled probability of each original node in the step 3.1 is as follows:

acquiring the category of each node in the first graph data;

5. The method for classifying an unbalanced graph node neural network based on graph data enhancement according to claim 3, wherein the step 3.2 specifically comprises:

6. The method for classifying an unbalanced graph node neural network based on graph data enhancement according to claim 3, wherein the step 3.3 specifically comprises:

7. The method for classifying an unbalanced graph node neural network based on graph data enhancement according to claim 1, wherein the step 4 is specifically:

8. The method for classifying the unbalanced graph node neural network based on the graph data enhancement according to claim 1, wherein the step 5 is specifically:

9. An electronic device comprising a memory and a processor, wherein: the memory is used for storing a computer program; the processor for executing the computer program to implement the data classification method according to any one of claims 1 to 8.

10. A computer-readable storage medium for storing a computer program, wherein the computer program when executed by a processor implements the data classification method according to any one of claims 1 to 8.