CN115631057A

CN115631057A - Social user classification method and system based on graph neural network

Info

Publication number: CN115631057A
Application number: CN202211307295.8A
Authority: CN
Inventors: 张维玉; 郭新超
Original assignee: Qilu University of Technology
Current assignee: Qilu University of Technology
Priority date: 2022-10-24
Filing date: 2022-10-24
Publication date: 2023-01-20

Abstract

The invention discloses a social user classification method, a social user classification system, electronic equipment and a computer-readable storage medium based on a graph neural network, and belongs to the technical field of social user classification. The method comprises the steps of obtaining node representation aiming at input original graph structure data constructed based on social user data, and executing oversampling operation based on the node representation; generating a synthetic node aiming at a few nodes in the data; acquiring adjacency information of the synthetic node based on the synthetic node; distributing pseudo labels for the synthetic nodes; and combining the synthesized nodes, the adjacent information and the real nodes to construct a node balance graph and classifying. The problem of unbalanced classification can be solved, the accuracy of social user classification is improved, and the problems that in the prior art, the accuracy is low and the calculation cost is high due to unbalanced classification and classification of social users are solved.

Description

Social user classification method and system based on graph neural network

Technical Field

The present application relates to the field of social user classification technologies, and in particular, to a social user classification method and system based on a graph neural network.

Background

The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.

In recent years, with the development of Graph Neural Networks (GNNs), graph representation learning has been improved greatly and is widely applied to classification tasks, but existing work still mainly focuses on data balance learning.

Node classification is an important research topic in graph representation learning. Graph Neural Networks (GNNs) have achieved the most advanced node classification performance. However, existing GNNs solve the problem of data sample balancing of different classes; for many real scenes, some classes may have far fewer instances than others. In this case, training the GNN classifier directly would be insufficient to represent samples from those few classes and result in sub-optimal performance.

However, in the real world, the number of different classes of samples in the data may be unbalanced, i.e., there may be some phenomena that are much more than samples of other classes. In the aspect of detection of false users, most of the social users of the video website and the social network site are real users, and only a small part of the social users are robot users (false users), so that the phenomenon is particularly prominent, and the problem of unbalanced classification exists.

Semi-supervised classification learning is to use a small part of labeled data to train a classifier in a large amount of data so as to complete a final classification task, and because there is only limited labeled data when performing social user classification, the semi-supervised learning causes fewer labeled samples, which further enlarges the severity of the problem because we have only limited labeled data, so that fewer labeled samples become fewer.

In the field of machine learning, the problem of unbalanced classification is widely studied and can be summarized into three categories, namely a data-level method, an algorithm-level method and a hybrid method. The data level method enables category distribution to be more balanced by using an oversampling or undersampling technology, wherein the oversampling is used for balancing a data set by oversampling a few types of samples and the undersampling is used for undersampling a plurality of types of samples; undersampling may lead to more efficient classification, but since it discards useful information in most classes, it eventually shakes decision boundaries and leads to poor classifiers; in contrast, oversampling preserves more information by copying existing samples or synthesizing new samples, copying (also known as random oversampling) randomly copies some few samples, so it usually produces a smaller few class regions, which may result in an overfitting. Algorithmic-level methods typically introduce different misclassification penalties or prior probabilities for different classes; the mixing method combines the two. However, directly applying them to the graph may result in sub-optimal results, the relationship is the key information that needs to be mined in the graph structure data, and the insufficient representation of a few samples not only affects the embedding quality, but also affects the knowledge exchange process between adjacent nodes. Previous algorithms fail to solve this problem because they assume that each sample is independent.

While existing methods have demonstrated their success in unbalanced data learning, two problems remain:

(1) In an oversampling strategy, synthesis provides a wider decision area than replication, but results in heavy computational costs;

(2) Hybrid strategies rebalance the dataset with a wide decision region, but integrated learning strategies require a large computational cost in training, especially when combined with over-sampling strategies;

(3) For real scenarios like social user classification, some classes may have far fewer instances than others. In this case, training the GNN classifier directly would be insufficient to represent samples from those few classes and result in suboptimal performance.

Disclosure of Invention

To address the deficiencies of the prior art, the present application provides a graph neural network based social user classification method, system, electronic device, and computer-readable storage medium that generate a composite minority node by interpolating values in an expressive embedding space obtained by a GNN-based feature extractor, and predict links of the composite node using an edge generator, thereby balancing the minority node with other nodes to facilitate node classification by GNN.

In a first aspect, the application provides a social user classification method based on a graph neural network;

a social user classification method based on a graph neural network comprises the following steps:

acquiring node representation aiming at input original graph structure data constructed based on social user data, and performing oversampling operation based on the node representation;

generating a synthetic node aiming at a few nodes in the data;

acquiring adjacency information of the synthetic node based on the synthetic node;

distributing pseudo labels for the synthetic nodes;

and combining the synthesized nodes, the adjacent information and the real nodes to construct a node balance graph and classifying.

Further, the specific steps of performing the oversampling operation to generate the synthetic node based on the node representation are:

based on the node representation, aiming at a few nodes in the data, acquiring corresponding node representation through feature extraction;

and generating a synthetic node according to the attribute information and the topology information of the minority nodes.

Further, the specific step of allocating the pseudo label to the partial synthesis node is as follows:

acquiring the influence degree of the neighborhood label information on the predicted label according to the weight matrix and the neighborhood label information;

and obtaining the predicted label according to the influence degree of the original label and the neighborhood label information on the predicted label.

Further, the specific steps of combining the synthesized node, the adjacent information and the real node to construct the node balance graph include:

connecting real node embedding with synthetic node embedding to obtain an enhanced node representation set;

and embedding the synthesized nodes into the label node set to obtain an enhanced tag set.

Further, before classification, the network is trained through an objective function to obtain a neural network classification model.

Further, the objective function is:

wherein eta is _node For cross entropy loss function, η _edge For training the loss function of the edge generator, eta _p For the objective function of adaptive tag propagation, λ is the hyper-parameter, θ, φ,

are the parameters of the feature extractor, edge generator and node classifier, respectively.

Furthermore, two layers of GraphSage are adopted as a main model structure.

In a second aspect, the present application provides a social user classification system based on a graph neural network;

a graph neural network-based social user classification system, comprising:

the method comprises a feature extractor, a node generator, an edge generator, a label propagator and a GNN classifier;

the feature extractor is used for acquiring original graph structure data constructed based on social user data and acquiring node representation according to the original graph structure data;

the node generator is used for generating a synthetic node aiming at a few nodes in the data based on the node representation;

the edge generator is used for acquiring the adjacent information of the synthetic node based on the synthetic node;

the label propagator is used for distributing pseudo labels for the synthetic nodes;

the GNN classifier is used for combining the synthesized nodes, the adjacent information and the real nodes to construct a node balance graph and classify.

In a third aspect, the present application provides an electronic device;

an electronic device comprising a memory and a processor and computer instructions stored on the memory and executed on the processor, wherein the computer instructions, when executed by the processor, perform the steps of the method for social user classification based on a graph neural network.

In a fourth aspect, the present application provides a computer-readable storage medium;

a computer readable storage medium for storing computer instructions, which when executed by a processor, perform the steps of the above-mentioned method for social user classification based on graph neural network.

Compared with the prior art, the beneficial effect of this application is:

1. the method expands the prior unbalanced learning technology for independent and identically distributed data to an unbalanced node classification task, adopts the most stable and most effective synthesis minority oversampling algorithm to provide relationship information for newly synthesized samples, classifies the newly synthesized samples based on class balanced data, and improves the classification accuracy;

2. according to the method, for the processing of a small number of nodes, the GNN feature extractor is used for generating embedding, the node generator generates a small number of nodes in a potential space, then the edge generator adds connection to the new nodes to obtain an enhanced graph with class balance, and finally the nodes are classified through the GNN classifier.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.

Fig. 1 is a schematic flowchart of a social user classification method based on a graph neural network according to an embodiment of the present application;

fig. 2 is a schematic diagram of a framework provided in an embodiment of the present application.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular is intended to include the plural unless the context clearly dictates otherwise, and furthermore, it should be understood that the terms "comprises" and "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

Example one

In the prior art, in the problem of social user classification, the category of a false user is far smaller than that of a real user, and in the process of prediction classification through a graph neural network, the existing method for the unbalanced classification problem has high calculation cost, and directly trained samples are not enough to represent the samples from a few categories, so that suboptimal performance is caused, and the classification accuracy is reduced; therefore, the application provides a social user classification method based on the graph neural network.

Next, a social user classification method based on a graph neural network disclosed in this embodiment is described in detail with reference to fig. 1-2.

The embodiment provides a social user classification method based on a graph neural network, which comprises the following steps:

s1, acquiring node representation aiming at input original graph structure data constructed based on social user data, and executing oversampling operation based on the node representation; the method comprises the following specific steps:

extracting, by a feature extractor, an input node representation of original graph structure data constructed based on social user data; in this embodiment, the first level GraphSage computing node is selected to represent:

where F represents the input node attribute matrix, F [,:]representing the node attributes. A [: ,]representing the v-th column in the adjacency matrix,

is the embedding of nodes, W ¹ Is a weight parameter representing an activation function like RELU.

S2, after the representation of each node is obtained in an embedding space constructed by the feature extractor, aiming at a few nodes in the data, generating a synthetic node; in this embodiment, the minority nodes are nodes with tags in the data, specifically, a SMOTE algorithm is adopted, and common oversampling is increased by changing repeated interpolation, and the basic idea of the SMOTE algorithm is to interpolate samples from a target minority class in an embedding space and interpolate neighbors of the target minority class in the embedding space; the specific process is as follows:

is provided with

Is a few nodes with labels of Y _u . First, find and

the closest labeled node in the same class, i.e.,

where nn (v) refers to the nearest neighbor of v in the same class, measured in embedding space using euclidean metrics.

Using nearest neighbors, the resulting nodes are generated as:

where δ is a random variable that follows a uniform distribution over the range [0,1 ].

Due to the fact that

And

belonging to the same class and being very close to each other, so that resulting composite nodes

Should also belong to the same class in order to obtain a labeled synthetic node.

S3, acquiring adjacency information of the synthetic node based on the synthetic node; synthetic nodes have now been generated to balance the class distribution, since these nodes are not linked to the original graph G, and are therefore isolated from the original graph G. Firstly, training through a real node and an existing edge generator; then, adjacency information of the composite node is predicted by the edge generator, and the generated composite node and edge are added to the initial adjacency matrix. The node representation can be well utilized to reconstruct the adjacency matrix, and good link prediction is provided for the synthesized node.

To maintain model simplicity and make analysis easier, the edge generator is implemented using weighted inner products, as follows:

wherein, E _v,u And representing the information of the predicted relationship between the nodes v and u, and S represents a parameter matrix for capturing the interaction between the nodes.

Training the edge generator by a loss function, the loss function being:

where E represents the predicted connection between nodes in V.

S4, distributing pseudo labels for the synthetic nodes; utensil for cleaning buttockIn particular, the goal of label propagation is to find a prediction matrix Y that is consistent with the label matrix _L . The specific formula is as follows:

wherein Y (0) = Y, K represents the number of power iteration steps,

is a prediction label and the transpose matrix is denoted by T and can be set as a normalized adjacency matrix. After the label is spread for K times, the label is predicted to obtain the neighborhood label information of the K hop distance. For this reason, we have designed an adaptive label propagation algorithm, and the specific formula can be expressed as:

wherein gamma is _ik Representing the degree of influence of k-hop neighborhood information on the predicted label, gamma _ik Can be expressed as:

wherein,

note that vector W is the weight matrix and ReLU is the activation function. The self-adaptive label propagation operator sets the attention vector and the weight matrix as learnable parameters, adjusts the propagation strategy of each node, and finally, the smooth label can capture rich structural information in the input graph.

The objective function of adaptive tag propagation is as follows:

wherein,

is to the node v _i Prediction of (a), y _i Is the original label and l () is the cross entropy loss.

And S5, combining the synthesized nodes, the adjacent information and the real nodes to construct a node balance graph and classifying.

Is prepared by reacting H ¹ (embedding of real nodes) and the set of enhanced node representations resulting from the concatenation of the embedding of synthetic nodes,

by embedding the synthetic node into V _L The resulting enhanced marker set; thereby obtaining a set of nodes with labels

Enhanced graph of

In particular, a second GraphSage block is used, with the addition of a linear layer for

The node classification above, as follows:

wherein H ² The node representing the 2 nd GraphSage block represents a matrix, and W represents a weight parameter. P is _v Is the probability distribution over class labels for node v. The classifier module is optimized by using cross entropy loss:

during the test, the nodes v, Y are connected _v ' the prediction class is set as the most probable class

Further, before social user classification, training a network through an objective function to obtain a graph neural network classification model, and when the objective function is minimum, obtaining an optimal graph neural network classification model, wherein the specific training steps are the same as the steps of the method, and the objective function is as follows:

Example two

The embodiment discloses a social user classification system based on a graph neural network, which comprises a feature extractor, a node generator, an edge generator, a label propagator and a GNN classifier;

the feature extractor is used for acquiring original graph structure data constructed based on social user data and acquiring node representation according to the original graph structure data; the feature extractor can be implemented using any type of GNN, and in particular, the feature extractor chooses GraphSage as the backbone model structure because it can effectively learn various local topologies and can be well generalized to new structures. The message transmission and fusion process comprises the following steps:

The node generator is used for generating a synthetic node aiming at a few nodes in the data based on the node representation; specifically, after the representation of each node is obtained in the embedding space constructed by the feature extractor, a synthetic node is generated for a few nodes in the data; specifically, a SMOTE algorithm is adopted, common oversampling is increased by changing repeated interpolation, and the SMOTE algorithm has the basic idea that samples from a target minority class are interpolated in an embedding space, and neighbors of the target minority class are interpolated in the embedding space; the specific process is as follows:

is provided with

Is a few nodes with labels of Y _u . First, find and

the closest labeled node in the same class, i.e.,

Using nearest neighbors, the resulting nodes are generated as:

Due to the fact that

And

belong to the same class and are in close proximity to each other, so the resulting composite nodes

Should also belong to the same class, so that a labeled synthetic node is obtained.

The edge generator is used for acquiring the adjacency information of the synthetic node based on the synthetic node; (ii) a Specifically, the generator trains the actual node and the existing edge for predicting the neighbor information of the synthetic node. These new nodes and edges will be added to the initial adjacency matrix and serve as inputs to the GNN-based classifier.

wherein E is _v,u And S represents a parameter matrix for capturing the interaction between the nodes.

Training the edge generator by a loss function, the loss function being:

where E represents the predicted connection between nodes in V.

Label propagators for synthesizing sectionsPoint distributing a pseudo label; the goal of label propagation is to find a prediction matrix Y that is consistent with the label matrix _L . The concrete formula is as follows:

wherein Y (0) = Y, K represents the number of power iteration steps,

wherein,

note that vector W is a weight matrix and ReLU is an activation function. The self-adaptive label propagation operator sets the attention vector and the weight matrix as learnable parameters, adjusts the propagation strategy of each node, and finally, the smooth label can capture rich structural information in the input graph.

The GNN classifier is used for combining the synthesized nodes, the adjacent information and the real nodes to construct a node balance graph and classifying the node balance graph; using a second GraphSage block, add a linear layer for

The node classification above, as follows:

wherein H ² The node representing the 2 nd GraphSage block represents a matrix and W represents a weight parameter. P is _v Is the probability distribution over class labels for node v. The classifier module is optimized by using cross entropy loss:

It should be noted that the above feature extractor, node generator, edge generator, label propagator and GNN classifier correspond to the steps in the first embodiment, and the above modules are the same as the corresponding steps in the implementation example and application scenarios, but are not limited to the disclosure in the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer executable instructions.

EXAMPLE III

The third embodiment of the invention provides electronic equipment, which comprises a memory, a processor and computer instructions stored on the memory and run on the processor, wherein when the computer instructions are run by the processor, the steps of the social user classification method based on the graph neural network are completed.

Example four

The fourth embodiment of the present invention provides a computer-readable storage medium, configured to store computer instructions, where the computer instructions, when executed by a processor, perform the steps of the social user classification method based on a graph neural network.

The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In the foregoing embodiments, the description of each embodiment has an emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions in other embodiments.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A social user classification method based on a graph neural network is characterized by comprising the following steps:

generating a synthetic node aiming at a few nodes in the data;

distributing pseudo labels for the synthetic nodes;

and combining the synthesized nodes, the adjacent information and the real nodes to construct a node balance graph for classification.

2. The method for classifying social users based on graph neural network as claimed in claim 1, wherein the step of performing the oversampling operation to generate the synthetic node based on the node representation comprises:

aiming at a few nodes in the data, acquiring corresponding node representation through feature extraction;

3. The method for classifying social users based on the graph neural network as claimed in claim 1, wherein the step of assigning the pseudo labels to the partial synthetic nodes comprises:

according to the weight matrix and the neighborhood label information, obtaining the influence degree of the neighborhood label information on the predicted label;

4. The method for classifying social users based on the graph neural network as claimed in claim 1, wherein the step of combining the synthetic nodes, the adjacent information and the real nodes to construct the node balance graph comprises the following specific steps:

and embedding the synthesized nodes into the label node set to obtain an enhanced mark set.

5. The method of claim 1, wherein prior to the classification, the network is trained using an objective function to obtain a graph neural network classification model.

6. The method of claim 5, wherein the objective function is:

wherein eta _node For cross entropy loss function, η _edge For training the loss function of the edge generator, eta _p For the objective function of adaptive tag propagation, λ is the hyper-parameter, θ, φ,

respectively, parameters of the feature extractor, the edge generator, and the node classifier.

7. The method for classifying social users based on a graph neural network as claimed in claim 5, wherein two layers of GraphSage are used as a backbone model structure.

8. A social user classification system based on a graph neural network is characterized by comprising a feature extractor, a node generator, an edge generator, a label propagator and a GNN classifier;

9. An electronic device comprising a memory and a processor and computer instructions stored on the memory and executed on the processor, the computer instructions when executed by the processor performing the steps of any of claims 1-7.

10. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the steps of any one of claims 1 to 7.