CN116304311A

CN116304311A - Online social network spam comment user detection method

Info

Publication number: CN116304311A
Application number: CN202310148077.2A
Authority: CN
Inventors: 杨泽; 戴维迪; 邵明来; 李天鹏
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2023-02-22
Filing date: 2023-02-22
Publication date: 2023-06-23

Abstract

The invention relates to a method for detecting users of online social network spam comments, which comprises the following steps: step one, diagram construction and pretreatment: establishing a graph structure by taking online social platform users as nodes and the interaction relationship among the users as edges, and constructing an adjacency matrix; manually marking part of data, giving the number and the label of the marked nodes, wherein 1 represents a spammer and 0 represents a normal user; establishing a confidence vector; step two, constructing a graph neural network: the graph neural network comprises two layers, the output dimension of the last layer is 2, the 1 st dimension represents the confidence that the neural network judges the node as a spam comment sender, the 2 nd dimension represents the confidence that the neural network judges the node as a normal user, the graph neural network acquires the characteristics of the node by aggregating the characteristics of the node neighbors, the category information of the neighbors is considered when the node characteristics are extracted, and different characteristic aggregation strategies are executed for the neighbors of different types; and thirdly, iterative optimization.

Description

Online social network spam comment user detection method

Technical Field

The invention belongs to the field of data mining, and relates to an anomaly detection method based on a graph neural network. According to the method, an online social network is modeled as a graph, the users with the spam comments are regarded as abnormal nodes in the online social network, the graph neural network fused with side information is used for extracting the characteristics of the users of the online social network, and the characteristics are input into a classifier for semi-supervised abnormal detection.

Background

With the development of the internet, more and more online platforms such as microblogs, popular critique, bean paste and the like are emerging. With the rise in user cardinality, there are increasing numbers of nonsensical and malicious reviews on these online platforms. In addition, some bad merchants hire special swipe accounts to send good reviews under their products for the purpose of swiping the rate. These malicious users severely disrupt the trustworthiness of various online platforms. Only manual inspection consumes a lot of manpower, so the demand for intelligent detection of users of spam comments is increasing.

By modeling users as nodes, various interactions between users as edges may establish an online social network for an online social platform, such that graph-based algorithms may be used to detect spammer senders. Due to the complexity of the graph data, camouflage of the spammer, and other characteristics, detection of the spammer still faces various challenges.

The graph neural network is widely applied to various graph learning tasks due to the excellent performance of the graph neural network on graph feature extraction. Fdgars ^[1] A method for detecting spammer comment senders by using a graph neural network. However, in order to avoid being detected by the algorithm, the spammer may generate some disguising behavior, such as establishing normal interactions with a large number of normal users, or disguising his own user attributes and transmitted comments to be similar to normal user attributes and comments. At this time, the structure of the graph neural network needs to be optimized to adapt to the detection task of the spammer when camouflage behaviors exist.

[1]Wang J,Wen R,Wu C,et al.FdGars:Fraudster Detection via Graph ConvolutionalNetworks in Online App Review System[C].In Companion of The 2019 WorldWide Web Conference,2019:310–316

Disclosure of Invention

The invention mainly aims to provide a method for detecting users of spam comments in an online social network, which is used for more accurately detecting spammer comment senders in the social network. The technical proposal is as follows:

a method for detecting users of online social network spam comments comprises the following steps:

step one, graph construction and preprocessing

(1) Establishing a graph structure by taking online social platform users as nodes and the interaction relationship among the users as edges, and constructing an adjacency matrix;

(2) Digitizing the user attributes, constructing an attribute matrix, wherein each row of the attribute matrix represents the attribute of the corresponding user;

(3) Manually marking part of data, giving the number and the label of the marked nodes, wherein 1 represents a spammer, 0 represents a normal user, and dividing a training set and a testing set;

(4) Establishing a confidence vector

Wherein N represents the number of nodes, the i-th site is 0 time and represents the node i more likely to be a normal user, and the i-th site is 1 time and represents the node i more likely to be a spammer comment sender; initializing the confidence vector, enabling the corresponding position of the node with the label of 0 in the training set to be 0 in the B, enabling the corresponding position of the node with the label of 1 in the training set to be 1 in the B, and enabling the rest positions to be 0;

step two, constructing a graph neural network

The graph neural network comprises two layers, the output dimension of the last layer is 2, the 1 st dimension represents the confidence that the neural network judges the node as a spam comment sender, the 2 nd dimension represents the confidence that the neural network judges the node as a normal user, the graph neural network acquires the characteristics of the node by aggregating the characteristics of the node neighbors, the category information of the neighbors is considered when the node characteristics are extracted, and different characteristic aggregation strategies are executed for the neighbors of different types;

the neural network comprises the following processes for each layer of graph:

(1) Feature h of user u using full connectivity layer _u The user characteristic z after dimension reduction is obtained by dimension reduction _u The formula is as follows:

z _u ＝W _t h _u

wherein,,

weight matrix of full connection layer, d _in Inputting dimensions for the layer, d _out Outputting dimensions for the layer;

(2) Regarding the node v as a central node, for each neighbor u of the node v under the relationship r, calculating the importance coefficient of the node v according to the relationship with the central node

The formula is as follows:

wherein,,

is a trainable weight vector;

(3) Judging whether the neighbors of the node are similar to the neighbors of the node according to the confidence vector B, and putting the similar neighbors of the node v under the relation r into a set

In the heterogeneous neighbors put into the set +.>

In (a) and (b);

(4) Respectively carrying out normalization operation on importance coefficients of the two types of neighbors to obtain attention scores for aggregation; for node v's neighbor u under relationship r, ifNode u is similar to node v, its attention score

The method is characterized by comprising the following steps:

wherein,,

is a set of homogeneous neighbors of node v; exp is a natural exponential function; sigma is a nonlinear activation function; similarly, if the neighbor node is heterogeneous with node v, its attention score +.>

The method is characterized by comprising the following steps:

wherein,,

is a set of heterogeneous neighbors of node v;

(5) Respectively calculating the embedding of similar neighbors of the central node v under the relation r according to the attention scores obtained by the previous step

Embedding of heterogeneous neighbors of the central node v>

The calculation formula is as follows:

(6) For each node v, in its characteristic z _v The Euclidean distance between the node and other node characteristics is used as a basis to obtain a k neighbor graph formed by k neighbor nodes

(7) According to

Performing an aggregation operation to obtain k neighbor embedding h of each node v _knn,v The formula is as follows:

wherein K is the neighbor number selected by K neighbors;

is a weight matrix; />

K neighbor set for node v;

(8) For each node v, its homogeneous neighbors are embedded

Heterogeneous neighbor embedding->

And k neighbor embedding h _knn,v The fusion can obtain the comprehensive embedding of the node v under the relation r>

(9) Introducing a multi-head attention mechanism, repeating the steps (1) to (8) for H times, and combining the steps

Splicing to obtain the characteristic of the multi-head post-attention node v under the relation r>

(10) The operation of splicing and linear transformation is adopted to make the relation of multiple

Integration into h' _v ；

After stacking two layers of the graph neural network, the output h of the last layer is obtained _v,out The output is a two-dimensional vector; the 1 st dimension represents the confidence level of the sender of the spam comment, and the 2 nd dimension represents the confidence level of the normal user; for h _v,out The probability value of the node belonging to the normal node and the abnormal node can be represented after the softmax operation is taken; when h _v,out When the 0 th dimension value of (2) is larger than the 1 st dimension value, the node is judged as a normal node; when h _v,out When the 1 st dimension value is larger than the 0 th dimension value, the node is determined as an abnormal node;

step three, iterative optimization

(1) Inputting the whole graph into a graph neural network to obtain an output result h _out ，

Is all h _v,out Is connected with the longitudinal splicing of the two parts;

(2) Undersampling training labels to obtain node sets participating in loss calculation

The number of normal nodes participating in loss calculation is similar to that of abnormal nodes, so that the influence of imbalance of the labels 01 is avoided;

(3) Calculated according to the following formula

Loss of->

Wherein y is _v A label representing node v;

(4) Output h according to model _out Updating confidence vector B to let h _out The corresponding position of the row with the 1 st dimension being greater than the 2 nd dimension in the B is 1, and the rest is 0;

(5) According to the loss

Executing a gradient descent algorithm;

(6) When losing

Stopping training when convergence is performed;

step four, outputting the unlabeled user category

(1) Obtaining model output h _out Taking out the row corresponding to the node without the label;

(2) If node i is at h _out The 1 st dimension value of the corresponding row in the list is larger than the 2 nd dimension value, the node is a spammer comment sender, and otherwise, the node is a normal user.

Firstly modeling users as nodes, modeling the interaction relationship among the users as edges to establish a graph structure, and simultaneously, manually marking a small number of spammer comment senders; then, a graph neural network is built, the neural network mainly comprises three parts, namely neighborhood feature extraction, global feature extraction and feature fusion, the graph neural network finally outputs a two-dimensional vector, the first dimension can be regarded as the probability that the user is a spammer comment sender, and the second dimension can be regarded as the probability that the user is a normal user; then, iterative optimization is carried out by using a gradient descent algorithm, and in each iteration, the loss of the neural network is calculated by using the label information and the cross entropy and the parameters of the neural network are updated by using gradient descent according to the loss; and finally, acquiring the output of the neural network as a detection result after loss convergence. The invention has the following characteristics: the label information required to be manually marked is less; the detection capability of the disguised spammer is high.

Drawings

FIG. 1 is a flow chart of the steps performed.

Detailed Description

Users have various interaction relationships on an online social platform, for example, in a popular comment, the users can interact through mutual comments, and also can interact through mutual comments on the same article. Therefore, the invention mainly solves the problem of detecting the spammer comment senders on the multiple relation diagrams. For a multiple relationship graph

Wherein,,

representing node set,/->

Representing a collection of node attributes. For each relation r.epsilon. {1,2, …, R } there is a set of edges +.>

Wherein (1)>

Representative node v _i And node v _j There is an edge under the relation r. The specific steps of the invention are as follows:

1) Graph construction and preprocessing

First, users are nodes, and the interactive relation among the users is an edge to build a graph structure and an adjacency matrix.

And secondly, digitizing the user attributes to construct an attribute matrix, wherein each row of the matrix represents the attributes of the corresponding user.

And thirdly, manually marking 3% of data, and giving the number and the label of the marked nodes. 1 represents a spam comment sender, and 0 represents a normal user.

And fourthly, dividing the training set and the testing set according to the manual labeling, wherein the ratio of the training set to the testing set is 7:3.

Fifth, a confidence vector is established

Wherein N represents the number of nodes, the ith position is 0 time representing node i more likely to be a normal user, and the ith position is 1 time representing node i more likely to be a spammer comment sender. Initializing the confidence vector, enabling the corresponding position of the node with the label of 0 in the training set to be 0 in the B, and enabling the corresponding position of the node with the label of 1 in the training set to be 1 in the B. The remaining positions are all 0.

2) Graph neural network construction

The graph neural network comprises two layers, the output dimension of the last layer is 2, the 1 st dimension represents the confidence that the neural network judges the node as a spammer comment sender, and the 2 nd dimension represents the confidence that the neural network judges the node as a normal user.

Conventional spam sender detection often ignores the disguising phenomenon of the spammer. The characteristics of the node are acquired by aggregating the characteristics of the node neighbors through the graph neural network. If the spammer interacts with a large number of normal users, the spammer may get similar features to the normal users after passing through the neural network. Therefore, the method considers the category information of the neighbors when extracting the node characteristics, and executes different characteristic aggregation strategies for the neighbors of different types.

The following procedure is included for each layer of neural network:

algorithm 1: graph neural network of fusion edge type

First step, using the full connection layer to connect the feature h of user u _u The user characteristic z after dimension reduction is obtained by dimension reduction _u The specific formula is as follows:

z _u ＝W _t h _u

wherein,,

secondly, regarding the node v as a central node, and calculating importance coefficients of each neighbor u of the node v under the relationship r according to the relationship between the node v and the central node

The specific formula is as follows:

wherein,,

is a trainable weight vector.

Thirdly, judging whether the neighbors of the node are similar to the nodes according to the confidence coefficient vector B, and putting the similar neighbors of the node v under the relation r into a set

In the heterogeneous neighbors put into the set +.>

Is a kind of medium.

And fourthly, respectively carrying out normalization operation on importance coefficients of the two types of neighbors to obtain attention scores for aggregation. For the neighbor u of the node v under the relation r, if the node u is similar to the node v, the attention score of the neighbor u

The following formula can be used to determine:

wherein,,

is a set of homogeneous neighbors of node v; exp is a natural exponential function; sigma is an arbitrary nonlinear activation function. Similarly, if the neighbor node is heterogeneous with node v, its attention score +.>

The following formula can be used to determine:

wherein,,

is a set of heterogeneous neighbors of node v.

Fifthly, respectively calculating the embedding of the similar neighbors of the central node v under the relation r according to the attention score calculated in the previous step

Embedding of heterogeneous neighbors of the central node v>

The calculation formula is as follows:

sixth step, for each node v, the characteristic z _v The Euclidean distance between the node and other node characteristics is used as a basis to obtain a k neighbor graph formed by k neighbor nodes

Seventh step, according to

wherein K is the neighbor number selected by K neighbors, and is generally 2;

is a weight matrix; />

Is the k-nearest neighbor set of node v.

Eighth step, for each node v, its same kind of neighbors are embedded

Heterogeneous neighbor embedding->

Wherein,,

is a linear transformation matrix for embedding the node itself, the similar neighbor and the heterogeneous neighborInlet and k nearest neighbor embedding integration as d _out Vector of dimension; and I is a splicing operation.

Ninth, a multi-head attention mechanism is introduced, the first to eighth steps are repeated for H times, and these are repeated

The recommended value of H is 4.

Tenth step, the relation is that

Integration into h' _v . Directly adopts splicing and linear transformation operation. The formula is as follows:

wherein,,

is a weight matrix for Rd _out Dimension node feature dimension reduction to d _out 。

The above is an operation flow of the one-layer graph neural network. The output is a two-dimensional vector. The 1 st dimension represents the confidence of the sender of the spam comment, and the 2 nd dimension represents the confidence of the normal user. For h _v,out And after the softmax operation is taken, the probability value of the node belonging to the normal node and the abnormal node can be represented. When h _v,out When the 0 th dimension value of (2) is larger than the 1 st dimension value, the node is judged as a normal node; when h _v,out When the 1 st dimension value is larger than the 0 th dimension value, the node is determined to be an abnormal node.

3) Iterative optimization

The first step, inputting the whole graph into a graph neural network to obtain an output result h _out 。

Is all h _v,out Is a longitudinal splice of (c).

Step two, undersampling the training label to obtain a node set participating in loss calculation

The number of normal nodes participating in loss calculation is similar to that of abnormal nodes, so that the influence of imbalance of the tag 01 is avoided.

Third, calculate according to the following formula

Loss of->

Wherein y is _v A label representing node v.

Fourth step, outputting h according to the model _out Updating confidence vector B to let h _out Rows of dimension 1 and greater than dimension 2 of the row in B correspond to positions 1 and the remainder are 0.

Fifth step, according to the loss

A gradient descent algorithm is performed.

Sixth step, when losing

Training is stopped when convergence occurs.

4) Label-free user class output

First, obtaining a model output h _out And taking out the row corresponding to the node without the label.

Second, if node i is at h _out The 1 st dimension of the corresponding row is largeIn dimension 2, the node is the spammer, and otherwise is the normal user.

Thirdly, if various indexes such as the accuracy of the model and the like are required to be obtained, the labels of the test set and the corresponding output results are used for carrying out corresponding calculation.

The method and the system can be suitable for detection tasks of the spammer in various online platforms. And the method can effectively detect the spammer comment sender in camouflage. In comment data of Amazon instrument commodities, the invention takes users as nodes, interaction relations among the users are edges, user attributes are node characteristics to establish a graph structure, and a candidate list of spam comment senders is output after iterative training is carried out by using a graph neural network. The recall rate can reach 90%.

Claims

1. A method for detecting users of online social network spam comments comprises the following steps:

step one, graph construction and preprocessing

(4) Establishing a confidence vector

step two, constructing a graph neural network

the neural network comprises the following processes for each layer of graph:

z _u ＝W _t h _u

wherein,,

The formula is as follows:

wherein,,

is a trainable weight vector;

In the heterogeneous neighbors put into the set +.>

In (a) and (b);

(4) Respectively carrying out normalization operation on importance coefficients of the two types of neighbors to obtain attention scores for aggregation; for the neighbor u of the node v under the relation r, if the node u is similar to the node v, the attention score of the neighbor u

The method is characterized by comprising the following steps:

wherein,,

The method is characterized by comprising the following steps:

wherein,,

is a set of heterogeneous neighbors of node v;

Embedding of heterogeneous neighbors of the central node v>

The calculation formula is as follows:

(7) According to

wherein K is the neighbor number selected by K neighbors;

is a weight matrix; />

K neighbor set for node v;

(8) For each node v, its homogeneous neighbors are embedded

Heterogeneous neighbor embedding->

Integration into h' _v ；

step three, iterative optimization

Is all h _v,out Is connected with the longitudinal splicing of the two parts;

(3) Calculated according to the following formula

Loss of->

Wherein y is _v A label representing node v;

(5) According to the loss

Executing a gradient descent algorithm;

(6) When losing

Stopping training when convergence is performed;

step four, outputting the unlabeled user category

(2) If node i is at h _out The 1 st dimension value of the corresponding row in the list is larger than the 2 nd dimension value, and the node is a spam commentThe sender, and vice versa, is a normal user.