CN113392334A

CN113392334A - False comment detection method in cold start environment

Info

Publication number: CN113392334A
Application number: CN202110733235.1A
Authority: CN
Inventors: 向凌云; 郭国庆; 游卉擎; 刘宇航; 夏卓群
Original assignee: Changsha University of Science and Technology
Current assignee: Changsha University of Science and Technology
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2021-09-14
Anticipated expiration: 2041-06-29
Also published as: CN113392334B

Abstract

A false comment detection method in a cold start environment comprises the following steps: step (1) feature extraction; constructing a heterogeneous graph; step (3) shared feature learning based on graph convolution; and (4) feature fusion and classification. The method and the device can accurately judge the false comments in the cold start environment.

Description

False comment detection method in cold start environment

Technical Field

The invention relates to the field of computer information processing, in particular to a false comment detection method in a cold start environment.

Background

The more abundant the behavior information left by the user on the social network site, the more effective the traditional behavior characteristic analysis method, and in the cold start environment, the new user only issues one comment, and is difficult to extract effective behavior characteristics from the comment, and the text characteristics have been proved to be not good in performance when detecting the false comment of the commercial network site, so that the main difficulty of the false comment detection in the cold start environment is the lack of the activity track of the new user, which results in the lack of effective detection means in the prior art.

Therefore, the invention provides a false comment detection method in a cold start environment.

Disclosure of Invention

In order to realize the purpose of the invention, the following technical scheme is adopted for realizing the purpose:

a false comment detection method in a cold start environment comprises the following steps:

step (1) feature extraction;

constructing a heterogeneous graph;

step (3) shared feature learning based on graph convolution;

and (4) feature fusion and classification.

The false comment detection method under the cold start environment comprises the following steps:

the characteristic extraction in the step (1) comprises the following steps: extracting the behavior characteristics of a user entity, a product entity and a comment entity, extracting the text characteristics of a comment based on CNN, and expressing the user, the product and the comment by using a characteristic vector;

the step (2) of constructing the heterogeneous graph comprises the following steps: constructing an abnormal composition by taking a user entity and a product entity as nodes and taking the issued comment and the received comment as edges;

the step (3) of shared feature learning based on graph convolution comprises the following steps: for each review published by the cold-start user, learning user-based shared behavior features and product-based shared behavior features using a graph convolution neural network;

the feature fusion and classification in the step (4) comprises the following steps: the original behavior features and text features of the comments are fused with the learned shared behavior features to generate new feature vectors of the cold-start comments, and the new feature vectors are used for constructing a classifier to judge the false comments.

The false comment detection method under the cold start environment comprises the following steps of (1) feature extraction:

and for all the user comments, extracting the behavior characteristics of all the users and the behavior characteristics of all the products, and taking the behavior characteristics and the behavior characteristics as characteristic values of the user nodes and the product nodes respectively.

For all user entities u and product entities p, the behavior feature values are:

BF_u＝{uMNR,uPR,uNR,uERD,uavgRD,uBST} (1)

BF_p＝{pMNR,pPR,pNR,pavgRD,pERD} (2)

wherein, BF_uFor behavioral characteristics of user entities, BF_pIs the behavior characteristics of the product entity;

in addition, for each comment, behavior features based on comment entities are extracted

BF_r＝{Rank,RD,EXT,DEV,ISR} (3)

Combining the behavior characteristics of the user corresponding to the comment and the product-based behavior characteristics of the corresponding product to form a complete behavior characteristic vector q (r) of the comment r,

q(r)＝[o₁,o₂,…,o_j,…,o₁₆] (4)

the pre-training text feature extraction model obtains the text features of each comment, and the classification adopts a softmax activation function:

class_Te＝softmax(W_Te·Te(r)+b_Te) (5)

where Te (r) is a text feature vector obtained by convolution of the comment text r, W_TeIs a learnable weight matrix, b_TeIndicates deviation, class_TeIs used to indicate a classification as trueReal or false comments;

after the text feature extraction model is trained, Te (r) obtained by the text feature extraction model based on the CNN is used as a text feature vector of each comment r.

The false comment detection method under the cold start environment comprises the following steps of (2) constructing a heterogeneous graph: the relationship of the heterogeneous graph is represented by a triplet as follows: a source node type, an edge type and a target node type; the heterogeneous graph constructed in step 2 includes two sets of relationships: users, reviews, products and products, reviewed, users; wherein, the node of the user type uses the behavior characteristic BF corresponding to the user_uRepresenting the behavioral characteristics BF of the product type corresponding to the product_pThe relationship is expressed by s and is divided into two types, namely comment and commented.

The false comment detection method under the cold start environment, wherein the shared feature learning based on graph convolution in the step (3) comprises the following steps:

after the abnormal graph is constructed, for each edge in the graph, extracting the shared behavior characteristics of the old user to the new user by adopting a two-layer graph convolution neural network, wherein the convolution process is shown as a formula 6,

wherein f is_sIs the convolution module for each relation s, AGG is the aggregation function,

the characteristics of the source node in the relationship s are represented,

representing the characteristics of the target node in the relationship s. During initialization, according to the user characteristic BF that if the node type is a user, the initial characteristic value h is corresponding to the node_u(ii) a If the node type is a product, the characteristic value h is the product characteristic BF corresponding to the node_p，l+1 represents the current iteration times, l represents the previous iteration times, and the initial value of l is 0;

the convolution module fs is represented by:

where N (i) is the neighbor set of node i, j is an element in the set N (i), c_jiIs the product of the square root of the degree of nodes, i.e.

Represents the characteristic value, W, of node j after l iterations^lRepresenting learnable weights, b^lRepresents the deviation, σ is the activation function;

obtaining the hidden characteristic value h of the source node of each edge through convolution operation on the heterogeneous graph_srcAnd hidden eigenvalues h of the target nodes of each edge_dst。

The false comment detection method under the cold start environment, wherein the feature fusion and classification in the step (4) comprises the following steps: in the feature fusion and classification stage, the original text features, behavior features, source node sharing features and target node sharing features of each edge in the heterogeneous map are spliced, and then the spliced feature vectors are processed by using a full connection layer with a softmax activation function to obtain a final classification result.

Finally, using the full-link layer with the softmax activation function to process F (r), obtaining a final classification result y:

y＝softmax(W_F·F(r)+b_F) (9)

wherein, W_FIs a learnable parameter matrix, b_FRepresenting the deviation, y has a dimension of 2,respectively representing the probability that the current edge is a false comment and a true comment.

Drawings

FIG. 1 is a block diagram of a false comment detection method in a cold start environment;

FIG. 2 is a schematic diagram of a graph convolution network-based shared feature learning process;

fig. 3 is a schematic diagram of a text feature extraction model.

Detailed Description

The following detailed description of the embodiments of the present invention is provided in conjunction with the accompanying drawings of fig. 1-3.

As shown in fig. 1, the method for detecting false comments in a cold start environment includes the following steps:

and (1) extracting features. The behavior characteristics of the user entity, the product entity and the comment entity are extracted for the user comment, and the text characteristics of the user comment are extracted based on a CNN (convolutional neural network), so that the user, the product and the comment are expressed by a feature vector.

And (3) constructing a heterogeneous graph. And constructing the heteromorphic graph by taking the user entity and the product entity as nodes and taking the issued comment and the received comment as edges. Because the feature vectors obtained by feature extraction are independent among the entities, the association information among the entities can be stored in a mode of constructing an abnormal graph.

And (3) learning the shared features based on graph convolution. For the comments issued by each cold-start user, the graph convolution neural network is used for learning the sharing behavior characteristics based on the users and the sharing behavior characteristics based on the products, and the behavior information missing from the cold-start users is supplemented, so that the detection effect of the false comments under cold start is improved.

And (4) feature fusion and classification. And fusing the original behavior features and text features of the comments issued by the cold-start user with the learned two types of shared features to generate a new feature vector of the cold-start comments. The new feature vectors are used to build classifiers to enable discrimination of false comments.

Specifically, the method comprises the following steps: step 1, feature extraction:

the behavior characteristics of the user entity, the product entity and the comment entity in the step 1 are respectively explained as follows:

TABLE 1 behavioral characteristics in user entities

TABLE 2 behavioral characteristics in product entities

TABLE 3 assessment of behavioral characteristics in an entity

BF_u＝{uMNR,uPR,uNR,uERD,uavgRD,uBST} (10)

BF_p＝{pMNR,pPR,pNR,pavgRD,pERD} (11)

wherein, BF_uFor behavioral characteristics of user entities, BF_pThe meaning of each characteristic value is shown in table 1 and table 2 as the behavior characteristic of the product entity.

Furthermore, for each review, 5 review entity-based behavioral features were extracted according to Table 3

BF_r＝{Rank,RD,EXT,DEV,ISR} (12)

And combining the 6 user-based behavior characteristics of the users corresponding to the comments and the 5 product-based behavior characteristics of the corresponding products to form a complete behavior characteristic vector q (r) of the comment r.

q(r)＝[o₁,o₂,…,o_j,…,o₁₆] (13)

Then, a text feature extraction model based on CNN (convolutional neural network) is pre-trained by using the false comment text and the common comment text, and is used for obtaining the text feature of each comment. The model structure is shown in fig. 3.

Wherein, the characteristic diagrams 1, 2 and 3 are hidden layers obtained by convolution kernels with convolution window heights of 3,4 and 5 respectively; the classification uses the softmax activation function, which is described as:

class_Te＝softmax(W_Te·Te(r)+b_Te) (14)

where Te (r) is a text feature vector obtained by convolution of the comment text r, W_TeIs a learnable weight matrix, b_TeIndicates deviation, class_TeThe value of (d) is used to indicate whether a classification is true or false.

The convolution operation represents the comment text as a feature vector Te, and the text feature extraction model enables Te to represent whether the comment text is real or not to the maximum extent through training, so that the feature vector is extracted to be used as a text feature vector corresponding to the comment.

In the pre-trained CNN-based text feature extraction model, in the aspect of parameter setting, the number of convolution kernels is set to be 60, the text feature length is set to be 10, the maximum pooling is used, the learning rate is set to be 0.00001, and the epoch (iteration number) is set to be 100. The model adopts cross entropy loss, the weight ratio of the normal comments to the false comments is set to be 1:10, the problem of unbalanced proportion of the normal comments to the false comments is solved, and the model with the highest F1 value in the training process is stored as a final feature extraction model.

After a text feature extraction model based on the CNN is trained, Te (r) obtained by the text feature extraction model based on the CNN is used for each comment r as a text feature vector of the comment, and the length of the Te (r) is the text feature length given when parameters are set.

Step 2, constructing an isomeric diagram:

in order to extract shared characteristics from old users associated with new users to solve the problem of missing of new user behavior information, after the behavior characteristics of each user and each product are extracted, the users and the products are used as nodes to construct a heterogeneous graph.

The heteromorphic graph relationship can be represented by a triplet: (source node type, edge type, target node type), the heterogeneous graph constructed in step 2 includes two sets of relationships: (user, review, product), (product, reviewed, user). Wherein, the node of the user type uses the behavior characteristic BF corresponding to the user_uRepresenting the behavioral characteristics BF of the product type corresponding to the product_pAnd representing, wherein the edge is represented by the comment or the behavior characteristic of the comment. The above-mentioned relation is represented by s and is divided into two types, i.e., comment and commented.

Step 3, shared characteristic learning based on GCN (graph convolution network)

After the abnormal graph is constructed, for each edge in the graph, a two-layer graph convolution neural network is adopted to extract the shared behavior characteristics of the old user to the new user, the convolution process is shown as a formula 6, and the characteristic matrix is a matrix formed by characteristic values of each node in the abnormal graph. The mathematical definition of graph convolution in an anomaly graph is:

the characteristics of the source node in the relationship s are represented,

representing the characteristics of the target node in the relationship s. During initialization, according to the user characteristic BF that if the node type is a user, the initial characteristic value h is corresponding to the node_u(ii) a If the node type is a product, the characteristic value h is the product characteristic BF corresponding to the node_pL +1 represents the current iteration number, l represents the previous iteration number, and the initial value of l is 0.

The aggregation function AGG used in the present invention is sum.

The convolution module fs is represented by:

Represents the characteristic value, W, of node j after l iterations^lRepresenting learnable weights, b^lRepresenting the deviation, σ is the activation function, and Relu is used in the present invention.

When constructing the graph, the feature vector described by formula (1) or formula (2) is used as the initial feature value h of each node i according to different node types_i ⁰And (7) assigning values. The node i performs graph convolution on all neighbor nodes of the node through the process described by the formula (7), and then converges the feature vectors of all the neighbor nodes of the node i by using the formula (6). And iterating the process to enable each node to learn the hidden characteristic value h of the node.

Through convolution operation on the heterogeneous graph, the hidden characteristic value of the source node of each edge is h_srcThe hidden characteristic value of the target node of each edge is h_dstThen the two hidden features are treated as shared features of the source node and the target node, and the two sets of feature vectors are used for enriching the behavior information of each edge missing. According to the relationship represented by the edge, h_srcAnd h_dstRespectively representing the user sharing behavior characteristics or the product sharing behavior characteristics: when an edge represents a (user, comment, product) relationship, h_srcSharing of behavioral characteristics for users, h_dstSharing behavioral characteristics for the product; when an edge represents a (product, commented on, user) relationship, h_srcFor product sharing behavioral characteristics, h_dstBehavioral characteristics are shared for users.

Step 4. feature fusion and classification

In the feature fusion and classification stage, original text features, behavior features, source node sharing features and target node sharing features of each edge (namely each comment) in the abnormal graph are spliced, and then the spliced feature vectors are processed by using a full connection layer with a softmax activation function to obtain a final classification result.

y＝softmax(W_F·F(r)+b_F) (18)

wherein, W_FIs a learnable parameter matrix, b_FThe deviation is represented, and the dimension of y is 2, which respectively represents the probability that the current (i.e. the comment to be detected) is a false comment and a true comment.

Results and analysis of the experiments

To demonstrate the effectiveness of the proposed method of the present invention, the proposed model was compared to other 7-class baseline methods, a brief description of which is as follows:

(1) LF: traditional bigram features are used as comment text features.

(2) Supervised-CNN: and training the convolutional neural network by using the marked comments only, thereby extracting semantic information of the comments as text features of the comments and identifying false comments only according to the semantic information.

(3) LF + BF: and evaluating the text characteristics and the behavior characteristics of the comment entity to represent comments, and performing false comment detection by using the characteristics obtained by splicing, wherein the text characteristics are binary grammatical characteristics, and the behavior characteristics comprise comment text length, score, absolute deviation rate of the score, and maximum cosine similarity between the comment and other comments in the corresponding product.

(4) And BF _ EditSim + LF, namely associating the new user with the old user by using a representation learning-based method, then using the most similar behavior characteristics of the old user as the behavior characteristics of the new user, and finally splicing the behavior characteristics and the binary grammar characteristics as the characteristic representation of the cold start comment so as to detect whether the comment is real.

(5) BF _ W2Vsim + W2V: firstly, a word vector of each word in the comment is obtained through a word vector model word2vec, then the text features of the comment are obtained by taking the mean value, then the comment which is most similar to the cold-start comment is obtained by using the cosine similarity between the cold-start comment and the text features of the existing comment, finally the feature representation of cold start is formed by using the behavior features of the most similar comment and the text features of the comment, and the comment is detected according to the combined feature vector.

(6) RE: and (3) constructing the behavior characteristics of the user by using a TransE model, wherein the text characteristics adopt CNN, and the emotional tendency of the text is stored by adopting constraint.

(7) RE + RRE + PRE: the model is expanded on an RE model, and the comment representation, the comment score and the product comment score obtained by the RE model are spliced to serve as final comment representation.

In order to verify the effectiveness of the method, hotel comment data in a Yelp data set is selected for experiment. The Yelp dataset is a publicly available commercial website dataset that provides a good balance between commercial authenticity and ground truth and is therefore widely used in many predecessor writings. And taking the first comment published by the new user with the label after 1/2012 as a test set, and taking the first comment published by the user before 1/2012 as a training set for learning the GCN-based shared feature extraction model. In addition, in order to train the global text feature representation model, all labeled comment data before 1 month and 1 day of 2012 are separately extracted for separately training the CNN-based text feature extraction model.

TABLE 1 comparative experimental results for different methods in cold start environment

The results of the experiment are shown in table 4. The method provided by the invention is superior to a comparison method in all evaluation indexes. Particularly, compared with other methods, the recall rate of the method provided by the invention is improved by about 10%, which shows that the method provided by the invention can more accurately identify the false comments. Furthermore, by analyzing table 1, the following conclusions are made:

1) in the cold start environment, the text features still perform poorly. The LF recognition accuracy of the method based on the binary grammatical feature is the lowest in all comparison methods, while the Supervised-CNN method based on the text feature of the CNN has the lowest value compared with the other methods F1. This indicates that relying on the comment text alone does not effectively identify false comments.

2) The detection effect under the cold start environment is improved to a certain extent by combining the behavior characteristics. As can be seen from the results of the LF + BF model, combining the behavior features and the text features can improve the detection accuracy of false comments under cold start, but from the fact that model 3 recall rate and F1 are rather reduced, it can be concluded that: relying only on the behavioral characteristics of the comment itself at cold start will result in more spurious comments being identified as normal comments.

3) The method for directly replacing the behavior characteristics of the comment to be detected with the similar comment behavior characteristics under cold start has poor effect. The model 4 and the model 5 are subjected to false comment detection in a mode of replacing features from the perspective of similarity between users and texts, and experimental results show that the accuracy of the model is not obviously improved from the perspective of similarity between users or the perspective of similarity between texts, and partial indexes (such as F1 value of the model 4 and recall rate) are even lower than that of a method only using text features.

4) By extracting the association from the existing comments, the behavior characteristics of the cold start comment are constructed and combined with the original behavior characteristics of the cold start comment, and a better effect can be achieved. The model 8 extracts the behavior characteristics of the associated user through the abnormal picture and combines the behavior characteristics with the original behavior characteristics of the model, so that the obtained experimental effect is best, and compared with other methods, all parameters are greatly improved.

5) The shared characteristic based on graph convolution learning effectively solves the problem of behavior characteristic information loss of cold-start users, and improves the accuracy of false comment detection in a cold-start environment. Compared to other comparative methods the model presented here outperforms other comparative methods in all evaluation indices.

The method can express the association among the user, the product and the comment in a graph mode, and learn the shared behavior characteristics through graph convolution for supplementing the missing behavior characteristics of the cold start user; fusing text features and behavior features of the comments and shared behavior features of entities with which the comments are associated to detect false comments; the problem of poor detection effect of false comments caused by lack of user behavior information in a cold start environment is effectively solved.

Claims

1. A false comment detection method in a cold start environment is characterized by comprising the following steps:

step (1) feature extraction;

constructing a heterogeneous graph;

step (3) shared feature learning based on graph convolution;

and (4) feature fusion and classification.

2. The method of claim 1, wherein:

the characteristic extraction in the step (1) comprises the following steps: and extracting the behavior characteristics of the user entity, the product entity and the comment entity, extracting the text characteristics of the comment based on the CNN, and expressing the user, the product and the comment by using a characteristic vector.