CN114742564A - False reviewer group detection method fusing complex relationships - Google Patents
False reviewer group detection method fusing complex relationships Download PDFInfo
- Publication number
- CN114742564A CN114742564A CN202210449853.8A CN202210449853A CN114742564A CN 114742564 A CN114742564 A CN 114742564A CN 202210449853 A CN202210449853 A CN 202210449853A CN 114742564 A CN114742564 A CN 114742564A
- Authority
- CN
- China
- Prior art keywords
- node
- model
- false
- training
- nodes
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 74
- 238000000034 method Methods 0.000 claims abstract description 77
- 238000012549 training Methods 0.000 claims abstract description 71
- 230000008569 process Effects 0.000 claims abstract description 32
- 230000006870 function Effects 0.000 claims description 58
- 239000011159 matrix material Substances 0.000 claims description 47
- 230000007246 mechanism Effects 0.000 claims description 20
- 238000013528 artificial neural network Methods 0.000 claims description 14
- 238000012512 characterization method Methods 0.000 claims description 14
- 238000003062 neural network model Methods 0.000 claims description 12
- 238000004364 calculation method Methods 0.000 claims description 9
- 230000002776 aggregation Effects 0.000 claims description 6
- 238000004220 aggregation Methods 0.000 claims description 6
- 230000004913 activation Effects 0.000 claims description 5
- 238000013461 design Methods 0.000 claims description 5
- 238000005457 optimization Methods 0.000 claims description 5
- 230000007704 transition Effects 0.000 claims description 5
- 239000013598 vector Substances 0.000 claims description 4
- 239000010410 layer Substances 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 239000011541 reaction mixture Substances 0.000 claims description 3
- 239000002356 single layer Substances 0.000 claims description 3
- 150000001875 compounds Chemical class 0.000 claims description 2
- 238000010606 normalization Methods 0.000 claims description 2
- 230000000694 effects Effects 0.000 abstract description 8
- 238000013473 artificial intelligence Methods 0.000 abstract description 3
- 238000002474 experimental method Methods 0.000 description 7
- 238000012552 review Methods 0.000 description 7
- 230000002159 abnormal effect Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000012800 visualization Methods 0.000 description 3
- 230000006399 behavior Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 230000005856 abnormality Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000012733 comparative method Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 239000006185 dispersion Substances 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000012804 iterative process Methods 0.000 description 1
- 239000003550 marker Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/018—Certifying business or products
- G06Q30/0185—Product, service or business identity fraud
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y04—INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
- Y04S—SYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
- Y04S10/00—Systems supporting electrical power generation, transmission or distribution
- Y04S10/50—Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Entrepreneurship & Innovation (AREA)
- Probability & Statistics with Applications (AREA)
- Accounting & Taxation (AREA)
- Development Economics (AREA)
- Economics (AREA)
- Finance (AREA)
- Marketing (AREA)
- Strategic Management (AREA)
- General Business, Economics & Management (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention belongs to the field of artificial intelligence, and provides a false comment group detection method fusing complex relationships, which is used for false comment group detection on an online trading platform. The method comprises three stages of node representation updating, model training and false comment group detection. The method applies the trained model to a real data set, can identify the false reviewers, and can well distinguish the false reviewer group from the normal reviewers. The method is based on the complex relation characteristics of the nodes, makes full use of valuable relation information among reviewers, integrates the embedding process and the clustering detection process to obtain a false reviewer group detection model taking a target as a guide, and can overcome the problems of poor universality, low detection effect and the like of the conventional group detection method.
Description
Technical Field
The invention relates to the field of artificial intelligence, in particular to a false reviewer group detection method fusing complex relationships.
Background
The rapid popularity of the online comment system enables comments to become an important basis for people to buy commodities, more and more people can check the comments on a platform before buying the commodities and can also make evaluations on the commodities after buying the commodities. These reviews may provide useful information and first-hand merchandise experience to the customer, and thus the quality of online reviews is particularly important, and false reviews that do not conform to the fact of the merchandise may affect the reputation of the merchandise and may also obscure the line of sight of the buyer.
Most of the existing false comment detection technologies are realized by a big data and artificial intelligence method, the traditional detection technology utilizes manually generated features to classify reviewers, and relationship features between users are captured based on behavior features, language features in comments and construction graphs. In the past, researchers mainly focus on detecting individual false reviewers, however, a false review group often causes more harm to an online review system, and difficulty exists in finding out the false reviewers of the group: the false comments in the group may be normal individual comments, and the previous individual false comment detection techniques are difficult to work. In addition, relationships between false reviewers are difficult to establish, and such complex relationships can enable the model to grab connections between reviewers within a group, thereby assisting in false reviewer group detection.
Current false comment population detection methods can be categorized into the following categories:
a detection method based on a clustering algorithm. The detection algorithm based on the clustering algorithm generally uses algorithms such as a graph neural network to learn node embedding expression, then the nodes are clustered through the clustering algorithm, and finally a false comment group is detected through the detection method. Common clustering algorithms are, for example, the partition-based clustering algorithm KMeans, the density-based clustering algorithm DBSCAN.
(1) The KMeans clustering algorithm mainly divides all points in a sample space into K groups, similarity is usually measured by using euclidean distance, and the main flow of the algorithm is as follows: k centroids are randomly placed, one centroid being present in each cluster. The distance of each point to the centroid is calculated and each data point is assigned to its nearest centroid, forming a cluster. In an iterative process, the position of the centroid K is recalculated.
(2) The DBSCAN clustering algorithm first determines the type of each point, each of which in the data set may be a core point or a boundary point. A data point is a core point if at least M points in its neighborhood are within a specified radius R, and a boundary point if less than M data points in its neighborhood, or it can be reached from a core point, i.e. it is within R distance from the core point. The core points that are neighbors will be connected and placed in the same cluster and boundary points will be assigned to each cluster.
Graph-based detection methods. Starting from the subgraph, judging the doubtful character of the group by using the node or the attribute of the subgraph, thereby realizing the whole detection process. Some methods aggregate relationships from differences in graph topology, time, and scores, using joint probabilities to detect false reviewer populations. The method ignores the structural characteristics of the nodes and does not consider the complex relation among the nodes. Still other methods address several main characteristics of the population, such as synchronicity, mildness and dispersion, and detect population abnormalities by calculating certain indices. The method is lack of universality in practical application, specific indexes need to be provided for different networks or data sets to well complete the detection task of the false comment group, and if the method is popularized, the detection precision is greatly reduced. In addition, such methods only consider features within the population, and still lack consideration of complex relationships between reviewers.
Disclosure of Invention
In the existing false comment group detection method, an embedding process and a subsequent clustering and detection process are separated, a training process lacks target guidance, and if a result representing learning is not suitable for detection, the obtained false comment group detection result is poor. In addition, complex relationships in the comment network are ignored, and valuable relationship information among the reviewers in the group cannot be utilized.
Aiming at the problems in the prior art, the invention provides a false comment crowd detection method fusing complex relationships, which is used for false comment crowd detection on an online selling platform, wherein a target is used as guidance, complex relationship characteristics of nodes are based, complex relationship representation of the nodes is learned by utilizing the characteristics, and topological information of a picture is reconstructed by using an automatic encoder; in order to integrate the embedding process and the clustering and detecting process, the method adopts a self-supervision training model and guides the optimization of the model by using the clustering and detecting results.
In order to achieve the purpose, the invention adopts the technical scheme that: a false reviewer group detection method fusing complex relationships is characterized and updated by using a graph neural network based on an attention mechanism to comment nodes in a comment network; performing model training on the reconfiguration loss and the self-supervision distribution loss of the design drawing, obtaining an optimal model, and applying the optimal model to a group of false reviewers to detect and identify the group of the false reviewers in the comment network; the method comprises the following specific steps:
firstly, updating a node representation to obtain a reconstructed image; the model extracts an adjacency matrix and an attribute matrix of the comment network, and obtains a complex relation matrix according to the adjacency matrix. After the complex relation matrix is obtained, the attention encoder fuses the complex relation with the message transmission process, effectively encodes the high-order structure information and the node attribute information of the network, and then updates the node representation. A graph neural network based on an attention mechanism is used as an encoder; taking the initial characteristics of the nodes as the initial embedding of the nodes, and fusing the complex relationship of the nodes on a graph neural network based on an attention mechanism to ensure that the node characterization expresses high-order structural characteristics and attribute characteristics at the same time;
1.1) calculating node similarity; in order to simplify the calculation and reduce the model parameters, the node is limited in the first-order neighbor node of the central node, and the calculation formula is as follows:
cij=a(Whi,Whj)#(1)
in the formula, cijRepresenting the importance of the node j to the node i, and W represents a weight matrix; h isiAnd hjRespectively representing the feature vectors of the node i and the node j; a represents a function of computing node similarity;
1.2) calculating a complex relation matrix; the comment network has complex structural relationships, and the complex relationships among the nodes of the comment network contain valuable information. Obtaining a complex relation matrix of the node by considering a high-order neighbor node of the node:
M=(B+B2+…+Bt)/t# (2)
where B is the transition matrix, B is the transition matrix when an edge exists between node i and node jij=1/diWherein d isiDegree of a node; when there is no edge between node i and node j, B ij0; matrix M represents a complex relationship matrix, MijThe complex relation of the node i and the node j under the order t is obtained;
1.3) fusing complex relationships; the method comprises the steps that a single-layer feedforward neural network is used as a calculation mode, a complex relation matrix M is fused with a graph neural network based on an attention mechanism, specifically, the complex relation matrix is multiplied with node similarity, and when the similarity between nodes is calculated, not only the similarity between node representations but also the influence of the complex relation between the nodes on the similarity are considered; LeakyReLU is selected as an activation function to increase model non-linearity factors, so that the feature expression capability of the model is enhanced. After the complex relationships are fused, the importance expression of the node j to the node i is rewritten as follows:
1.4) updating the node representation; the softmax function is used for carrying out normalization processing on the importance of the neighbor nodes, so that the importance of the first-order neighbor nodes to the central node is distributed between [0 and 1], and the characteristics of the neighbor nodes are aggregated to update the node representation;
in the formula (4), αijRepresenting the normalized attention coefficient; n is a radical ofiA first-order neighbor set representing a node i;
in the formula (5), the reaction mixture is,a representation on level i of a neighbor node j to node i,represents the representation of node i on the l +1 th; the final representation of the node is obtained by multilayer aggregation;
secondly, training a model; the model first uses the topology information reconstructed from the encoder to calculate the loss, which is the first partial loss, by calculating the difference between the original and the reconstructed adjacency matrix. The second part of loss is obtained by a self-supervision training mode, the model determines core points in the comment network by using a DBSCAN clustering algorithm, the distances between all nodes and the core points are calculated, and KL divergence is used as the loss of the second part of loss. The final loss function is composed of the two loss functions and is used for jointly training the model. And after loss is calculated, updating model parameters by using a gradient descent algorithm, and finishing training.
Designing a graph reconstruction loss function and an automatic supervision distribution loss function, updating graph neural network model parameters based on an attention mechanism, and completing training, wherein the method comprises the following specific steps:
2.1) calculating a graph reconstruction loss function; calculating the difference between the adjacent matrixes according to the topological information of the reconstructed graph of the encoder to obtain the reconstruction loss of the reconstructed graph and the original graph; the formula is as follows:
in the formula (I), the compound is shown in the specification,is a contiguous matrix; h is an updated node characterization matrix; σ is an activation function;
in the training process, cross entropy is adopted as a loss function:
where y represents the value of an element in the adjacency matrix,representing the corresponding elements in the reconstructed adjacency matrix. This part of the training requires minimizing the reconstruction loss, which is defined as follows:
2.2) calculating an automatic supervision distribution loss function; one of the challenges of the false comment detection method is the training of no label-guided models; the model adopts a self-supervision training mode and adopts the embedded expression of pseudo-labeled optimization nodes; clustering nodes by adopting a clustering algorithm, and clustering by adopting a K-Means algorithm in the model:
in the formula, muiIs SiThe mean value of all nodes in the cluster, k is the number of sets to be clustered.
After all the false comment groups are obtained, determining core points in the comment network by adopting a DBSCAN clustering algorithm, and calculating the distance distribution between each node and the core points;
during training, the distribution of data needs to be continuously learned to distinguish normal nodes from abnormal nodes, piuRepresenting pseudo-labels, q, calculated by the modeliuThe distance distribution between the features of all nodes and the core points detected by DBSCAN is represented. q. q.siuIs defined as follows:
in the formula uuA characterization representing core points detected by the DBSCAN; ziA representation representing a current processing node; u. ofkRepresenting a characterization of the core points of the kth class. The formula calculates the distance between the characterization of the node and the characterization of the core point, and if the distance between the node and the core point is close enough, the node can be considered to belong to the group and is considered to be a normal node. Assuming that a node is far away from the core point, the node can be regarded as an outlier, i.e., a corresponding group of false comments. The node label can be obtained by the following formula:
Si=argmax·qiu#(11)
using the KL divergence as a loss function to measure the difference between the distance distribution between the node and the core point and the pseudo label thereof;
the KL divergence mainly measures the difference between the probability distribution Q and the reference probability distribution P. Unlike the label obtained in equation (11), the target distribution piuConsidered as a true label, is calculated by Q in the training process, piuThe P distribution is relied on and updated according to the phase, and the P distribution is regarded as an automatic supervision label in the phase. The main function of the target distribution is to supervise the learning of the model and guide the updating of the distribution Q. The formula for P is as follows:
in the formula, qikRepresenting the distance distribution between the features of all nodes and the core point of the kth class. The loss function for the self-supervised optimization embedding is as follows:
2.3) calculating a joint loss function; the joint loss function expression is:
L=·Lr+βLc#(14)
in the formula, LrReconstruction of the loss function for the graph, LcIs an auto-supervised distributed loss function, the weight between two loss functions;
2.4) model training, setting initial parameters of the graph neural network model based on the attention mechanism, and iterating the training process based on the joint loss function to obtain the optimal parameters of the graph neural network model based on the attention mechanism;
thirdly, detecting a false comment group; and detecting the real comment network by adopting the attention-based graph neural network model obtained in the second step, and storing the detection result.
The graph reconstruction loss function adopts a cross entropy loss function; the clustering algorithm for clustering the nodes adopts a KMeans clustering algorithm.
The specific method for the model training in 2.4) is as follows:
setting initial parameters of the graph neural network model based on the attention mechanism, wherein the initial parameters comprise the number of aggregation layers, node embedding dimensions, the number of clustering of a KMeans clustering algorithm, training iteration times and the like of the graph neural network model based on the attention mechanism;
continuously adjusting parameters in the training process of the model, and determining optimal parameters according to the descending condition of the joint loss function in the training process or the final detection result of the model;
the method specifically comprises the following steps: inputting the comment network and the adjacency matrix of the network into a model, operating and training the model, recording the detection performance of the model after the training, repeatedly training for many times under the same set of hyper-parameters, and taking the average value of the detection precision as the final result detection precision; after model training under a group of parameters is completed, parameters in the model are adjusted according to a control variable method, one parameter of the model is adjusted according to the direction of increasing the average precision, and other parameters are kept unchanged; and repeatedly adjusting parameters, reserving a group of parameter settings for enabling the average discrimination precision of the model to reach the highest, and finishing the model training.
The invention has the beneficial effects that: the method can identify the false reviewers and can well distinguish the false reviewer group from the normal reviewers. The method is based on the complex relation characteristics of the nodes, makes full use of valuable relation information among reviewers, integrates the embedding process and the clustering detection process to obtain a false reviewer group detection model taking a target as a guide, and can overcome the problems of poor universality, low detection effect and the like of the conventional group detection method.
Drawings
FIG. 1 is a basic framework diagram of the present invention;
FIG. 2 is a flow chart of the present invention;
FIG. 3 is a graph of recall rate changes during training in accordance with an embodiment of the present invention;
FIG. 4 is a graph of the variation of the loss function during training according to an embodiment of the present invention;
FIG. 5 is a visualization diagram of the population detection result according to an embodiment of the present invention.
Detailed Description
The following further describes a specific embodiment of the present invention with reference to the drawings and technical solutions.
A false reviewer group detection method fusing complex relationships comprises three stages: updating the node representation; training a model; false comment population detection.
In the first step, the node representation is updated. In the stage, the graph neural network based on the attention mechanism is used as an encoder, the initial embedding of the nodes is represented by the initial characteristics of the nodes, and the complex relationships of the nodes are fused on the graph neural network based on the attention mechanism, so that the node characterization has the capacity of expressing high-order structural characteristics and attribute characteristics.
1.1) calculating node similarity. In order to simplify the calculation and reduce the model parameters, the node is limited in a one-hop neighbor of the central node, and the calculation formula is as follows:
cij=a(Whi,Whj)#(1)
in the formula, cijRepresenting the importance of the node j to the node i, and W represents a weight matrix; (ii) a h isiAnd hjRespectively representing the feature vectors of the node i and the node j; a represents a function for computing node similarity;
1.2) calculating a complex relation matrix. The comment network has complex structural relationships, and the complex relationships among the nodes of the comment network contain valuable information. By considering the higher-order neighbors of a node, a complex relationship matrix of the node can be obtained:
M=(B+B2+…+Bt)/t#(2)
where B is the transition matrix, and if there is an edge between node i and node j, B isij=1/diWherein d isiAs the degree of a node, when there is no edge between node i and node j, B ij0. Matrix M represents a complex relationship matrix, MijIs a complex relationship of node i and node j in order t.
1.3) fusing complex relationships. The method comprises the specific steps that a single-layer feedforward neural network is selected as a calculation mode, a complex relation matrix M is fused with a graph attention network, and the complex relation matrix is multiplied by the similarity between nodes to show that not only the similarity between node representations but also the influence of the complex relation between the nodes on the similarity need to be considered when the similarity between the nodes is calculated. Finally, LeakyReLU is selected as an activation function to increase model nonlinear factors, so that the feature expression capability of the model is enhanced. After the complex relationship is fused, the importance expression of the node j to the node i is rewritten as follows:
1.4) updating the node representation. In order to enable the importance of the neighbor nodes to the central node to be distributed between [0 and 1], the importance of the neighbor nodes is normalized by utilizing a softmax function, and the characteristics of the neighbor nodes are aggregated to update the node representation.
In the formula (4), αijRepresents the normalized attention coefficient, NiA first-order neighbor set representing a node i; in the formula (5), the reaction mixture is,a representation on level i of a neighbor node j to node i,representing the representation of node i on the l +1 th. The final characterization of the nodes is obtained by multilayer aggregation.
And secondly, training a model. Firstly, a loss function is designed, and after loss is calculated by using the designed loss function, model parameters are updated so as to complete training. The model first reconstructs the original network using a decoder to calculate the adjacency matrix difference loss of the original network and the reconstructed network. Because nodes in the detection task of the false reviewer group have no labels, the embedding is optimized by adopting a self-supervision training mode, a core point in the review network is generated by utilizing a DBSCAN clustering algorithm, the distance between the core point and other nodes is measured by using KL divergence, and then the difference between the false mark and the learned embedding distribution is calculated. After the loss calculation is completed, the gradient descent algorithm is used to update the model parameters, completing the training.
2.1) calculate graph reconstruction loss. Reconstructing an original graph by adopting an inner product mode, wherein a reconstruction formula is as follows:
in the formula, H is the embedded vector of the learned node,for reconstructing the patterned adjacency matrix, for enabling the reconstructed adjacency matrixAs similar as possible to the input adjacency matrix. In the training process, cross entropy is adopted as lossFunction:
where y represents the value of an element in the adjacency matrix,representing the corresponding elements in the reconstructed adjacency matrix. This part of the training requires minimizing the reconstruction loss, which is defined as follows:
2.2) calculating the distribution loss. One of the challenges of the false comment detection method is the training of the label-free guidance model. The model adopts a self-supervision mode and uses pseudo marks to optimize node embedded representation. Because the nodes in the graph are independent, all the nodes are clustered firstly in the training process, and the model adopts a K-Means algorithm to cluster:
in the formula, muiIs SiThe mean value of all nodes in the cluster, k is the number of sets to be clustered. After all the false comment groups are obtained, the DBSCAN algorithm is adopted to detect abnormal groups. The DBSCAN algorithm firstly distinguishes core points and boundary points in the graph, takes the detected core points as the core points in the training model, and calculates the distance between the characterization of other nodes and the characterization of the core points. During the training process, the distribution of data needs to be continuously learned to distinguish normal nodes from abnormal nodes, piuRepresenting pseudo-labels, q, calculated by the modeliuRepresents the distance distribution between the features of all nodes and the core points detected by the DBSCAN. q. q.siuIs defined as follows:
in the formula uuRepresenting a characterization of the core points detected by DBSCAN. The formula calculates the distance between the representation of the node and the representation of the core point, and if the distance between the node and the core point is close enough, the node can be considered to belong to the group and be considered as a normal node. Assuming that a node is far away from the core point, the node can be regarded as an outlier, i.e., a corresponding group of false comments. The node label can be obtained by the following formula:
Si=argmax·qiu#(11)
the model adopts KL divergence to measure the difference between the pseudo-marker and the learned distribution, and the KL divergence mainly measures the difference between the probability distribution Q and the reference probability distribution P. Unlike the label obtained in equation (11), the target distribution piuConsidered as a true label, is calculated by Q in the training process, piuThe P distribution is relied on and updated according to the phase, and the P distribution is regarded as an automatic supervision label in the phase. The main function of the target distribution is to supervise the learning of the model and guide the updating of the distribution Q. The formula for P is as follows:
the loss function for the self-supervised optimization embedding is as follows:
2.3) calculating a joint loss function. The overall loss function of the model consists of a graph reconstruction loss function and an automatic supervision distribution loss function, and the final loss function expression is as follows:
L=·Lr+βLc#(14)
in the formula, LrTo reconstruct the loss, LcTo distribute the losses, β is used to control the weight between the two losses.
2.4) model training. The training of the model is carried out according to the following steps: setting initial hyper-parameters including the aggregation layer number of the graph attention network, the node embedding dimension, the clustering number of the KMeans clustering algorithm, the training iteration number and the like.
In the training process of the model, the hyper-parameters need to be adjusted manually, so that the detection effect of the model is optimal. Generally speaking, the hyper-parameters need to be determined according to the drop condition of the loss function in the training process or the final detection result of the model. After the hyper-parameters are set, inputting information such as a comment network and an adjacent matrix of the network into the model, operating the model, waiting for the model training to be finished, recording the detection performance of the model after the training, repeating the process for a plurality of times under the same group of hyper-parameters, and taking the average value of the detection precision as the final result detection precision. After model training under a group of hyper-parameters is completed, the hyper-parameters in the model are adjusted according to a control variable method, one hyper-parameter of the model is adjusted according to the direction of increasing the average precision, and other parameters are kept unchanged. And repeating the adjustment process of the hyper-parameters, reserving a group of hyper-parameter settings for enabling the average discrimination precision of the model to reach the highest, and finishing the model training.
And thirdly, detecting a false comment group. And detecting the real comment network by using the model trained in the last step and the hyper-parameter, and storing the detection result of the model on the comment network.
Table 1 the algorithm runs
In conjunction with the protocol of the present invention, the experimental analysis was performed as follows:
the invention verifies the detection effect of the false comment population on an Amazon data set processed by a researcher, and the basic situation of the data set is shown in Table 2. The relationship type U-P-U in the table represents that two users have at least commented on one same product. U-S-U represents that two reviewers reviewed the same score within a week. U-V-U represents that two reviewers have similar comments. The experiment was performed on four datasets, corresponding to the above three relationships, and one dataset consisting of the three relationships, the four datasets being Amazon _ p, Amazon _ s, Amazon _ v, and Amazon datasets, respectively.
TABLE 2 basic cases of false comment data sets used in the experiment
The experimental analysis process of the false reviewer group detection method fusing complex relationships can be divided into two parts: comparing the method with the existing false comment group detection method, and verifying the superiority of the method by taking the recall rate as an evaluation index; and performing a visual experiment on the training process and the detection result, thereby more intuitively analyzing the rationality of model design and the effectiveness of the detection effect.
(1) Test result comparison experiment
Several false comment population detection methods that researchers have proposed are compared with the present method, wherein Graph-developer uses a Graph-based approach to find the target item and on this basis detects a group of false reviewers, and the population detection problem is solved by a 2-hop diagram. Collueage uses a Markov random field to detect colluded false reviewers and false comment activity. The method comprises the steps that the DeFrauder detects candidate fraud groups by utilizing a product review graph and combining behavior signals, maps the candidate fraud groups into an embedding space, assigns scores to each group, and finally determines a false reviewer group according to the scores. Besides the comparison method, in order to verify the effectiveness of the modules in the method, two decoupling detection methods are additionally added in the experiment: GCN + KMeans + DBSCAN and GAT + KMeans + DBSCAN. The first method is to embed the initial data set by GCN, the second method is to embed the initial data set by GAT, and after the embedding is obtained, the embedding results are detected by both KMeans clustering method and DBSCAN method.
The experimental results of the present method and the comparative method are shown in table 3. Through the longitudinal comparison of experimental results, the performance of the method is obviously superior to that of other methods, and the detection effect is greatly improved. The results of GAT + KMeans + DBSCAN are superior to GCN + KMeans + DBSCAN, demonstrating the effectiveness of using GAT as a graph encoder. Compared with GCN, GAT can aggregate neighbor characteristics according to the similarity between a central node and neighbor nodes, so that a large amount of information of false nodes cannot be aggregated in a characterization result in a normal node. Through transverse comparison of experimental results, it can be seen that under the condition that three different relationships and all relationships are considered, the method obtains the optimal result, which shows that the KMeans clustering algorithm is fused in the deep learning model, and the core points are continuously updated in an iterative manner in the training process, so that a more accurate detection result can be obtained.
TABLE 2 test results
(2) Visual experiment of training process and detection result
The visualization experiment aims to express the reasonability of the design of the method by analyzing the loss and the change of the result recall rate in the training process, and the effectiveness of the detection result is visually expressed by the means of visualization of the detection result.
Fig. 3 shows the change of the recall rate in the training process, and the overall situation of the graph shows that the recall rate of the detection result is continuously improved along with the training, and the rationality of the model design is verified.
The variation of the loss function during training is shown in fig. 4. Analysis is performed by combining fig. 3 and fig. 4, and as the loss function is continuously reduced, the recall rate is continuously improved, which indicates that the obtained representation learning result can be also suitable for the detection of the false reviewer group while the representation learning result is continuously updated by the method. In a reverse way, the loss function designed by the method can well feed back the loss to the model and supervise the learning of the model, and the problem that the representation learning result is possibly not suitable for the detection method is solved.
FIG. 5 shows the clustering result of the model on the Amazon data set, and it can be seen that the method has a good effect on the detection problem of the false reviewer population. Wherein the black entities represent the false comment population, mainly concentrated on the lower left, and the gray entities represent the normal comment nodes, mainly concentrated on the upper right.
The above embodiments only express the embodiments of the present invention, but not should be understood as the limitation of the scope of the invention patent, it should be noted that, for those skilled in the art, many variations and modifications can be made without departing from the concept of the present invention, and these all fall into the protection scope of the present invention.
Claims (3)
1. A false reviewer group detection method fusing complex relationships is characterized in that the false reviewer group detection method fusing complex relationships uses a graph neural network based on an attention mechanism to perform representation updating on comment nodes in a comment network; performing model training on the reconfiguration loss and the self-supervision distribution loss of the design drawing, obtaining an optimal model, and applying the optimal model to a group of false reviewers to detect and identify the group of the false reviewers in the comment network; the method comprises the following specific steps:
firstly, updating node representation to obtain a reconstructed graph; a graph neural network based on an attention mechanism is used as an encoder; taking the initial characteristics of the nodes as the initial embedding of the nodes, and fusing the complex relationship of the nodes on a graph neural network based on an attention mechanism to ensure that the node characterization expresses high-order structural characteristics and attribute characteristics at the same time;
1.1) calculating node similarity; and limiting the node in a first-order neighbor node of the central node, wherein the calculation formula is as follows:
cij=a(Whi,Whj)#(1)
in the formula, cijRepresenting the importance of the node j to the node i, and W represents a weight matrix; h isiAnd hjRespectively representing the feature vectors of the node i and the node j; a represents a function for computing node similarity;
1.2) calculating a complex relation matrix; obtaining a complex relation matrix of the node by considering a high-order neighbor node of the node:
M=(B+B2+…+Bt)/t# (2)
where B is a transition matrix, and B is the transition matrix when an edge exists between node i and node jij=1/diWherein d isiDegree of a node; when there is no edge between node i and node j, Bij0; matrix M represents a complex relationship matrix, MijThe complex relation of the node i and the node j under the order t is obtained;
1.3) fusing complex relationships; fusing a complex relation matrix M and a graph neural network based on an attention mechanism by taking a single-layer feedforward neural network as a calculation mode, specifically multiplying the complex relation matrix by the similarity of nodes; and selecting LeakyReLU as an activation function, fusing the complex relationship, and rewriting the importance expression of the node j to the node i into:
1.4) updating the node representation; the softmax function is used for carrying out normalization processing on the importance of the neighbor nodes, so that the importance of the first-order neighbor nodes to the central node is distributed between [0 and 1], and the characteristics of the neighbor nodes are aggregated to update the node representation;
in the formula (4), αijRepresenting the normalized attention coefficient; n is a radical ofiA first-order neighbor set representing a node i;
in the formula (5), the reaction mixture is,a representation on level i of a neighbor node j to node i,representing the representation of node i on the l +1 th; the final representation of the node is obtained by multilayer aggregation;
secondly, training a model; designing a graph reconstruction loss function and an automatic supervision distribution loss function, updating graph neural network model parameters based on an attention mechanism, and completing training, wherein the method specifically comprises the following steps:
2.1) calculating a graph reconstruction loss function; calculating the difference between adjacent matrixes according to the topological information of the reconstructed image of the encoder to obtain the reconstruction loss of the reconstructed image and the original image; the formula is as follows:
in the formula (I), the compound is shown in the specification,is a contiguous matrix; h is an updated node characterization matrix; σ is an activation function;
2.2) calculating an automatic supervision distribution loss function; adopting a self-supervision training mode and adopting a pseudo-label optimization node embedded representation; clustering the nodes by adopting a clustering algorithm, determining core points in the comment network by adopting a DBSCAN clustering algorithm, and calculating the distance distribution between each node and the core points; using the KL divergence as a loss function to measure the difference between the distance distribution between the node and the core point and the pseudo label thereof;
2.3) calculating a joint loss function; the joint loss function expression is:
L=·Lr+βLc# (7)
in the formula, LrReconstruction of the loss function for the graph, LcIs an auto-supervised distributed loss function, the weight between two loss functions;
2.4) model training, setting initial parameters of the graph neural network model based on the attention mechanism, and iterating the training process based on the joint loss function to obtain the optimal parameters of the graph neural network model based on the attention mechanism;
thirdly, detecting a false comment group; and detecting the real comment network by adopting the attention-based graph neural network model obtained in the second step, and storing the detection result.
2. The method for detecting the false reviewer population fusing the complex relationships according to claim 1, wherein the graph reconstruction loss function adopts a cross entropy loss function; the clustering algorithm for clustering the nodes adopts a KMeans clustering algorithm.
3. The method for detecting the false reviewer population fusing the complex relationship according to claim 2, wherein the model training in 2.4) is as follows:
setting initial parameters of the graph neural network model based on the attention mechanism, wherein the initial parameters comprise the number of aggregation layers, node embedding dimensions, the number of clustering of a KMeans clustering algorithm and training iteration times of the graph neural network model based on the attention mechanism;
continuously adjusting parameters in the training process of the model, and determining optimal parameters according to the descending condition of the joint loss function or the final detection result of the model in the training process;
the method specifically comprises the following steps: inputting the comment network and the adjacency matrix of the network into a model, operating and training the model, recording the detection performance of the model after the training, repeatedly training for many times under the same set of hyper-parameters, and taking the average value of the detection precision as the final result detection precision; after model training under a group of parameters is completed, parameters in the model are adjusted according to a control variable method, one parameter of the model is adjusted according to the direction of increasing the average precision, and other parameters are kept unchanged; and repeatedly adjusting parameters, reserving a group of parameter settings for enabling the average discrimination precision of the model to reach the highest, and finishing the model training.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210449853.8A CN114742564B (en) | 2022-04-27 | 2022-04-27 | False reviewer group detection method integrating complex relations |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210449853.8A CN114742564B (en) | 2022-04-27 | 2022-04-27 | False reviewer group detection method integrating complex relations |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114742564A true CN114742564A (en) | 2022-07-12 |
CN114742564B CN114742564B (en) | 2024-09-17 |
Family
ID=82282704
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210449853.8A Active CN114742564B (en) | 2022-04-27 | 2022-04-27 | False reviewer group detection method integrating complex relations |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114742564B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116737934A (en) * | 2023-06-20 | 2023-09-12 | 合肥工业大学 | Naval false comment detection algorithm based on semi-supervised graph neural network |
CN116993433A (en) * | 2023-07-14 | 2023-11-03 | 重庆邮电大学 | Internet E-commerce abnormal user detection method based on big data |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110580341A (en) * | 2019-09-19 | 2019-12-17 | 山东科技大学 | False comment detection method and system based on semi-supervised learning model |
US20210089579A1 (en) * | 2019-09-23 | 2021-03-25 | Arizona Board Of Regents On Behalf Of Arizona State University | Method and apparatus for collecting, detecting and visualizing fake news |
CN112597302A (en) * | 2020-12-18 | 2021-04-02 | 东北林业大学 | False comment detection method based on multi-dimensional comment representation |
CN112732921A (en) * | 2021-01-19 | 2021-04-30 | 福州大学 | False user comment detection method and system |
-
2022
- 2022-04-27 CN CN202210449853.8A patent/CN114742564B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110580341A (en) * | 2019-09-19 | 2019-12-17 | 山东科技大学 | False comment detection method and system based on semi-supervised learning model |
US20210089579A1 (en) * | 2019-09-23 | 2021-03-25 | Arizona Board Of Regents On Behalf Of Arizona State University | Method and apparatus for collecting, detecting and visualizing fake news |
CN112597302A (en) * | 2020-12-18 | 2021-04-02 | 东北林业大学 | False comment detection method based on multi-dimensional comment representation |
CN112732921A (en) * | 2021-01-19 | 2021-04-30 | 福州大学 | False user comment detection method and system |
Non-Patent Citations (2)
Title |
---|
尹春勇;朱宇航;: "基于垂直集成Tri-training的虚假评论检测模型", 计算机应用, no. 08, 10 August 2020 (2020-08-10) * |
曾致远;卢晓勇;徐盛剑;陈木生;: "基于多层注意力机制深度学习模型的虚假评论检测", 计算机应用与软件, no. 05, 12 May 2020 (2020-05-12) * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116737934A (en) * | 2023-06-20 | 2023-09-12 | 合肥工业大学 | Naval false comment detection algorithm based on semi-supervised graph neural network |
CN116737934B (en) * | 2023-06-20 | 2024-03-22 | 合肥工业大学 | Naval false comment detection algorithm based on semi-supervised graph neural network |
CN116993433A (en) * | 2023-07-14 | 2023-11-03 | 重庆邮电大学 | Internet E-commerce abnormal user detection method based on big data |
Also Published As
Publication number | Publication date |
---|---|
CN114742564B (en) | 2024-09-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Krishnaiah et al. | Survey of classification techniques in data mining | |
CN112184391B (en) | Training method of recommendation model, medium, electronic equipment and recommendation model | |
WO2020008919A1 (en) | Machine learning device and method | |
CN109389151B (en) | Knowledge graph processing method and device based on semi-supervised embedded representation model | |
CN114742564B (en) | False reviewer group detection method integrating complex relations | |
CN113807422B (en) | Weighted graph convolutional neural network scoring prediction model integrating multi-feature information | |
CN111931505A (en) | Cross-language entity alignment method based on subgraph embedding | |
CN111584010B (en) | Key protein identification method based on capsule neural network and ensemble learning | |
CN111667466B (en) | Multi-objective optimization feature selection method for multi-classification of strip steel surface quality defects | |
CN112308115A (en) | Multi-label image deep learning classification method and equipment | |
CN113269647A (en) | Graph-based transaction abnormity associated user detection method | |
CN112529683A (en) | Method and system for evaluating credit risk of customer based on CS-PNN | |
CN113407864A (en) | Group recommendation method based on mixed attention network | |
CN115840853A (en) | Course recommendation system based on knowledge graph and attention network | |
CN110109005B (en) | Analog circuit fault testing method based on sequential testing | |
CN114036298B (en) | Node classification method based on graph convolution neural network and word vector | |
CN113837266B (en) | Software defect prediction method based on feature extraction and Stacking ensemble learning | |
CN114997476A (en) | Commodity prediction method fusing commodity incidence relation | |
CN112905906B (en) | Recommendation method and system fusing local collaboration and feature intersection | |
CN117194771B (en) | Dynamic knowledge graph service recommendation method for graph model characterization learning | |
CN117408735A (en) | Client management method and system based on Internet of things | |
CN111221915B (en) | Online learning resource quality analysis method based on CWK-means | |
CN113392334B (en) | False comment detection method in cold start environment | |
CN115730248A (en) | Machine account detection method, system, equipment and storage medium | |
CN114820074A (en) | Target user group prediction model construction method based on machine learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |