CN116166806A - False shadow evaluation detection method based on graph attention neural network - Google Patents

False shadow evaluation detection method based on graph attention neural network Download PDF

Info

Publication number
CN116166806A
CN116166806A CN202310255641.0A CN202310255641A CN116166806A CN 116166806 A CN116166806 A CN 116166806A CN 202310255641 A CN202310255641 A CN 202310255641A CN 116166806 A CN116166806 A CN 116166806A
Authority
CN
China
Prior art keywords
node
movie
comment
meta
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310255641.0A
Other languages
Chinese (zh)
Inventor
王海舟
杨菲
陈雅宁
金地
周罡
王文贤
陈兴蜀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202310255641.0A priority Critical patent/CN116166806A/en
Publication of CN116166806A publication Critical patent/CN116166806A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a false film evaluation detection method based on a graph attention neural network, which comprises the steps of firstly constructing a web crawler to collect data of a film platform and manually labeling comment text information to construct a data set; constructing an abnormal graph by using the collected user, film and comment data, confirming nodes in the abnormal graph, extracting node characteristics, and extracting the node characteristics by using a TF-IDF algorithm, normalization and ConSERT model; constructing a graph annotation meaning network based on different composition, wherein the graph annotation meaning network comprises a node-based attention mechanism and a meta-path-based attention mechanism, and comment nodes containing semantic information are obtained and embedded for classification. The invention firstly provides a method for capturing node context semantic information by using the heterogeneous graph to detect the false photo evaluation, has better effect than the traditional text-based classification method, and provides a method and thinking for detecting the false photo evaluation in the future.

Description

False shadow evaluation detection method based on graph attention neural network
Technical Field
The invention relates to the technical field of network security in computer science and technology, in particular to a false shadow evaluation detection method based on a graph attention neural network.
Background
With the rapid development of economy, the living standard of people is greatly improved, and the requirements of people on living quality are also higher. The method plays a great promoting role in the development of the entertainment industry in China. The ornamental movie is one of the leisure modes commonly used in daily life of people, and in recent years, with the development of online ticket purchasing and online comment platforms, consumers can more conveniently select movies, ticket purchasing and other operations.
But at the same time, the online ticket purchasing brings convenience to people and also brings a plurality of defects. The rapid development of the film industry in recent years brings great economic benefits, and makes the malignant competition of the film market more serious, so that the phenomenon that film manufacturers ask for false film comments and malicious film comments occurs instead of the phenomenon of issuing false film comments. The overuse of false film evaluation not only can influence the credibility of a platform, so that the benefit of the platform is damaged, but also can influence the rights and interests of consumers, and furthermore, the unscrupulous false film evaluation is very likely to cause the situation of expelling bad coins in the film industry, thereby causing impact on the development of the whole film industry.
At present, a lot of related work of false comment detection exists, but the detection method is basically focused on false comments of an electronic commerce platform, and the classical false comment detection method is poor in false comment due to a certain difference between the characteristics of movie comments and traditional product comments.
Disclosure of Invention
Aiming at the problems, the invention aims to provide a false shadow evaluation detection method based on a graph attention neural network, which uses a heterogeneous graph to capture node context semantic information to detect false shadow evaluation and has better detection effect. The technical proposal is as follows:
a false shadow evaluation detection method based on a graph attention neural network comprises the following steps:
step 1: data set construction
The method comprises the steps of designing a targeted crawler, collecting movie basic information, related movie comment information and basic information of comment posting users of various themes in a certain period of time of a certain movie platform, marking comment text data, and constructing a movie comment data set;
step 2: feature extraction
Extracting keywords according to the movie introduction of the movie, and generating a movie feature vector by using a TF-IDF algorithm; normalizing a series of data of the user grade, the number of the pictures read and the past comments to obtain a user feature vector; extracting sentence vectors from the comment text by using a ConSERT frame based on the BERT model to obtain comment feature vectors;
step 3: detection model
Constructing a detection model based on a graph attention neural network, and splicing the extracted movie feature vector, the user feature vector and the comment feature vector to be used as the input of the model; using a node level attention mechanism to learn the weights of the neighbors based on the meta-paths and summarizing the weights to obtain a first step of node embedding; and then using an attention mechanism based on the element path to distinguish different element paths, learning weights of different element paths, carrying out weighted combination on node embedding obtained in the first step to obtain a final node embedding for classification tasks, and finally outputting a detection result.
Further, the step 1 specifically includes:
step 1.1: building a URL by using a movie name, building a Request object by using a Requests library, requesting resources from a server, returning movie related information, and then analyzing a returned webpage by using a BeautiffulSoup library to obtain a movie ID corresponding to the movie name;
step 1.2: using an API interface of the film platform, constructing a URL in combination with a film ID, using a Request to Request resources, and returning a JSON file containing film detailed information and film comment information;
step 1.3: acquiring a JSON file of a user homepage by combining user ID information acquired from movie comment information to acquire user information;
step 1.4: and (3) formulating a data labeling standard aiming at false film commentary, and carrying out data labeling on the extracted film commentary records.
Still further, the movie details include: movie ID, movie score, director, movie score distribution, movie show time, movie type, movie viewer number, and movie viewer number; the movie comment information includes: comment content, comment score, comment approval, comment return number, comment time and comment user name; the user information includes: user ID, user rating, ticket purchase information, total number of user comments, total number of user topics, number of user wants to watch movies, number of user watching movies.
Furthermore, the extracting the feature vector of the movie in the step 2 specifically includes:
step 2.1.1: data preprocessing operation comprising word segmentation, part-of-speech tagging and stop word removal is performed on a given movie profile text D to obtain n candidate keywords, namely d= [ t ] 1 ,t 2 ,...,t n ];
Step 2.1.2: calculate word t i Word frequency TF in text D;
step 2.1.3: calculate word t i Inverse text frequency across corpora
Figure BDA0004129484790000021
D t For word t in corpus i The number of the documents appearing, D n Is the total number of documents;
step 2.1.4: calculating to obtain word t i TF-IDF values of all candidate keywords are calculated;
step 2.1.5: and (3) arranging the candidate keyword calculation results in a reverse order to obtain N words with the top ranking as movie introduction text keywords.
Furthermore, in the step 2, the extracting of the user comment vector adopts a ConSERT framework based on the BERT model, and the framework performs fine tuning on the BERT model, which includes:
generating different input samples for the embedded layer by adopting a data enhancement module;
calculating sentence representations for each of the entered comment texts using a shared BERT encoder; during training, sentence representations are obtained using the last layer of embedded average pooling;
a contrast loss layer is arranged at the top of the BERT encoder, so that consistency between a sentence representation and a sentence corresponding to the sentence representation is improved to the greatest extent, and similarity between the sentence representation and other sentence representations in the same batch is minimized;
for each comment text x entered, the trimmed BERT model first passes it to a data enhancement module, which applies two transformations T 1 、T 2 To generate two types of embedded e i =T 1 (x)、e j =T 2 (x) Wherein e is i ,e j ∈R L × d L is the sequence length, d is the hidden dimension; subsequently e i ,e j Are sent to a shared BERT encoder, encoded by multi-layer transformers in BERT, and the encoded result is then averaged and pooled to generate sentence vector r i ,r j
Further, the step 3 specifically includes:
step 3.1: taking a heterogeneous graph formed by a movie node M, a user node U and a comment node C as the input of a model; the sentence vector obtained by the task target node C is expressed as { h } 1 ,h 2 ,...,h n N is the total number of destination nodes C;
there are two meta-paths,
Figure BDA0004129484790000031
two comments are sent by the same user;
Figure BDA0004129484790000032
representing that two comments are issued by a user watching the same movie,/->
Figure BDA0004129484790000033
Representing the reverse → relationship;
step 3.2: node embedding based on node's attention mechanism
Step 3.2.1: determining an attention value of a node level of a given pair of nodes (i, j) connected by a meta-path phi
Figure BDA0004129484790000034
This value also means the importance of node j to node i, which is calculated as:
Figure BDA0004129484790000035
wherein ,h'i and h'j Att is the characteristics of node i and node j, respectively node A deep neural network representing the attention of the execution node level; given meta-path Φ, att node Shared for all pairs of meta-path based nodes;
step 3.2.2: for the obtained attention value
Figure BDA0004129484790000036
And carrying out normalization operation to obtain the attention coefficient after normalization, wherein the attention coefficient is shown in the following formula:
Figure BDA0004129484790000037
wherein ,
Figure BDA0004129484790000041
for the attention coefficient after normalization, σ (·) is the activation function, ++>
Figure BDA0004129484790000042
Transpose of the node level attention vector for the meta-path Φ, ||represents the join, |is +.>
Figure BDA0004129484790000043
Neighbor nodes based on the element path phi are used as the node i;
step 3.2.2: embedding attention coefficients of nodes based on meta-path Φ of node i
Figure BDA0004129484790000044
Embedding the characteristic of the neighbor around the node to make a weighted summation, and then obtaining a characteristic representation corresponding to the node i through an activation function
Figure BDA0004129484790000045
The following formula is shown:
Figure BDA0004129484790000046
after all nodes obtain the characteristics, the characteristics Z corresponding to each node in the element path phi are obtained Φ The method comprises the steps of carrying out a first treatment on the surface of the Similarly, a set of meta-paths { Φ }, is given 01 ,...,Φ P P group node embeddings, denoted as
Figure BDA0004129484790000047
Step 3.3: node embedding based on attention mechanism of meta-path
Step 3.3.1: selecting all nodes under a meta-path phi, enabling each node to pass through a full connection layer, an activation function and multiplying a learnable parameter q to obtain the corresponding node i under the meta-path phiScalar of (2); the same operation is carried out on all nodes, and then weighted summation is carried out, and the weighted summation is divided by the number of the nodes to average, thus obtaining the importance of the element path phi
Figure BDA0004129484790000048
The following formula is shown:
Figure BDA0004129484790000049
wherein W is a weight matrix, b is a bias vector, q is a semantic-level attention vector, and V is the number of nodes;
step 3.3.2: normalizing by a softmax operation to obtain a meta-path phi i The weights of (2) are expressed as
Figure BDA00041294847900000410
The following formula is shown: />
Figure BDA00041294847900000411
Wherein P is the number of meta paths;
step 3.3.3: fusing the embedding of the nodes, and carrying out weighted summation on the nodes to obtain the final embedded Z i The following formula is shown:
Figure BDA00041294847900000412
step 3.4: embedding the obtained target task nodes into a full-connection layer to output classification results;
the loss function of the detection model based on the graph attention neural network is a minimized cross entropy loss function, and the loss function is shown in the following formula:
Figure BDA0004129484790000051
wherein ,yL is a sample node index set, Y l and Z l Is the label and embedding of the node, C is the parameter of the classifier, and L is the minimum cross entropy loss function value.
The beneficial effects of the invention are as follows: firstly, constructing a web crawler to collect and primarily process data of a film platform, and then, strictly and manually labeling comment text information to construct a data set; constructing an abnormal graph by using the collected user, film and comment data, wherein the key point is to confirm the nodes in the abnormal graph and extract the characteristics of the nodes, and in the process, the TF-IDF algorithm, normalization and ConSERT model are used for extracting the characteristics of the nodes; constructing a graph attention network based on different composition, wherein the graph attention network comprises two layers of attention mechanisms, namely an attention mechanism based on nodes and an attention mechanism based on element paths, and finally obtaining comment node embedding containing semantic information for classification; the invention firstly provides a mode of capturing node context semantic information by using the heterogeneous graph to detect false film evaluation, and experimental evaluation results show that the model provided by the invention has better effect than the traditional text-based classification mode, and provides a method and thinking for false film evaluation detection in the future.
Drawings
FIG. 1 is an overall frame diagram of a method for false photo evaluation detection based on a graph attention neural network according to the present invention.
FIG. 2 is a crawler flow chart of the present invention.
Fig. 3 is a diagram of a ConSERT framework of the invention.
FIG. 4 is a diagram of a detection model based on a graph attention neural network according to the present invention.
FIG. 5 is a graph comparing meta-path effects.
FIG. 6 is a graph showing the contrast of different sentence vectors.
FIG. 7 is a graph showing comparison of different test patterns.
Detailed Description
The invention will now be described in further detail with reference to the drawings and to specific examples.
As shown in fig. 1, the whole framework of the false shadow evaluation detection method based on the graph attention neural network mainly comprises three parts: data set construction, feature extraction and detection model.
(1) Data set construction: the invention develops a Web crawler to acquire data, and based on a relevant film platform, acquires near information
Movie basic information of various subjects and basic information about movie comment information and comment users for five years.
Then marking the crawled comment text data, and providing data for the invention by the constructed film and comment data set
And (5) supporting.
(2) Feature extraction: the core work of this section is to analyze and extract the features of the film and generate an initial feature vector for each film. Firstly, respectively extracting features of a film, a user and comments, extracting keywords by using a film brief of the film, and generating feature vectors of the film by using a TF-IDF algorithm; normalizing the user grade, the number of pictures read, the past comment data and the like to obtain a feature vector of the user; and extracting sentence vectors from the comment text by using a ConSERT framework based on the BERT model to obtain feature vectors of comments. An iso-pattern will then be built on top of the features of the above three sets of nodes as input to the model.
(3) And (3) detecting a model: in this section, the present invention uses heterogeneous graph neural networks containing different types of nodes and connections to generate node embeddings of target tasks for classification, in which the feature vectors generated in the "feature extraction" module are stitched into the input of the model. In addition, the invention introduces a node level attention mechanism and a path level attention mechanism to fully consider various semantic information contained in the heterograms to obtain better node embedded representations for classification.
1. Data set construction
Currently, few studies are done on false-film-review detection, and thus there is a lack of reliable false-film-review data sets. According to the invention, through developing a web crawler, targeted collection is carried out on contents in an film evaluation platform based on a certain strategy, and the collected data are processed and then marked manually, so that the false film evaluation data set is constructed, and the collected data comprise film basic information, related film evaluation information and user basic information for posting comments.
(1) Data collection method
The embodiment adopts a targeted crawler to collect various data, and the specific flow is shown in fig. 2.
Firstly, constructing a URL by using a movie name, constructing a Request object by using a Requests library, requesting resources from a server, returning related information of the movie, and then analyzing the returned web page by using a BeautifurSoup library to obtain a movie ID corresponding to the movie name. Then, an API interface of the cat eye platform is used for constructing a URL in combination with the movie ID, a Request is used for requesting resources, and a JSON file containing movie information and movie comment information is returned. Finally, the JSON file of the user homepage is acquired by combining the user ID information acquired from the movie comment information by using the same method, so as to acquire the user information. Table 1 shows the API interfaces used in the crawler process
TABLE 1API interface
Figure BDA0004129484790000061
/>
Figure BDA0004129484790000071
(2) Data collection policy
In order to improve accuracy of data annotation and universality of data, the embodiment collects detailed information, movie comments and comment user information of various movies which are mapped on the cat eye platform in five years from 2017 to 2021.
Movie details include movie ID, movie score, director, lead actor, movie score distribution, movie show time, movie type, movie viewer number, etc. Table 2 records the data fields obtained by the movie homepage crawler. The movie comment information includes comment content, comment score, comment approval, comment return number, comment time, comment user name, and the like. Table 3 records the data field table obtained by the movie review crawler. The user information includes a user ID, a user rank, ticket purchase information, a total number of user comments, a total number of user topics, a number of movies the user wants to watch, a number of movies the user watches, and the like. Table 4 records the data field table obtained by the user information crawler.
In this embodiment, movies of various subjects (40 kinds of movies including actions, love, horror, science fiction, etc.) in the last five years from 2017 to 2021 are collected together, 2,352 movies are collected together, and 734,130 movies are recorded in comments.
Table 2 movie API field
Figure BDA0004129484790000072
/>
Figure BDA0004129484790000081
TABLE 3 user API field
Fields Field description Examples of the examples
wishCount Think about the number of looks 5
viewedCount Number of views 17
cCount Total number of commentary 4
watchingCount Number of participating topics 2
Table 4 comment API field
Figure BDA0004129484790000082
(3) Data annotation
In the embodiment, the following data labeling standards for false film comments are formulated according to the characteristics of false comments of related films by referring to 30 methods (30 Ways You Can Spot Fake Online Reviews) for judging fraudulent online comments, which are proposed by the consumeris website. Specific evaluation criteria are as follows:
1) The comment content is completely irrelevant to the film being comment;
2) The evaluation and the grading are not matched, the comment content is criticizing high score, and the comment content is praise low score;
3) Comment description exaggerates its word too praise, the adjective filling the hole and the pure enthusiastic praise without any drawbacks;
4) The comment content has uniform sentence patterns, and similar comments appear in a large number of other movie comments;
5) Comment content appears to give too high or too low a score simply because it likes or dislikes a certain movie star or character.
As long as any of the above criteria is met, the comment will be considered a false comment. The specific data statistics are shown in table 3.5. The invention finally marks 472 films with 65,627 comment records in total for data marking, wherein 20,114 false comment records are included. Details are shown in table 5.
Table 5 dataset information
Year of film Number of movies Number of real comments Number of false comments Totals to
2017 125 11,567 3,860 15,427
2018 63 6,508 3,446 9,954
2019 96 12,358 4,665 17,023
2020 106 8,496 4,135 12,631
2021 82 6,584 4,008 10,592
Totals to 472 45,513 20,114 65,627
2. Feature extraction
(1) Movie feature
In the invention, the profile of the movie is selected as the basic information of feature extraction based on the features of the movie, and the comment issued by the user is normally expanded around the content of one or more aspects of the movie, so that the profile of the movie can be effectively used as the feature representation of the movie to help judge the false movie comment. In order to better find the commonalities existing among different movies and remove noise, the method firstly extracts keywords of movie introduction, highly congeals text information, and then converts the text information into vectors to obtain vector features.
The keyword extraction method comprises the following steps:
1) And carrying out data preprocessing operations such as word segmentation, part-of-speech tagging, stop word removal and the like on the given text D. The method adopts bargain word, reserves words with several parts of speech such as noun, other special names, verbs, auxiliary verbs, name verbs, adjectives, adverbs and the like, and finally obtains n candidate keywords, namely D= [ t ] 1 ,t 2 ,...,t n ]The method comprises the steps of carrying out a first treatment on the surface of the The adopted stop word list is a Chinese stop word list issued by a Chinese natural language processing open platform of a Chinese department of academy of science, wherein1,208 stop words are included;
2) Calculate word t i Word frequency in text D;
3) Calculate word t i In the whole corpus
Figure BDA0004129484790000101
D t For word t in corpus i The number of the documents appearing, D n Is the total number of documents;
4) Calculating to obtain word t i TF-IDF values of all candidate keywords are obtained by repeating the steps;
5) And (3) arranging the candidate keyword calculation results in a reverse order to obtain N words with the top ranking as text keywords.
Ten words are selected as text keywords for the movie introduction of each movie, and then TF-IDF is used to obtain a TF-IDF weight matrix as a movie feature.
(2) User features
The user-based features are information such as the number of movies to watch, the number of reviews to watch, the number of topics involved, the user grade, the number of praise points of reviews to watch, the number of reviews and scores, etc. extracted from the user's personal homepage and the content of reviews posted, the user review time, the relative time of the user review time and the movies, etc. These features can help determine the authenticity of the movie reviews published by the user based on the user's underlying information and behavioral information.
The individual characteristic indices of the user exhibit different dimensions and magnitudes due to differences in their characteristics. However, in the case where there is a large difference between different indexes, if the analysis is performed using the original index values, the effect of the index having a high value on the overall analysis is highlighted, and the index having a low value level is relatively weakened. Therefore, to ensure the credibility of the test results, the original index must be normalized. The normalization formula is shown as formula (1).
Figure BDA0004129484790000102
Wherein x' is a normalized characteristic index, and the method calculates an average value based on normalization for indexes including a plurality of values such as an shadow score.
(3) Comment features
In the present invention, the extraction of comment features is essentially a process of learning sentence representations. Since the related research finds that sentence representation generated by the BERT is collapsed and cannot play a good role in part of subsequent downstream tasks, the method uses ConSERT based on the BERT to complete the task of sentence vector representation, and the method fine-tunes the BERT, effectively solves the problem of collapse of the BERT, and makes the sentence vector more suitable for the downstream tasks, and a specific framework is shown in figure 3.
There are three main components in the frame:
and a data enhancement module: different input samples are generated for the embedded layer.
Shared BERT encoder: sentence representations are computed for each input text. In the training process, sentence representations are obtained using the last layer of embedded average pooling.
Contrast loss layer: at the top of the BERT encoder is a contrast loss layer that maximizes the consistency between a sentence representation and its corresponding enhanced sentence while minimizing its similarity to other sentence representations in the same batch.
For each text x entered. The model first passes it to a data enhancement module, which applies two transforms T 1 、T 2 To generate two types of embedded e i =T 1 (x)、e j =T 2 (x) Wherein e is i ,e j ∈R L × d L is the sequence length and d is the hidden dimension. Subsequently e i ,e j Are fed into a shared BERT encoder, encoded by multi-layer transformers in BERT, as shown in FIG. 3, and the encoded results are then averaged and pooled to produce sentence representation r i ,r j
3. Detection model
The invention provides a detection model based on a graph attention neural network, and the core idea of the model is to use heterograms to obtain node embedding of a specific task. First, a node level attention mechanism is used to learn the weights of the meta-path based neighbors and aggregate them to obtain a first step node embedding. And then, distinguishing the difference of the meta paths by using a meta-path-based attention mechanism, and carrying out weighted combination on the node embedding obtained in the first step to obtain a final node embedding for classification tasks. The following is a detailed description of the input layer, the node-based attention mechanism, the meta-path based attention mechanism, and the output layer. A specific model diagram of the present invention is shown in fig. 4.
(1) Input layer
The model input adopted by the method is a heterogeneous graph consisting of a movie node M, a user node U and a comment node C. The information contained in the three nodes is the feature information extracted from the previous feature extraction, and the specific process is not described here again. Wherein the sentence vector obtained by the task target node C is expressed as { h } 1 ,h 2 ,...,h n },h i Are 768-dimensional vectors. The invention uses two meta paths, which are respectively
Figure BDA0004129484790000111
The meta-path indicates that two comments are sent by the same user; />
Figure BDA0004129484790000112
The meta-path indicates that two comments were made by a user watching the same movie.
(2) Node-based attention mechanism
The node-based attention mechanism is divided into three steps altogether.
1) First, find the attention value of the node stage of a given pair of nodes (i, j) connected by a meta-path Φ
Figure BDA0004129484790000113
This value also means the importance of node j to node i, and its calculation formula is shown in equation (2).
Figure BDA0004129484790000114
wherein ,h'i and h'j Att is the characteristics of node i and node j, respectively node Representing a deep neural network that performs node-level attention. Given meta-path Φ, att node Is shared for all pairs of nodes based on meta-paths. This is because there are some similar connection patterns under one element path. The above formula shows that given a meta-path Φ, the weights of the meta-path based node pairs (i, j) depend on their characteristics.
2) And secondly, normalizing the obtained attention value to obtain an attention coefficient after normalization, wherein the process is shown in a formula (3).
Figure BDA0004129484790000121
wherein ,
Figure BDA0004129484790000122
for the attention coefficient after normalization, σ (·) is the activation function, ++>
Figure BDA0004129484790000123
For the transposition of the node level attention vector of the meta path Φ, ||represents the join, |j>
Figure BDA0004129484790000124
Neighbor nodes based on the meta-path phi for node i (including itself).
3) Third, embedding the node i based on the meta-path Φ can be performed by the node's attention coefficient
Figure BDA0004129484790000125
Embedding the characteristic of the neighbor around the node to make a weighted summation, and then obtaining a characteristic representation +.>
Figure BDA0004129484790000126
As shown in equation (4).
Figure BDA0004129484790000127
Each node is embedded by aggregation of its neighbors. Due to the attention weighting
Figure BDA0004129484790000128
Is generated for a single meta-path, which is semantically specific and is capable of capturing semantic information under that meta-path. The corresponding characteristics of other nodes can be obtained in the same way, and after all nodes obtain the characteristics, the characteristic Z corresponding to each node in the element path phi is obtained Φ . Similarly, a set of meta-paths { Φ }, is given 01 ,...,Φ P P group node embeddings, denoted as
Figure BDA0004129484790000129
(3) Attention mechanism based on meta-path
In general, each node in the heterogram contains multiple types of semantic information that can be revealed by the meta-path. In order to solve the selection of semantic information revealed by different meta-paths in the heterograms, the method uses a meta-path-based attention mechanism to automatically learn the importance of semantics brought by different meta-paths and fuse them. The meta-path based attention mechanism can also be divided into the following three steps.
1) In the first step, all nodes under a meta-path phi are selected, each node passes through a full connection layer, an activation function and then is multiplied by a learnable parameter q, so that a scalar corresponding to the node i under the meta-path phi is obtained. The same operation is carried out on all the nodes, then weighted summation is carried out, and finally the weighted summation is divided by the number of the nodes to average, thus obtaining the element pathThe importance of the diameter phi is expressed as
Figure BDA0004129484790000131
As shown in equation (5).
Figure BDA0004129484790000132
/>
Where W is the weight matrix, b is the bias vector, and q is the attention vector at the semantic level.
2) After obtaining the importance of each meta-path, the second step, normalizes the meta-path Φ by a softmax operation i The weights of (2) are expressed as
Figure BDA0004129484790000133
As shown in equation (6).
Figure BDA0004129484790000134
Where P is the number of meta paths.
Figure BDA0004129484790000135
Can be interpreted as a meta-path Φ i The contribution to a particular task, it is apparent that the higher β, the meta-path Φ i The more important.
3) Thirdly, embedding the nodes in fusion, and carrying out weighted summation on the nodes to obtain a final embedded Z i . As shown in equation (7).
Figure BDA0004129484790000136
(4) Output layer
Finally, the obtained target task nodes are embedded into a full-connection layer to obtain output for classification, and the loss function of the model is a minimum cross entropy loss function. As shown in equation (8).
Figure BDA0004129484790000137
wherein yL Is a sample node index set, Y l and Z l Is the label and embedding of the node, C is the parameter of the classifier, and L is the minimum cross entropy loss function value.
4. Experiment verification
Three experiments were designed to evaluate the false shadow evaluation detection effect of the model of the present invention. The data set is a false film evaluation data set collected by the project, and contains 20,114 false film evaluations and 45,513 normal film evaluations. In the experiment, 80% of the false photo evaluation data set is divided into a training set, 10% is taken as a verification set and 10% is taken as a test set. Each experiment was repeated 10 times to average as a final result.
4.1. Evaluating the effect of a meta-path
This experiment analyzes the effect of the selection of meta-path combinations on the performance of the model of the present invention, as a small number of high quality meta-paths can bring about considerable performance. The present experiment also considers a single meta-path and combinations thereof. Specifically, the method uses two meta paths, one is
Figure BDA0004129484790000141
The meta-path indicates that two movie comments are sent by the same user; the other is->
Figure BDA0004129484790000142
The meta-path indicates that two comments were made by a user watching the same movie. To analyze the effects of different combinations in the meta-paths, the experiment was performed using a single meta-path with a combined meta-path.
The experimental results are shown in table 6 and fig. 5. It can be seen from table 6 that the individual meta-paths each exhibit different properties, where MP1 > MP2, which means that different meta-paths represent different relationships with different effects on the final node embedding, and that the semantic information presented by the paths with better properties has a greater impact on the model. In addition, although the performance increase is not very obvious, it can be observed that the combination MP1& MP2 of the meta-paths shows better performance as a whole, and the combination meta-paths can show better performance with high probability because more semantic information is contained, and more characteristics are captured by the nodes, so that classification tasks can be better completed, and the effectiveness of the meta-paths proposed by the invention is proved.
Table 6 element path effects table
Meta path Accuracy rate of Accuracy of Recall rate of recall F1 value
MP1 0.8130 0.8513 0.9167 0.8828
MP2 0.8045 0.8048 0.9843 0.8855
MP1&MP2 0.8340 0.8857 0.9000 0.8928
4.2. Evaluating the effect of sentence vector models
Because the model of the invention carries out aggregation among nodes on the basis of taking node characteristics as input, the final node embedding is obtained for classification. Therefore, the feature input of sentence nodes serving as task targets is particularly important, and three different modes are used for acquiring sentence vectors in the experiment, namely, the initial embedding of the sentence nodes, so as to prove the validity of sentence vector input selected by the method.
The sentence vector obtaining method selected in the experiment comprises the following steps of a BERT model, wherein the model is a multi-layer bidirectional transducer encoder based on fine tuning; a Doc2vec model based on the Word2vec model, which trains a neural network to express concepts of the whole document on the task of predicting a center Word based on a context mean of Word vectors and vectors of the whole document; an SBERT model which modifies the pre-trained BERT and uses a twin network and a triplet network on the basis of the traditional BERT; conSERT model, which is generated to solve the "collapse" phenomenon of BERT vector, the generated sentence vector can be more suitable for downstream task.
The sentence vector dimensions for the four ways mentioned above are shown in table 7.
TABLE 7 sentence vector dimension table
Method Dimension(s)
BERT 768
Doc2Vec 100
SBERT 768
ConSERT 768
The experimental results of sentence vector comparison are shown in fig. 6, and it can be seen from the experimental results of comparison that the text feature vectors extracted by BERT, doc2Vec and SBERT have no effect as compared with the condert. The experimental effect of BERT and SBERT is worse than that of the Consert sentence vector extraction method because the representation of BERT has a collapse problem, and because the word representation of BERT is integrally conical, the representation of high-frequency word dominant sentences can be caused, so that the overall similarity is extremely high, and the downstream task is influenced. The improved SBERT is better than BERT in overall performance, but does not solve the underlying problem, resulting in poor performance. The Doc2Vec adopts an extraction idea of the text subject, and the vector dimension is selected by the user, so that the optimal vector dimension is difficult to determine, the effect is ensured, and the effect is poor.
4.3. Evaluating the effect of the proposed detection model
In order to prove that the model provided by the invention has obvious advantages in false film evaluation detection, we carefully select common detection models including traditional machine learning and deep learning to carry out experiments, wherein the common detection models comprise DPCNN (Deep Pyramid Convolutional Neural Networks, deep pyramid convolutional neural network), textCNN (Text Convolutional Neural Network, convolutional neural network), textRNN (Text Recurrent Neural Network, cyclic neural network) and att_ TextRNN, textRCNN, and the indexes of accuracy, precision, recall rate, F value and the like are respectively compared.
The experimental results are shown in fig. 7 and table 8. It can be seen that although the detection model provided by the invention does not reach the highest accuracy in the constructed false photo evaluation data set, the accuracy, recall and F1 value are all superior to those of other comparison models, and the F1 value reaches 0.8928. Furthermore, the detection results of models that incorporate the attention mechanism are better than models that do not, as the attention mechanism can help the model better find more efficient features. Compared with the traditional research mode which only adopts the text, the research thought combining the comment node related semantic information provided by the method can achieve better effect in the aspect of detecting false film comment. In addition, the text RNN and other models which take the sentence vector generated by ConSERT as input show better effects, and prove that the sentence vector feature of the method has advantages in acquiring the feature of the text. And the graph structure integrated by the model also has a certain contribution to false comment detection results.
TABLE 8 different detection model performances
Figure BDA0004129484790000151
Figure BDA0004129484790000161
In summary, the model provided by the invention plays a certain role in the aspects of selecting sentence vector feature extraction methods, introducing attention mechanisms and combining element paths, and has excellent results on false-film evaluation detection problems.

Claims (6)

1. The false shadow evaluation detection method based on the graph attention neural network is characterized by comprising the following steps of:
step 1: data set construction
The method comprises the steps of designing a targeted crawler, collecting movie basic information, related movie comment information and basic information of comment posting users of various themes in a certain period of time of a certain movie platform, marking comment text data, and constructing a movie comment data set;
step 2: feature extraction
Extracting keywords according to the movie introduction of the movie, and generating a movie feature vector by using a TF-IDF algorithm; normalizing a series of data of the user grade, the number of the pictures read and the past comments to obtain a user feature vector; extracting sentence vectors from the comment text by using a ConSERT frame based on the BERT model to obtain comment feature vectors;
step 3: detection model
Constructing a detection model based on a graph attention neural network, and splicing the extracted movie feature vector, the user feature vector and the comment feature vector to be used as the input of the model; using a node level attention mechanism to learn the weights of the neighbors based on the meta-paths and summarizing the weights to obtain a first step of node embedding; and then using an attention mechanism based on the element path to distinguish different element paths, learning weights of different element paths, carrying out weighted combination on node embedding obtained in the first step to obtain a final node embedding for classification tasks, and finally outputting a detection result.
2. The method for detecting false shadow comments based on a graph attention neural network according to claim 1, wherein the step 1 specifically includes:
step 1.1: building a URL by using a movie name, building a Request object by using a Requests library, requesting resources from a server, returning movie related information, and then analyzing a returned webpage by using a BeautiffulSoup library to obtain a movie ID corresponding to the movie name;
step 1.2: using an API interface of the film platform, constructing a URL in combination with a film ID, using a Request to Request resources, and returning a JSON file containing film detailed information and film comment information;
step 1.3: acquiring a JSON file of a user homepage by combining user ID information acquired from movie comment information to acquire user information;
step 1.4: and (3) formulating a data labeling standard aiming at false film comments, and labeling the extracted film comments with data.
3. The method for detecting false movie reviews based on a graph attention neural network according to claim 2, wherein the movie details include: movie ID, movie score, director, movie score distribution, movie show time, movie type, movie viewer number, and movie viewer number; the movie comment information includes: comment content, comment score, comment approval, comment return number, comment time and comment user name; the user information includes: user ID, user rating, ticket purchase information, total number of user comments, total number of user topics, number of user wants to watch movies, number of user watching movies.
4. The method for detecting false film evaluation based on the graph attention neural network according to claim 1, wherein the extracting of the film feature vector in the step 2 specifically includes:
step 2.1.1: data preprocessing operation comprising word segmentation, part-of-speech tagging and stop word removal is performed on a given movie profile text D to obtain n candidate keywords, namely d= [ t ] 1 ,t 2 ,...,t n ];
Step 2.1.2: calculate word t i Word frequency TF in text D;
step 2.1.3: calculate word t i Inverse text frequency across corpora
Figure FDA0004129484760000021
D t For word t in corpus i The number of the documents appearing, D n Is the total number of documents; />
Step 2.1.4: calculating to obtain word t i TF-IDF values of (2) and further calculating TF-IDF values of all candidate keywords
A numerical value;
step 2.1.5: and (3) arranging the candidate keyword calculation results in a reverse order to obtain N words with the top ranking as movie introduction text keywords.
5. The method for detecting false shadow comments based on a graph attention neural network according to claim 1, wherein the extracting of the user comment vector in the step 2 uses a consistency framework based on a BERT model, and the framework performs fine tuning on the BERT model, including:
generating different input samples for the embedded layer by adopting a data enhancement module;
calculating sentence representations for each of the entered comment texts using a shared BERT encoder; during training, sentence representations are obtained using the last layer of embedded average pooling;
a contrast loss layer is arranged at the top of the BERT encoder, so that consistency between a sentence representation and a sentence corresponding to the sentence representation is improved to the greatest extent, and similarity between the sentence representation and other sentence representations in the same batch is minimized;
for each comment text x entered, the trimmed BERT model first passes it to a data enhancement module, which applies two transformations T 1 、T 2 To generate two types of embedded e i =T 1 (x)、e j =T 2 (x) Wherein e is i ,e j ∈R L×d
L is the sequence length, d is the hidden dimension; subsequently e i ,e j Are sent to a shared BERT encoder, encoded by multi-layer transformers in BERT, and the encoded result is then averaged and pooled to generate sentence vector r i ,r j
6. The method for detecting false shadow comments based on a graph attention neural network according to claim 1, wherein the step 3 specifically includes:
step 3.1: taking a heterogeneous graph formed by a movie node M, a user node U and a comment node C as the input of a model;
the sentence vector obtained by the task target node C is expressed as { h } 1 ,h 2 ,...,h n N is the total number of destination nodes C; there are two meta-paths,
Figure FDA0004129484760000031
two comments are sent by the same user; />
Figure FDA0004129484760000032
Representing that two comments are issued by a user watching the same movie,/->
Figure FDA0004129484760000033
Representing the reverse → relationship;
step 3.2: node embedding based on node's attention mechanism
Step 3.2.1: determining an attention value of a node level of a given pair of nodes (i, j) connected by a meta-path phi
Figure FDA0004129484760000034
This value also means the importance of node j to node i, which is calculated as:
Figure FDA0004129484760000035
wherein ,h'i and h'j Att is the characteristics of node i and node j, respectively node A deep neural network representing the attention of the execution node level; given meta-path Φ, att node Shared for all pairs of meta-path based nodes;
step 3.2.2: for the obtained attention value
Figure FDA0004129484760000036
And carrying out normalization operation to obtain the attention coefficient after normalization, wherein the attention coefficient is shown in the following formula:
Figure FDA0004129484760000037
wherein ,
Figure FDA0004129484760000038
for the attention coefficient after normalization, σ (·) is the activation function, ++>
Figure FDA0004129484760000039
For the transposition of the node level attention vector of the meta path Φ, ||represents the join, |j>
Figure FDA00041294847600000310
For node i, neighbor node based on meta path phi, h' k The characteristic of the neighbor node based on the meta-path phi is used as the node i; />
Step 3.2.2: embedding attention coefficients of nodes based on meta-path Φ of node i
Figure FDA00041294847600000311
Embedding the characteristic of the neighbor around the node to make a weighted summation, and then obtaining a characteristic representation +_corresponding to the node i through an activation function>
Figure FDA00041294847600000312
The following formula is shown:
Figure FDA00041294847600000313
after all nodes obtain the characteristics, the characteristics Z corresponding to each node in the element path phi are obtained Φ The method comprises the steps of carrying out a first treatment on the surface of the Similarly, a set of meta-paths { Φ }, is given 01 ,...,Φ P P group node embeddings, denoted as
Figure FDA00041294847600000314
Step 3.3: node embedding based on attention mechanism of meta-path
Step 3.3.1: selecting all nodes under a meta-path phi, passing each node through a full connection layer, an activation function, and multiplying a learnable parameter q to obtainThe scalar corresponding to this node i under the meta-path Φ; the same operation is carried out on all nodes, and then weighted summation is carried out, and the weighted summation is divided by the number of the nodes to average, thus obtaining the importance of the element path phi
Figure FDA0004129484760000041
The following formula is shown:
Figure FDA0004129484760000042
wherein W is a weight matrix, b is a bias vector, q is a semantic-level attention vector, and V is the number of nodes; step 3.3.2: normalizing by a softmax operation to obtain a meta-path phi i The weights of (2) are expressed as
Figure FDA0004129484760000043
The following formula is shown:
Figure FDA0004129484760000044
wherein P is the number of meta paths;
step 3.3.3: fusing the embedding of the nodes, and carrying out weighted summation on the nodes to obtain the final embedded Z i The following formula is shown:
Figure FDA0004129484760000045
step 3.4: embedding the obtained target task nodes into a full-connection layer to output classification results;
the loss function of the detection model based on the graph attention neural network is a minimized cross entropy loss function, and the loss function is shown in the following formula:
Figure FDA0004129484760000046
wherein ,yL Is a sample node index set, Y l and Zl Is the label and embedding of the node, C is the parameter of the classifier, and L is the minimum cross entropy loss function value.
CN202310255641.0A 2023-03-16 2023-03-16 False shadow evaluation detection method based on graph attention neural network Pending CN116166806A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310255641.0A CN116166806A (en) 2023-03-16 2023-03-16 False shadow evaluation detection method based on graph attention neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310255641.0A CN116166806A (en) 2023-03-16 2023-03-16 False shadow evaluation detection method based on graph attention neural network

Publications (1)

Publication Number Publication Date
CN116166806A true CN116166806A (en) 2023-05-26

Family

ID=86414752

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310255641.0A Pending CN116166806A (en) 2023-03-16 2023-03-16 False shadow evaluation detection method based on graph attention neural network

Country Status (1)

Country Link
CN (1) CN116166806A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116737934A (en) * 2023-06-20 2023-09-12 合肥工业大学 Naval false comment detection algorithm based on semi-supervised graph neural network
CN117076812B (en) * 2023-10-13 2023-12-12 西安康奈网络科技有限公司 Intelligent monitoring management system of network information release and propagation platform
CN117557347A (en) * 2024-01-11 2024-02-13 北京华电电子商务科技有限公司 E-commerce platform user behavior management method

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116737934A (en) * 2023-06-20 2023-09-12 合肥工业大学 Naval false comment detection algorithm based on semi-supervised graph neural network
CN116737934B (en) * 2023-06-20 2024-03-22 合肥工业大学 Naval false comment detection algorithm based on semi-supervised graph neural network
CN117076812B (en) * 2023-10-13 2023-12-12 西安康奈网络科技有限公司 Intelligent monitoring management system of network information release and propagation platform
CN117557347A (en) * 2024-01-11 2024-02-13 北京华电电子商务科技有限公司 E-commerce platform user behavior management method
CN117557347B (en) * 2024-01-11 2024-04-12 北京华电电子商务科技有限公司 E-commerce platform user behavior management method

Similar Documents

Publication Publication Date Title
Giachanou et al. Multimodal multi-image fake news detection
CN106802915B (en) Academic resource recommendation method based on user behaviors
CN109933664B (en) Fine-grained emotion analysis improvement method based on emotion word embedding
CN110232149B (en) Hot event detection method and system
CN116166806A (en) False shadow evaluation detection method based on graph attention neural network
Raghavan et al. Review quality aware collaborative filtering
Ahmed Detecting opinion spam and fake news using n-gram analysis and semantic similarity
Islam et al. A proposed Bi-LSTM method to fake news detection
Chen et al. Generating ontologies with basic level concepts from folksonomies
CN107688870A (en) A kind of the classification factor visual analysis method and device of the deep neural network based on text flow input
JP4293145B2 (en) Word-of-mouth information determination method, apparatus, and program
Wang et al. Harshness-aware sentiment mining framework for product review
CN114238573A (en) Information pushing method and device based on text countermeasure sample
Liu et al. Correlation identification in multimodal weibo via back propagation neural network with genetic algorithm
JP4569380B2 (en) Vector generation method and apparatus, category classification method and apparatus, program, and computer-readable recording medium storing program
Baowaly et al. Predicting the helpfulness of game reviews: A case study on the steam store
Alhuzali et al. Predicting Sign of Depression via Using Frozen Pre-trained Models and Random Forest Classifier.
Verma et al. Automatic image caption generation using deep learning
Ma et al. Intelligent clickbait news detection system based on artificial intelligence and feature engineering
Guo et al. An opinion feature extraction approach based on a multidimensional sentence analysis model
Itani Sentiment analysis and resources for informal Arabic text on social media
Wilim et al. Sentiment Analysis About Indonesian Lawyers Club Television Program Using K-Nearest Neighbor, Naïve Bayes Classifier, And Decision Tree
Zhong et al. Identification of opinion spammers using reviewer reputation and clustering analysis
Dziczkowski et al. An opinion mining approach for web user identification and clients' behaviour analysis
Wang et al. Park recommendation algorithm based on user reviews and ratings

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination