CN112989056B

CN112989056B - False comment identification method and device based on aspect features

Info

Publication number: CN112989056B
Application number: CN202110487429.8A
Authority: CN
Inventors: 吕欣; 蔡梦思; 谭跃进; 豆亚杰; 谭索怡
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2021-04-30
Filing date: 2021-04-30
Publication date: 2021-07-30
Anticipated expiration: 2041-04-30
Also published as: CN112989056A

Abstract

One or more embodiments of the present specification provide a false comment identification method and apparatus based on aspect features, including: extracting aspect information from the comments to be identified; the facet information includes facet words; classifying the aspect words to obtain the classified aspect words and the aspect categories to which the aspect words belong; determining aspect features according to the aspect information and the aspect categories; and inputting the aspect characteristics into a pre-trained comment recognition model, and outputting a prediction result of whether the comment to be recognized is a false comment or not by the comment recognition model. The embodiment can accurately identify the false comments.

Description

False comment identification method and device based on aspect features

Technical Field

One or more embodiments of the present disclosure relate to the technical field of artificial intelligence, and in particular, to a false comment identification method and apparatus based on aspect features.

Background

With the development of information technology, the convenience of life is greatly improved by utilizing various online platforms to carry out shopping, traveling, ordering, transportation and the like. A plurality of comments are provided under each product of the online platform, a user can select whether to buy the product according to the comments, and a merchant can improve the quality and service of the product according to the comments. However, there are some false comments in many comments, which affect the judgment of users and merchants and are not favorable for economic benign development. Therefore, false comments are accurately identified from the comments, and the false comments are screened out, so that the method is a key problem to be solved by the online platform.

Disclosure of Invention

In view of this, one or more embodiments of the present disclosure are directed to a method and an apparatus for identifying a false comment based on an aspect feature, which can identify the false comment.

In view of the above, one or more embodiments of the present specification provide a method for identifying false comments based on aspect features, including:

extracting aspect information from the comments to be identified; the aspect information comprises aspect words;

classifying the aspect words to obtain the classified aspect words and the aspect categories to which the aspect words belong;

determining aspect features according to the aspect information and the aspect categories;

and inputting the aspect characteristics into a pre-trained comment recognition model, and outputting a prediction result of whether the comment to be recognized is a false comment or not by the comment recognition model.

Optionally, the extracting aspect information from the comment to be identified includes:

extracting reviewers, review products, scores, review contents and review lengths from the reviews to be identified;

and extracting the aspect words and the corresponding emotion words from the comment content.

Optionally, the aspect words and the emotion words extracted from the comment content are:

and extracting the aspect words and the corresponding emotion words from the comments to be recognized by utilizing a pre-trained Bi-LSTM classification model.

Optionally, the method for training the Bi-LSTM classification model is as follows:

providing a comment sample set, wherein comments in the comment sample set comprise all contained words and labels corresponding to the words; the label of the word is one of a facet word, an emotion word or other words;

and training a Bi-LSTM classification model by using the comment sample set to obtain the Bi-LSTM classification model capable of outputting the label of each word in the comment.

Optionally, the classifying the aspect words to obtain the classified aspect words and the aspect categories to which the aspect words belong includes:

selecting partial aspect words with preset popularity from the extracted aspect words according to the preset popularity;

determining a predetermined number of facet classes;

and dividing the part of aspect words into specific aspect categories according to the preset number of aspect categories by using a K-means clustering method.

Optionally, determining an aspect feature according to the aspect information and the aspect category includes:

counting the number of unique aspect words in the comment to be identified;

counting the proportion of the aspect categories in the comments to be identified;

counting the comment length of the related aspect category in the comment to be identified;

determining the emotion words corresponding to the side words in the comments to be recognized, and determining the emotion scores corresponding to the emotion words;

determining the average deviation between the scores in the comments to be identified and the emotion scores corresponding to the emotion words of all aspects of categories;

and counting the average value of the deviation of the emotion scores of the emotion words of the aspect categories in the comments and the emotion scores of the emotion words of the same aspect categories in all the comments.

Optionally, the method for counting the number of unique aspect words in the comment to be identified includes:

（1）

wherein,

is a commentrThe set of unique aspect words in (1),

the number of the aspect words contained in the unique aspect word set;

the method for counting the proportion of the aspect categories in the comments to be identified comprises the following steps:

（2）

wherein,

to comment onrThe set of aspect categories in (1) of (1),

as a set of aspect categories

The number of aspect categories included in (a),

is a productpThe number of aspect categories of (a);

the method for counting the comment length of the relevant aspect category in the comment to be identified comprises the following steps:

（3）

wherein the comments include at least one comment clause,

is a commentrIn the category ofacThe comment clause in which the aspect word is located,

is a comment clause

The number of all the words contained in (a),

to comment onrThe number of all words contained in;

the method for determining the average deviation between the scores in the comments to be recognized and the emotion scores corresponding to the emotion words of all aspects of the categories is as follows:

（4）

wherein,

to comment onrIn the category ofacThe aspect words of (1) are corresponding to the emotion words,

as emotional words

The corresponding sentiment score is calculated based on the emotion score,

to comment onrThe score of (a) is determined,

calculating the mean value;

the method for counting the average value of the deviation between the emotion scores of the emotion words in the aspect categories in the comments and the emotion scores of the emotion words in the same aspect categories in all the comments comprises the following steps:

（5）

wherein,

is a comment

In the category ofacThe aspect words of (1) are corresponding to the emotion words,

as emotional words

The sentiment score of (a) is determined,R _acis of the aspect classacThe set of comments of (a) is,

is of aspect classacSet of commentsR _acChinese emotion word

The average score of (a).

counting all comments issued by a reviewer according to the reviewer of the comment to be identified;

determining the total number of the unique aspect words issued by the reviewer according to all the reviews;

determining the average proportion of the aspect categories issued by the reviewer according to all the reviews;

from all reviews, the total number of review lengths for the reviewer for all facet categories is determined.

Optionally, the method for determining the total number of the unique facet words issued by the reviewer is as follows:

（6）

wherein,R _uis the revieweruThe set of all the comments that are published,

is the revieweruReview of (1)rThe set of aspect words in (1),

is a set of aspect words

The number of Chinese terms;

the method for determining the average proportion of the aspect categories issued by the reviewers comprises the following steps:

（7）

wherein,AC _ris a revieweruReview of (1)rAn aspect class set ofAC _rI is the set of aspect categoriesAC _rThe number of facet classes contained in (a),

is a revieweruThe collection of products that are being reviewed,NA _pis a productpThe number of facet classes contained;

the method for determining the total number of comment lengths of the reviewers on all the facet categories is as follows:

（8）

wherein,

is a comment sentence

The number of all the words contained in (a),

to comment onrThe number of all words contained in (a).

The embodiment of the present specification further provides a false comment identification device based on aspect characteristics, including:

the aspect extraction module is used for extracting aspect information from the comments to be identified; the aspect information comprises aspect words;

the aspect classification module is used for classifying the aspect words to obtain the classified aspect words and the aspect categories to which the aspect words belong;

a feature determination module for determining aspect features according to the aspect information and the aspect categories;

and the prediction module is used for inputting the aspect characteristics into a pre-trained comment identification model, and outputting a prediction result of whether the comment to be identified is a false comment or not by the comment identification model.

As can be seen from the above, in the false comment recognition method and apparatus based on aspect features provided in one or more embodiments of the present specification, the aspect information is extracted from the comment to be recognized, the aspect words are classified, the classified aspect words and the aspect categories to which the aspect words belong are obtained, the aspect features are determined according to the aspect information and the aspect categories, the aspect features are input into a pre-trained comment recognition model, and the comment recognition model outputs a prediction result of whether the comment to be recognized is a false comment. According to the method, whether the comment is a false comment or not is judged by analyzing the comment content about the product attribute in the comment, and the identification accuracy of the false comment can be improved.

Drawings

In order to more clearly illustrate one or more embodiments or prior art solutions of the present specification, the drawings that are needed in the description of the embodiments or prior art will be briefly described below, and it is obvious that the drawings in the following description are only one or more embodiments of the present specification, and that other drawings may be obtained by those skilled in the art without inventive effort from these drawings.

FIG. 1 is a schematic flow chart of a method according to one or more embodiments of the present disclosure;

FIG. 2 is a schematic diagram of feature importance analysis on a hotel review set according to one or more embodiments of the present disclosure;

FIG. 3 is a schematic diagram of a feature importance analysis on a restaurant review collection in accordance with one or more embodiments of the disclosure;

FIG. 4 is a schematic diagram of an aspect mean analysis over a hotel evaluation set in accordance with one or more embodiments of the present disclosure;

FIG. 5 is a schematic diagram illustrating a mean analysis of aspects features across a restaurant assessment set in accordance with one or more embodiments of the present disclosure;

FIG. 6 is a schematic diagram of an apparatus according to one or more embodiments of the present disclosure;

fig. 7 is a schematic structural diagram of an electronic device according to one or more embodiments of the present disclosure.

Detailed Description

For the purpose of promoting a better understanding of the objects, aspects and advantages of the present disclosure, reference is made to the following detailed description taken in conjunction with the accompanying drawings.

It is to be noted that unless otherwise defined, technical or scientific terms used in one or more embodiments of the present specification should have the ordinary meaning as understood by those of ordinary skill in the art to which this disclosure belongs. The use of "first," "second," and similar terms in one or more embodiments of the specification is not intended to indicate any order, quantity, or importance, but rather is used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.

As described in the background section, the online platform needs to identify false comments from numerous comments and screen out the false comments, so that the user can select a product according to the true comments, and the merchant can improve the product quality, the service quality, and the like according to the true comments and the product sales volume. The current false comment identification method can identify false comments based on text features, behavior features and the like, however, the identified false comments are not accurate enough due to the fact that false reviewers can forge text features and behavior features.

In implementing the present disclosure, the applicant finds that, when a user purchases a product, the user can view comments about the product attributes in detail, and the selection of the product can be influenced by the comments. Among these, product attributes are referred to in prior research as product aspects. Since the product is not generally used by the false reviewer, the content of the product comment in the false comment is different from that of the product comment in the real comment, and the false reviewer hardly forges the aspect characteristics, so that the false comment can be effectively identified by analyzing the aspect characteristics of the product.

In view of the above, the present specification provides a false comment identification method based on aspect features, which can improve the identification accuracy of false comments by extracting aspect information from comments, determining aspect features according to the aspect information, and identifying false comments through the aspect features.

Hereinafter, the technical means of the present disclosure will be described in further detail with reference to specific examples.

As shown in fig. 1, one or more embodiments of the present specification provide a method for identifying false comments based on aspect features, including:

s101: extracting aspect information from the comments to be identified; the aspect information comprises aspect words and corresponding emotion words, commentators, commentary products, scores, commentary content, commentary length and other information;

in this embodiment, first, the aspect information of the relevant product is extracted from the obtained comment to be identified. The aspect information comprises aspect words related to product attributes, emotion words expressing emotional tendency on product aspects, commentators who issue comments, commented products, scores on the products, comment contents, length of the comment contents and the like.

S102: classifying the extracted aspect words to obtain the classified aspect words and the aspect categories to which the aspect words belong;

in this embodiment, after the aspect words are extracted, according to the popularity of the current aspect words, part of the aspect words generally accepted by the user are selected and classified, so as to obtain the classified aspect words and the aspect categories to which the aspect words belong.

In some modes, part of the aspect words generally accepted by the user can be selected according to the popularity of the aspect words, and the selected aspect words are divided into at least two aspect categories by using a K-means clustering method. For example, from the review of a hotel, the extracted terms include water, cheese, hamburgers, gyms, televisions, swimming pools, attendants, baggage shipping, etc., and the terms can be classified into food (water, cheese, hamburgers), hardware facilities (gyms, swimming pools, televisions), service (attendants, baggage shipping), etc.

S103: determining aspect characteristics according to the aspect information and the aspect categories;

in some embodiments, determining eight kinds of aspect features according to the extracted aspect information, the classified aspect words, and the aspect categories to which the aspect words belong includes: the number of unique aspect words contained in the comment, the proportion of aspect categories contained in the comment, the length of the comment on related aspect categories in the comment, the average deviation between the score of the comment and the sentiment scores corresponding to the sentiment words on the aspect categories in the comment, the average of the sentiment scores of the sentiment words on the aspect categories in the comment and the sentiment scores of the sentiment words on the same aspect categories in all comments, the total number of unique aspect words published by the reviewer, the average proportion of aspect categories published by the reviewer, and the total number of the comment lengths of the reviewer on all aspect categories. According to the analysis of the above eight aspect features, whether the comment is a false comment or not can be predicted.

S104: and inputting the determined aspect characteristics into a pre-trained comment recognition model, and outputting a prediction result of whether the comment to be recognized is a false comment or not by the comment recognition model.

In the embodiment, after the comment to be recognized is processed, the aspect characteristics of the comment to be recognized are obtained, the aspect characteristics are input into a comment recognition model obtained through pre-training, the comment recognition model predicts according to the input aspect characteristics, a prediction result is output, and whether the comment to be recognized is a false comment or not can be determined according to the prediction result.

The method for recognizing the false comment based on the aspect characteristics comprises the steps of extracting aspect information from the comment to be recognized, classifying the extracted aspect words to obtain the classified aspect words and the aspect categories to which the aspect words belong, determining the aspect characteristics according to the aspect information, the classified aspect words and the aspect categories, inputting the aspect characteristics into a pre-trained comment recognition model, and outputting a prediction result of whether the comment to be recognized is the false comment or not through the comment recognition model. According to the method, whether the comment to be identified is the false comment or not is judged by analyzing the comment content about the product attribute in the comment to be identified, and the identification accuracy of the false comment can be improved.

In some embodiments, extracting facet information from the comments to be identified includes:

and extracting the aspect words and the emotional words from the comment content.

The obtained comment information is generally structured data, and the contents of a reviewer, a comment object (a specific product), a score, specific comment content, the length of the comment content and the like can be conveniently extracted from the structured data. The comment content is unstructured data, and specific contents related to comments, such as aspect words, sentiment words and the like, exist in the comment content part and are difficult to extract easily.

In some embodiments, specific comment contents such as facet words and emotion words can be extracted from the comment to be recognized by using the Bi-LSTM classification model. The Bi-LSTM classification model is obtained by pre-training, and the method for training the Bi-LSTM classification model comprises the following steps:

providing a comment sample set, wherein comments in the comment sample set comprise all contained words and labels corresponding to the words; wherein, the label of the word is one of the aspect word, the emotion word or other words;

and training the Bi-LSTM classification model by using the comment sample set to obtain the Bi-LSTM classification model capable of outputting the label of each word in the comment.

In this embodiment, in order to extract the aspect words and the emotion words from the comment content, the Bi-LSTM classification model is trained in advance, so that after the comment to be recognized is input into the Bi-LSTM classification model, the labels corresponding to each word to be recognized and reviewed can be output, and according to the labels of each output word, the words with the labels of the aspect words and the words with the labels of the emotion words are extracted from the comment to be recognized.

For the comments in the original comment sample set, each word can be labeled with a label from a preset label set in advance, after the label of each word is determined, word embedding is carried out by using a word2vec model, and the comment sample set comprising the words and the labels thereof is obtained. And training to obtain the Bi-LSTM classification model based on the comment sample set marked with the label.

In some approaches, the Bi-LSTM classification model includes an embedding layer, a Bi-LSTM layer, a softmax layer, and an output layer. And inputting the comment to be recognized into the Bi-LSTM classification model, and predicting the label of each word in the comment to be recognized by the Bi-LSTM classification model. Firstly, the input comments to be recognized are represented as vectors through an embedding layerX(ii) a Then, extracting the context characteristics of each word by using three Bi-LSTM layers; softmax activation function on top of Bi-LSTM layer for computing a set of labelsA，E，O}（AIn the interest of a term of art,Efor emotional words, O for other words) are determinedp(ii) a Finally, the label of each word in the input comment to be recognized is predicted according to the output of the softmax layer.

For example, input to be recognizedThe comment is "Service is good", the Bi-LSTM classification model will output "Service" with the labelAThe label of "good" isEThe label of "isO. Then, obtaining a tuple comprising the aspect words and the emotion words from the comment to be recognized<service，good>Wherein "service" is the facet of the product and "good" is the appraisal opinion of the product facet (service) by the reviewer.

In some embodiments, classifying the extracted aspect words to obtain the classified aspect words and aspect categories to which the aspect words belong includes:

determining a predetermined number of facet classes;

and dividing the selected part of the aspect words into specific aspect categories according to a preset number of aspect categories by using a K-means clustering method.

Since the review content submitted by the reviewer is arbitrary, the different aspects of the review content actually correspond to the same product attribute, e.g., cheese, and cheese are substantially the same product and have the same product attribute. Moreover, a review by a reviewer may include a plurality of specific attributes of the product, which may be categorized into substantially the same facet category, e.g., the reviewer reviews the food of a restaurant, the review includes pizza, french fries, coffee, etc., which may be categorized as a food facet category. Therefore, in order to simplify data processing and reduce data processing amount, the aspect words with certain popularity are selected according to the popularity of the aspect words of the product, the preset number of the aspect categories to be divided is determined, then the selected aspect words in the comments are divided into the preset number of the aspect categories by using a K-means clustering method, the preset number of the aspect categories are obtained, each aspect word belongs to one aspect category, and one or more aspect words may be arranged under each aspect category. In some ways, the K-means clustering method is a feature similarity-based clustering algorithm, and the facet words belonging to the same facet class are semantically similar.

In some embodiments, determining aspect characteristics from the aspect information and the aspect categories comprises:

counting the number of unique aspect words in the comment to be identified;

counting the average value of the deviation of the emotion scores of the emotion words in the comment for the aspect category and the emotion scores of the emotion words in the same aspect category in all the comments;

in this embodiment, five kinds of aspect features centering on the comment are used as a criterion for judging whether the comment to be identified is a false comment.

In some embodiments, the number of unique aspect words (denoted as NAW) in the comment to be identified is determined by:

（1）

wherein,

is a commentrThe set of unique aspect words in (1),

is the number of facet words contained in the unique facet word set. Wherein, the only aspect word refers to the aspect word which appears in the comment and is not repeated.

The proportion (expressed as PAC) of the aspect categories in the comments to be identified is counted by the following steps:

（2）

wherein,

to comment onrThe set of aspect categories in (1) of (1),

as a set of aspect categories

The number of aspect categories included in (a),

is a productpAspect class number of (2).

Determining the comment length (represented as ARL) of the related aspect category in the comment to be identified, wherein the method comprises the following steps:

（3）

wherein the comments include at least one comment clause,

is a comment clause

The number of all the words contained in (a),

to comment onrThe number of all words contained in;

determining the average deviation (represented as ARD) between the scores in the comments to be recognized and the emotion scores corresponding to the emotion words of all aspects, wherein the method comprises the following steps:

（4）

wherein,

as emotional words

The corresponding sentiment score is calculated based on the emotion score,

to comment onrScore of (1 to 5), denominator 4 represents the maximum difference between the highest and lowest scores (5-1 = 4),

to calculate the mean.

Counting the average value (expressed as ASD) of the deviation of the emotion scores of the emotion words of the aspect categories in the comments and the emotion scores of the emotion words of the same aspect categories in all the comments by:

（5）

wherein,

is a comment

as emotional words

is of aspect classacSet of commentsR _acChinese emotion word

The average score of (a).

In some embodiments, determining the aspect feature from the aspect information and the aspect category further comprises:

In this embodiment, three aspect features centering on the reviewer are used as a basis for judging whether the comment to be identified is a false comment.

In some embodiments, from all reviews, the total number of unique facet words (denoted as TNAW) posted by the reviewer is determined by:

（6）

wherein,R _uis the revieweruThe set of all the comments that are published,

is the revieweruReview of (1)rThe set of aspect words in (1),

is a set of aspect words

Number of Chinese facet words.

From all reviews, the average proportion of facet categories published by the reviewer (denoted APAC) is determined by:

（7）

is a revieweruThe collection of products that are being reviewed,NA _pis a productpNumber of aspect classes involved.

From all reviews, the total number of review lengths (denoted as AARL) for all facet categories by the reviewer is determined by:

（8）

in some embodiments, pre-trained comment recognition models are used to predict the aspect characteristics of the comment to be recognized. The comment identification model is realized based on an XGboost model, the XGboost model is a gradient enhancement decision tree model, and output is predicted according to a series of rules arranged in a tree structure. In some modes, the comment recognition model is realized by adopting a scimit-spare tool in Python, and the method comprises the steps of randomly selecting a plurality of comments from a real comment set and a false comment set, respectively forming a training set, a testing set and a verification set, training the model on the training set, adjusting model parameters according to results on the verification set, and finally calculating the accuracy of model prediction on the testing set by using the trained model.

The following describes the prediction effect achievable by the method according to the present embodiment with reference to experimental data.

The aspect feature-based false comment identification method provided by the embodiment was tested using a YelpChi _ Hotel review set (containing reviews of some hotels in chicago) and a YelpChi _ Res review set (containing reviews of some restaurants in chicago), and meanwhile, the text feature-and behavior feature-based false comment identification method was used for testing, and the test results of the three false comment identification methods were subjected to comparative analysis.

First, a review set is preprocessed, including: according to ",". "," is a little bit "

","! "punctuation marks, dividing the comment sentence into clauses; word embedding is carried out by using a word2vec tool in a Gensim software package to obtain a multidimensional vector expression of the comment sentence.

Based on the preprocessed comment sets, more than 60 ten thousand comments written by more than 5000 reviewers to more than 20 ten thousand hotels are selected from the comment sets as hotel comment sample sets, more than 70 ten thousand comments written by more than 3 ten thousand reviewers to more than 20 ten thousand restaurants are selected as restaurant comment sample sets, and the comment sample sets are used as comment sample sets of the false comment identification method for testing the aspect characteristics, the false comment identification method for text characteristics and the false comment identification method for behavior characteristics of the embodiment.

For the hotel review sample set, the Bi-LSTM hotel classification model is trained in advance. During training, the hotel comment sample set can be divided into a training set, a verification set and a test set according to a certain proportion (for example, 70%, 15% and 15%), the learning rate of the model is set to 0.01, and the number of hidden layers is set to 100. After training, a Bi-LSTM hotel classification model capable of extracting hotel-like aspect words and emotion words from the comments is obtained.

In the same way, for the restaurant review sample set, a Bi-LSTM restaurant classification model capable of extracting restaurant type aspect words and emotion words from the reviews is trained in advance.

Selecting all false comments and the real comments with the same number as the false comments from the hotel comment sample set, constructing a balance data set containing the real comments and the false comments, pre-training a hotel comment identification model by using the balance data set, and predicting the hotel comments by using the hotel comment identification model to obtain a prediction result of whether the comments are the false comments. During training, parameters of the XGboost model are carefully adjusted through grid search, and finally the optimal parameters of the XGboost model are set as follows: the number of decision trees is 500, the learning rate is 0.01, the maximum tree depth is 10, the weight of the minimum leaf node is 1, the proportion of random sampling is 0.95, the gamma value is 0.4, and in order to prevent overfitting and better evaluate the effectiveness of the model, five cross validation experiments are randomly carried out for 12 times to reduce errors.

According to the same method, a restaurant comment identification model is trained in advance for the restaurant comment sample set, and the restaurant comment identification model is used for predicting the restaurant comment to obtain a prediction result of whether the comment is a false comment or not.

According to the test result, the precision of extracting the aspect words and the emotion words from the hotel comment sample set by using the Bi-LSTM hotel classification model reaches 0.84, the precision of extracting the aspect words and the emotion words from the restaurant comment sample set by using the Bi-LSTM restaurant classification model reaches 0.92, the number of the extracted aspect words is large, and the extraction precision of the aspect words is high.

For hotel-class facet, the most popular 792 facet are selected, and 14 facet categories are determined, as shown in table 1:

TABLE 1 14 facet categories for hotel class facet word segmentation

For the restaurant class facet, the most popular 914 facets are selected, and 14 facet classes are determined, as shown in table 2:

TABLE 2 14 facet categories for hotel class facet word segmentation

As shown in tables 1 and 2, the extracted facet words are clustered into meaningful facet categories based on semantic relationships between the words. For example, the facet words in table 2 that are divided into facet class #1 include serve, staff, waiter, waitress, owner, chef, etc., and facet class #1 is an employee facet class.

The method of the present embodiment and the existing method based on the text feature and the behavior feature are evaluated according to five indexes of accuracy (accuracy) (a), recall (r), precision (p), F value (F) and ROC curve (AUC), as shown in table 3, model performances under seven different feature combinations are compared, including behavior feature B only, text feature T only, aspect feature a only, combination of behavior feature and text feature B + T, combination of behavior feature and aspect feature B + a, combination of text feature and aspect feature T + a, and combination of three features B + T + a.

TABLE 3 model Performance for different combinations of features

According to table 3, the false comment identification method based on the aspect features is superior to the identification method based on the text features and the behavior features in performance on various indexes. The aspect characteristics and the behavior characteristics are combined to identify the false comments, the accuracy rate on the two comment sets can reach 97.8% and 85.9%, the accuracy on the two comment sets can reach 96.7% and 85.3% based on the false comment identification of the aspect characteristics, the accuracy on the two comment sets is 70.4% and 67.9% based on the false comment identification of the combination of the behavior characteristics and the text characteristics, and therefore the model performance based on the aspect characteristics is improved by about 20% compared with the model performance based on the other two characteristics.

Feature importance provides a score for each feature, the higher the score, the higher the importance or relevance of the feature to false comment identification. As shown in fig. 2 and 3, the most important feature in the hotel review set is the facet feature AARL, the less important feature is the behavioral feature RR (whether it is a repeated review), the most important feature in the restaurant review set is the facet feature TNAW, the less important feature is the behavioral feature ISR (whether it is a unique review of a product by a user), and the text feature OW (the proportion of objective words in the review) and RFPP (the proportion of first-person pronouns in the review) and the like are less important in false review recognition. In the hotel review set, facet and behavioral features accounted for 59.0% and 29.6% of feature importance, respectively, while text features accounted for only 11.4% of feature importance. Similar results were found in restaurant review sets. These results show that the false comment identification method based on the aspect features provided by the embodiment can accurately and effectively identify the false comment, and if the aspect features are combined with the behavior features, the identification capability of the false comment can be improved, and the test results in table 3 are further verified.

In addition, independent t-test was also used to determine if there was a significant difference in the aspect characteristics NAW, PAC, ARL, ARD, ASD between the false and true reviews. As shown in fig. 4 and 5, the results indicate that in the hotel review set, PAC (mean = 0.35) and NAW (mean = 0.15) of the false reviews are significantly lower than PAC (mean = 0.40) and NAW (mean = 0.18) of the real reviews, which may indicate that reviews containing facet words or less categories of facets are likely to be false reviews. The ARD of a false comment (mean = 0.33) is much higher than the ARD of a real comment (mean = 0.28), which may indicate that a comment is likely to be a false comment if the score of the comment is much different from the score of the affective word of the facet word in question in the comment.

It should be noted that the method of one or more embodiments of the present disclosure may be performed by a single device, such as a computer or server. The method of the embodiment can also be applied to a distributed scene and completed by the mutual cooperation of a plurality of devices. In such a distributed scenario, one of the devices may perform only one or more steps of the method of one or more embodiments of the present disclosure, and the devices may interact with each other to complete the method.

It should be noted that the above description describes certain embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

As shown in fig. 6, an embodiment of the present specification further provides a false comment identification apparatus based on aspect features, including:

the aspect extraction module is used for extracting aspect information from the comments to be identified; the aspect information comprises aspect words and corresponding emotion words, commentators, commentary products, scores, commentary content, commentary length and other information;

the aspect classification module is used for classifying the extracted aspect words to obtain the classified aspect words and the aspect categories to which the aspect words belong;

the characteristic determining module is used for determining aspect characteristics according to the aspect information and the aspect categories;

and the prediction module is used for inputting the determined aspect characteristics into a pre-trained comment recognition model, and outputting a prediction result of whether the comment to be recognized is a false comment or not by the comment recognition model.

For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Of course, the functionality of the modules may be implemented in the same one or more software and/or hardware implementations in implementing one or more embodiments of the present description.

The apparatus of the foregoing embodiment is used to implement the corresponding method in the foregoing embodiment, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Fig. 7 is a schematic diagram illustrating a more specific hardware structure of an electronic device according to this embodiment, where the electronic device may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein the processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 are communicatively coupled to each other within the device via bus 1050.

The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present disclosure.

The Memory 1020 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 1020 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 1020 and called to be executed by the processor 1010.

The input/output interface 1030 is used for connecting an input/output module to input and output information. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.

The communication interface 1040 is used for connecting a communication module (not shown in the drawings) to implement communication interaction between the present apparatus and other apparatuses. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).

Bus 1050 includes a path that transfers information between various components of the device, such as processor 1010, memory 1020, input/output interface 1030, and communication interface 1040.

It should be noted that although the above-mentioned device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.

The electronic device of the foregoing embodiment is used to implement the corresponding method in the foregoing embodiment, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Computer-readable media of the present embodiments, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the spirit of the present disclosure, features from the above embodiments or from different embodiments may also be combined, steps may be implemented in any order, and there are many other variations of different aspects of one or more embodiments of the present description as described above, which are not provided in detail for the sake of brevity.

In addition, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown in the provided figures, for simplicity of illustration and discussion, and so as not to obscure one or more embodiments of the disclosure. Furthermore, devices may be shown in block diagram form in order to avoid obscuring the understanding of one or more embodiments of the present description, and this also takes into account the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the one or more embodiments of the present description are to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the disclosure, it should be apparent to one skilled in the art that one or more embodiments of the disclosure can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.

While the present disclosure has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic ram (dram)) may use the discussed embodiments.

It is intended that the one or more embodiments of the present specification embrace all such alternatives, modifications and variations as fall within the broad scope of the appended claims. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of one or more embodiments of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. The false comment identification method based on the aspect features is characterized by comprising the following steps:

extracting aspect information from the comments to be identified, wherein the aspect information comprises reviewers, comment products, scores, comment contents and comment lengths;

extracting aspect words and corresponding emotion words from the comment content;

determining aspect features based on the aspect information and the aspect categories, including:

counting the number of unique aspect words in the comment to be identified, wherein the method comprises the following steps:

（1）

wherein,

is the only set of facet words in the comment r,

the number of the aspect words contained in the unique aspect word set;

（2）

wherein,

to review the set of facet categories in r,

as a set of aspect categories

The number of aspect categories included in (a),

the number of facet categories for product p;

（3）

wherein the comments include at least one comment clause,

is a comment clause in which the aspect word belonging to the aspect category ac in the comment r is located,

is a comment clause

The number of all the words contained in (a),

the number of all words contained in the comment r;

determining the average deviation between the scores in the comments to be identified and the emotion scores corresponding to the emotion words of all aspects of categories, wherein the method comprises the following steps:

（4）

wherein,

for the emotional words corresponding to the aspect words of which the aspect category is ac in the comment r,

as emotional words

The corresponding sentiment score is calculated based on the emotion score,

in order to review the score of the r,

calculating the mean value;

counting the average value of the deviation between the emotion scores of the emotion words in the aspect categories in the comments and the emotion scores of the emotion words in the same aspect categories in all the comments, wherein the method comprises the following steps:

（5）

wherein,

is a comment

The emotion words corresponding to the aspect words of which the aspect category is ac in the middle can be obtained,

as emotional words

Sentiment score of R_acFor a set of reviews for which the facet category is ac,

set of comments R for aspect category ac_acChinese emotion word

An average score of (d);

2. The method of claim 1, wherein the extracting the aspect words and the emotion words from the comment content comprises:

3. The method of claim 2, wherein the method of training the Bi-LSTM classification model is:

4. The method of claim 1, wherein classifying the aspect words to obtain the classified aspect words and aspect categories to which the aspect words belong comprises:

determining a predetermined number of facet classes;

5. The method of claim 1, wherein determining aspect characteristics from the aspect information and the aspect categories further comprises:

6. The method of claim 5, wherein the method of determining the total number of unique facet words published by a reviewer is:

（6）

wherein R is_uIs the set of all comments posted by reviewer u,

is a set of facet words in the comment r of the reviewer u,

is a set of aspect words

The number of Chinese terms;

（7）

wherein, AC_rFor the set of facet classes in the comment r of reviewer u, | AC_rI is the set of aspect classes AC_rThe number of facet classes contained in (a),

set of products, NA, reviewed for reviewer u_pNumber of facet classes contained for product p;

（8）

wherein,

is a comment sentence

The number of all the words contained in (a),

is the number of all words contained in comment r.

7. An apparatus for identifying false comments based on aspect features, comprising:

the aspect extraction module is used for extracting aspect information from the comments to be identified; the aspect information comprises reviewers, review products, scores, review contents and review lengths; extracting aspect words and corresponding emotion words from the comment content;

a feature determination module for determining an aspect feature based on the aspect information and the aspect category, comprising: counting the number of unique aspect words in the comment to be identified; counting the proportion of the aspect categories in the comments to be identified; counting the comment length of the related aspect category in the comment to be identified; determining the emotion words corresponding to the side words in the comments to be recognized, and determining the emotion scores corresponding to the emotion words; determining the average deviation between the scores in the comments to be identified and the emotion scores corresponding to the emotion words of all aspects of categories; counting the average value of the deviation of the emotion scores of the emotion words in the comment for the aspect category and the emotion scores of the emotion words in the same aspect category in all the comments; wherein,

the method for counting the number of the unique aspect words in the comment to be identified comprises the following steps: