CN112989056B - False comment identification method and device based on aspect features - Google Patents

False comment identification method and device based on aspect features Download PDF

Info

Publication number
CN112989056B
CN112989056B CN202110487429.8A CN202110487429A CN112989056B CN 112989056 B CN112989056 B CN 112989056B CN 202110487429 A CN202110487429 A CN 202110487429A CN 112989056 B CN112989056 B CN 112989056B
Authority
CN
China
Prior art keywords
words
comment
emotion
categories
comments
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110487429.8A
Other languages
Chinese (zh)
Other versions
CN112989056A (en
Inventor
吕欣
蔡梦思
谭跃进
豆亚杰
谭索怡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202110487429.8A priority Critical patent/CN112989056B/en
Publication of CN112989056A publication Critical patent/CN112989056A/en
Application granted granted Critical
Publication of CN112989056B publication Critical patent/CN112989056B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0282Rating or review of business operators or products

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Business, Economics & Management (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • Accounting & Taxation (AREA)
  • Strategic Management (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Molecular Biology (AREA)
  • Game Theory and Decision Science (AREA)
  • Software Systems (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

One or more embodiments of the present specification provide a false comment identification method and apparatus based on aspect features, including: extracting aspect information from the comments to be identified; the facet information includes facet words; classifying the aspect words to obtain the classified aspect words and the aspect categories to which the aspect words belong; determining aspect features according to the aspect information and the aspect categories; and inputting the aspect characteristics into a pre-trained comment recognition model, and outputting a prediction result of whether the comment to be recognized is a false comment or not by the comment recognition model. The embodiment can accurately identify the false comments.

Description

False comment identification method and device based on aspect features
Technical Field
One or more embodiments of the present disclosure relate to the technical field of artificial intelligence, and in particular, to a false comment identification method and apparatus based on aspect features.
Background
With the development of information technology, the convenience of life is greatly improved by utilizing various online platforms to carry out shopping, traveling, ordering, transportation and the like. A plurality of comments are provided under each product of the online platform, a user can select whether to buy the product according to the comments, and a merchant can improve the quality and service of the product according to the comments. However, there are some false comments in many comments, which affect the judgment of users and merchants and are not favorable for economic benign development. Therefore, false comments are accurately identified from the comments, and the false comments are screened out, so that the method is a key problem to be solved by the online platform.
Disclosure of Invention
In view of this, one or more embodiments of the present disclosure are directed to a method and an apparatus for identifying a false comment based on an aspect feature, which can identify the false comment.
In view of the above, one or more embodiments of the present specification provide a method for identifying false comments based on aspect features, including:
extracting aspect information from the comments to be identified; the aspect information comprises aspect words;
classifying the aspect words to obtain the classified aspect words and the aspect categories to which the aspect words belong;
determining aspect features according to the aspect information and the aspect categories;
and inputting the aspect characteristics into a pre-trained comment recognition model, and outputting a prediction result of whether the comment to be recognized is a false comment or not by the comment recognition model.
Optionally, the extracting aspect information from the comment to be identified includes:
extracting reviewers, review products, scores, review contents and review lengths from the reviews to be identified;
and extracting the aspect words and the corresponding emotion words from the comment content.
Optionally, the aspect words and the emotion words extracted from the comment content are:
and extracting the aspect words and the corresponding emotion words from the comments to be recognized by utilizing a pre-trained Bi-LSTM classification model.
Optionally, the method for training the Bi-LSTM classification model is as follows:
providing a comment sample set, wherein comments in the comment sample set comprise all contained words and labels corresponding to the words; the label of the word is one of a facet word, an emotion word or other words;
and training a Bi-LSTM classification model by using the comment sample set to obtain the Bi-LSTM classification model capable of outputting the label of each word in the comment.
Optionally, the classifying the aspect words to obtain the classified aspect words and the aspect categories to which the aspect words belong includes:
selecting partial aspect words with preset popularity from the extracted aspect words according to the preset popularity;
determining a predetermined number of facet classes;
and dividing the part of aspect words into specific aspect categories according to the preset number of aspect categories by using a K-means clustering method.
Optionally, determining an aspect feature according to the aspect information and the aspect category includes:
counting the number of unique aspect words in the comment to be identified;
counting the proportion of the aspect categories in the comments to be identified;
counting the comment length of the related aspect category in the comment to be identified;
determining the emotion words corresponding to the side words in the comments to be recognized, and determining the emotion scores corresponding to the emotion words;
determining the average deviation between the scores in the comments to be identified and the emotion scores corresponding to the emotion words of all aspects of categories;
and counting the average value of the deviation of the emotion scores of the emotion words of the aspect categories in the comments and the emotion scores of the emotion words of the same aspect categories in all the comments.
Optionally, the method for counting the number of unique aspect words in the comment to be identified includes:
Figure 830351DEST_PATH_IMAGE001
(1)
wherein,
Figure 786806DEST_PATH_IMAGE002
is a commentrThe set of unique aspect words in (1),
Figure 402595DEST_PATH_IMAGE003
the number of the aspect words contained in the unique aspect word set;
the method for counting the proportion of the aspect categories in the comments to be identified comprises the following steps:
Figure 165015DEST_PATH_IMAGE004
(2)
wherein,
Figure 549859DEST_PATH_IMAGE005
to comment onrThe set of aspect categories in (1) of (1),
Figure 473953DEST_PATH_IMAGE006
as a set of aspect categories
Figure 842618DEST_PATH_IMAGE007
The number of aspect categories included in (a),
Figure 346411DEST_PATH_IMAGE008
is a productpThe number of aspect categories of (a);
the method for counting the comment length of the relevant aspect category in the comment to be identified comprises the following steps:
Figure 320183DEST_PATH_IMAGE009
(3)
wherein the comments include at least one comment clause,
Figure 680758DEST_PATH_IMAGE010
is a commentrIn the category ofacThe comment clause in which the aspect word is located,
Figure 471472DEST_PATH_IMAGE011
is a comment clause
Figure 310115DEST_PATH_IMAGE012
The number of all the words contained in (a),
Figure 403973DEST_PATH_IMAGE013
to comment onrThe number of all words contained in;
the method for determining the average deviation between the scores in the comments to be recognized and the emotion scores corresponding to the emotion words of all aspects of the categories is as follows:
Figure 669869DEST_PATH_IMAGE014
(4)
wherein,
Figure 747546DEST_PATH_IMAGE015
to comment onrIn the category ofacThe aspect words of (1) are corresponding to the emotion words,
Figure 124301DEST_PATH_IMAGE016
as emotional words
Figure 603824DEST_PATH_IMAGE017
The corresponding sentiment score is calculated based on the emotion score,
Figure 40621DEST_PATH_IMAGE018
to comment onrThe score of (a) is determined,
Figure 871174DEST_PATH_IMAGE019
calculating the mean value;
the method for counting the average value of the deviation between the emotion scores of the emotion words in the aspect categories in the comments and the emotion scores of the emotion words in the same aspect categories in all the comments comprises the following steps:
Figure 989303DEST_PATH_IMAGE020
(5)
wherein,
Figure 792174DEST_PATH_IMAGE021
is a comment
Figure 402802DEST_PATH_IMAGE022
In the category ofacThe aspect words of (1) are corresponding to the emotion words,
Figure 189493DEST_PATH_IMAGE023
as emotional words
Figure 173629DEST_PATH_IMAGE024
The sentiment score of (a) is determined,R ac is of the aspect classacThe set of comments of (a) is,
Figure 893324DEST_PATH_IMAGE025
is of aspect classacSet of commentsR ac Chinese emotion word
Figure 671924DEST_PATH_IMAGE026
The average score of (a).
Optionally, determining an aspect feature according to the aspect information and the aspect category includes:
counting all comments issued by a reviewer according to the reviewer of the comment to be identified;
determining the total number of the unique aspect words issued by the reviewer according to all the reviews;
determining the average proportion of the aspect categories issued by the reviewer according to all the reviews;
from all reviews, the total number of review lengths for the reviewer for all facet categories is determined.
Optionally, the method for determining the total number of the unique facet words issued by the reviewer is as follows:
Figure 945910DEST_PATH_IMAGE027
(6)
wherein,R u is the revieweruThe set of all the comments that are published,
Figure 468158DEST_PATH_IMAGE028
is the revieweruReview of (1)rThe set of aspect words in (1),
Figure 776780DEST_PATH_IMAGE029
is a set of aspect words
Figure 726282DEST_PATH_IMAGE028
The number of Chinese terms;
the method for determining the average proportion of the aspect categories issued by the reviewers comprises the following steps:
Figure 753143DEST_PATH_IMAGE030
(7)
wherein,AC r is a revieweruReview of (1)rAn aspect class set ofAC r I is the set of aspect categoriesAC r The number of facet classes contained in (a),
Figure 813503DEST_PATH_IMAGE031
is a revieweruThe collection of products that are being reviewed,NA p is a productpThe number of facet classes contained;
the method for determining the total number of comment lengths of the reviewers on all the facet categories is as follows:
Figure 242211DEST_PATH_IMAGE032
(8)
wherein,
Figure 565876DEST_PATH_IMAGE033
is a commentrIn the category ofacThe comment clause in which the aspect word is located,
Figure 814454DEST_PATH_IMAGE034
is a comment sentence
Figure 944084DEST_PATH_IMAGE033
The number of all the words contained in (a),
Figure 224369DEST_PATH_IMAGE035
to comment onrThe number of all words contained in (a).
The embodiment of the present specification further provides a false comment identification device based on aspect characteristics, including:
the aspect extraction module is used for extracting aspect information from the comments to be identified; the aspect information comprises aspect words;
the aspect classification module is used for classifying the aspect words to obtain the classified aspect words and the aspect categories to which the aspect words belong;
a feature determination module for determining aspect features according to the aspect information and the aspect categories;
and the prediction module is used for inputting the aspect characteristics into a pre-trained comment identification model, and outputting a prediction result of whether the comment to be identified is a false comment or not by the comment identification model.
As can be seen from the above, in the false comment recognition method and apparatus based on aspect features provided in one or more embodiments of the present specification, the aspect information is extracted from the comment to be recognized, the aspect words are classified, the classified aspect words and the aspect categories to which the aspect words belong are obtained, the aspect features are determined according to the aspect information and the aspect categories, the aspect features are input into a pre-trained comment recognition model, and the comment recognition model outputs a prediction result of whether the comment to be recognized is a false comment. According to the method, whether the comment is a false comment or not is judged by analyzing the comment content about the product attribute in the comment, and the identification accuracy of the false comment can be improved.
Drawings
In order to more clearly illustrate one or more embodiments or prior art solutions of the present specification, the drawings that are needed in the description of the embodiments or prior art will be briefly described below, and it is obvious that the drawings in the following description are only one or more embodiments of the present specification, and that other drawings may be obtained by those skilled in the art without inventive effort from these drawings.
FIG. 1 is a schematic flow chart of a method according to one or more embodiments of the present disclosure;
FIG. 2 is a schematic diagram of feature importance analysis on a hotel review set according to one or more embodiments of the present disclosure;
FIG. 3 is a schematic diagram of a feature importance analysis on a restaurant review collection in accordance with one or more embodiments of the disclosure;
FIG. 4 is a schematic diagram of an aspect mean analysis over a hotel evaluation set in accordance with one or more embodiments of the present disclosure;
FIG. 5 is a schematic diagram illustrating a mean analysis of aspects features across a restaurant assessment set in accordance with one or more embodiments of the present disclosure;
FIG. 6 is a schematic diagram of an apparatus according to one or more embodiments of the present disclosure;
fig. 7 is a schematic structural diagram of an electronic device according to one or more embodiments of the present disclosure.
Detailed Description
For the purpose of promoting a better understanding of the objects, aspects and advantages of the present disclosure, reference is made to the following detailed description taken in conjunction with the accompanying drawings.
It is to be noted that unless otherwise defined, technical or scientific terms used in one or more embodiments of the present specification should have the ordinary meaning as understood by those of ordinary skill in the art to which this disclosure belongs. The use of "first," "second," and similar terms in one or more embodiments of the specification is not intended to indicate any order, quantity, or importance, but rather is used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.
As described in the background section, the online platform needs to identify false comments from numerous comments and screen out the false comments, so that the user can select a product according to the true comments, and the merchant can improve the product quality, the service quality, and the like according to the true comments and the product sales volume. The current false comment identification method can identify false comments based on text features, behavior features and the like, however, the identified false comments are not accurate enough due to the fact that false reviewers can forge text features and behavior features.
In implementing the present disclosure, the applicant finds that, when a user purchases a product, the user can view comments about the product attributes in detail, and the selection of the product can be influenced by the comments. Among these, product attributes are referred to in prior research as product aspects. Since the product is not generally used by the false reviewer, the content of the product comment in the false comment is different from that of the product comment in the real comment, and the false reviewer hardly forges the aspect characteristics, so that the false comment can be effectively identified by analyzing the aspect characteristics of the product.
In view of the above, the present specification provides a false comment identification method based on aspect features, which can improve the identification accuracy of false comments by extracting aspect information from comments, determining aspect features according to the aspect information, and identifying false comments through the aspect features.
Hereinafter, the technical means of the present disclosure will be described in further detail with reference to specific examples.
As shown in fig. 1, one or more embodiments of the present specification provide a method for identifying false comments based on aspect features, including:
s101: extracting aspect information from the comments to be identified; the aspect information comprises aspect words and corresponding emotion words, commentators, commentary products, scores, commentary content, commentary length and other information;
in this embodiment, first, the aspect information of the relevant product is extracted from the obtained comment to be identified. The aspect information comprises aspect words related to product attributes, emotion words expressing emotional tendency on product aspects, commentators who issue comments, commented products, scores on the products, comment contents, length of the comment contents and the like.
S102: classifying the extracted aspect words to obtain the classified aspect words and the aspect categories to which the aspect words belong;
in this embodiment, after the aspect words are extracted, according to the popularity of the current aspect words, part of the aspect words generally accepted by the user are selected and classified, so as to obtain the classified aspect words and the aspect categories to which the aspect words belong.
In some modes, part of the aspect words generally accepted by the user can be selected according to the popularity of the aspect words, and the selected aspect words are divided into at least two aspect categories by using a K-means clustering method. For example, from the review of a hotel, the extracted terms include water, cheese, hamburgers, gyms, televisions, swimming pools, attendants, baggage shipping, etc., and the terms can be classified into food (water, cheese, hamburgers), hardware facilities (gyms, swimming pools, televisions), service (attendants, baggage shipping), etc.
S103: determining aspect characteristics according to the aspect information and the aspect categories;
in some embodiments, determining eight kinds of aspect features according to the extracted aspect information, the classified aspect words, and the aspect categories to which the aspect words belong includes: the number of unique aspect words contained in the comment, the proportion of aspect categories contained in the comment, the length of the comment on related aspect categories in the comment, the average deviation between the score of the comment and the sentiment scores corresponding to the sentiment words on the aspect categories in the comment, the average of the sentiment scores of the sentiment words on the aspect categories in the comment and the sentiment scores of the sentiment words on the same aspect categories in all comments, the total number of unique aspect words published by the reviewer, the average proportion of aspect categories published by the reviewer, and the total number of the comment lengths of the reviewer on all aspect categories. According to the analysis of the above eight aspect features, whether the comment is a false comment or not can be predicted.
S104: and inputting the determined aspect characteristics into a pre-trained comment recognition model, and outputting a prediction result of whether the comment to be recognized is a false comment or not by the comment recognition model.
In the embodiment, after the comment to be recognized is processed, the aspect characteristics of the comment to be recognized are obtained, the aspect characteristics are input into a comment recognition model obtained through pre-training, the comment recognition model predicts according to the input aspect characteristics, a prediction result is output, and whether the comment to be recognized is a false comment or not can be determined according to the prediction result.
The method for recognizing the false comment based on the aspect characteristics comprises the steps of extracting aspect information from the comment to be recognized, classifying the extracted aspect words to obtain the classified aspect words and the aspect categories to which the aspect words belong, determining the aspect characteristics according to the aspect information, the classified aspect words and the aspect categories, inputting the aspect characteristics into a pre-trained comment recognition model, and outputting a prediction result of whether the comment to be recognized is the false comment or not through the comment recognition model. According to the method, whether the comment to be identified is the false comment or not is judged by analyzing the comment content about the product attribute in the comment to be identified, and the identification accuracy of the false comment can be improved.
In some embodiments, extracting facet information from the comments to be identified includes:
extracting reviewers, review products, scores, review contents and review lengths from the reviews to be identified;
and extracting the aspect words and the emotional words from the comment content.
The obtained comment information is generally structured data, and the contents of a reviewer, a comment object (a specific product), a score, specific comment content, the length of the comment content and the like can be conveniently extracted from the structured data. The comment content is unstructured data, and specific contents related to comments, such as aspect words, sentiment words and the like, exist in the comment content part and are difficult to extract easily.
In some embodiments, specific comment contents such as facet words and emotion words can be extracted from the comment to be recognized by using the Bi-LSTM classification model. The Bi-LSTM classification model is obtained by pre-training, and the method for training the Bi-LSTM classification model comprises the following steps:
providing a comment sample set, wherein comments in the comment sample set comprise all contained words and labels corresponding to the words; wherein, the label of the word is one of the aspect word, the emotion word or other words;
and training the Bi-LSTM classification model by using the comment sample set to obtain the Bi-LSTM classification model capable of outputting the label of each word in the comment.
In this embodiment, in order to extract the aspect words and the emotion words from the comment content, the Bi-LSTM classification model is trained in advance, so that after the comment to be recognized is input into the Bi-LSTM classification model, the labels corresponding to each word to be recognized and reviewed can be output, and according to the labels of each output word, the words with the labels of the aspect words and the words with the labels of the emotion words are extracted from the comment to be recognized.
For the comments in the original comment sample set, each word can be labeled with a label from a preset label set in advance, after the label of each word is determined, word embedding is carried out by using a word2vec model, and the comment sample set comprising the words and the labels thereof is obtained. And training to obtain the Bi-LSTM classification model based on the comment sample set marked with the label.
In some approaches, the Bi-LSTM classification model includes an embedding layer, a Bi-LSTM layer, a softmax layer, and an output layer. And inputting the comment to be recognized into the Bi-LSTM classification model, and predicting the label of each word in the comment to be recognized by the Bi-LSTM classification model. Firstly, the input comments to be recognized are represented as vectors through an embedding layerX(ii) a Then, extracting the context characteristics of each word by using three Bi-LSTM layers; softmax activation function on top of Bi-LSTM layer for computing a set of labelsAEO}(AIn the interest of a term of art,Efor emotional words, O for other words) are determinedp(ii) a Finally, the label of each word in the input comment to be recognized is predicted according to the output of the softmax layer.
For example, input to be recognizedThe comment is "Service is good", the Bi-LSTM classification model will output "Service" with the labelAThe label of "good" isEThe label of "isO. Then, obtaining a tuple comprising the aspect words and the emotion words from the comment to be recognized<service,good>Wherein "service" is the facet of the product and "good" is the appraisal opinion of the product facet (service) by the reviewer.
In some embodiments, classifying the extracted aspect words to obtain the classified aspect words and aspect categories to which the aspect words belong includes:
selecting partial aspect words with preset popularity from the extracted aspect words according to the preset popularity;
determining a predetermined number of facet classes;
and dividing the selected part of the aspect words into specific aspect categories according to a preset number of aspect categories by using a K-means clustering method.
Since the review content submitted by the reviewer is arbitrary, the different aspects of the review content actually correspond to the same product attribute, e.g., cheese, and cheese are substantially the same product and have the same product attribute. Moreover, a review by a reviewer may include a plurality of specific attributes of the product, which may be categorized into substantially the same facet category, e.g., the reviewer reviews the food of a restaurant, the review includes pizza, french fries, coffee, etc., which may be categorized as a food facet category. Therefore, in order to simplify data processing and reduce data processing amount, the aspect words with certain popularity are selected according to the popularity of the aspect words of the product, the preset number of the aspect categories to be divided is determined, then the selected aspect words in the comments are divided into the preset number of the aspect categories by using a K-means clustering method, the preset number of the aspect categories are obtained, each aspect word belongs to one aspect category, and one or more aspect words may be arranged under each aspect category. In some ways, the K-means clustering method is a feature similarity-based clustering algorithm, and the facet words belonging to the same facet class are semantically similar.
In some embodiments, determining aspect characteristics from the aspect information and the aspect categories comprises:
counting the number of unique aspect words in the comment to be identified;
counting the proportion of the aspect categories in the comments to be identified;
counting the comment length of the related aspect category in the comment to be identified;
determining the emotion words corresponding to the side words in the comments to be recognized, and determining the emotion scores corresponding to the emotion words;
determining the average deviation between the scores in the comments to be identified and the emotion scores corresponding to the emotion words of all aspects of categories;
counting the average value of the deviation of the emotion scores of the emotion words in the comment for the aspect category and the emotion scores of the emotion words in the same aspect category in all the comments;
in this embodiment, five kinds of aspect features centering on the comment are used as a criterion for judging whether the comment to be identified is a false comment.
In some embodiments, the number of unique aspect words (denoted as NAW) in the comment to be identified is determined by:
Figure 515673DEST_PATH_IMAGE001
(1)
wherein,
Figure 251547DEST_PATH_IMAGE002
is a commentrThe set of unique aspect words in (1),
Figure 919289DEST_PATH_IMAGE036
is the number of facet words contained in the unique facet word set. Wherein, the only aspect word refers to the aspect word which appears in the comment and is not repeated.
The proportion (expressed as PAC) of the aspect categories in the comments to be identified is counted by the following steps:
Figure 791430DEST_PATH_IMAGE004
(2)
wherein,
Figure 253636DEST_PATH_IMAGE005
to comment onrThe set of aspect categories in (1) of (1),
Figure 476806DEST_PATH_IMAGE006
as a set of aspect categories
Figure 948239DEST_PATH_IMAGE007
The number of aspect categories included in (a),
Figure 737204DEST_PATH_IMAGE037
is a productpAspect class number of (2).
Determining the comment length (represented as ARL) of the related aspect category in the comment to be identified, wherein the method comprises the following steps:
Figure 370310DEST_PATH_IMAGE009
(3)
wherein the comments include at least one comment clause,
Figure 80777DEST_PATH_IMAGE010
is a commentrIn the category ofacThe comment clause in which the aspect word is located,
Figure 90322DEST_PATH_IMAGE038
is a comment clause
Figure 937055DEST_PATH_IMAGE012
The number of all the words contained in (a),
Figure 6642DEST_PATH_IMAGE013
to comment onrThe number of all words contained in;
determining the average deviation (represented as ARD) between the scores in the comments to be recognized and the emotion scores corresponding to the emotion words of all aspects, wherein the method comprises the following steps:
Figure 938826DEST_PATH_IMAGE014
(4)
wherein,
Figure 486482DEST_PATH_IMAGE015
to comment onrIn the category ofacThe aspect words of (1) are corresponding to the emotion words,
Figure 453301DEST_PATH_IMAGE016
as emotional words
Figure 708438DEST_PATH_IMAGE017
The corresponding sentiment score is calculated based on the emotion score,
Figure 862339DEST_PATH_IMAGE018
to comment onrScore of (1 to 5), denominator 4 represents the maximum difference between the highest and lowest scores (5-1 = 4),
Figure 213686DEST_PATH_IMAGE019
to calculate the mean.
Counting the average value (expressed as ASD) of the deviation of the emotion scores of the emotion words of the aspect categories in the comments and the emotion scores of the emotion words of the same aspect categories in all the comments by:
Figure 35011DEST_PATH_IMAGE039
(5)
wherein,
Figure 446401DEST_PATH_IMAGE021
is a comment
Figure 353177DEST_PATH_IMAGE022
In the category ofacThe aspect words of (1) are corresponding to the emotion words,
Figure 508215DEST_PATH_IMAGE023
as emotional words
Figure 980784DEST_PATH_IMAGE024
The sentiment score of (a) is determined,R ac is of the aspect classacThe set of comments of (a) is,
Figure 563075DEST_PATH_IMAGE040
is of aspect classacSet of commentsR ac Chinese emotion word
Figure 957148DEST_PATH_IMAGE026
The average score of (a).
In some embodiments, determining the aspect feature from the aspect information and the aspect category further comprises:
counting all comments issued by a reviewer according to the reviewer of the comment to be identified;
determining the total number of the unique aspect words issued by the reviewer according to all the reviews;
determining the average proportion of the aspect categories issued by the reviewer according to all the reviews;
from all reviews, the total number of review lengths for the reviewer for all facet categories is determined.
In this embodiment, three aspect features centering on the reviewer are used as a basis for judging whether the comment to be identified is a false comment.
In some embodiments, from all reviews, the total number of unique facet words (denoted as TNAW) posted by the reviewer is determined by:
Figure 791243DEST_PATH_IMAGE027
(6)
wherein,R u is the revieweruThe set of all the comments that are published,
Figure 383898DEST_PATH_IMAGE028
is the revieweruReview of (1)rThe set of aspect words in (1),
Figure 871511DEST_PATH_IMAGE029
is a set of aspect words
Figure 752879DEST_PATH_IMAGE028
Number of Chinese facet words.
From all reviews, the average proportion of facet categories published by the reviewer (denoted APAC) is determined by:
Figure 715632DEST_PATH_IMAGE030
(7)
wherein,AC r is a revieweruReview of (1)rAn aspect class set ofAC r I is the set of aspect categoriesAC r The number of facet classes contained in (a),
Figure 366056DEST_PATH_IMAGE031
is a revieweruThe collection of products that are being reviewed,NA p is a productpNumber of aspect classes involved.
From all reviews, the total number of review lengths (denoted as AARL) for all facet categories by the reviewer is determined by:
Figure 290150DEST_PATH_IMAGE041
(8)
in some embodiments, pre-trained comment recognition models are used to predict the aspect characteristics of the comment to be recognized. The comment identification model is realized based on an XGboost model, the XGboost model is a gradient enhancement decision tree model, and output is predicted according to a series of rules arranged in a tree structure. In some modes, the comment recognition model is realized by adopting a scimit-spare tool in Python, and the method comprises the steps of randomly selecting a plurality of comments from a real comment set and a false comment set, respectively forming a training set, a testing set and a verification set, training the model on the training set, adjusting model parameters according to results on the verification set, and finally calculating the accuracy of model prediction on the testing set by using the trained model.
The following describes the prediction effect achievable by the method according to the present embodiment with reference to experimental data.
The aspect feature-based false comment identification method provided by the embodiment was tested using a YelpChi _ Hotel review set (containing reviews of some hotels in chicago) and a YelpChi _ Res review set (containing reviews of some restaurants in chicago), and meanwhile, the text feature-and behavior feature-based false comment identification method was used for testing, and the test results of the three false comment identification methods were subjected to comparative analysis.
First, a review set is preprocessed, including: according to ",". "," is a little bit "
Figure 393235DEST_PATH_IMAGE042
","! "punctuation marks, dividing the comment sentence into clauses; word embedding is carried out by using a word2vec tool in a Gensim software package to obtain a multidimensional vector expression of the comment sentence.
Based on the preprocessed comment sets, more than 60 ten thousand comments written by more than 5000 reviewers to more than 20 ten thousand hotels are selected from the comment sets as hotel comment sample sets, more than 70 ten thousand comments written by more than 3 ten thousand reviewers to more than 20 ten thousand restaurants are selected as restaurant comment sample sets, and the comment sample sets are used as comment sample sets of the false comment identification method for testing the aspect characteristics, the false comment identification method for text characteristics and the false comment identification method for behavior characteristics of the embodiment.
For the hotel review sample set, the Bi-LSTM hotel classification model is trained in advance. During training, the hotel comment sample set can be divided into a training set, a verification set and a test set according to a certain proportion (for example, 70%, 15% and 15%), the learning rate of the model is set to 0.01, and the number of hidden layers is set to 100. After training, a Bi-LSTM hotel classification model capable of extracting hotel-like aspect words and emotion words from the comments is obtained.
In the same way, for the restaurant review sample set, a Bi-LSTM restaurant classification model capable of extracting restaurant type aspect words and emotion words from the reviews is trained in advance.
Selecting all false comments and the real comments with the same number as the false comments from the hotel comment sample set, constructing a balance data set containing the real comments and the false comments, pre-training a hotel comment identification model by using the balance data set, and predicting the hotel comments by using the hotel comment identification model to obtain a prediction result of whether the comments are the false comments. During training, parameters of the XGboost model are carefully adjusted through grid search, and finally the optimal parameters of the XGboost model are set as follows: the number of decision trees is 500, the learning rate is 0.01, the maximum tree depth is 10, the weight of the minimum leaf node is 1, the proportion of random sampling is 0.95, the gamma value is 0.4, and in order to prevent overfitting and better evaluate the effectiveness of the model, five cross validation experiments are randomly carried out for 12 times to reduce errors.
According to the same method, a restaurant comment identification model is trained in advance for the restaurant comment sample set, and the restaurant comment identification model is used for predicting the restaurant comment to obtain a prediction result of whether the comment is a false comment or not.
According to the test result, the precision of extracting the aspect words and the emotion words from the hotel comment sample set by using the Bi-LSTM hotel classification model reaches 0.84, the precision of extracting the aspect words and the emotion words from the restaurant comment sample set by using the Bi-LSTM restaurant classification model reaches 0.92, the number of the extracted aspect words is large, and the extraction precision of the aspect words is high.
For hotel-class facet, the most popular 792 facet are selected, and 14 facet categories are determined, as shown in table 1:
TABLE 1 14 facet categories for hotel class facet word segmentation
Figure 162608DEST_PATH_IMAGE043
For the restaurant class facet, the most popular 914 facets are selected, and 14 facet classes are determined, as shown in table 2:
TABLE 2 14 facet categories for hotel class facet word segmentation
Figure 933118DEST_PATH_IMAGE044
As shown in tables 1 and 2, the extracted facet words are clustered into meaningful facet categories based on semantic relationships between the words. For example, the facet words in table 2 that are divided into facet class #1 include serve, staff, waiter, waitress, owner, chef, etc., and facet class #1 is an employee facet class.
The method of the present embodiment and the existing method based on the text feature and the behavior feature are evaluated according to five indexes of accuracy (accuracy) (a), recall (r), precision (p), F value (F) and ROC curve (AUC), as shown in table 3, model performances under seven different feature combinations are compared, including behavior feature B only, text feature T only, aspect feature a only, combination of behavior feature and text feature B + T, combination of behavior feature and aspect feature B + a, combination of text feature and aspect feature T + a, and combination of three features B + T + a.
TABLE 3 model Performance for different combinations of features
Figure 762533DEST_PATH_IMAGE045
According to table 3, the false comment identification method based on the aspect features is superior to the identification method based on the text features and the behavior features in performance on various indexes. The aspect characteristics and the behavior characteristics are combined to identify the false comments, the accuracy rate on the two comment sets can reach 97.8% and 85.9%, the accuracy on the two comment sets can reach 96.7% and 85.3% based on the false comment identification of the aspect characteristics, the accuracy on the two comment sets is 70.4% and 67.9% based on the false comment identification of the combination of the behavior characteristics and the text characteristics, and therefore the model performance based on the aspect characteristics is improved by about 20% compared with the model performance based on the other two characteristics.
Feature importance provides a score for each feature, the higher the score, the higher the importance or relevance of the feature to false comment identification. As shown in fig. 2 and 3, the most important feature in the hotel review set is the facet feature AARL, the less important feature is the behavioral feature RR (whether it is a repeated review), the most important feature in the restaurant review set is the facet feature TNAW, the less important feature is the behavioral feature ISR (whether it is a unique review of a product by a user), and the text feature OW (the proportion of objective words in the review) and RFPP (the proportion of first-person pronouns in the review) and the like are less important in false review recognition. In the hotel review set, facet and behavioral features accounted for 59.0% and 29.6% of feature importance, respectively, while text features accounted for only 11.4% of feature importance. Similar results were found in restaurant review sets. These results show that the false comment identification method based on the aspect features provided by the embodiment can accurately and effectively identify the false comment, and if the aspect features are combined with the behavior features, the identification capability of the false comment can be improved, and the test results in table 3 are further verified.
In addition, independent t-test was also used to determine if there was a significant difference in the aspect characteristics NAW, PAC, ARL, ARD, ASD between the false and true reviews. As shown in fig. 4 and 5, the results indicate that in the hotel review set, PAC (mean = 0.35) and NAW (mean = 0.15) of the false reviews are significantly lower than PAC (mean = 0.40) and NAW (mean = 0.18) of the real reviews, which may indicate that reviews containing facet words or less categories of facets are likely to be false reviews. The ARD of a false comment (mean = 0.33) is much higher than the ARD of a real comment (mean = 0.28), which may indicate that a comment is likely to be a false comment if the score of the comment is much different from the score of the affective word of the facet word in question in the comment.
It should be noted that the method of one or more embodiments of the present disclosure may be performed by a single device, such as a computer or server. The method of the embodiment can also be applied to a distributed scene and completed by the mutual cooperation of a plurality of devices. In such a distributed scenario, one of the devices may perform only one or more steps of the method of one or more embodiments of the present disclosure, and the devices may interact with each other to complete the method.
It should be noted that the above description describes certain embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
As shown in fig. 6, an embodiment of the present specification further provides a false comment identification apparatus based on aspect features, including:
the aspect extraction module is used for extracting aspect information from the comments to be identified; the aspect information comprises aspect words and corresponding emotion words, commentators, commentary products, scores, commentary content, commentary length and other information;
the aspect classification module is used for classifying the extracted aspect words to obtain the classified aspect words and the aspect categories to which the aspect words belong;
the characteristic determining module is used for determining aspect characteristics according to the aspect information and the aspect categories;
and the prediction module is used for inputting the determined aspect characteristics into a pre-trained comment recognition model, and outputting a prediction result of whether the comment to be recognized is a false comment or not by the comment recognition model.
For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Of course, the functionality of the modules may be implemented in the same one or more software and/or hardware implementations in implementing one or more embodiments of the present description.
The apparatus of the foregoing embodiment is used to implement the corresponding method in the foregoing embodiment, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.
Fig. 7 is a schematic diagram illustrating a more specific hardware structure of an electronic device according to this embodiment, where the electronic device may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein the processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 are communicatively coupled to each other within the device via bus 1050.
The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present disclosure.
The Memory 1020 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 1020 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 1020 and called to be executed by the processor 1010.
The input/output interface 1030 is used for connecting an input/output module to input and output information. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.
The communication interface 1040 is used for connecting a communication module (not shown in the drawings) to implement communication interaction between the present apparatus and other apparatuses. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).
Bus 1050 includes a path that transfers information between various components of the device, such as processor 1010, memory 1020, input/output interface 1030, and communication interface 1040.
It should be noted that although the above-mentioned device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.
The electronic device of the foregoing embodiment is used to implement the corresponding method in the foregoing embodiment, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.
Computer-readable media of the present embodiments, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the spirit of the present disclosure, features from the above embodiments or from different embodiments may also be combined, steps may be implemented in any order, and there are many other variations of different aspects of one or more embodiments of the present description as described above, which are not provided in detail for the sake of brevity.
In addition, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown in the provided figures, for simplicity of illustration and discussion, and so as not to obscure one or more embodiments of the disclosure. Furthermore, devices may be shown in block diagram form in order to avoid obscuring the understanding of one or more embodiments of the present description, and this also takes into account the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the one or more embodiments of the present description are to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the disclosure, it should be apparent to one skilled in the art that one or more embodiments of the disclosure can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.
While the present disclosure has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic ram (dram)) may use the discussed embodiments.
It is intended that the one or more embodiments of the present specification embrace all such alternatives, modifications and variations as fall within the broad scope of the appended claims. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of one or more embodiments of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (7)

1. The false comment identification method based on the aspect features is characterized by comprising the following steps:
extracting aspect information from the comments to be identified, wherein the aspect information comprises reviewers, comment products, scores, comment contents and comment lengths;
extracting aspect words and corresponding emotion words from the comment content;
classifying the aspect words to obtain the classified aspect words and the aspect categories to which the aspect words belong;
determining aspect features based on the aspect information and the aspect categories, including:
counting the number of unique aspect words in the comment to be identified, wherein the method comprises the following steps:
Figure 625902DEST_PATH_IMAGE001
(1)
wherein,
Figure 324867DEST_PATH_IMAGE002
is the only set of facet words in the comment r,
Figure 713123DEST_PATH_IMAGE003
the number of the aspect words contained in the unique aspect word set;
the method for counting the proportion of the aspect categories in the comments to be identified comprises the following steps:
Figure 672989DEST_PATH_IMAGE004
(2)
wherein,
Figure 731075DEST_PATH_IMAGE005
to review the set of facet categories in r,
Figure 307550DEST_PATH_IMAGE006
as a set of aspect categories
Figure 126862DEST_PATH_IMAGE005
The number of aspect categories included in (a),
Figure 206813DEST_PATH_IMAGE007
the number of facet categories for product p;
the method for counting the comment length of the relevant aspect category in the comment to be identified comprises the following steps:
Figure 435800DEST_PATH_IMAGE008
(3)
wherein the comments include at least one comment clause,
Figure 233992DEST_PATH_IMAGE009
is a comment clause in which the aspect word belonging to the aspect category ac in the comment r is located,
Figure 839417DEST_PATH_IMAGE010
is a comment clause
Figure 508296DEST_PATH_IMAGE011
The number of all the words contained in (a),
Figure 298397DEST_PATH_IMAGE012
the number of all words contained in the comment r;
determining the emotion words corresponding to the side words in the comments to be recognized, and determining the emotion scores corresponding to the emotion words;
determining the average deviation between the scores in the comments to be identified and the emotion scores corresponding to the emotion words of all aspects of categories, wherein the method comprises the following steps:
Figure 459251DEST_PATH_IMAGE013
(4)
wherein,
Figure 727422DEST_PATH_IMAGE014
for the emotional words corresponding to the aspect words of which the aspect category is ac in the comment r,
Figure 391752DEST_PATH_IMAGE015
as emotional words
Figure 352755DEST_PATH_IMAGE016
The corresponding sentiment score is calculated based on the emotion score,
Figure 735326DEST_PATH_IMAGE017
in order to review the score of the r,
Figure 541608DEST_PATH_IMAGE018
calculating the mean value;
counting the average value of the deviation between the emotion scores of the emotion words in the aspect categories in the comments and the emotion scores of the emotion words in the same aspect categories in all the comments, wherein the method comprises the following steps:
Figure 716237DEST_PATH_IMAGE019
(5)
wherein,
Figure 457928DEST_PATH_IMAGE020
is a comment
Figure 452429DEST_PATH_IMAGE021
The emotion words corresponding to the aspect words of which the aspect category is ac in the middle can be obtained,
Figure 937768DEST_PATH_IMAGE022
as emotional words
Figure 966904DEST_PATH_IMAGE023
Sentiment score of RacFor a set of reviews for which the facet category is ac,
Figure 738551DEST_PATH_IMAGE024
set of comments R for aspect category acacChinese emotion word
Figure 361293DEST_PATH_IMAGE025
An average score of (d);
and inputting the aspect characteristics into a pre-trained comment recognition model, and outputting a prediction result of whether the comment to be recognized is a false comment or not by the comment recognition model.
2. The method of claim 1, wherein the extracting the aspect words and the emotion words from the comment content comprises:
and extracting the aspect words and the corresponding emotion words from the comments to be recognized by utilizing a pre-trained Bi-LSTM classification model.
3. The method of claim 2, wherein the method of training the Bi-LSTM classification model is:
providing a comment sample set, wherein comments in the comment sample set comprise all contained words and labels corresponding to the words; the label of the word is one of a facet word, an emotion word or other words;
and training a Bi-LSTM classification model by using the comment sample set to obtain the Bi-LSTM classification model capable of outputting the label of each word in the comment.
4. The method of claim 1, wherein classifying the aspect words to obtain the classified aspect words and aspect categories to which the aspect words belong comprises:
selecting partial aspect words with preset popularity from the extracted aspect words according to the preset popularity;
determining a predetermined number of facet classes;
and dividing the part of aspect words into specific aspect categories according to the preset number of aspect categories by using a K-means clustering method.
5. The method of claim 1, wherein determining aspect characteristics from the aspect information and the aspect categories further comprises:
counting all comments issued by a reviewer according to the reviewer of the comment to be identified;
determining the total number of the unique aspect words issued by the reviewer according to all the reviews;
determining the average proportion of the aspect categories issued by the reviewer according to all the reviews;
from all reviews, the total number of review lengths for the reviewer for all facet categories is determined.
6. The method of claim 5, wherein the method of determining the total number of unique facet words published by a reviewer is:
Figure 774957DEST_PATH_IMAGE026
(6)
wherein R isuIs the set of all comments posted by reviewer u,
Figure 533966DEST_PATH_IMAGE027
is a set of facet words in the comment r of the reviewer u,
Figure 742093DEST_PATH_IMAGE028
is a set of aspect words
Figure 586552DEST_PATH_IMAGE027
The number of Chinese terms;
the method for determining the average proportion of the aspect categories issued by the reviewers comprises the following steps:
Figure 272749DEST_PATH_IMAGE029
(7)
wherein, ACrFor the set of facet classes in the comment r of reviewer u, | ACrI is the set of aspect classes ACrThe number of facet classes contained in (a),
Figure 276477DEST_PATH_IMAGE030
set of products, NA, reviewed for reviewer upNumber of facet classes contained for product p;
the method for determining the total number of comment lengths of the reviewers on all the facet categories is as follows:
Figure 530872DEST_PATH_IMAGE031
(8)
wherein,
Figure 721682DEST_PATH_IMAGE032
is a comment clause in which the aspect word belonging to the aspect category ac in the comment r is located,
Figure 352514DEST_PATH_IMAGE033
is a comment sentence
Figure 210749DEST_PATH_IMAGE032
The number of all the words contained in (a),
Figure 760679DEST_PATH_IMAGE034
is the number of all words contained in comment r.
7. An apparatus for identifying false comments based on aspect features, comprising:
the aspect extraction module is used for extracting aspect information from the comments to be identified; the aspect information comprises reviewers, review products, scores, review contents and review lengths; extracting aspect words and corresponding emotion words from the comment content;
the aspect classification module is used for classifying the aspect words to obtain the classified aspect words and the aspect categories to which the aspect words belong;
a feature determination module for determining an aspect feature based on the aspect information and the aspect category, comprising: counting the number of unique aspect words in the comment to be identified; counting the proportion of the aspect categories in the comments to be identified; counting the comment length of the related aspect category in the comment to be identified; determining the emotion words corresponding to the side words in the comments to be recognized, and determining the emotion scores corresponding to the emotion words; determining the average deviation between the scores in the comments to be identified and the emotion scores corresponding to the emotion words of all aspects of categories; counting the average value of the deviation of the emotion scores of the emotion words in the comment for the aspect category and the emotion scores of the emotion words in the same aspect category in all the comments; wherein,
the method for counting the number of the unique aspect words in the comment to be identified comprises the following steps:
Figure 579730DEST_PATH_IMAGE001
(1)
wherein,
Figure 873308DEST_PATH_IMAGE002
is the only set of facet words in the comment r,
Figure 195836DEST_PATH_IMAGE003
the number of the aspect words contained in the unique aspect word set;
the method for counting the proportion of the aspect categories in the comments to be identified comprises the following steps:
Figure 182247DEST_PATH_IMAGE004
(2)
wherein,
Figure 488595DEST_PATH_IMAGE005
to review the set of facet categories in r,
Figure 585864DEST_PATH_IMAGE006
as a set of aspect categories
Figure 22618DEST_PATH_IMAGE005
The number of aspect categories included in (a),
Figure 179930DEST_PATH_IMAGE007
the number of facet categories for product p;
the method for counting the comment length of the relevant aspect category in the comment to be identified comprises the following steps:
Figure 973574DEST_PATH_IMAGE008
(3)
wherein the comments include at least one comment clause,
Figure 343375DEST_PATH_IMAGE011
is a comment clause in which the aspect word belonging to the aspect category ac in the comment r is located,
Figure 30708DEST_PATH_IMAGE010
is a comment clause
Figure 234288DEST_PATH_IMAGE011
The number of all the words contained in (a),
Figure 108703DEST_PATH_IMAGE012
the number of all words contained in the comment r;
the method for determining the average deviation between the scores in the comments to be recognized and the emotion scores corresponding to the emotion words of all aspects of the categories is as follows:
Figure 688720DEST_PATH_IMAGE035
(4)
wherein,
Figure 230560DEST_PATH_IMAGE036
for the emotional words corresponding to the aspect words of which the aspect category is ac in the comment r,
Figure 605040DEST_PATH_IMAGE037
as emotional words
Figure 966751DEST_PATH_IMAGE038
The corresponding sentiment score is calculated based on the emotion score,
Figure 943935DEST_PATH_IMAGE039
in order to review the score of the r,
Figure 215647DEST_PATH_IMAGE040
calculating the mean value;
the method for counting the average value of the deviation between the emotion scores of the emotion words in the aspect categories in the comments and the emotion scores of the emotion words in the same aspect categories in all the comments comprises the following steps:
Figure 885663DEST_PATH_IMAGE041
(5)
wherein,
Figure 875616DEST_PATH_IMAGE042
is a comment
Figure 656490DEST_PATH_IMAGE021
The emotion words corresponding to the aspect words of which the aspect category is ac in the middle can be obtained,
Figure 907343DEST_PATH_IMAGE043
as emotional words
Figure 623626DEST_PATH_IMAGE023
Sentiment score of RacFor a set of reviews for which the facet category is ac,
Figure 959929DEST_PATH_IMAGE044
set of comments R for aspect category acacChinese emotion word
Figure 419861DEST_PATH_IMAGE025
An average score of (d);
and the prediction module is used for inputting the aspect characteristics into a pre-trained comment identification model, and outputting a prediction result of whether the comment to be identified is a false comment or not by the comment identification model.
CN202110487429.8A 2021-04-30 2021-04-30 False comment identification method and device based on aspect features Active CN112989056B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110487429.8A CN112989056B (en) 2021-04-30 2021-04-30 False comment identification method and device based on aspect features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110487429.8A CN112989056B (en) 2021-04-30 2021-04-30 False comment identification method and device based on aspect features

Publications (2)

Publication Number Publication Date
CN112989056A CN112989056A (en) 2021-06-18
CN112989056B true CN112989056B (en) 2021-07-30

Family

ID=76336946

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110487429.8A Active CN112989056B (en) 2021-04-30 2021-04-30 False comment identification method and device based on aspect features

Country Status (1)

Country Link
CN (1) CN112989056B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115374372B (en) * 2022-08-26 2023-04-07 广州工程技术职业学院 Method, device, equipment and storage medium for quickly identifying false information of network community
CN116385029B (en) * 2023-04-20 2024-01-30 深圳市天下房仓科技有限公司 Hotel bill detection method, system, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103903164A (en) * 2014-03-25 2014-07-02 华南理工大学 Semi-supervised automatic aspect extraction method and system based on domain information
CN111666480A (en) * 2020-06-10 2020-09-15 东北电力大学 False comment identification method based on rolling type collaborative training
CN112597302A (en) * 2020-12-18 2021-04-02 东北林业大学 False comment detection method based on multi-dimensional comment representation

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108874768B (en) * 2018-05-16 2019-04-16 山东科技大学 A kind of e-commerce falseness comment recognition methods based on theme emotion joint probability
US20200177529A1 (en) * 2018-11-29 2020-06-04 International Business Machines Corporation Contextually correlated live chat comments in a live stream with mobile notifications

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103903164A (en) * 2014-03-25 2014-07-02 华南理工大学 Semi-supervised automatic aspect extraction method and system based on domain information
CN111666480A (en) * 2020-06-10 2020-09-15 东北电力大学 False comment identification method based on rolling type collaborative training
CN112597302A (en) * 2020-12-18 2021-04-02 东北林业大学 False comment detection method based on multi-dimensional comment representation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于半监督学习算法的虚假评论识别研究;任亚峰,姬东鸿;《四川大学学报(工程科学版)》;20140520;全文 *
虚假评论识别研究综述;袁禄,朱郑州;《计算机科学》;20210130;全文 *

Also Published As

Publication number Publication date
CN112989056A (en) 2021-06-18

Similar Documents

Publication Publication Date Title
US11238081B2 (en) Method, apparatus, and computer program product for classification and tagging of textual data
US11055557B2 (en) Automated extraction of product attributes from images
US11222055B2 (en) System, computer-implemented method and computer program product for information retrieval
CN103870973B (en) Information push, searching method and the device of keyword extraction based on electronic information
Alotaibi et al. Suggestion Mining from Opinionated Text of Big Social Media Data.
CN112989056B (en) False comment identification method and device based on aspect features
CN112667899A (en) Cold start recommendation method and device based on user interest migration and storage equipment
CN110347908B (en) Voice shopping method, device, medium and electronic equipment
CN112905739A (en) False comment detection model training method, detection method and electronic equipment
CN109598586A (en) A kind of recommended method based on attention model
CN107832338A (en) A kind of method and system for identifying core product word
CN106649686B (en) User interest grouping method and system based on the potential feature of multilayer
CN114792246B (en) Product typical feature mining method and system based on topic integrated clustering
CN107133811A (en) The recognition methods of targeted customer a kind of and device
Wei et al. Online education recommendation model based on user behavior data analysis
CN112463966B (en) False comment detection model training method, false comment detection model training method and false comment detection model training device
CN116629258B (en) Structured analysis method and system for judicial document based on complex information item data
CN114138932A (en) Method, device and equipment for determining explanatory information and readable storage medium
CN109815391A (en) News data analysis method and device, electric terminal based on big data
CN112182126A (en) Model training method and device for determining matching degree, electronic equipment and readable storage medium
Laily et al. Mining Indonesia tourism's reviews to evaluate the services through multilabel classification and LDA
KR102684423B1 (en) Method and system for data searching
CN115357711A (en) Aspect level emotion analysis method and device, electronic equipment and storage medium
Jayawickrama et al. Seeking sinhala sentiment: Predicting facebook reactions of sinhala posts
Pramudya et al. Hotel Reviews Classification and Review-based Recommendation Model Construction using BERT and RoBERTa

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant