CN112559685A

CN112559685A - Automobile forum spam comment identification method

Info

Publication number: CN112559685A
Application number: CN202011458869.2A
Authority: CN
Inventors: 王磊; 赛影辉; 王志超; 肖飞; 韦圣兵
Original assignee: Wuhu Automotive Prospective Technology Research Institute Co ltd; Chery Automobile Co Ltd
Current assignee: Wuhu Automotive Prospective Technology Research Institute Co ltd; Chery Automobile Co Ltd
Priority date: 2020-12-11
Filing date: 2020-12-11
Publication date: 2021-03-26

Abstract

The invention relates to the technical field of product comments, and provides an automobile forum spam comment identification method which specifically comprises the following steps: s1, selecting a sample, and labeling the sample; s2, respectively generating various strong classifiers for identifying the spam comment features; and S3, forming a comment strong classifier based on the various feature strong classifiers, and identifying the spam comments based on the comment strong classifier. According to the method, the influence of comment characteristics such as subject words, advertising words, emotional words, forbidden words, similarity, comment repetition number and special symbols on comment results is fully considered, and the accuracy of spam comment identification is improved.

Description

Automobile forum spam comment identification method

Technical Field

The invention relates to the technical field of product comments, and provides an automobile forum spam comment identification method.

Background

At present, a mainstream portal website or a blog forum of an automobile allows a user to publish some comment opinions aiming at a product or topic, so that some false and purposeful rumor or advertisement comment information irrelevant to the product or topic exists, which is called spam comment, and the spam comment refers to unrealistic comment aiming at exaggerating or depreciating the product, or the comment object is only a comment irrelevant to the product such as a product brand, a merchant and other products; such as advertisements, user questions, and discussion of daily questions, etc. The existence of these spam comments affects the accuracy of forum comment data and misleads consumers and host plants.

Disclosure of Invention

The invention provides a method for identifying automobile forum spam comments, which aims to automatically filter the spam comments and improve the accuracy of forum comment data.

The invention is realized in such a way, and the method for identifying the automobile forum spam comments specifically comprises the following steps:

s1, selecting a sample, and labeling the sample;

s2, respectively generating various strong classifiers for identifying the spam comment features;

and S3, forming a comment strong classifier based on the various feature strong classifiers, and identifying the spam comments based on the comment strong classifier.

Further, the spam characteristics include:

the method comprises the following steps of not containing subject words, containing hyperlinks and advertising words, containing forbidden words, having low frequency of emotional words and words, having low similarity between comments and product description, having high repetition number of comments and containing special symbols.

Further, the forming method of the strong feature classifier specifically includes:

s21, defining the total iteration number as Q, and initializing the sample weight;

s22, in the current (n +1) th iteration, acquiring the characteristic r_ckWeak classifier with minimum error rate

S23, updating the weak classifier

After the error weights are normalized, the iteration times are added by 1, S22 is executed until the iteration times reach Q times, and the output weak classifier is the characteristic r_ckThe strong classifier of (1).

Further, weak classifiers

The weight calculation formula is specifically as follows:

μ_qweak classifiers representing updates

The weight of the weight is calculated,

representing weak classifiers

Error rate of (2).

Further, the error weight calculation formula of the sample is specifically as follows:

W_(n+1)iis the error weight of sample i in the n +1 th iteration, W_niIs the error weight, μ, of sample i in the nth iteration_qRepresents a characteristic r_ckWeight coefficient of q-th weak classifier, y_iThe mark value, h, representing positive and negative samples_q(s_i) Represents a characteristic r_ckThe q-th weak classifier below.

Further, after updating the error weight of the sample, if the current iteration number n +1 is greater than the set number, and the error weight w of the sample exists_n+1,iAnd when the error weight is larger than the set threshold, correcting the error weight of the sample by adopting the following formula:

w*_n+1,irepresents the error weight, w, of the sample i in the n +1 th iteration after correction_n+1,iRepresenting the error weight, v, of sample i in the (n +1) th iteration before correction_mIndicating the number of times sample i was misclassified.

Further, the degree of similarity between the review and the product standard description is calculated using the following formula:

W_is' represents the corresponding weight of a phrase i in the standard product description, the phrase i consists of a keyword i and a similar meaning word thereof, W_ic' represents the corresponding weight of the phrase i in the product review j, W_isRepresents the corresponding weight of the keyword i in the standard product description, W_icRepresents the corresponding weight of the keyword i in the product review j, Sim(s)_i,c_ij') denotes s_iAnd c_ij' similarity, s_iDenotes the ith sample, c_ij' the jth keyword, Same (s, c), representing the ith comment in the set of comments_j) Is the number of subject words contained in the product evaluation, the physical meaning of len(s) is the length of the subject words,

i.e. normalizing the length of the comment in order to reduce the similarityIntroducing a smoothing factor of 0.5 under the influence of the degree value;

W_isrepresents the corresponding weight of the valid keyword i in the standard product description, W_icRepresenting the corresponding weight of the valid keyword i in the product review j.

Further, the error rate calculation formula of the qth weak classifier is specifically as follows:

represents a characteristic r_ckM represents the number of training samples in the product review set, h_q(s_i) Represents a characteristic r_ckQ-th weak classifier of_iThe value of a flag, w, representing positive and negative samples_niIs the error weight of the ith sample in the nth iteration.

According to the method, the influence of comment characteristics such as subject words, advertising words, emotional words, forbidden words, similarity, comment repetition number and special symbols on comment results is fully considered, and the accuracy of spam comment identification is improved.

Drawings

Fig. 1 is a flowchart of an automobile forum spam comment identification method provided by the embodiment of the present invention.

Detailed Description

The following description of preferred embodiments of the invention will be made in further detail with reference to the accompanying drawings.

Before extracting the product review features, the reviews are first preprocessed. And performing word segmentation on product standard description and comments by adopting a word segmentation system ICTCCLAS of the institute of computational technology of Chinese academy of sciences, removing stop words irrelevant to the comment content, and analyzing and processing the remaining effective keywords.

The product comment features are important indexes for comment effectiveness screening, whether the spam comments are representative or not and whether all comments are covered as much as possible can be effectively identified, the product comments are described through the following features, and feature values of the product comments are extracted:

(1) subject term

The product subject term is a core word for describing the product and also a core word for product review, and is generally a core noun related to the product. And extracting product standard description in the keywords and core words in the comments to be evaluated by using a word segmentation system ICTCCLAS of the institute of computational technology of Chinese academy of sciences, and comparing the product standard description and the core words one by one. If the comment does not have any subject word in the product standard description, namely the characteristic value is 0, the comment is considered as a spam comment, otherwise, the characteristic value is 1, and the comment is temporarily reserved as a valid comment.

(2) Hyperlink and advertisement word

The product spam comments comprise hyperlinks and advertising words, which are typical expressions of advertising information and mostly comprise product promotion, shop or website recommendation, company promotion and the like.

For hyperlink spam comments, because hyperlinks generally appear in a web site, a plurality of continuous English alphabetic characters, such as http://, appear. If the comment contains the hyperlink, the comment is considered to be possibly a spam comment, the characteristic value of the comment is 0, the comment is preferably excluded, otherwise, the characteristic value of the comment is 1, and the comment is temporarily reserved as a valid comment.

Aiming at the advertisement words, a common advertisement word dictionary is established by analyzing and summarizing popular comments at the present stage, such as QQ, special price, hot shopping, Taobao, whole parcel postings and the like; considering that the advertisement words include information such as product price and QQ number, which generally appear as numbers, if a plurality of numbers and Chinese character elements are scanned, the advertisement words are considered to be included. Similarly, if the comment contains an advertisement word, the comment is considered as a spam comment, the feature value of the comment is 0, and the comment is preferentially excluded, otherwise, the feature value of the comment is 1, and the comment is temporarily reserved as a valid comment.

(3) Forbidden word

The forbidden words are words containing malignant attacks, such as tm, sb, rotten goods and the like, and a forbidden word dictionary is also established. And scanning the keywords of each comment, and finding out forbidden words, namely considering the comment as a spam comment, wherein the characteristic value of the spam comment is 0, and preferentially excluding the spam comment, otherwise, considering the characteristic value of the spam comment as 1, and temporarily keeping the spam comment as an effective comment.

(4) Emotional words

The product comment is an evaluation and discussion of people on relevant parameters and purchasing experience of products, and people can truly express own subjective opinion, attitude, feeling, emotion and the like through the comment. Thus, the comment necessarily contains the sentiment of the reviewer. The fewer the number of words of the emotional words, the more likely it is that they belong to spam comments. And through statistical analysis and calculation, a product comment emotion word dictionary is also established, and the emotion word frequency in the comment is used as the characteristic value of the comment.

(5) Degree of similarity

The identification of the spam comments cannot be separated from the measurement of text similarity, the similarity refers to the common size of two contrasts, and the similarity is a common characteristic index for measuring whether the comments are spam comments or not. Cosine similarity is used herein to measure the degree of similarity between reviews and product standard descriptions, which is formulated as follows:

W_isrepresents the corresponding weight of the valid keyword i in the standard product description, W_icRepresenting the corresponding weight of the valid keyword i in the product review j. The smaller the similarity, the more likely it is a spam comment.

Weight calculation formula: w_is1+ log (n × a +1), where n is the number of times the keyword i appears in the product description, and a is a weight adjustment parameter, which can be automatically adjusted by a machine. And a parameter a, wherein the program automatically increases or decreases the weight coefficient by judging whether the weight is satisfied by the staff.

(6) Number of review repetitions

Through the analysis of a large number of products, the statistical finding is that: there is a class of spam reviews that appear to be normal reviews, but from a holistic perspective, the same reviewer or different reviewers may be found to have posted a large number of the same or similar reviews, called repeat reviews, for the same question. The comments are possibly spam comments issued by abnormal users in order to attack competitors to improve the goodness of the marketers, and the more the number of the comments is, the more the comments are possibly spam comments. To simplify the calculation, only the number of identical comments of a sentence exceeding a certain length is counted and whether to repeat is determined by whether the keywords are identical. If the repetition number of the keyword exceeds 80%, the comment is considered as a spam comment, the characteristic value of the comment is 0, and the comment is preferentially excluded, otherwise, the characteristic value of the comment is 1, and the comment is temporarily reserved as an effective comment.

(7) Special symbols

Punctuation marks in normal comments are generally normal, only a few commas, pause signs, semicolons, periods or exclamation marks and the like are written by very individual reviewers, but a string of commas, pause signs, period signs, punctuation marks, question marks, exclamation marks and the like are used for expressing strong feelings of the reviewers, and punctuation marks such as "#", "&" and the like are not used generally, and in order to bypass the existing filtering mechanism, a string of special punctuation marks is often added among spam keywords with obvious spam characteristics to mask spam characteristics of spam comments, for example: : |, welcome by the consumer, the mao wei 4S store |, the expectation of culture with you! Either Da ^0^ x' z. And if a special symbol is found, the comment is considered as a spam comment, the characteristic value of the comment is 0, the comment is preferably excluded, otherwise, the characteristic value of the comment is 1, and the comment is temporarily reserved as a valid comment.

The evaluation set of a certain product is as follows:

comments＝{C₁((r_c1,w_c1),(r_c2,w_c2)...,(r_c7,w_c7)),...,C_n((r_c1,w_c1),(r_c2,w_c2)...,(r_c7,w_c7))}

wherein C is₁,C₂,…,C_nRespectively representing different comments, r, in the product comment set_c1…r_c7Characteristic value, w, representing corresponding comment_c1…w_c7And representing the weight of each characteristic value of the corresponding comment, wherein the characteristic values are respectively as follows: term feature r_c1Hyperlink and advertising word feature r_c2Features of forbidden words r_c3Emotional word feature r_c4Similarity feature r_c5Review repeat number feature r_c6Special symbol feature r_c7The values of the characteristic values are as follows:

r_c1,r_c2,r_c3,r_c6,r_c7∈{0,1}

r_c4number of emotion words/total number of keywords

r_c5＝Similarity′(s,c_j)

Fig. 1 is a flowchart of an automobile forum spam comment identification method provided in an embodiment of the present invention, and the method specifically includes the following steps:

(1) selecting a sample

The invention is based on a machine learning method, trains a product review design strong classifier based on an AdaBoost algorithm, and selects w samples (C) in a certain product review set₁,y₁),(C₂,y₂)...(C_w,y_w),y_mE {0,1}, where C_wRepresenting the w-th review sample in the product review set, y_mThe label value of the positive sample and the negative sample is 1, namely the normal comment, namely the positive sample, and 0, namely the spam comment, namely the negative sample;

(2) respectively obtaining spam comment characteristics r_ckThe strong classifiers are called as characteristic strong classifiers, because 7 types of spam comment characteristics exist, the value of k is 1-7, and the forming methods of the 7 characteristic strong classifiers are the same; in the embodiment of the present invention, the characteristic r_ckThe strong classifier obtaining method specifically comprises the following steps:

21) initializing sample weights

Defining the total iteration number as Q, initializing the sample weight of the sample, and assuming that a positive samples exist, wherein the initialized sample error weight is 1/a;

22) aiming at each feature, constructing a plurality of if classifiers, wherein the classification precision of the weak classifiers to the corresponding feature is relatively low, the classification precision of the strong classifiers to the corresponding feature is relatively high, the strong classifier with the highest classification precision in the weak classifiers is the strong classifier under the corresponding feature, and in the current (n +1) th iteration, acquiring the feature r_ckWeak classifier with minimum error rate

In the embodiment of the present invention, the characteristic r_ckThe error rate calculation formula of the qth weak classifier is specifically as follows:

23) Updating weak classifiers

After normalizing the error weight, adding 1 to the iteration times, executing step 22) until the iteration times reach Q times, and outputting the characteristic r_ckThe strong classifier of (1);

in the iteration process, if the error rates of the trained weak classifiers are all larger than 0.6 or equal to 0, deleting the current weak classifier, and not recycling the current comments.

In the embodiment of the invention, the smaller the error rate of the classifier is, the larger the weight is, the iteration times are increased, and the classifier

The weight update formula of (1) is as follows:

μ_qrepresenting weak classifiers

The weight is updated with the value of the weight,

representing weak classifiers

Error rate of (2).

In the embodiment of the present invention, the error weight update formula of the sample is specifically as follows:

In the embodiment of the present invention, the strong classifiers for each feature are expressed as follows:

wherein h is_q(s_i) And (3) representing the result after classification by the weak classifier, wherein 0 represents that the sample classification output by the classifier is different from the identification classification of the sample, 1 represents that the sample classification output by the classifier is the same as the identification classification result of the sample, and h represents the classifier.

(3) And forming a comment strong classifier based on the characteristic strong classifiers, and identifying the spam comments based on the comment strong classifier.

In the embodiment of the present invention, the method for obtaining the comment strong classifier specifically includes:

31) defining the total iteration number as Q, and assuming that a positive samples exist, the error weight of the initialized samples is 1/a;

32) in the current (n +1) th iteration, the weak classifier h with the minimum error rate is obtained_c-minConstructing a group of weak classifiers based on h(s) of the strong classifiers with various characteristics, wherein the error rate calculation formula of the qth weak classifier is as follows:

h_q(s) denotes the qth weak classifier,

represents the q-th weak classifier h_q(S) error rate, m represents the number of training samples in the product review set, y_iThe value of a flag, w, representing positive and negative samples_niError weight for the ith sample in the nth iteration;

33) updating weak classifier h_c-minAfter the error weight is normalized, adding 1 to the iteration times, executing step 32) until the iteration times reaches Q times, and outputting a strong classifier of comments;

In the embodiment of the invention, the smaller the error rate of the classifier is, the larger the weight is, the iteration times are increased, and the classifier h_c-minThe weight update formula of (1) is as follows:

μ_qrepresents a weak classifier h_c-minThe value of the weight of (a) is updated,

represents a weak classifier h_c-minError rate of (2).

In the embodiment of the present invention, the error weight calculation formula of the sample is specifically as follows:

h_q(s) denotes the q-th weak classifier, W_(n+1)iIs the error weight of sample i in the n +1 th iteration, W_niIs the error weight, μ, of sample i in the nth iteration_qDenotes the qth weak classifier h_qWeight coefficient of(s), y_iThe label values representing positive and negative samples.

In the embodiment of the invention, the AdaBoost algorithm is realized by changing the weight distribution of samples, different training sets in each training are formed by updating the weight corresponding to each sample, and the training sample set of the next classifier is a new data set formed by updating the weight of each sample by the previous classifier. And updating the weight of each sample according to whether each sample is classified correctly in each training round and the classification error rate of the weak classifier in the previous round, wherein the weight of the sample classified correctly is reduced, and the weight of the sample classified incorrectly is increased, so that the sample classified incorrectly is highlighted. However, if the weight update is not limited, some extreme samples or samples which are difficult to classify themselves increase with the number of iterations, and these highlighted "difficult" samples are classified incorrectly each time, resulting in an exponential increase in the weight of such "difficult" samples when the sample weight is updated. Therefore, the present invention considers the limitation of the sample weight, which is added after the cyclic update:

in the embodiment of the invention, after the error weight of the sample is updated every time, whether the following conditions exist is judged: the current iteration number n +1 is larger than the set number value,and there is an error weight w of the sample_n+1,iIf the error weight is larger than the set threshold, the error weight of the sample is corrected by adopting the following formula, and the corrected error weight is w_n+1,i：

v_mRepresenting the number of times a sample i is classified as erroneous, adding a logarithm such that the impact of the number of errors is reduced, log3 > 1, so when the number of errors v is_iThe sample weight is slowly reduced when the sample weight is more than 3, so that the exponential increase of the sample weight is effectively inhibited.

Similarity is one of the most important characteristic values of product evaluation, and the existing similarity calculation cannot effectively detect out similar meaning words, so that 2 similar meaning words with similar meanings can be regarded as completely different words, and misjudgment is caused. Therefore, the text proposes some semantic information added among the words during similarity comparison, such as the near meaning information among the words, the word shape similarity and the position information, and the like, and the formula for improving the similarity is as follows:

namely toThe comment length is normalized, and a smoothing factor of 0.5 is introduced in order to reduce the influence on the similarity score.

It is clear that the specific implementation of the invention is not restricted to the above-described embodiments, but that various insubstantial modifications of the inventive process concept and technical solutions are within the scope of protection of the invention.

Claims

1. The method for identifying the spam comments in the forums of the automobiles is characterized by comprising the following steps:

s1, selecting a sample, and labeling the sample;

2. The automobile forum spam score identification method as recited in claim 1, wherein said spam score characteristics include:

3. The automobile forum spam score identification method as claimed in claim 2, wherein the formation method of the strong feature classifier is as follows:

s22, at presentIn the (n +1) th iteration, the characteristic r is obtained_ckWeak classifier with minimum error rate

S23, updating the weak classifier

4. The vehicle forum spam score identification method as claimed in claim 3, wherein the weak classifier

The weight calculation formula is specifically as follows:

μ_qweak classifiers representing updates

The weight of the weight is calculated,

representing weak classifiers

Error rate of (2).

5. The automobile forum spam score identification method as claimed in claim 4, wherein the error weight calculation formula of the sample is as follows:

6. The automobile forum spam score identifying method as claimed in claim 3, wherein after updating the error weight of the sample, if the current iteration number n +1 is greater than the set number, and there is an error weight w of the sample_n+1,iAnd when the error weight is larger than the set threshold, correcting the error weight of the sample by adopting the following formula:

7. The automotive forum spam score identification method as claimed in claim 2, wherein the similarity between the reviews and the product standard description is calculated using the following formula:

W_is' represents the corresponding weight of a phrase i in the standard product description, the phrase i consists of a keyword i and a similar meaning word thereof, W_ic' represents the corresponding weight of the phrase i in the product review j, W_isRepresenting the corresponding weight of the keyword i in the standard product descriptionHeavy, W_icRepresents the corresponding weight of the keyword i in the product review j, Sim(s)_i,c_ij') denotes s_iAnd c_ij' similarity, s_iDenotes the ith sample, c_ij' the jth keyword, Same (s, c), representing the ith comment in the set of comments_j) Is the number of subject words contained in the product evaluation, the physical meaning of len(s) is the length of the subject words,

normalization processing is carried out on the comment length, and a smoothing factor of 0.5 is introduced to reduce the influence on the similarity score;

8. The automobile forum spam score identification method as claimed in claim 3, wherein the error rate calculation formula of the q weak classifier is as follows: