CN112559685A - Automobile forum spam comment identification method - Google Patents

Automobile forum spam comment identification method Download PDF

Info

Publication number
CN112559685A
CN112559685A CN202011458869.2A CN202011458869A CN112559685A CN 112559685 A CN112559685 A CN 112559685A CN 202011458869 A CN202011458869 A CN 202011458869A CN 112559685 A CN112559685 A CN 112559685A
Authority
CN
China
Prior art keywords
comment
weight
sample
spam
error
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011458869.2A
Other languages
Chinese (zh)
Inventor
王磊
赛影辉
王志超
肖飞
韦圣兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhu Automotive Prospective Technology Research Institute Co ltd
Chery Automobile Co Ltd
Original Assignee
Wuhu Automotive Prospective Technology Research Institute Co ltd
Chery Automobile Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhu Automotive Prospective Technology Research Institute Co ltd, Chery Automobile Co Ltd filed Critical Wuhu Automotive Prospective Technology Research Institute Co ltd
Priority to CN202011458869.2A priority Critical patent/CN112559685A/en
Publication of CN112559685A publication Critical patent/CN112559685A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of product comments, and provides an automobile forum spam comment identification method which specifically comprises the following steps: s1, selecting a sample, and labeling the sample; s2, respectively generating various strong classifiers for identifying the spam comment features; and S3, forming a comment strong classifier based on the various feature strong classifiers, and identifying the spam comments based on the comment strong classifier. According to the method, the influence of comment characteristics such as subject words, advertising words, emotional words, forbidden words, similarity, comment repetition number and special symbols on comment results is fully considered, and the accuracy of spam comment identification is improved.

Description

Automobile forum spam comment identification method
Technical Field
The invention relates to the technical field of product comments, and provides an automobile forum spam comment identification method.
Background
At present, a mainstream portal website or a blog forum of an automobile allows a user to publish some comment opinions aiming at a product or topic, so that some false and purposeful rumor or advertisement comment information irrelevant to the product or topic exists, which is called spam comment, and the spam comment refers to unrealistic comment aiming at exaggerating or depreciating the product, or the comment object is only a comment irrelevant to the product such as a product brand, a merchant and other products; such as advertisements, user questions, and discussion of daily questions, etc. The existence of these spam comments affects the accuracy of forum comment data and misleads consumers and host plants.
Disclosure of Invention
The invention provides a method for identifying automobile forum spam comments, which aims to automatically filter the spam comments and improve the accuracy of forum comment data.
The invention is realized in such a way, and the method for identifying the automobile forum spam comments specifically comprises the following steps:
s1, selecting a sample, and labeling the sample;
s2, respectively generating various strong classifiers for identifying the spam comment features;
and S3, forming a comment strong classifier based on the various feature strong classifiers, and identifying the spam comments based on the comment strong classifier.
Further, the spam characteristics include:
the method comprises the following steps of not containing subject words, containing hyperlinks and advertising words, containing forbidden words, having low frequency of emotional words and words, having low similarity between comments and product description, having high repetition number of comments and containing special symbols.
Further, the forming method of the strong feature classifier specifically includes:
s21, defining the total iteration number as Q, and initializing the sample weight;
s22, in the current (n +1) th iteration, acquiring the characteristic rckWeak classifier with minimum error rate
Figure BDA0002830490840000021
S23, updating the weak classifier
Figure BDA0002830490840000022
After the error weights are normalized, the iteration times are added by 1, S22 is executed until the iteration times reach Q times, and the output weak classifier is the characteristic rckThe strong classifier of (1).
Further, weak classifiers
Figure BDA0002830490840000023
The weight calculation formula is specifically as follows:
Figure BDA0002830490840000024
μqweak classifiers representing updates
Figure BDA0002830490840000025
The weight of the weight is calculated,
Figure BDA0002830490840000026
representing weak classifiers
Figure BDA0002830490840000027
Error rate of (2).
Further, the error weight calculation formula of the sample is specifically as follows:
Figure BDA0002830490840000028
W(n+1)iis the error weight of sample i in the n +1 th iteration, WniIs the error weight, μ, of sample i in the nth iterationqRepresents a characteristic rckWeight coefficient of q-th weak classifier, yiThe mark value, h, representing positive and negative samplesq(si) Represents a characteristic rckThe q-th weak classifier below.
Further, after updating the error weight of the sample, if the current iteration number n +1 is greater than the set number, and the error weight w of the sample existsn+1,iAnd when the error weight is larger than the set threshold, correcting the error weight of the sample by adopting the following formula:
Figure BDA0002830490840000029
w*n+1,irepresents the error weight, w, of the sample i in the n +1 th iteration after correctionn+1,iRepresenting the error weight, v, of sample i in the (n +1) th iteration before correctionmIndicating the number of times sample i was misclassified.
Further, the degree of similarity between the review and the product standard description is calculated using the following formula:
Figure BDA0002830490840000031
Wis' represents the corresponding weight of a phrase i in the standard product description, the phrase i consists of a keyword i and a similar meaning word thereof, Wic' represents the corresponding weight of the phrase i in the product review j, WisRepresents the corresponding weight of the keyword i in the standard product description, WicRepresents the corresponding weight of the keyword i in the product review j, Sim(s)i,cij') denotes siAnd cij' similarity, siDenotes the ith sample, cij' the jth keyword, Same (s, c), representing the ith comment in the set of commentsj) Is the number of subject words contained in the product evaluation, the physical meaning of len(s) is the length of the subject words,
Figure BDA0002830490840000032
i.e. normalizing the length of the comment in order to reduce the similarityIntroducing a smoothing factor of 0.5 under the influence of the degree value;
Figure BDA0002830490840000033
Wisrepresents the corresponding weight of the valid keyword i in the standard product description, WicRepresenting the corresponding weight of the valid keyword i in the product review j.
Further, the error rate calculation formula of the qth weak classifier is specifically as follows:
Figure BDA0002830490840000034
Figure BDA0002830490840000035
represents a characteristic rckM represents the number of training samples in the product review set, hq(si) Represents a characteristic rckQ-th weak classifier ofiThe value of a flag, w, representing positive and negative samplesniIs the error weight of the ith sample in the nth iteration.
According to the method, the influence of comment characteristics such as subject words, advertising words, emotional words, forbidden words, similarity, comment repetition number and special symbols on comment results is fully considered, and the accuracy of spam comment identification is improved.
Drawings
Fig. 1 is a flowchart of an automobile forum spam comment identification method provided by the embodiment of the present invention.
Detailed Description
The following description of preferred embodiments of the invention will be made in further detail with reference to the accompanying drawings.
Before extracting the product review features, the reviews are first preprocessed. And performing word segmentation on product standard description and comments by adopting a word segmentation system ICTCCLAS of the institute of computational technology of Chinese academy of sciences, removing stop words irrelevant to the comment content, and analyzing and processing the remaining effective keywords.
The product comment features are important indexes for comment effectiveness screening, whether the spam comments are representative or not and whether all comments are covered as much as possible can be effectively identified, the product comments are described through the following features, and feature values of the product comments are extracted:
(1) subject term
The product subject term is a core word for describing the product and also a core word for product review, and is generally a core noun related to the product. And extracting product standard description in the keywords and core words in the comments to be evaluated by using a word segmentation system ICTCCLAS of the institute of computational technology of Chinese academy of sciences, and comparing the product standard description and the core words one by one. If the comment does not have any subject word in the product standard description, namely the characteristic value is 0, the comment is considered as a spam comment, otherwise, the characteristic value is 1, and the comment is temporarily reserved as a valid comment.
(2) Hyperlink and advertisement word
The product spam comments comprise hyperlinks and advertising words, which are typical expressions of advertising information and mostly comprise product promotion, shop or website recommendation, company promotion and the like.
For hyperlink spam comments, because hyperlinks generally appear in a web site, a plurality of continuous English alphabetic characters, such as http://, appear. If the comment contains the hyperlink, the comment is considered to be possibly a spam comment, the characteristic value of the comment is 0, the comment is preferably excluded, otherwise, the characteristic value of the comment is 1, and the comment is temporarily reserved as a valid comment.
Aiming at the advertisement words, a common advertisement word dictionary is established by analyzing and summarizing popular comments at the present stage, such as QQ, special price, hot shopping, Taobao, whole parcel postings and the like; considering that the advertisement words include information such as product price and QQ number, which generally appear as numbers, if a plurality of numbers and Chinese character elements are scanned, the advertisement words are considered to be included. Similarly, if the comment contains an advertisement word, the comment is considered as a spam comment, the feature value of the comment is 0, and the comment is preferentially excluded, otherwise, the feature value of the comment is 1, and the comment is temporarily reserved as a valid comment.
(3) Forbidden word
The forbidden words are words containing malignant attacks, such as tm, sb, rotten goods and the like, and a forbidden word dictionary is also established. And scanning the keywords of each comment, and finding out forbidden words, namely considering the comment as a spam comment, wherein the characteristic value of the spam comment is 0, and preferentially excluding the spam comment, otherwise, considering the characteristic value of the spam comment as 1, and temporarily keeping the spam comment as an effective comment.
(4) Emotional words
The product comment is an evaluation and discussion of people on relevant parameters and purchasing experience of products, and people can truly express own subjective opinion, attitude, feeling, emotion and the like through the comment. Thus, the comment necessarily contains the sentiment of the reviewer. The fewer the number of words of the emotional words, the more likely it is that they belong to spam comments. And through statistical analysis and calculation, a product comment emotion word dictionary is also established, and the emotion word frequency in the comment is used as the characteristic value of the comment.
(5) Degree of similarity
The identification of the spam comments cannot be separated from the measurement of text similarity, the similarity refers to the common size of two contrasts, and the similarity is a common characteristic index for measuring whether the comments are spam comments or not. Cosine similarity is used herein to measure the degree of similarity between reviews and product standard descriptions, which is formulated as follows:
Figure BDA0002830490840000051
Wisrepresents the corresponding weight of the valid keyword i in the standard product description, WicRepresenting the corresponding weight of the valid keyword i in the product review j. The smaller the similarity, the more likely it is a spam comment.
Weight calculation formula: wis1+ log (n × a +1), where n is the number of times the keyword i appears in the product description, and a is a weight adjustment parameter, which can be automatically adjusted by a machine. And a parameter a, wherein the program automatically increases or decreases the weight coefficient by judging whether the weight is satisfied by the staff.
(6) Number of review repetitions
Through the analysis of a large number of products, the statistical finding is that: there is a class of spam reviews that appear to be normal reviews, but from a holistic perspective, the same reviewer or different reviewers may be found to have posted a large number of the same or similar reviews, called repeat reviews, for the same question. The comments are possibly spam comments issued by abnormal users in order to attack competitors to improve the goodness of the marketers, and the more the number of the comments is, the more the comments are possibly spam comments. To simplify the calculation, only the number of identical comments of a sentence exceeding a certain length is counted and whether to repeat is determined by whether the keywords are identical. If the repetition number of the keyword exceeds 80%, the comment is considered as a spam comment, the characteristic value of the comment is 0, and the comment is preferentially excluded, otherwise, the characteristic value of the comment is 1, and the comment is temporarily reserved as an effective comment.
(7) Special symbols
Punctuation marks in normal comments are generally normal, only a few commas, pause signs, semicolons, periods or exclamation marks and the like are written by very individual reviewers, but a string of commas, pause signs, period signs, punctuation marks, question marks, exclamation marks and the like are used for expressing strong feelings of the reviewers, and punctuation marks such as "#", "&" and the like are not used generally, and in order to bypass the existing filtering mechanism, a string of special punctuation marks is often added among spam keywords with obvious spam characteristics to mask spam characteristics of spam comments, for example: : |, welcome by the consumer, the mao wei 4S store |, the expectation of culture with you! Either Da ^0^ x' z. And if a special symbol is found, the comment is considered as a spam comment, the characteristic value of the comment is 0, the comment is preferably excluded, otherwise, the characteristic value of the comment is 1, and the comment is temporarily reserved as a valid comment.
The evaluation set of a certain product is as follows:
comments={C1((rc1,wc1),(rc2,wc2)...,(rc7,wc7)),...,Cn((rc1,wc1),(rc2,wc2)...,(rc7,wc7))}
wherein C is1,C2,…,CnRespectively representing different comments, r, in the product comment setc1…rc7Characteristic value, w, representing corresponding commentc1…wc7And representing the weight of each characteristic value of the corresponding comment, wherein the characteristic values are respectively as follows: term feature rc1Hyperlink and advertising word feature rc2Features of forbidden words rc3Emotional word feature rc4Similarity feature rc5Review repeat number feature rc6Special symbol feature rc7The values of the characteristic values are as follows:
rc1,rc2,rc3,rc6,rc7∈{0,1}
rc4number of emotion words/total number of keywords
rc5=Similarity′(s,cj)
Fig. 1 is a flowchart of an automobile forum spam comment identification method provided in an embodiment of the present invention, and the method specifically includes the following steps:
(1) selecting a sample
The invention is based on a machine learning method, trains a product review design strong classifier based on an AdaBoost algorithm, and selects w samples (C) in a certain product review set1,y1),(C2,y2)...(Cw,yw),ymE {0,1}, where CwRepresenting the w-th review sample in the product review set, ymThe label value of the positive sample and the negative sample is 1, namely the normal comment, namely the positive sample, and 0, namely the spam comment, namely the negative sample;
(2) respectively obtaining spam comment characteristics rckThe strong classifiers are called as characteristic strong classifiers, because 7 types of spam comment characteristics exist, the value of k is 1-7, and the forming methods of the 7 characteristic strong classifiers are the same; in the embodiment of the present invention, the characteristic rckThe strong classifier obtaining method specifically comprises the following steps:
21) initializing sample weights
Defining the total iteration number as Q, initializing the sample weight of the sample, and assuming that a positive samples exist, wherein the initialized sample error weight is 1/a;
22) aiming at each feature, constructing a plurality of if classifiers, wherein the classification precision of the weak classifiers to the corresponding feature is relatively low, the classification precision of the strong classifiers to the corresponding feature is relatively high, the strong classifier with the highest classification precision in the weak classifiers is the strong classifier under the corresponding feature, and in the current (n +1) th iteration, acquiring the feature rckWeak classifier with minimum error rate
Figure BDA0002830490840000081
In the embodiment of the present invention, the characteristic rckThe error rate calculation formula of the qth weak classifier is specifically as follows:
Figure BDA0002830490840000082
Figure BDA0002830490840000083
represents a characteristic rckM represents the number of training samples in the product review set, hq(si) Represents a characteristic rckQ-th weak classifier ofiThe value of a flag, w, representing positive and negative samplesniIs the error weight of the ith sample in the nth iteration.
23) Updating weak classifiers
Figure BDA0002830490840000084
After normalizing the error weight, adding 1 to the iteration times, executing step 22) until the iteration times reach Q times, and outputting the characteristic rckThe strong classifier of (1);
in the iteration process, if the error rates of the trained weak classifiers are all larger than 0.6 or equal to 0, deleting the current weak classifier, and not recycling the current comments.
In the embodiment of the invention, the smaller the error rate of the classifier is, the larger the weight is, the iteration times are increased, and the classifier
Figure BDA0002830490840000085
The weight update formula of (1) is as follows:
Figure BDA0002830490840000086
μqrepresenting weak classifiers
Figure BDA0002830490840000087
The weight is updated with the value of the weight,
Figure BDA0002830490840000088
representing weak classifiers
Figure BDA0002830490840000089
Error rate of (2).
In the embodiment of the present invention, the error weight update formula of the sample is specifically as follows:
Figure BDA00028304908400000810
W(n+1)iis the error weight of sample i in the n +1 th iteration, WniIs the error weight, μ, of sample i in the nth iterationqRepresents a characteristic rckWeight coefficient of q-th weak classifier, yiThe mark value, h, representing positive and negative samplesq(si) Represents a characteristic rckThe q-th weak classifier below.
In the embodiment of the present invention, the strong classifiers for each feature are expressed as follows:
Figure BDA0002830490840000091
wherein h isq(si) And (3) representing the result after classification by the weak classifier, wherein 0 represents that the sample classification output by the classifier is different from the identification classification of the sample, 1 represents that the sample classification output by the classifier is the same as the identification classification result of the sample, and h represents the classifier.
(3) And forming a comment strong classifier based on the characteristic strong classifiers, and identifying the spam comments based on the comment strong classifier.
In the embodiment of the present invention, the method for obtaining the comment strong classifier specifically includes:
31) defining the total iteration number as Q, and assuming that a positive samples exist, the error weight of the initialized samples is 1/a;
32) in the current (n +1) th iteration, the weak classifier h with the minimum error rate is obtainedc-minConstructing a group of weak classifiers based on h(s) of the strong classifiers with various characteristics, wherein the error rate calculation formula of the qth weak classifier is as follows:
Figure BDA0002830490840000092
hq(s) denotes the qth weak classifier,
Figure BDA0002830490840000093
represents the q-th weak classifier hq(S) error rate, m represents the number of training samples in the product review set, yiThe value of a flag, w, representing positive and negative samplesniError weight for the ith sample in the nth iteration;
33) updating weak classifier hc-minAfter the error weight is normalized, adding 1 to the iteration times, executing step 32) until the iteration times reaches Q times, and outputting a strong classifier of comments;
in the iteration process, if the error rates of the trained weak classifiers are all larger than 0.6 or equal to 0, deleting the current weak classifier, and not recycling the current comments.
In the embodiment of the invention, the smaller the error rate of the classifier is, the larger the weight is, the iteration times are increased, and the classifier hc-minThe weight update formula of (1) is as follows:
Figure BDA0002830490840000101
μqrepresents a weak classifier hc-minThe value of the weight of (a) is updated,
Figure BDA0002830490840000102
represents a weak classifier hc-minError rate of (2).
In the embodiment of the present invention, the error weight calculation formula of the sample is specifically as follows:
Figure BDA0002830490840000103
hq(s) denotes the q-th weak classifier, W(n+1)iIs the error weight of sample i in the n +1 th iteration, WniIs the error weight, μ, of sample i in the nth iterationqDenotes the qth weak classifier hqWeight coefficient of(s), yiThe label values representing positive and negative samples.
In the embodiment of the invention, the AdaBoost algorithm is realized by changing the weight distribution of samples, different training sets in each training are formed by updating the weight corresponding to each sample, and the training sample set of the next classifier is a new data set formed by updating the weight of each sample by the previous classifier. And updating the weight of each sample according to whether each sample is classified correctly in each training round and the classification error rate of the weak classifier in the previous round, wherein the weight of the sample classified correctly is reduced, and the weight of the sample classified incorrectly is increased, so that the sample classified incorrectly is highlighted. However, if the weight update is not limited, some extreme samples or samples which are difficult to classify themselves increase with the number of iterations, and these highlighted "difficult" samples are classified incorrectly each time, resulting in an exponential increase in the weight of such "difficult" samples when the sample weight is updated. Therefore, the present invention considers the limitation of the sample weight, which is added after the cyclic update:
in the embodiment of the invention, after the error weight of the sample is updated every time, whether the following conditions exist is judged: the current iteration number n +1 is larger than the set number value,and there is an error weight w of the samplen+1,iIf the error weight is larger than the set threshold, the error weight of the sample is corrected by adopting the following formula, and the corrected error weight is wn+1,i
Figure BDA0002830490840000111
vmRepresenting the number of times a sample i is classified as erroneous, adding a logarithm such that the impact of the number of errors is reduced, log3 > 1, so when the number of errors v isiThe sample weight is slowly reduced when the sample weight is more than 3, so that the exponential increase of the sample weight is effectively inhibited.
Similarity is one of the most important characteristic values of product evaluation, and the existing similarity calculation cannot effectively detect out similar meaning words, so that 2 similar meaning words with similar meanings can be regarded as completely different words, and misjudgment is caused. Therefore, the text proposes some semantic information added among the words during similarity comparison, such as the near meaning information among the words, the word shape similarity and the position information, and the like, and the formula for improving the similarity is as follows:
Figure BDA0002830490840000112
Wis' represents the corresponding weight of a phrase i in the standard product description, the phrase i consists of a keyword i and a similar meaning word thereof, Wic' represents the corresponding weight of the phrase i in the product review j, WisRepresents the corresponding weight of the keyword i in the standard product description, WicRepresents the corresponding weight of the keyword i in the product review j, Sim(s)i,cij') denotes siAnd cij' similarity, siDenotes the ith sample, cij' the jth keyword, Same (s, c), representing the ith comment in the set of commentsj) Is the number of subject words contained in the product evaluation, the physical meaning of len(s) is the length of the subject words,
Figure BDA0002830490840000113
namely toThe comment length is normalized, and a smoothing factor of 0.5 is introduced in order to reduce the influence on the similarity score.
Figure BDA0002830490840000114
WisRepresents the corresponding weight of the valid keyword i in the standard product description, WicRepresenting the corresponding weight of the valid keyword i in the product review j. The smaller the similarity, the more likely it is a spam comment.
It is clear that the specific implementation of the invention is not restricted to the above-described embodiments, but that various insubstantial modifications of the inventive process concept and technical solutions are within the scope of protection of the invention.

Claims (8)

1. The method for identifying the spam comments in the forums of the automobiles is characterized by comprising the following steps:
s1, selecting a sample, and labeling the sample;
s2, respectively generating various strong classifiers for identifying the spam comment features;
and S3, forming a comment strong classifier based on the various feature strong classifiers, and identifying the spam comments based on the comment strong classifier.
2. The automobile forum spam score identification method as recited in claim 1, wherein said spam score characteristics include:
the method comprises the following steps of not containing subject words, containing hyperlinks and advertising words, containing forbidden words, having low frequency of emotional words and words, having low similarity between comments and product description, having high repetition number of comments and containing special symbols.
3. The automobile forum spam score identification method as claimed in claim 2, wherein the formation method of the strong feature classifier is as follows:
s21, defining the total iteration number as Q, and initializing the sample weight;
s22, at presentIn the (n +1) th iteration, the characteristic r is obtainedckWeak classifier with minimum error rate
Figure FDA0002830490830000011
S23, updating the weak classifier
Figure FDA0002830490830000012
After the error weights are normalized, the iteration times are added by 1, S22 is executed until the iteration times reach Q times, and the output weak classifier is the characteristic rckThe strong classifier of (1).
4. The vehicle forum spam score identification method as claimed in claim 3, wherein the weak classifier
Figure FDA0002830490830000013
The weight calculation formula is specifically as follows:
Figure FDA0002830490830000014
μqweak classifiers representing updates
Figure FDA0002830490830000015
The weight of the weight is calculated,
Figure FDA0002830490830000016
representing weak classifiers
Figure FDA0002830490830000017
Error rate of (2).
5. The automobile forum spam score identification method as claimed in claim 4, wherein the error weight calculation formula of the sample is as follows:
Figure FDA0002830490830000021
W(n+1)iis the error weight of sample i in the n +1 th iteration, WniIs the error weight, μ, of sample i in the nth iterationqRepresents a characteristic rckWeight coefficient of q-th weak classifier, yiThe mark value, h, representing positive and negative samplesq(si) Represents a characteristic rckThe q-th weak classifier below.
6. The automobile forum spam score identifying method as claimed in claim 3, wherein after updating the error weight of the sample, if the current iteration number n +1 is greater than the set number, and there is an error weight w of the samplen+1,iAnd when the error weight is larger than the set threshold, correcting the error weight of the sample by adopting the following formula:
Figure FDA0002830490830000022
w*n+1,irepresents the error weight, w, of the sample i in the n +1 th iteration after correctionn+1,iRepresenting the error weight, v, of sample i in the (n +1) th iteration before correctionmIndicating the number of times sample i was misclassified.
7. The automotive forum spam score identification method as claimed in claim 2, wherein the similarity between the reviews and the product standard description is calculated using the following formula:
Figure FDA0002830490830000023
Wis' represents the corresponding weight of a phrase i in the standard product description, the phrase i consists of a keyword i and a similar meaning word thereof, Wic' represents the corresponding weight of the phrase i in the product review j, WisRepresenting the corresponding weight of the keyword i in the standard product descriptionHeavy, WicRepresents the corresponding weight of the keyword i in the product review j, Sim(s)i,cij') denotes siAnd cij' similarity, siDenotes the ith sample, cij' the jth keyword, Same (s, c), representing the ith comment in the set of commentsj) Is the number of subject words contained in the product evaluation, the physical meaning of len(s) is the length of the subject words,
Figure FDA0002830490830000024
normalization processing is carried out on the comment length, and a smoothing factor of 0.5 is introduced to reduce the influence on the similarity score;
Figure FDA0002830490830000031
Wisrepresents the corresponding weight of the valid keyword i in the standard product description, WicRepresenting the corresponding weight of the valid keyword i in the product review j.
8. The automobile forum spam score identification method as claimed in claim 3, wherein the error rate calculation formula of the q weak classifier is as follows:
Figure FDA0002830490830000032
Figure FDA0002830490830000033
represents a characteristic rckM represents the number of training samples in the product review set, hq(si) Represents a characteristic rckQ-th weak classifier ofiThe value of a flag, w, representing positive and negative samplesniIs the error weight of the ith sample in the nth iteration.
CN202011458869.2A 2020-12-11 2020-12-11 Automobile forum spam comment identification method Pending CN112559685A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011458869.2A CN112559685A (en) 2020-12-11 2020-12-11 Automobile forum spam comment identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011458869.2A CN112559685A (en) 2020-12-11 2020-12-11 Automobile forum spam comment identification method

Publications (1)

Publication Number Publication Date
CN112559685A true CN112559685A (en) 2021-03-26

Family

ID=75062290

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011458869.2A Pending CN112559685A (en) 2020-12-11 2020-12-11 Automobile forum spam comment identification method

Country Status (1)

Country Link
CN (1) CN112559685A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023016267A1 (en) * 2021-08-12 2023-02-16 北京锐安科技有限公司 Spam comment identification method and apparatus, and device and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462409A (en) * 2014-12-12 2015-03-25 重庆理工大学 Cross-language emotional resource data identification method based on AdaBoost
CN106708868A (en) * 2015-11-16 2017-05-24 中国移动通信集团北京有限公司 Method and system for analyzing internet data
CN106844349A (en) * 2017-02-14 2017-06-13 广西师范大学 Comment spam recognition methods based on coorinated training
CN108153733A (en) * 2017-12-26 2018-06-12 北京小度信息科技有限公司 Comment on the sorting technique and device of quality
CN111582350A (en) * 2020-04-30 2020-08-25 上海电力大学 Filtering factor optimization AdaBoost method and system based on distance weighted LSSVM

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462409A (en) * 2014-12-12 2015-03-25 重庆理工大学 Cross-language emotional resource data identification method based on AdaBoost
CN106708868A (en) * 2015-11-16 2017-05-24 中国移动通信集团北京有限公司 Method and system for analyzing internet data
CN106844349A (en) * 2017-02-14 2017-06-13 广西师范大学 Comment spam recognition methods based on coorinated training
CN108153733A (en) * 2017-12-26 2018-06-12 北京小度信息科技有限公司 Comment on the sorting technique and device of quality
CN111582350A (en) * 2020-04-30 2020-08-25 上海电力大学 Filtering factor optimization AdaBoost method and system based on distance weighted LSSVM

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
SHUQIONG WU: "Parameterized AdaBoost: Introducing a Parameter to Speed Up the Training of Real AdaBoost", 《 IEEE SIGNAL PROCESSING LETTERS》 *
任克强等: "基于AFSA和PSO融合优化的AdaBoost人脸检测算法", 《小型微型计算机系统》 *
李志欣等: "基于Co-Training的微博垃圾评论识别方法", 《计算机工程》 *
王健: "《面向样本不平衡的故障特征提取方法》", 29 February 2016 *
石磊: "基于不平衡数据处理的电子商务垃圾评论识别研究", 《中国优秀博硕士学位论文全文数据库(硕士)》 *
邓冰娜等: "一种应用于博客的垃圾评论识别方法", 《郑州大学学报(理学版)》 *
黄铃等: "基于AdaBoost的微博垃圾评论识别方法", 《计算机应用》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023016267A1 (en) * 2021-08-12 2023-02-16 北京锐安科技有限公司 Spam comment identification method and apparatus, and device and medium

Similar Documents

Publication Publication Date Title
CN109325165B (en) Network public opinion analysis method, device and storage medium
CN111198995B (en) Malicious webpage identification method
TWI424325B (en) Systems and methods for organizing collective social intelligence information using an organic object data model
Chang et al. Research on detection methods based on Doc2vec abnormal comments
CN103309862B (en) Webpage type recognition method and system
CN109902179A (en) The method of screening electric business comment spam based on natural language processing
CN106506327B (en) Junk mail identification method and device
CN101477544A (en) Rubbish text recognition method and system
Mohanty et al. Resumate: A prototype to enhance recruitment process with NLP based resume parsing
CN109522412A (en) Text emotion analysis method, device and medium
CN108388554A (en) Text emotion identifying system based on collaborative filtering attention mechanism
CN110955750A (en) Combined identification method and device for comment area and emotion polarity, and electronic equipment
Ghosal et al. Sentiment analysis on (Bengali horoscope) corpus
Marasović et al. Multilingual modal sense classification using a convolutional neural network
CN109933648A (en) A kind of differentiating method and discriminating device of real user comment
CN110569495A (en) Emotional tendency classification method and device based on user comments and storage medium
Arya et al. News web page classification using url content and structure attributes
Sakai et al. Cause information extraction from financial articles concerning business performance
CN112463966B (en) False comment detection model training method, false comment detection model training method and false comment detection model training device
CN113220964B (en) Viewpoint mining method based on short text in network message field
CN111523311B (en) Search intention recognition method and device
CN112559685A (en) Automobile forum spam comment identification method
Putra et al. Sentiment analysis on marketplace review using hybrid lexicon and svm method
CN112100385A (en) Single label text classification method, computing device and computer readable storage medium
Kae et al. Categorization of display ads using image and landing page features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210326