CN112559685A - Automobile forum spam comment identification method - Google Patents
Automobile forum spam comment identification method Download PDFInfo
- Publication number
- CN112559685A CN112559685A CN202011458869.2A CN202011458869A CN112559685A CN 112559685 A CN112559685 A CN 112559685A CN 202011458869 A CN202011458869 A CN 202011458869A CN 112559685 A CN112559685 A CN 112559685A
- Authority
- CN
- China
- Prior art keywords
- comment
- weight
- sample
- spam
- error
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/258—Heading extraction; Automatic titling; Numbering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/247—Thesauruses; Synonyms
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to the technical field of product comments, and provides an automobile forum spam comment identification method which specifically comprises the following steps: s1, selecting a sample, and labeling the sample; s2, respectively generating various strong classifiers for identifying the spam comment features; and S3, forming a comment strong classifier based on the various feature strong classifiers, and identifying the spam comments based on the comment strong classifier. According to the method, the influence of comment characteristics such as subject words, advertising words, emotional words, forbidden words, similarity, comment repetition number and special symbols on comment results is fully considered, and the accuracy of spam comment identification is improved.
Description
Technical Field
The invention relates to the technical field of product comments, and provides an automobile forum spam comment identification method.
Background
At present, a mainstream portal website or a blog forum of an automobile allows a user to publish some comment opinions aiming at a product or topic, so that some false and purposeful rumor or advertisement comment information irrelevant to the product or topic exists, which is called spam comment, and the spam comment refers to unrealistic comment aiming at exaggerating or depreciating the product, or the comment object is only a comment irrelevant to the product such as a product brand, a merchant and other products; such as advertisements, user questions, and discussion of daily questions, etc. The existence of these spam comments affects the accuracy of forum comment data and misleads consumers and host plants.
Disclosure of Invention
The invention provides a method for identifying automobile forum spam comments, which aims to automatically filter the spam comments and improve the accuracy of forum comment data.
The invention is realized in such a way, and the method for identifying the automobile forum spam comments specifically comprises the following steps:
s1, selecting a sample, and labeling the sample;
s2, respectively generating various strong classifiers for identifying the spam comment features;
and S3, forming a comment strong classifier based on the various feature strong classifiers, and identifying the spam comments based on the comment strong classifier.
Further, the spam characteristics include:
the method comprises the following steps of not containing subject words, containing hyperlinks and advertising words, containing forbidden words, having low frequency of emotional words and words, having low similarity between comments and product description, having high repetition number of comments and containing special symbols.
Further, the forming method of the strong feature classifier specifically includes:
s21, defining the total iteration number as Q, and initializing the sample weight;
s22, in the current (n +1) th iteration, acquiring the characteristic rckWeak classifier with minimum error rate
S23, updating the weak classifierAfter the error weights are normalized, the iteration times are added by 1, S22 is executed until the iteration times reach Q times, and the output weak classifier is the characteristic rckThe strong classifier of (1).
μqweak classifiers representing updatesThe weight of the weight is calculated,representing weak classifiersError rate of (2).
Further, the error weight calculation formula of the sample is specifically as follows:
W(n+1)iis the error weight of sample i in the n +1 th iteration, WniIs the error weight, μ, of sample i in the nth iterationqRepresents a characteristic rckWeight coefficient of q-th weak classifier, yiThe mark value, h, representing positive and negative samplesq(si) Represents a characteristic rckThe q-th weak classifier below.
Further, after updating the error weight of the sample, if the current iteration number n +1 is greater than the set number, and the error weight w of the sample existsn+1,iAnd when the error weight is larger than the set threshold, correcting the error weight of the sample by adopting the following formula:
w*n+1,irepresents the error weight, w, of the sample i in the n +1 th iteration after correctionn+1,iRepresenting the error weight, v, of sample i in the (n +1) th iteration before correctionmIndicating the number of times sample i was misclassified.
Further, the degree of similarity between the review and the product standard description is calculated using the following formula:
Wis' represents the corresponding weight of a phrase i in the standard product description, the phrase i consists of a keyword i and a similar meaning word thereof, Wic' represents the corresponding weight of the phrase i in the product review j, WisRepresents the corresponding weight of the keyword i in the standard product description, WicRepresents the corresponding weight of the keyword i in the product review j, Sim(s)i,cij') denotes siAnd cij' similarity, siDenotes the ith sample, cij' the jth keyword, Same (s, c), representing the ith comment in the set of commentsj) Is the number of subject words contained in the product evaluation, the physical meaning of len(s) is the length of the subject words,i.e. normalizing the length of the comment in order to reduce the similarityIntroducing a smoothing factor of 0.5 under the influence of the degree value;
Wisrepresents the corresponding weight of the valid keyword i in the standard product description, WicRepresenting the corresponding weight of the valid keyword i in the product review j.
Further, the error rate calculation formula of the qth weak classifier is specifically as follows:
represents a characteristic rckM represents the number of training samples in the product review set, hq(si) Represents a characteristic rckQ-th weak classifier ofiThe value of a flag, w, representing positive and negative samplesniIs the error weight of the ith sample in the nth iteration.
According to the method, the influence of comment characteristics such as subject words, advertising words, emotional words, forbidden words, similarity, comment repetition number and special symbols on comment results is fully considered, and the accuracy of spam comment identification is improved.
Drawings
Fig. 1 is a flowchart of an automobile forum spam comment identification method provided by the embodiment of the present invention.
Detailed Description
The following description of preferred embodiments of the invention will be made in further detail with reference to the accompanying drawings.
Before extracting the product review features, the reviews are first preprocessed. And performing word segmentation on product standard description and comments by adopting a word segmentation system ICTCCLAS of the institute of computational technology of Chinese academy of sciences, removing stop words irrelevant to the comment content, and analyzing and processing the remaining effective keywords.
The product comment features are important indexes for comment effectiveness screening, whether the spam comments are representative or not and whether all comments are covered as much as possible can be effectively identified, the product comments are described through the following features, and feature values of the product comments are extracted:
(1) subject term
The product subject term is a core word for describing the product and also a core word for product review, and is generally a core noun related to the product. And extracting product standard description in the keywords and core words in the comments to be evaluated by using a word segmentation system ICTCCLAS of the institute of computational technology of Chinese academy of sciences, and comparing the product standard description and the core words one by one. If the comment does not have any subject word in the product standard description, namely the characteristic value is 0, the comment is considered as a spam comment, otherwise, the characteristic value is 1, and the comment is temporarily reserved as a valid comment.
(2) Hyperlink and advertisement word
The product spam comments comprise hyperlinks and advertising words, which are typical expressions of advertising information and mostly comprise product promotion, shop or website recommendation, company promotion and the like.
For hyperlink spam comments, because hyperlinks generally appear in a web site, a plurality of continuous English alphabetic characters, such as http://, appear. If the comment contains the hyperlink, the comment is considered to be possibly a spam comment, the characteristic value of the comment is 0, the comment is preferably excluded, otherwise, the characteristic value of the comment is 1, and the comment is temporarily reserved as a valid comment.
Aiming at the advertisement words, a common advertisement word dictionary is established by analyzing and summarizing popular comments at the present stage, such as QQ, special price, hot shopping, Taobao, whole parcel postings and the like; considering that the advertisement words include information such as product price and QQ number, which generally appear as numbers, if a plurality of numbers and Chinese character elements are scanned, the advertisement words are considered to be included. Similarly, if the comment contains an advertisement word, the comment is considered as a spam comment, the feature value of the comment is 0, and the comment is preferentially excluded, otherwise, the feature value of the comment is 1, and the comment is temporarily reserved as a valid comment.
(3) Forbidden word
The forbidden words are words containing malignant attacks, such as tm, sb, rotten goods and the like, and a forbidden word dictionary is also established. And scanning the keywords of each comment, and finding out forbidden words, namely considering the comment as a spam comment, wherein the characteristic value of the spam comment is 0, and preferentially excluding the spam comment, otherwise, considering the characteristic value of the spam comment as 1, and temporarily keeping the spam comment as an effective comment.
(4) Emotional words
The product comment is an evaluation and discussion of people on relevant parameters and purchasing experience of products, and people can truly express own subjective opinion, attitude, feeling, emotion and the like through the comment. Thus, the comment necessarily contains the sentiment of the reviewer. The fewer the number of words of the emotional words, the more likely it is that they belong to spam comments. And through statistical analysis and calculation, a product comment emotion word dictionary is also established, and the emotion word frequency in the comment is used as the characteristic value of the comment.
(5) Degree of similarity
The identification of the spam comments cannot be separated from the measurement of text similarity, the similarity refers to the common size of two contrasts, and the similarity is a common characteristic index for measuring whether the comments are spam comments or not. Cosine similarity is used herein to measure the degree of similarity between reviews and product standard descriptions, which is formulated as follows:
Wisrepresents the corresponding weight of the valid keyword i in the standard product description, WicRepresenting the corresponding weight of the valid keyword i in the product review j. The smaller the similarity, the more likely it is a spam comment.
Weight calculation formula: wis1+ log (n × a +1), where n is the number of times the keyword i appears in the product description, and a is a weight adjustment parameter, which can be automatically adjusted by a machine. And a parameter a, wherein the program automatically increases or decreases the weight coefficient by judging whether the weight is satisfied by the staff.
(6) Number of review repetitions
Through the analysis of a large number of products, the statistical finding is that: there is a class of spam reviews that appear to be normal reviews, but from a holistic perspective, the same reviewer or different reviewers may be found to have posted a large number of the same or similar reviews, called repeat reviews, for the same question. The comments are possibly spam comments issued by abnormal users in order to attack competitors to improve the goodness of the marketers, and the more the number of the comments is, the more the comments are possibly spam comments. To simplify the calculation, only the number of identical comments of a sentence exceeding a certain length is counted and whether to repeat is determined by whether the keywords are identical. If the repetition number of the keyword exceeds 80%, the comment is considered as a spam comment, the characteristic value of the comment is 0, and the comment is preferentially excluded, otherwise, the characteristic value of the comment is 1, and the comment is temporarily reserved as an effective comment.
(7) Special symbols
Punctuation marks in normal comments are generally normal, only a few commas, pause signs, semicolons, periods or exclamation marks and the like are written by very individual reviewers, but a string of commas, pause signs, period signs, punctuation marks, question marks, exclamation marks and the like are used for expressing strong feelings of the reviewers, and punctuation marks such as "#", "&" and the like are not used generally, and in order to bypass the existing filtering mechanism, a string of special punctuation marks is often added among spam keywords with obvious spam characteristics to mask spam characteristics of spam comments, for example: : |, welcome by the consumer, the mao wei 4S store |, the expectation of culture with you! Either Da ^0^ x' z. And if a special symbol is found, the comment is considered as a spam comment, the characteristic value of the comment is 0, the comment is preferably excluded, otherwise, the characteristic value of the comment is 1, and the comment is temporarily reserved as a valid comment.
The evaluation set of a certain product is as follows:
comments={C1((rc1,wc1),(rc2,wc2)...,(rc7,wc7)),...,Cn((rc1,wc1),(rc2,wc2)...,(rc7,wc7))}
wherein C is1,C2,…,CnRespectively representing different comments, r, in the product comment setc1…rc7Characteristic value, w, representing corresponding commentc1…wc7And representing the weight of each characteristic value of the corresponding comment, wherein the characteristic values are respectively as follows: term feature rc1Hyperlink and advertising word feature rc2Features of forbidden words rc3Emotional word feature rc4Similarity feature rc5Review repeat number feature rc6Special symbol feature rc7The values of the characteristic values are as follows:
rc1,rc2,rc3,rc6,rc7∈{0,1}
rc4number of emotion words/total number of keywords
rc5=Similarity′(s,cj)
Fig. 1 is a flowchart of an automobile forum spam comment identification method provided in an embodiment of the present invention, and the method specifically includes the following steps:
(1) selecting a sample
The invention is based on a machine learning method, trains a product review design strong classifier based on an AdaBoost algorithm, and selects w samples (C) in a certain product review set1,y1),(C2,y2)...(Cw,yw),ymE {0,1}, where CwRepresenting the w-th review sample in the product review set, ymThe label value of the positive sample and the negative sample is 1, namely the normal comment, namely the positive sample, and 0, namely the spam comment, namely the negative sample;
(2) respectively obtaining spam comment characteristics rckThe strong classifiers are called as characteristic strong classifiers, because 7 types of spam comment characteristics exist, the value of k is 1-7, and the forming methods of the 7 characteristic strong classifiers are the same; in the embodiment of the present invention, the characteristic rckThe strong classifier obtaining method specifically comprises the following steps:
21) initializing sample weights
Defining the total iteration number as Q, initializing the sample weight of the sample, and assuming that a positive samples exist, wherein the initialized sample error weight is 1/a;
22) aiming at each feature, constructing a plurality of if classifiers, wherein the classification precision of the weak classifiers to the corresponding feature is relatively low, the classification precision of the strong classifiers to the corresponding feature is relatively high, the strong classifier with the highest classification precision in the weak classifiers is the strong classifier under the corresponding feature, and in the current (n +1) th iteration, acquiring the feature rckWeak classifier with minimum error rate
In the embodiment of the present invention, the characteristic rckThe error rate calculation formula of the qth weak classifier is specifically as follows:
represents a characteristic rckM represents the number of training samples in the product review set, hq(si) Represents a characteristic rckQ-th weak classifier ofiThe value of a flag, w, representing positive and negative samplesniIs the error weight of the ith sample in the nth iteration.
23) Updating weak classifiersAfter normalizing the error weight, adding 1 to the iteration times, executing step 22) until the iteration times reach Q times, and outputting the characteristic rckThe strong classifier of (1);
in the iteration process, if the error rates of the trained weak classifiers are all larger than 0.6 or equal to 0, deleting the current weak classifier, and not recycling the current comments.
In the embodiment of the invention, the smaller the error rate of the classifier is, the larger the weight is, the iteration times are increased, and the classifierThe weight update formula of (1) is as follows:
μqrepresenting weak classifiersThe weight is updated with the value of the weight,representing weak classifiersError rate of (2).
In the embodiment of the present invention, the error weight update formula of the sample is specifically as follows:
W(n+1)iis the error weight of sample i in the n +1 th iteration, WniIs the error weight, μ, of sample i in the nth iterationqRepresents a characteristic rckWeight coefficient of q-th weak classifier, yiThe mark value, h, representing positive and negative samplesq(si) Represents a characteristic rckThe q-th weak classifier below.
In the embodiment of the present invention, the strong classifiers for each feature are expressed as follows:
wherein h isq(si) And (3) representing the result after classification by the weak classifier, wherein 0 represents that the sample classification output by the classifier is different from the identification classification of the sample, 1 represents that the sample classification output by the classifier is the same as the identification classification result of the sample, and h represents the classifier.
(3) And forming a comment strong classifier based on the characteristic strong classifiers, and identifying the spam comments based on the comment strong classifier.
In the embodiment of the present invention, the method for obtaining the comment strong classifier specifically includes:
31) defining the total iteration number as Q, and assuming that a positive samples exist, the error weight of the initialized samples is 1/a;
32) in the current (n +1) th iteration, the weak classifier h with the minimum error rate is obtainedc-minConstructing a group of weak classifiers based on h(s) of the strong classifiers with various characteristics, wherein the error rate calculation formula of the qth weak classifier is as follows:
hq(s) denotes the qth weak classifier,represents the q-th weak classifier hq(S) error rate, m represents the number of training samples in the product review set, yiThe value of a flag, w, representing positive and negative samplesniError weight for the ith sample in the nth iteration;
33) updating weak classifier hc-minAfter the error weight is normalized, adding 1 to the iteration times, executing step 32) until the iteration times reaches Q times, and outputting a strong classifier of comments;
in the iteration process, if the error rates of the trained weak classifiers are all larger than 0.6 or equal to 0, deleting the current weak classifier, and not recycling the current comments.
In the embodiment of the invention, the smaller the error rate of the classifier is, the larger the weight is, the iteration times are increased, and the classifier hc-minThe weight update formula of (1) is as follows:
μqrepresents a weak classifier hc-minThe value of the weight of (a) is updated,represents a weak classifier hc-minError rate of (2).
In the embodiment of the present invention, the error weight calculation formula of the sample is specifically as follows:
hq(s) denotes the q-th weak classifier, W(n+1)iIs the error weight of sample i in the n +1 th iteration, WniIs the error weight, μ, of sample i in the nth iterationqDenotes the qth weak classifier hqWeight coefficient of(s), yiThe label values representing positive and negative samples.
In the embodiment of the invention, the AdaBoost algorithm is realized by changing the weight distribution of samples, different training sets in each training are formed by updating the weight corresponding to each sample, and the training sample set of the next classifier is a new data set formed by updating the weight of each sample by the previous classifier. And updating the weight of each sample according to whether each sample is classified correctly in each training round and the classification error rate of the weak classifier in the previous round, wherein the weight of the sample classified correctly is reduced, and the weight of the sample classified incorrectly is increased, so that the sample classified incorrectly is highlighted. However, if the weight update is not limited, some extreme samples or samples which are difficult to classify themselves increase with the number of iterations, and these highlighted "difficult" samples are classified incorrectly each time, resulting in an exponential increase in the weight of such "difficult" samples when the sample weight is updated. Therefore, the present invention considers the limitation of the sample weight, which is added after the cyclic update:
in the embodiment of the invention, after the error weight of the sample is updated every time, whether the following conditions exist is judged: the current iteration number n +1 is larger than the set number value,and there is an error weight w of the samplen+1,iIf the error weight is larger than the set threshold, the error weight of the sample is corrected by adopting the following formula, and the corrected error weight is wn+1,i:
vmRepresenting the number of times a sample i is classified as erroneous, adding a logarithm such that the impact of the number of errors is reduced, log3 > 1, so when the number of errors v isiThe sample weight is slowly reduced when the sample weight is more than 3, so that the exponential increase of the sample weight is effectively inhibited.
Similarity is one of the most important characteristic values of product evaluation, and the existing similarity calculation cannot effectively detect out similar meaning words, so that 2 similar meaning words with similar meanings can be regarded as completely different words, and misjudgment is caused. Therefore, the text proposes some semantic information added among the words during similarity comparison, such as the near meaning information among the words, the word shape similarity and the position information, and the like, and the formula for improving the similarity is as follows:
Wis' represents the corresponding weight of a phrase i in the standard product description, the phrase i consists of a keyword i and a similar meaning word thereof, Wic' represents the corresponding weight of the phrase i in the product review j, WisRepresents the corresponding weight of the keyword i in the standard product description, WicRepresents the corresponding weight of the keyword i in the product review j, Sim(s)i,cij') denotes siAnd cij' similarity, siDenotes the ith sample, cij' the jth keyword, Same (s, c), representing the ith comment in the set of commentsj) Is the number of subject words contained in the product evaluation, the physical meaning of len(s) is the length of the subject words,namely toThe comment length is normalized, and a smoothing factor of 0.5 is introduced in order to reduce the influence on the similarity score.
WisRepresents the corresponding weight of the valid keyword i in the standard product description, WicRepresenting the corresponding weight of the valid keyword i in the product review j. The smaller the similarity, the more likely it is a spam comment.
It is clear that the specific implementation of the invention is not restricted to the above-described embodiments, but that various insubstantial modifications of the inventive process concept and technical solutions are within the scope of protection of the invention.
Claims (8)
1. The method for identifying the spam comments in the forums of the automobiles is characterized by comprising the following steps:
s1, selecting a sample, and labeling the sample;
s2, respectively generating various strong classifiers for identifying the spam comment features;
and S3, forming a comment strong classifier based on the various feature strong classifiers, and identifying the spam comments based on the comment strong classifier.
2. The automobile forum spam score identification method as recited in claim 1, wherein said spam score characteristics include:
the method comprises the following steps of not containing subject words, containing hyperlinks and advertising words, containing forbidden words, having low frequency of emotional words and words, having low similarity between comments and product description, having high repetition number of comments and containing special symbols.
3. The automobile forum spam score identification method as claimed in claim 2, wherein the formation method of the strong feature classifier is as follows:
s21, defining the total iteration number as Q, and initializing the sample weight;
s22, at presentIn the (n +1) th iteration, the characteristic r is obtainedckWeak classifier with minimum error rate
5. The automobile forum spam score identification method as claimed in claim 4, wherein the error weight calculation formula of the sample is as follows:
W(n+1)iis the error weight of sample i in the n +1 th iteration, WniIs the error weight, μ, of sample i in the nth iterationqRepresents a characteristic rckWeight coefficient of q-th weak classifier, yiThe mark value, h, representing positive and negative samplesq(si) Represents a characteristic rckThe q-th weak classifier below.
6. The automobile forum spam score identifying method as claimed in claim 3, wherein after updating the error weight of the sample, if the current iteration number n +1 is greater than the set number, and there is an error weight w of the samplen+1,iAnd when the error weight is larger than the set threshold, correcting the error weight of the sample by adopting the following formula:
w*n+1,irepresents the error weight, w, of the sample i in the n +1 th iteration after correctionn+1,iRepresenting the error weight, v, of sample i in the (n +1) th iteration before correctionmIndicating the number of times sample i was misclassified.
7. The automotive forum spam score identification method as claimed in claim 2, wherein the similarity between the reviews and the product standard description is calculated using the following formula:
Wis' represents the corresponding weight of a phrase i in the standard product description, the phrase i consists of a keyword i and a similar meaning word thereof, Wic' represents the corresponding weight of the phrase i in the product review j, WisRepresenting the corresponding weight of the keyword i in the standard product descriptionHeavy, WicRepresents the corresponding weight of the keyword i in the product review j, Sim(s)i,cij') denotes siAnd cij' similarity, siDenotes the ith sample, cij' the jth keyword, Same (s, c), representing the ith comment in the set of commentsj) Is the number of subject words contained in the product evaluation, the physical meaning of len(s) is the length of the subject words,normalization processing is carried out on the comment length, and a smoothing factor of 0.5 is introduced to reduce the influence on the similarity score;
Wisrepresents the corresponding weight of the valid keyword i in the standard product description, WicRepresenting the corresponding weight of the valid keyword i in the product review j.
8. The automobile forum spam score identification method as claimed in claim 3, wherein the error rate calculation formula of the q weak classifier is as follows:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011458869.2A CN112559685A (en) | 2020-12-11 | 2020-12-11 | Automobile forum spam comment identification method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011458869.2A CN112559685A (en) | 2020-12-11 | 2020-12-11 | Automobile forum spam comment identification method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112559685A true CN112559685A (en) | 2021-03-26 |
Family
ID=75062290
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011458869.2A Pending CN112559685A (en) | 2020-12-11 | 2020-12-11 | Automobile forum spam comment identification method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112559685A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023016267A1 (en) * | 2021-08-12 | 2023-02-16 | 北京锐安科技有限公司 | Spam comment identification method and apparatus, and device and medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104462409A (en) * | 2014-12-12 | 2015-03-25 | 重庆理工大学 | Cross-language emotional resource data identification method based on AdaBoost |
CN106708868A (en) * | 2015-11-16 | 2017-05-24 | 中国移动通信集团北京有限公司 | Method and system for analyzing internet data |
CN106844349A (en) * | 2017-02-14 | 2017-06-13 | 广西师范大学 | Comment spam recognition methods based on coorinated training |
CN108153733A (en) * | 2017-12-26 | 2018-06-12 | 北京小度信息科技有限公司 | Comment on the sorting technique and device of quality |
CN111582350A (en) * | 2020-04-30 | 2020-08-25 | 上海电力大学 | Filtering factor optimization AdaBoost method and system based on distance weighted LSSVM |
-
2020
- 2020-12-11 CN CN202011458869.2A patent/CN112559685A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104462409A (en) * | 2014-12-12 | 2015-03-25 | 重庆理工大学 | Cross-language emotional resource data identification method based on AdaBoost |
CN106708868A (en) * | 2015-11-16 | 2017-05-24 | 中国移动通信集团北京有限公司 | Method and system for analyzing internet data |
CN106844349A (en) * | 2017-02-14 | 2017-06-13 | 广西师范大学 | Comment spam recognition methods based on coorinated training |
CN108153733A (en) * | 2017-12-26 | 2018-06-12 | 北京小度信息科技有限公司 | Comment on the sorting technique and device of quality |
CN111582350A (en) * | 2020-04-30 | 2020-08-25 | 上海电力大学 | Filtering factor optimization AdaBoost method and system based on distance weighted LSSVM |
Non-Patent Citations (7)
Title |
---|
SHUQIONG WU: "Parameterized AdaBoost: Introducing a Parameter to Speed Up the Training of Real AdaBoost", 《 IEEE SIGNAL PROCESSING LETTERS》 * |
任克强等: "基于AFSA和PSO融合优化的AdaBoost人脸检测算法", 《小型微型计算机系统》 * |
李志欣等: "基于Co-Training的微博垃圾评论识别方法", 《计算机工程》 * |
王健: "《面向样本不平衡的故障特征提取方法》", 29 February 2016 * |
石磊: "基于不平衡数据处理的电子商务垃圾评论识别研究", 《中国优秀博硕士学位论文全文数据库(硕士)》 * |
邓冰娜等: "一种应用于博客的垃圾评论识别方法", 《郑州大学学报(理学版)》 * |
黄铃等: "基于AdaBoost的微博垃圾评论识别方法", 《计算机应用》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023016267A1 (en) * | 2021-08-12 | 2023-02-16 | 北京锐安科技有限公司 | Spam comment identification method and apparatus, and device and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109325165B (en) | Network public opinion analysis method, device and storage medium | |
CN111198995B (en) | Malicious webpage identification method | |
TWI424325B (en) | Systems and methods for organizing collective social intelligence information using an organic object data model | |
Chang et al. | Research on detection methods based on Doc2vec abnormal comments | |
CN103309862B (en) | Webpage type recognition method and system | |
CN109902179A (en) | The method of screening electric business comment spam based on natural language processing | |
CN106506327B (en) | Junk mail identification method and device | |
CN101477544A (en) | Rubbish text recognition method and system | |
Mohanty et al. | Resumate: A prototype to enhance recruitment process with NLP based resume parsing | |
CN109522412A (en) | Text emotion analysis method, device and medium | |
CN108388554A (en) | Text emotion identifying system based on collaborative filtering attention mechanism | |
CN110955750A (en) | Combined identification method and device for comment area and emotion polarity, and electronic equipment | |
Ghosal et al. | Sentiment analysis on (Bengali horoscope) corpus | |
Marasović et al. | Multilingual modal sense classification using a convolutional neural network | |
CN109933648A (en) | A kind of differentiating method and discriminating device of real user comment | |
CN110569495A (en) | Emotional tendency classification method and device based on user comments and storage medium | |
Arya et al. | News web page classification using url content and structure attributes | |
Sakai et al. | Cause information extraction from financial articles concerning business performance | |
CN112463966B (en) | False comment detection model training method, false comment detection model training method and false comment detection model training device | |
CN113220964B (en) | Viewpoint mining method based on short text in network message field | |
CN111523311B (en) | Search intention recognition method and device | |
CN112559685A (en) | Automobile forum spam comment identification method | |
Putra et al. | Sentiment analysis on marketplace review using hybrid lexicon and svm method | |
CN112100385A (en) | Single label text classification method, computing device and computer readable storage medium | |
Kae et al. | Categorization of display ads using image and landing page features |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210326 |