CN107818173B

CN107818173B - Vector space model-based Chinese false comment filtering method

Info

Publication number: CN107818173B
Application number: CN201711129611.6A
Authority: CN
Inventors: 刘珊; 杨波; 郑文锋; 蔡礼高
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2017-11-15
Filing date: 2017-11-15
Publication date: 2021-05-14
Anticipated expiration: 2037-11-15
Also published as: CN107818173A

Abstract

The invention discloses a Chinese false comment filtering method based on a vector space model. Meanwhile, another part of false comments is screened out by combining the emotional polarity of the comments and the user scores. And introducing a part of real comment samples, and training the BP neural network by using the two types of samples. And judging unlabeled comments by using the trained network.

Description

Vector space model-based Chinese false comment filtering method

Technical Field

The invention belongs to the technical field of machine learning, and particularly relates to a Chinese false comment filtering method based on a vector space model.

Background

With the increasing maturity of internet technology, the evaluation enthusiasm of consumer networks is gradually enhanced, and a large amount of comment data is generated on the networks. The users can use the comment information to assist consumption decision, and simultaneously, the users are troubled by the problems of uneven comment quality, information overload and the like.

The network brings convenient experience to consumers, and simultaneously, due to the characteristic of no regional limitation, the network causes the defects of lack of consumption basis, inconsistent commodity description information with reality and the like. Therefore, more and more consumers have to know the evaluation and attitude of the purchased customers to the product before consuming the product so as to make a reliable decision. However, with the rapid increase in the number of evaluations and the five-fold variety of evaluation contents, it is increasingly difficult for users to acquire valuable evaluation information.

Information really valuable to users is difficult to identify from massive comments only by a manual method, and an automatic method is urgently needed to assist people in screening, so that the method has important research value on evaluation and screening of text contents.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a Chinese false comment filtering method based on a vector space model, which is used for identifying false comments of a film comment website based on a BP neural network so as to provide a real consumption reference for a user.

In order to achieve the above purpose, the invention provides a method for filtering Chinese false comments based on a vector space model, which is characterized by comprising the following steps

(1) Simulating website login and capturing comments;

(2) according to the set comment length L, removing comments in the L;

(3) and segmenting the comments into words to obtain a sentence component structure

(3.1) firstly establishing an interference word bank, wherein the interference word bank comprises connecting words, subjects and objects, then calculating the proportion of the interference words in each comment, comparing the obtained proportion of the interference words with a preset proportion threshold, and rejecting the comments of which the proportion is greater than the proportion threshold;

(3.2) performing word segmentation processing on the comment obtained in the step (3.1) by using a Chinese academy NLPIR Chinese word segmentation java edition tool, deleting punctuation marks, encoding the segmented comment according to the part of speech, establishing a comment structure encoding library, searching whether the comment structure encoding library has the same code, if so, adding 1 to the characteristic value of the comment template, and if not, not modifying;

(4) sorting the comments obtained in the step (3) according to the useful number of votes of the user, selecting the 5% of the comments in the top sorting as real comments, and marking as a positive sample;

(5) constructing an improved version vector space model by using the unmarked comments in the step (4)

(5.1) carrying out word frequency TF and anti-word frequency IDF statistics on the unmarked comments in the step (4)

TF is f/m, the TF value is between 0 and 1, f represents the number of times of the current word appearing in the current comment, and m represents the sum of the number of times of all the words appearing in the current comment;

n represents the total number of comments in the entire corpus, and

representing the number of comments containing the current word;

(5.2) constructing an improved version vector space model

Wherein d is_i,d_jRespectively representing the ith comment and the jth comment, N representing the sum of the numbers of all words, w_ikExpressing the statistical product of the word frequency TF and the inverse word frequency IDF of the kth vocabulary in the ith comment;

(5.3) calculating the similarity of any two comments by using an improved vector space model, screening out the same or similar comments, marking the same or similar comments as false comments, and marking the comments as a positive example sample I;

(6) carrying out emotion scoring on the unmarked comments in the step (4) according to BosonNLP emotion dictionary data and the Hopkinson emotion analysis word data, and then carrying out emotion polarity judgment according to the emotion scores, wherein the judgment result shows that the Score is positive when the Score is greater than 0 and the judgment result shows that the Score is negative when the Score is less than 0;

marking the comments with positive emotion polarity and user score lower than the average judgment standard or with negative emotion polarity and user score higher than the average judgment standard as false comments, and taking the false comments as a negative example sample II;

(7) sorting the users in the unmarked comments in the step (4) in a descending order according to the comment times of each user, marking all comments of the first 1% of users as false comments, and taking the false comments as a negative sample III;

(8) respectively forming positive example vectors and negative example vectors by the positive example samples and the negative example samples obtained in the steps (4), (5), (6) and (7); inputting the positive vector into the BP neural network, and modifying the weight between each layer of the BP neural network by using forward propagation and backward propagation through iteration so that the BP neural network outputs '1'; inputting the negative example vector into a BP neural network, modifying the weight between each layer of the BP neural network by using forward propagation and backward propagation through iteration, and enabling the BP neural network to output '0', thereby training the BP neural network;

(9) inputting the comments grabbed in real time into the trained BP neural network, and if the output of the BP neural network is '1', the comments are real comments; if the BP neural network output is "0", then the comment is a false comment.

The invention aims to realize the following steps:

the invention relates to a Chinese false comment filtering method based on a vector space model, which judges the similarity among comments by improving the vector space model and takes the comment with high similarity as a part of a false comment. Meanwhile, another part of false comments is screened out by combining the emotional polarity of the comments and the user scores. And introducing a part of real comment samples, and training the BP neural network by using the two types of samples. And judging unlabeled comments by using the trained network.

Meanwhile, the Chinese false comment filtering method based on the vector space model also has the following steps

Has the advantages that:

(1) the positive sample and the negative sample are integrated to train the BP neural network, so that the reliability of the training samples is improved; secondly, the BP neural network is selected, so that the BP neural network not only can process the condition that the feature vector is relatively larger, but also can process the condition that the training set is relatively larger, and is better than a logistic regression machine and a support vector machine in limitation.

(2) And the vectorization of the training samples integrates hidden influence factors such as structural coding, a vector space model, emotion polarity, comment time and the like.

Drawings

FIG. 1 is a flow chart of a method for filtering false Chinese comments based on a vector space model according to the present invention;

Detailed Description

The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.

Examples

FIG. 1 is a flow chart of a method for filtering Chinese false comments based on a vector space model according to the present invention.

In this embodiment, as shown in fig. 1, the method for filtering Chinese false comments based on a vector space model of the present invention includes the following steps

S1, using Python to realize simulation login of the website, and using a regular expression to capture the publishing time of each comment, the text content of the comment, the nickname, id, homepage address and the like of a comment publisher;

s2, removing comments smaller than L according to the set comment length L; in this embodiment, the threshold is set to 15, and comments with a length smaller than 15 are rejected;

s3, segmenting the comments to obtain sentence component structure

S3.1, establishing an interference word bank, wherein the interference word bank comprises nonsense words such as connection words, subject words and object words, calculating the proportion of the interference words in each comment, comparing the obtained proportion of the interference words with a preset proportion threshold value of 50%, and rejecting the comments with the proportion larger than 50%;

s3.2, performing word segmentation on the comment obtained in the step S3.1 by using a Chinese academy NLPIR Chinese word segmentation java version tool, deleting punctuations, encoding the segmented comment according to the parts of speech such as nouns, verbs, adverbs, adjectives and the like, establishing a comment structure encoding library, searching whether the comment structure encoding library has the same encoding, if so, adding 1 to the characteristic value of the comment template, and if not, not modifying the comment template;

the encoding process is for example:

the third line in the above example is the comment structure code;

s4, sorting the comments obtained in the step S3 according to the useful number of votes of the user, selecting the 5% of the comments in the top of the sorting as real comments, and marking as a positive sample;

s5, constructing an improved version vector space model by using the unmarked comments in the step S4

Vector Space Model (VSM) is the most commonly used similarity calculation model and has wide application in natural language processing, and the traditional Vector space model follows the following principle:

assume that there are ten words in total: w is a₁，w₂，……，w₁₀And there are three comments, d respectively₁，d₂And d₃. The word frequency table obtained by statistics is shown in table 1:

	w₁	w₂	w₃	w₄	w₅	w₆	w₇	w₈	w₉	w₁₀
											d₁	1	2		5		7	9
d₂		3		4		6	8
											d₃	10		11		12		13	14	15

TABLE 1

The vector space formula commonly used is as follows:

wherein d is_i,d_jRespectively representing the ith comment and the jth comment, N representing the sum of the numbers of all words, a_ikIndicating the number of times the k < th > word appears in the ith comment.

Suppose to calculate d₁And d₂Then:

the above formula is computationally intensive, and here a dimension reduction method is used to reduce the computational complexity. The adoption of the dimension reduction strategy can not only improve the efficiency, but also improve the precision. For example, the following two phrases:

1. this is my meal.

2. That is your meal.

If "this", "that", "you", "i", "is", "of" are all treated as functional words, then the similarity is 100%. If none is removed, the similarity may be only 60%. And the subject display of both words is the same.

The direct use of the number of words presents a problem when comparing documents with a large number of words and a small number of words. For example, document I contains 10000 words, while word a appears 10 times; document II contains 100 words, while a appears 5 times. Thus, in the similarity calculation, a in document I has a larger influence on the final result than a in document II. This is clearly not reasonable, since a only accounts for 0.1% of document I, but 5% of document II.

In order to solve the problems, two concepts of word frequency TF and inverse word frequency IDF are introduced, and the specific method comprises the following steps:

s5.1, carrying out word frequency TF and inverse word frequency IDF statistics on the unmarked comments in the step S4

TF is f/m, the TF value is between 0 and 1, f represents the number of times that the current word appears in the current comment, and m represents the number of times that the word with the largest number of times appears in the current comment, so that errors caused by unreasonable distribution of the frequency of the words in the comment are reduced;

n represents the total number of comments in the entire corpus, and

the comment number containing the current word is represented, so that the similarity error caused by uneven word frequency distribution in the corpus range is reduced;

s5.2, constructing an improved version vector space model

Wherein d is_i,d_jRespectively representing the ith comment and the jth comment, w_ikExpressing the statistical product of the word frequency TF and the inverse word frequency IDF of the kth vocabulary in the ith comment;

s5.3, calculating the similarity of any two comments by using an improved version vector space model, screening out the same or similar comments, marking the same or similar comments as false comments, and marking the comments as a negative example sample I;

s6, according to BosonNLP emotion dictionary data and Hopkinson emotion analysis word data, carrying out emotion scoring on the comments which are not marked in the step S4, and then carrying out emotion polarity judgment according to emotion scores, wherein the judgment result is that Score >0 is positive, and Score <0 is negative;

comparing the emotional tendency with the score, if the emotional tendency is good, but the score is less than 3 stars (5 stars are standard), namely, the comments with positive emotional polarity and user score lower than the average judgment standard, or the emotional tendency is poor, but the score is more than 3 stars, namely, the comments with negative emotional polarity and user score higher than the average judgment standard are marked as false comments and are used as a negative sample II;

s7, sorting the users in the step S4 according to the number of the comments of each user in a descending order, marking all the comments of the first 1% of users as false comments, and taking the false comments as a negative example sample III;

s8, respectively forming positive example vectors and negative example vectors by the positive example samples and the negative example samples obtained in the steps S4, S5, S6 and S7, wherein each comment forms one vector no matter whether the positive example sample or the negative example sample, then all the positive example vectors are input into the BP neural network, and through iteration, forward propagation and backward propagation are used for modifying the weight between each layer of the BP neural network, so that the BP neural network outputs '1'; inputting all negative example vectors into a BP neural network, modifying the weight between each layer of the BP neural network by using forward propagation and backward propagation through iteration, and enabling the BP neural network to output '0', thereby training the BP neural network;

s9, inputting the comments grabbed in real time into the trained BP neural network, wherein if the output of the BP neural network is '1', the comments are real comments; if the BP neural network output is "0", then the comment is a false comment.

Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.

Claims

1. A Chinese false comment filtering method based on a vector space model is characterized by comprising the following steps

(1) Simulating website login and capturing comments;

(2) removing comments smaller than L according to the set comment length L;

(3) dividing the comments into words to obtain a sentence component structure;

(5) constructing an improved version vector space model by using the unmarked comments in the step (4);

(5.1) carrying out word frequency TF and inverse word frequency IDF statistics on the unmarked comments in the step (4);

TF is f/m, the TF value is between 0 and 1, f represents the number of times that the current word appears in the current comment, and m represents the number of times that the word with the largest number of times appears in the current comment;

n represents the total number of comments in the entire corpus, and

representing the number of comments containing the current word;

(5.2) constructing an improved version vector space model;

(5.3) calculating the similarity of any two comments by using an improved vector space model, screening out the same or similar comments, marking the same or similar comments as false comments, and marking the comments as a negative example sample I;

(8) respectively forming positive example vectors and negative example vectors by the positive example samples and the negative example samples obtained in the steps (4), (5), (6) and (7); inputting the positive vector into the BP neural network, and modifying the weight between each layer of the BP neural network by using forward propagation and backward propagation through iteration so that the BP neural network outputs '1'; inputting the negative example vector into a BP neural network, and training the BP neural network by modifying the weight between each layer of the BP neural network by using forward propagation and backward propagation through iteration so that the BP neural network outputs '0';