CN111428513A - False comment analysis method based on convolutional neural network - Google Patents

False comment analysis method based on convolutional neural network Download PDF

Info

Publication number
CN111428513A
CN111428513A CN202010393416.XA CN202010393416A CN111428513A CN 111428513 A CN111428513 A CN 111428513A CN 202010393416 A CN202010393416 A CN 202010393416A CN 111428513 A CN111428513 A CN 111428513A
Authority
CN
China
Prior art keywords
convolution
vector
layer
neural network
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010393416.XA
Other languages
Chinese (zh)
Inventor
宋丹
陆奎
吴杰胜
刘洋
戴旭凡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University of Science and Technology
Original Assignee
Anhui University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University of Science and Technology filed Critical Anhui University of Science and Technology
Priority to CN202010393416.XA priority Critical patent/CN111428513A/en
Publication of CN111428513A publication Critical patent/CN111428513A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • G06Q30/0203Market surveys; Market polls

Abstract

The invention discloses a false comment method based on a convolutional neural network, which is used for identifying false commodity evaluation in the field of short text processing by combining a mainstream short text feature mining model. The method includes the steps of firstly selecting a sample data set, taking 80% of sample data as a training set, and taking 20% of samples as a testing set. And inputting the training set into a Word2Vec model to obtain the Word vector dimension. And then, the output result of the model has higher accuracy in precision, recall rate and f1-score through convolution calculation and feature extraction. This demonstrates the feasibility of applying convolutional neural network models to false comment identification for its practice.

Description

False comment analysis method based on convolutional neural network
Technical Field
The invention relates to the field of machine learning, in particular to a false comment analysis method based on a convolutional neural network.
Background
With the development of the internet, network consumption is more and more daily, people can use comments as an important reference, and a plurality of consumption platforms are provided with user opinion feedback mechanisms. However, many merchants are motivated by interests to want to issue some false comments, and in the mass comments, if correct judgment cannot be made, consumption experience is affected. But also has a bad influence on the development of the e-commerce platform.
Therefore, a method for accurately identifying false comments is urgently needed at present, and some scholars search many methods, but the result is not ideal and the accuracy is low. The convolutional neural network has good fault-tolerant capability, parallel processing capability and self-learning capability. Problems can be handled without a clear background, ambiguous inference rules, sample data with large defects, distortions, and high resolution can be accommodated.
Disclosure of Invention
Aiming at the problems of high consumption, low accuracy and the like of the existing false comment detection method, the invention provides a false comment analysis method based on a convolutional neural network. The invention adopts the following technical scheme for realizing the purpose:
step 1: collecting comment documents, extracting labels from comments in the comment documents, randomly dividing the collected comments into a training group and a testing group according to the ratio of 4: 1, and testing the identification performance of the network model on false comments by adopting a cross-validation method.
And 2, encoding the sentence by using the output Word vector of Word2Vec, namely encoding the words in the comment by using One-Hot, namely, expressing the words as vectors with the length of L (L represents the size of a Word list), and expressing One sentence by using One vector.
Embedded layer notation: y-w [ x ]
Step 3, extracting the characteristics of the sentence through a convolution kernel of h × k, wherein h represents the number of convolution words equivalent to the value of N in an N-gram model, k represents the dimensionality of a word vector, the size of a convolution window is inconsistent with the number of the convolution kernels, so that the semantic coding of text vector extraction is different, the sizes of different convolution windows are tried, the number of the convolution kernels is set to be consistent with the dimensionality of the word vector, and the sizes of the convolution windows are respectively set to be not consistentThe same value, feature map, column 1, row (n-h +1), i.e., c ═ c1,c2,...cn-h+1) Wherein c isi=f(w*xi:i+h-1+b)。
w is the convolution kernel of the corresponding size, b is the offset, a convolution operation, and f is the activation function.
And 4, step 4: the output of the previous layer is subjected to Pooling (Pooling) operation, and the number of parameters is reduced, specifically as follows:
firstly, extracting the characteristics of the output of the convolutional layer, selecting a specific (generally strongest characteristic) value from a plurality of characteristic values as a reserved value of the pooling layer, and then carrying out nonlinear conversion on the vector screened by the pooling layer and sending the vector into a classifier for classification.
Pooling layer notation: max (C, axis 1)
And 5: the fully connected layer gets the probability of each class by using the softmax classifier. And finally obtaining a performance index result of the false comment, which is specifically as follows:
softmax function formula:
Figure BDA0002486459940000021
given the input xi and the parameter w, the normalized probability assigned to the correct class label. The softmax classifier will have a greater probability of correct classification and a lesser probability of incorrect classification.
The invention has the following advantages:
1. the method can not only carry out false analysis on Chinese paradox, but also analyze and classify English;
2. by adopting the cross validation method, the use of data samples is reduced, more effective information is obtained from limited data, and overfitting is reduced to a certain extent.
3. The Word vector output by Word2Vec is used as an embedding layer, the relevance between words is emphasized, and the accuracy of the detection result is improved.
Drawings
FIG. 1 is a system flow chart of a convolutional neural network-based false comment analysis method
FIG. 2 is a schematic diagram of a structure based on a convolutional neural network model
Detailed Description
The invention is further illustrated by the following specific examples.
With reference to fig. 1 and fig. 2, a convolutional neural network-based false comment analysis method includes the following steps:
step 1: collecting commodity reviews, preprocessing data, and filtering some invalid reviews, such as: if the sentences are too short, the same sentences and the like, the accuracy of the experimental result is influenced by invalid comments. Adding labels to the processed comment documents, then randomly dividing the comment samples into a training group and a testing group according to the ratio of 4: 1, and testing the identification performance of the network model on the false comments by adopting a cross-validation method. The comment samples are randomly divided into five parts, one part of the comment samples is selected as a test group, the other four parts of the comment samples are selected as training groups, after one round of test, one part of the comment samples is selected as the test group, and the other four parts of the comment samples are selected as the training groups until all data samples are traversed.
And 2, encoding the sentence by using the output Word vector of Word2Vec, namely encoding the words in the comment by using One-Hot, namely, expressing the words as vectors with the length of L (L represents the size of a Word list), and expressing One sentence by using One vector.
For example: writing codes, changing the world
The method is developed by using a word vector mode, and a general training word vector can use a google open source word2vec program. In the word vector, "write" ═ 1.1, 2.1 "," code "═ 1.5, 2.9", "change" ═ 2.7, 3.1 "," world "═ 2.9, 3.5", then the sequence of the words "write code, change world" can be rewritten to ((1.1, 2.1), (1.5, 2.9), (2.7, 3.1), (2.9, 3.5)), the original text sequence is a vector of 4 × 1, and the text after rewriting can be represented as a matrix of 4 × 2.
Any array of m x d, m-dimensional text sequence, number of words, d-dimensional word vector may be represented in the text sequence.
And step 3: the convolution operation is first to slide a sliding window on the input array matrix of m × d, and assuming that the step size is 2, each time convolution is performed on the sliding window of 2 × d, m-h +1 ═ m-1 results can be obtained.
The results are then concatenated to form an n-1 dimensional feature vector. Wherein c isi=f(w*xi:i+h-1+b)
xi:i+hA sliding matrix window of size h x d, w and x, formed by the i-th row to the i + h-1-th row of the input matrixi:i+hIs h x d, b is a bias parameter. f is a non-linear activation function, ciIs a scalar quantity.
And finally, setting different h in the convolutional layer to generate a plurality of different filters, wherein the convolution operation of each filter is equivalent to extracting one feature, and different features can be extracted through the filters of different sliding windows.
And 4, step 4: the max pooling is adopted, the largest one of the feature vector values generated by each filter in the upper layer is selected as a feature value, all the values form a feature vector of a provisional layer, the feature vector is finally sent to a full-link layer, and each word vector is classified by using a Softmax function to obtain a result of a false comment.
And 5: and comparing the classification result obtained by the Softmax function with the data set label, if the classification result is the same as the data set label, keeping the parameters unchanged, and if the parameters are different, adjusting the parameters, and training again until the loss function is converged.
Example 1
This experiment verifies the invention using a false goodness gold dataset collected by MyleOtt et al, which is divided into two categories, positive and negative comments, each of which is divided into true and false comments, respectively. The number of each type of data samples is equal, and the data samples are 400, and a total of 1600 hotel reviews are used as data samples, and the data samples have two characteristic labels at the beginning.
A false comment analysis method based on a convolutional neural network comprises the following steps:
step 1 is executed, firstly, the sample data set is loaded from the designated path, because the number of the training sample sets has direct influence on the experimental result, the experiment randomly divides the data set, 80% of the samples are used as training data models of the training set, and 20% of the samples are used as verification data models of the test set.
And 2, inputting the sample data set into a Word2Vec model, wherein each data sample is represented by a Word vector. And (3) sending the processed word vectors into a convolutional layer, trying to set convolution windows with different sizes in an experiment, and finally setting the number of convolution kernels to be 300 to be consistent with the dimension of the word vectors.
(1) Setting the sizes of convolution windows to be 1 to 5 respectively, setting the number of convolution kernels corresponding to each length to be 300, setting the step length to be 1, and performing sliding convolution on each convolution kernel on a sentence vector.
(2) After the process (1) is repeated, 300 feature vectors are extracted from the convolution kernel of each convolution window length, and the feature vectors convolved by different convolution window lengths are different.
And step 3, in the pooling layer, processing the maximum value of the feature vector obtained by each convolution window length, obtaining a plurality of feature maps through a plurality of different convolution kernels of the convolution layer, obtaining a plurality of one-dimensional vectors after the processing of the pooling layer, and splicing the one-dimensional vectors together to form the input of the full connection layer. And weighting and summing the output vectors of the previous layer. And the output layer uses a drop out method to prevent the model from being over-fitted, and finally, the class with the largest value in the output results is judged as the prediction result of the model.
After completing step 2-3 and fixing the model parameters, the model is verified by using the test set, and the results are shown in the table:
test results Table 1
Figure BDA0002486459940000051
The test results of the test set samples have higher accuracy in precision, recall rate and f1-score, and can be competent for most detection tasks.
The above description is provided for the purpose of facilitating understanding of the present invention by those skilled in the art, and it is not intended to limit the scope of the present invention, which is defined by the appended claims and their equivalents, and may be used in other fields directly or indirectly without departing from the scope of the present invention.

Claims (2)

1. A false comment analysis method based on a convolutional neural network is characterized by comprising the following steps:
step 1: collecting comment documents, extracting labels from comments in the comment documents, randomly dividing the collected comments into a training group and a testing group according to the ratio of 4: 1, and testing the identification performance of the network model on false comments by adopting a cross-validation method.
Step 2, using output Word vectors of Word2Vec to code sentences, wherein the general training Word vectors can use a google open source Word2Vec program, the words are coded by One-Hot, namely, words are represented as vectors with the length of L (L represents the size of a Word list), and One vector represents One sentence.
The embedded layer word vector output is: y-w [ x ]
Step 3, extracting the characteristics of the sentence through a convolution kernel of h × k, wherein h represents the number of convolution words, which is equivalent to the value of N in an N-gram model, k represents the dimensionality of a word vector, the size of a convolution window is inconsistent with the number of the convolution kernels, so that the semantic coding of text vector extraction is different, the size of different convolution windows is tried, the number of the convolution kernels is set to be consistent with the dimensionality of the word vector, the sizes of the convolution windows are respectively set to be different values, a feature map is set, the column of the feature map is set to be 1, and the behavior is (N-h +1), namely c ═ (c ═ h [ + ] (c [ + ])1,c2,...cn-h+1),
Wherein c isi=f(w*xi:i+h-1+b)。
w is the convolution kernel of the corresponding size, b is the offset, a convolution operation, and f is the activation function.
And 4, step 4: the output of the previous layer is subjected to Pooling (Pooling) operation, and the number of parameters is reduced, specifically as follows:
firstly, extracting the characteristics of the output of the convolutional layer, selecting a specific (generally strongest characteristic) value from a plurality of characteristic values as a reserved value of the pooling layer, and then carrying out nonlinear conversion on the vector screened by the pooling layer and sending the vector into a classifier for classification.
The pooling layer output is: max (C, axis 1)
And 5: the fully connected layer gets the probability of each class by using the softmax classifier. And finally obtaining a performance index result of the false comment, which is specifically as follows:
softmax function formula:
Figure FDA0002486459930000011
given the input xi and the parameter w, the normalized probability assigned to the correct class label. The softmax classifier will have a greater probability of correct classification and a lesser probability of incorrect classification.
2. The method of claim 1, wherein the convolutional neural network-based false comment analysis method is characterized in that: when a Word vector is obtained by adopting a Word2Vec model, a three-layer neural network is used for constructing a probability language model, a text is converted into a digital vector, the probability of the occurrence of the next Word is predicted according to the currently known first n-1 words, and the first n-1 Word vectors are spliced end to serve as an input layer.
CN202010393416.XA 2020-05-11 2020-05-11 False comment analysis method based on convolutional neural network Pending CN111428513A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010393416.XA CN111428513A (en) 2020-05-11 2020-05-11 False comment analysis method based on convolutional neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010393416.XA CN111428513A (en) 2020-05-11 2020-05-11 False comment analysis method based on convolutional neural network

Publications (1)

Publication Number Publication Date
CN111428513A true CN111428513A (en) 2020-07-17

Family

ID=71555131

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010393416.XA Pending CN111428513A (en) 2020-05-11 2020-05-11 False comment analysis method based on convolutional neural network

Country Status (1)

Country Link
CN (1) CN111428513A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112307755A (en) * 2020-09-28 2021-02-02 天津大学 Multi-feature and deep learning-based spam comment identification method
CN112463966A (en) * 2020-12-08 2021-03-09 北京邮电大学 False comment detection model training method, detection method and device
CN112905739A (en) * 2021-02-05 2021-06-04 北京邮电大学 False comment detection model training method, detection method and electronic equipment
CN114492423A (en) * 2021-12-28 2022-05-13 广州大学 False comment detection method, system and medium based on feature fusion and screening

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108345587A (en) * 2018-02-14 2018-07-31 广州大学 A kind of the authenticity detection method and system of comment
CN109670542A (en) * 2018-12-11 2019-04-23 田刚 A kind of false comment detection method based on comment external information

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108345587A (en) * 2018-02-14 2018-07-31 广州大学 A kind of the authenticity detection method and system of comment
CN109670542A (en) * 2018-12-11 2019-04-23 田刚 A kind of false comment detection method based on comment external information

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李静: ""基于卷积神经网络的虚假评论的识别"", 《软件》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112307755A (en) * 2020-09-28 2021-02-02 天津大学 Multi-feature and deep learning-based spam comment identification method
CN112463966A (en) * 2020-12-08 2021-03-09 北京邮电大学 False comment detection model training method, detection method and device
CN112463966B (en) * 2020-12-08 2024-04-05 北京邮电大学 False comment detection model training method, false comment detection model training method and false comment detection model training device
CN112905739A (en) * 2021-02-05 2021-06-04 北京邮电大学 False comment detection model training method, detection method and electronic equipment
CN114492423A (en) * 2021-12-28 2022-05-13 广州大学 False comment detection method, system and medium based on feature fusion and screening
CN114492423B (en) * 2021-12-28 2022-10-18 广州大学 False comment detection method, system and medium based on feature fusion and screening

Similar Documents

Publication Publication Date Title
CN109543084B (en) Method for establishing detection model of hidden sensitive text facing network social media
CN111428513A (en) False comment analysis method based on convolutional neural network
CN112364638B (en) Personality identification method based on social text
WO2022116536A1 (en) Information service providing method and apparatus, electronic device, and storage medium
CN110309867B (en) Mixed gas identification method based on convolutional neural network
CN110516074B (en) Website theme classification method and device based on deep learning
CN109492230B (en) Method for extracting insurance contract key information based on interested text field convolutional neural network
CN112732916A (en) BERT-based multi-feature fusion fuzzy text classification model
CN111475622A (en) Text classification method, device, terminal and storage medium
CN111966825A (en) Power grid equipment defect text classification method based on machine learning
CN112966068A (en) Resume identification method and device based on webpage information
CN112667782A (en) Text classification method, device, equipment and storage medium
CN113742733A (en) Reading understanding vulnerability event trigger word extraction and vulnerability type identification method and device
CN113987187A (en) Multi-label embedding-based public opinion text classification method, system, terminal and medium
CN111274494B (en) Composite label recommendation method combining deep learning and collaborative filtering technology
WO2023159756A1 (en) Price data processing method and apparatus, electronic device, and storage medium
CN115712740A (en) Method and system for multi-modal implication enhanced image text retrieval
CN113377844A (en) Dialogue type data fuzzy retrieval method and device facing large relational database
CN112347252A (en) Interpretability analysis method based on CNN text classification model
CN115759095A (en) Named entity recognition method and device for tobacco plant diseases and insect pests
CN116089605A (en) Text emotion analysis method based on transfer learning and improved word bag model
CN114648029A (en) Electric power field named entity identification method based on BiLSTM-CRF model
CN111046934B (en) SWIFT message soft clause recognition method and device
CN114626367A (en) Sentiment analysis method, system, equipment and medium based on news article content
CN113297376A (en) Legal case risk point identification method and system based on meta-learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200717

WD01 Invention patent application deemed withdrawn after publication