AU2020100710A4 - A method for sentiment analysis of film reviews based on deep learning and natural language processing - Google Patents
A method for sentiment analysis of film reviews based on deep learning and natural language processing Download PDFInfo
- Publication number
- AU2020100710A4 AU2020100710A4 AU2020100710A AU2020100710A AU2020100710A4 AU 2020100710 A4 AU2020100710 A4 AU 2020100710A4 AU 2020100710 A AU2020100710 A AU 2020100710A AU 2020100710 A AU2020100710 A AU 2020100710A AU 2020100710 A4 AU2020100710 A4 AU 2020100710A4
- Authority
- AU
- Australia
- Prior art keywords
- deep learning
- film reviews
- layer
- reviews
- film
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
- 238000012552 review Methods 0.000 title claims abstract description 52
- 238000000034 method Methods 0.000 title claims abstract description 42
- 238000004458 analytical method Methods 0.000 title claims abstract description 18
- 238000013135 deep learning Methods 0.000 title claims abstract description 14
- 238000003058 natural language processing Methods 0.000 title claims abstract description 10
- 238000013528 artificial neural network Methods 0.000 claims abstract description 32
- 238000011176 pooling Methods 0.000 claims abstract description 32
- 238000004364 calculation method Methods 0.000 claims abstract description 12
- 238000012360 testing method Methods 0.000 claims abstract description 12
- 238000013136 deep learning model Methods 0.000 claims abstract description 8
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 10
- 238000012549 training Methods 0.000 abstract description 35
- 230000008451 emotion Effects 0.000 abstract description 6
- 238000007781 pre-processing Methods 0.000 abstract description 2
- 238000001514 detection method Methods 0.000 abstract 1
- 239000010410 layer Substances 0.000 description 111
- 230000006870 function Effects 0.000 description 41
- 238000013527 convolutional neural network Methods 0.000 description 25
- 239000011159 matrix material Substances 0.000 description 18
- 230000008569 process Effects 0.000 description 18
- 239000013598 vector Substances 0.000 description 10
- 230000004913 activation Effects 0.000 description 9
- 238000012545 processing Methods 0.000 description 9
- 238000000605 extraction Methods 0.000 description 8
- 230000002996 emotional effect Effects 0.000 description 7
- 238000005457 optimization Methods 0.000 description 7
- 238000010801 machine learning Methods 0.000 description 6
- 210000002569 neuron Anatomy 0.000 description 6
- 230000008901 benefit Effects 0.000 description 5
- 230000008859 change Effects 0.000 description 4
- 238000013461 design Methods 0.000 description 3
- 238000013507 mapping Methods 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 230000014509 gene expression Effects 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 229920006395 saturated elastomer Polymers 0.000 description 2
- 208000025174 PANDAS Diseases 0.000 description 1
- 208000021155 Paediatric autoimmune neuropsychiatric disorders associated with streptococcal infection Diseases 0.000 description 1
- 240000000220 Panda oleosa Species 0.000 description 1
- 235000016496 Panda oleosa Nutrition 0.000 description 1
- 241000282376 Panthera tigris Species 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000008909 emotion recognition Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000012417 linear regression Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 239000002356 single layer Substances 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/0985—Hyperparameter optimisation; Meta-learning; Learning-to-learn
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Image Analysis (AREA)
Abstract
A method for sentiment analysis of film reviews based on deep learning and natural language processing is disclosed. The method for analyzing emotions of film reviews by deep learning includes: getting film reviews text data and marking positive and negative emotions in film reviews; preprocessing the film reviews by removing redundant information; vectorizing film reviews text according to the bag-of-words model; splitting the vectorized film reviews into training sets and test sets; setting up the initial deep learning model of film reviews sentiment analysis, which connects and integrates four convolution neural network layers, two pooling layers, and two full connected layers; training the initial deep learning model by training data set to generate the final deep learning model, using the final deep learning model to detect the film reviews test set and output the detection results. The invention can accurately distinguish positive and negative emotions of film reviews, and the deep learning model has a simple structure and a small amount of calculation, thereby improving the speed of emotion analysis of film reviews. a raw review -Remove HTML, review text -n n-leters- letters only lowercase & split into individual ones Join a clean review 4-the words- meaningfulwords * Remove - words together stopwords Figure 1 [sentense] John likes to watch movies. Mary likes too. John also likes to watch football game. mokes" 52 to":3 John likes to watchmovies.Marylikestoo. Johnalso likes to watch football game. [1, 1, 1, 1, 0, 1, 1, 1, 0, 0] Figure 2
Description
CM DESCRIPTION
TITLE
A method for sentiment analysis of film reviews based on deep learning and natural language processing
FIELD OF THE INVENTION
The present invention relates to sentiment analysis of film reviews, and more particularly, relates to sentiment analysis of film reviews based on convolutional neural network and nature language processing, which can distinguish positive and negative attitudes of film reviews accurate and fast.
BACKGROUND OF THE INVENTION
With the continuous development of the Internet, more and more people start to express their views online about commodities, films and hot events. These comments are very important and meaningful. For e-commerces, analyzing the emotional tendency of customer’s evaluation towards specific product can help them to understand customer preferences, and thus improve service and enhance customer satisfaction. For enterprises, digging online comments can help them understand consumers’ evaluation of their products and therefore optimize their products. From the perspective of the government, the comments on the Internet involve people’s attitudes and views on hot issues and national policies, so in this way the government can better respond to public
2020100710 05 May cq opinions and thus make good changes.
Our project is mainly about film reviews, one of the hottest Internet comments. Film reviews have many characteristics such as short space, random content, rich semantic diversity, and a large number of emoji. Reading movie reviews can let audience have a clear view and comprehensive understanding of the movie they may choose to watch. For the cinema, they can check the film reviews in order to explore the movie’s reputation and repercussions, so as to adjust its arrangement and publicity for the movie, and then to maximize their benefits.
Since mainly focusing on the emotional tendency of the comment text, we are trying to use sentiment analysis. Sentiment analysis, also known as emotional tendency analysis, is a direction of natural language processing (NLP), which is aiming to analyze the positive or negative aspects of text description.
There used to be two kinds of sentiment classification technology: emotion dictionary and machine learning. The former mainly uses emotional words to judge the emotional tendency of the text, and therefore needs to construct an emotional dictionary manually. The latter (such as Naive Bayes, maximum entropy classification, and support vector machines) mainly uses machine learning algorithm to train the statistical language model, and uses the trained classifier to classify the emotion of the text. This method can take the context semantic
2020100710 05 May cq information into account, and it is more accurate to judge the emotional tendency of the sentence as a whole. However, the common algorithms rely on the manual extraction of features, which requires higher experience of experts.
Recently, deep learning techniques (such as recurrent neural network, convolutional neural network) for sentiment analysis have become very popular. Compared with the approaches based on manually extracted features, this method can provide automatic feature extraction and better performance. And among various deep learning models in multiple sentiment classification task datasets, researchers found that convolutional neural network has achieved the same or even better results as other methods. Our project is mainly about a method for sentiment analysis of film reviews based on deep learning and natural language processing
SUMMARY OF THE INVENTION
In order to solve the shortcomings, problems of the above method, the present invention proposes a method for sentiment analysis of film reviews based on deep learning and natural language processing, which is several multi-layer convolutional neural networks and fully connected neural networks connected in series. This method can give full play to the advantages of automatic learning features of deep learning, and effectively solve the above-mentioned problems such as difficulties in extracting comment features and low accuracy of automatic real-time
CM
2020100710 05 May learning.
The technical scheme of this patent is as follows:
This is a method for sentiment analysis of film reviews based on deep learning and natural language processing, including parti: data acquisition part2: data processing part3: deep learning structure design part4: model training and optimization part5: real-time movie review sentiment recognition.
Parti:data acquisition: we crawl relevant movie reviews for various types of movies from major video websites.
Part2:data processing:
A. Remove html tags: use the beautifulsoup library in bs4 to remove html tags.
B. Lower case: lower() function of strings is used to achieve lower case.
C. Remove stop words: in information retrieval, to save storage space and improve search efficiency, certain words are automatically filtered out before or after processing natural language data (or text), which are called stop Words, we use the nltk library stopwords class to remove stop words.
D. Remove non-character data: use Python's re library to remove non-character data through regular expression string matching. The A to D process mentioned above is shown in figure 1.
E. Establish a bag-of-words model: Put all the words in one bag,
2020100710 05 May 2020 regardless of their morphology and word order so that each word is independent. Build a dictionary for mapping matches. The sentence can be represented by a vector, and the corresponding subscript matches the subscript of the mapping dictionary. The value is the number of times the word appears in the sentence. The process of building a bag-of-words model is shown in Figure 2.
F. Onehot encoding: one-hot encoding, also known as one-bit effective encoding, mainly uses N-bit status registers to encode N states, each state is independent of its register bit, and only one bit at any time is valid. Its purpose is to transform categorical variables into a form that machine learning algorithms can easily use.
G. Train test is divided into two parts: the train_test_split() function of the model_selection module in Python's skleam library is used to split the training set and test set.
Part3:deep learning structure design:
Before the introduction of neural networks, the old-fashioned process was: raw data-> artificial feature extractions algorithms-> results, After the introduction of convolutional neural network: raw data-> convolutional networks algorithms result, what convolution has to solve is automatic feature extraction
It can be seen from Fig.3. and Fig. 13. that the convolutional neural network is very similar in structure to the fully connected neural network.
2020100710 05 May 2020
The convolutional neural network is also organized through layers of nodes. Like a fully connected neural network, each node in a convolutional neural network is a neuron. In a fully-connected neural network, nodes between each adjacent two layers have edge connections, so the nodes in each fully-connected layer are generally organized into a column, which is convenient for displaying the connection structure. For convolutional neural networks, only some nodes between two adjacent layers are connected. In order to show the dimensions of each layer of neurons, the nodes of each convolutional layer are generally organized into a three-dimensional matrix.
In addition to the similar structure, the input and output and training process of the convolutional neural network are basically the same as those of the fully connected neural network. Taking image classification as an example, the input layer of the convolutional neural network is the original pixels of the image, and each node in the output layer represents different types of credibility. This is consistent with the input and output of a fully connected neural network. Both the loss function and the parameter optimization process are also applicable to convolutional neural networks. The process of training a convolutional neural network in TensorFlow is no different from training a fully connected neural network. The only difference between a convolutional neural network
2020100710 05 May 2020 and a fully connected neural network is the connection between two adjacent layers in the neural network.
The biggest problem with using fully connected neural networks to process images is that there are too many parameters in the fully connected layer. In addition to slowing down the calculation, increasing parameters can easily lead to overfitting problems. Therefore, a more reasonable neural network structure is needed to effectively reduce the number of parameters in the neural network. Convolutional neural networks can achieve this goal.
Fully connected network: Fully connected means that the nodes of each layer are not connected to each other. The nodes of each layer are connected to all nodes of the previous layer and the next layer. The fully connected feature allows each layer to be represented individually by a matrix, and operations from each layer to the next layer can be performed in parallel using matrix operations. Composition: input layer, activation function, fully connected layer.
Convolutional neural network: Composition: Input layer, Convolutional layer, Rectified kinear Units layer (REFU layer), Pooling layer, Fully-Connected layer.
The input layer size depends on the input data which is the input of the entire neural network.
2020100710 05 May 2020
Convolutional layer: As you can see from the name, the Convolutional layer is the most important part of a convolutional neural network. Each convolutional layer in a convolutional neural network consists of several convolutional units. The parameters of each convolutional unit are obtained through optimization of the back-propagation algorithm. The input of each node in the convolutional layer is only a small block of the previous layer of the neural network. The commonly used size of this small block is 3 x 3 or 5 x 5. The purpose of the convolution operation is to extract different features of the input. The first layer of convolutional layers can usually only extract some low-level features such as edges and lines. Higher-level networks can iteratively extract more complex features from low-level features. In general, the node matrix processed by the convolutional layer will become deeper.
To better understand this layer, first define a few symbols:
W: the size of the input unit, usually expressed by the width or height of the input unit
F: Receptive filed, refers to the size of the area on the input image where the pixels on the feature map output by the convolutional neural network are mapped
S: stride, which controls the distance between two adjacent hidden units at the same depth and the input area connected to them. If the stride is
2020100710 05 May 2020 small, there will be a lot of overlap in the input area of adjacent hidden units, and the overlap area will be reduced if the stride is large.
P: zero-padding. We can change the overall size of the input unit by padding zeros around the input unit to control the size of the output unit. K: depth, controls the depth of the output unit, that is, the number of filters, and the number of neurons connected to the same area.
Then we can use the formula. 8. to calculate that there can be several hidden units in an output unit in one dimension: ^-F + 2f + l (1) (W: the size of the input unit, F: receptive filed, S: stride, P:zero-padding) Then introduce the weight sharing principle. The so-called weight sharing is to give an input image and use a filter to scan this image. The number in the filter is called weight, which can greatly reduce the number of weight parameters and simplify the network structure. The embodiment of this principle in convolutional neural networks is that for an input matrix, only one filter is used for scanning, and there is no need to define different filters for each position of the matrix.
On the convolutional layer, if the input layer size isWl * Hl * DI, four additional parameters need to be given, the number of filter (K), and the size of filter, that is, received filed (F), stride (S), the amount of zero padding (P). The output is a three-dimensional unit W2 * H2 * D2, where
W-F + 2P
W7= ' /+z/ +, (2)
2020100710 05 May 2020 _Hx-F + 2P ii 0 —--r 1 s
(3)
A =K (4) for example:
calculation process: Figures 4 and 5 describe the basic process of the calculation, In the first step, there is a sliding window of the same size as the filter on the input matrix, and then the part of the input matrix in the sliding window is multiplied by the corresponding position of the filter matrix, the second step is to sum the results produced by the three matrices and add the bias term, then, make the window slide the size of the stride, repeat the above operation. Figures 6, 7, 10, and 11 correspond to convolutional layers 1 -4
Rectified Linear Units layer (RELU layer): The activation function of neurons in this layer uses the RELU function.
Activation Function is a function that runs on the neurons of the artificial neural network and is responsible for mapping the input of the neuron to the output. The properties of the activation function: (1) Non-linear. The linear activation layer has no effect on the deep neural network, because its role will still be various linear transformations of the input. . (2) Continuously differentiable. Requirements for the gradient descent method. (3)The range is preferably not saturated. When there is a saturated interval segment, if the system is optimized to enter this io
2020100710 05 May 2020 segment, the gradient is approximately 0, and the learning of the network will stop.(4) Monotonicity. When the activation function is monotonic, the error function of the single-layer neural network is convex, which is easy to optimize. (5) It is approximately linear at the origin, so when the weight is initialized to a random value close to 0, the network can learn faster, without adjusting the initial value of the network. Currently, the commonly used activation functions have only the above-mentioned properties, and none of them have all the above-mentioned properties.
The most commonly used activation function ReLU function /'(x) = max(0,x) (5)
Pooling layer: The neural network of the pooling layer does not change the depth of the three-dimensional matrix, but it can reduce the size of the matrix. Features with large dimensions are usually obtained after the convolutional layer. The Pooling layer divides the features into several regions and takes the maximum or average value to obtain new, smaller-dimensional features, which can further reduce the size of the final fully connected layer. The number of nodes is decisive for achieving the purpose of reducing parameters in the entire neural network. By compressing the input feature maps, on the one hand, the feature maps are made smaller, simplifying the computational complexity of the network; on the other hand, feature compression is performed to extract the main features. There are two types of pooling operations, one is Avg
2020100710 05 May
C9 Pooling and the other is Max Pooling, for example:
Using a 2 * 2 filter, max pooling is to find the maximum value in each area, where stride equal to 2, and finally extract the main features from the original feature map to get the right picture. As shown in Figure 8.
In convolutional neural networks, we often encounter pooling operations, and the pooling layer is often behind the convolutional layer. Through the pooling, the feature vectors output by the convolutional layer are reduced, and the results are improved (not easy to overfit). Figures 9 and 12 correspond to pooling layer 1, pooling layer 2, respectively.
Fully Connected layer: After multiple rounds of convolutional layers and pooling layers are processed, the final classification result of the convolutional neural network usually consists of 1 or 2 fully connected layers. After several rounds of processing of the convolutional layer and pooling layer, it can be considered that the information has been abstracted into features with higher information content. We can consider the convolutional layer and pooling layer as the process of automatic feature extraction. After the feature extraction is completed, the fully connected layer still needs to be used to complete the classification task.
In general, it is the combination of all local features into global features, which are used to calculate the score of each final category. Figure 14 shows the structure of the fully connected layer
Part4: model training and optimization:The algorithms and ideas used in
2020100710 05 May cq this part: the concept of batching, dropout, adam, learning rate decay, iterations, regular L2 loss, weight Initialization.
Batching: In the process of model training, due to large data sets and other reasons, it is often impossible to read all the data at one time. In order to overcome this, the concept of batching was introduced to train or test the data in batches to reduce memory usage and improve training speed. If batch size is too small, system will frequently I/O, resulting low training efficiency. If too large, the computer cannot load so many images to memory, application throws exceptions.
Dropout: In a machine learning model, if the model has too many parameters and too few training samples, the trained model is prone to overfitting. Over-fitting problems are often encountered when training neural networks. Over-fitting is specifically manifested in the following: the model has a small loss function on the training data and a high prediction accuracy; but the test data has a large loss function and the prediction Accuracy is low. Overfitting is a common problem in many machine learning. If the model is overfitting, the resulting model is almost useless. In order to solve the problem of overfitting, a model integration method is generally adopted, that is, multiple models are trained to be combined. At this time, it takes a lot of time to train models. Not only does it take time to train multiple models, it also takes time to test multiple models. In summary, when training deep neural networks,
2020100710 05 May cq there are always two major disadvantages: (1) easy to overfit (2) time consuming. Dropout can effectively alleviate the occurrence of overfitting, to a certain extent Regularization effect. Dropout means that during the training process of a deep learning network, some neural network units are temporarily discarded from the network with a certain probability, which is equivalent to finding a thinner network from the original network.
Adam: Adam optimization algorithm is a first-order optimization algorithm that can replace the traditional stochastic gradient descent process. It can iteratively update neural network weights based on training data. Adam's algorithm is different from traditional stochastic gradient descent. Stochastic gradient descent maintains a single learning rate (ie, alpha) to update all weights, and the learning rate does not change during the training process. Adam designs independent adaptive learning rates for different parameters by calculating the first and second moment estimates of the gradient, and obtains the advantages of two stochastic gradient descent extended algorithms of AdaGrad and RMSProp.
Learning rate decay: If the learning rate is too large, the speed of training will be improved, but the accuracy of the result is not enough, and it may also lead to a situation that it cannot converge and oscillate. The learning rate is too small, the accuracy will be improved, but the training speed is
2020100710 05 May 2020 slow and it takes more time. So we can use degraded learning rate, also known as decay learning rate. Its role is to attenuate the value of the learning rate during the training process. After the training reaches a certain level, a small learning rate is used to improve the accuracy Selection of the number of iterations: Increasing the number of iterations when other parameters are constant can usually improve the accuracy, but it will reduce the training speed, so we need to choose a reasonable number of iterations.
Regular L21oss: The process of training a machine learning model is to minimize the error between the model's output and the actual collected results. Therefore, we also have an indicator for measuring errors, called the loss functioned stands for variation, me stands for model complexity) min(/d+/mc) (6)
After adding the L2 regular, the linear regression loss function becomes like this:
N /055 =|| β)||2 +^(a'xn-yny (7)
H=1 (co is a matrix of weights, x is the input value, y is the actual value, and T is the transpose of the matrix)
The purpose is to minimize errors and model complexity. The smaller the error, the higher the model's fit, the smaller the model complexity, the simpler the calculation, and the stronger the generalization ability.
Weight Initialization:We generally want the weight to be a small value
2020100710 05 May 2020 close to zero, and at the same time to have some randomness, so we consider using a normal distribution with a mathematical expectation of 0 to select the initialization weight.(p is the mathematical expectation and σ Λ 2 is the variance) /ω = 7έσεΧΡ<·«Τ-’ (8)
Part5: real-time movie review emotion recognition: enter test comments into the trained model for recognition
DESCRIPTION OF THE DRAWINGS
The appended drawings are only for the purpose of description and explanation but not for limitation, wherein:
Fig.l represents the process of data processing.
Fig.2 gives an example of building a bag-of-words model and one-hot encoding.
Fig.3 shows the structure of our entire neural network.
Fig.4 gives an example of the calculation principle of the convolutional layer.
Fig.5 illustrates how the convolutional layer moves.
Fig.6 shows the working structure of convolutional layer 1.
Fig.7 shows the working structure of convolutional layer 2.
Fig.8 gives an example of how the max pooling layer computes.
Fig.9 shows the working structure of max pooling layer 1.
Fig. 10 shows the working structure of convolutional layer 3.
2020100710 05 May
CM Fig. 11 shows the working structure of convolutional layer 4.
Fig. 12 shows the working structure of max pooling layer 2.
Fig. 13 gives an example of a fully connected neural network.
Fig. 14 shows the working structure of full-connected layers.
Fig. 15 lists our data when debugging the network.
DESCRIPTION OF PREFERRED EMBODIMENT
In this part, we will describe the specific methods and details used in the implementation of this invention.
1. Data:
Before we started to use our movie review analysis system for processing, we collected some movie reviews online and processed the data. Our raw data set, which is saved as a tsv file, contains 25,000 reviews. The information type of each movie review is divided into 3 categories, which are id, sentiment, and review. The sentiment is composed of two kinds of labels representing positive and negative.
2. Preprocessing:
First, considering the grammar and sentence ambiguity of natural language, we needed to perform text analysis. A simple way was to extract the words that made sense from the text.
We first imported the “pandas” library and used the “read_csv” function to read the data set. Then, we use a function, in which we
2020100710 05 May rq perform the preliminary processing of the raw reviews, to convert each raw review to a preprocessed movie review.
In this function, the input should be a single review that is a string. We imported the “BeautifulSoup” library from the “bs4” library. Then, we took advantage of the “BeautifulSoup” library and the “Ixml” library to parse the html tags of the review and use the “get_text” function to get the plain text inside. After that, we clear non-character data from the review using regular expressions, replacing all non-character data with spaces. Later we can split the remaining words in the review using the “split” function. The split function works by slicing a string by specifying a delimiter and here our separator is a space. At the same time, we lowercase all characters because it was easier to match. The review was now a list of strings of many lowercase words. Next, “nltk” is the main toolkit for processing languages under python, which helps to remove stop words, part-of-speech tagging, and word segmentation and clauses. Considering that there were many invalid words in the list, we imported the “stopwords” library from the “nltk.corpus” library to judge whether the words make sense and are not stop words. We only preserve the valid ones because it can not only improve training speed but also save memory. Finally, we join the words back into one string separated by space. The result is returned as a string.
2020100710 05 May cq To handle all the reviews, we get the size of the reviews and call the function above for each review. The result is saved as a list of strings named clean_train_reviews.
Afterward, we initialize the “CountVectorizer” object, which is scikit-leam's bag of words tool, named vectorizer. The “CountVectorizer” belongs to the common feature numerical calculation class and is a text feature extraction method. For each training text, it only consider how often each word appeared in the training text. In this case, we only set to keep the first 4096 non-repeating words that appear in each review. The number 4096 was taken because it was convenient for us to reshape the matrix later and optimize performance. What’s more, “CountVectorizer” will convert the words in the text into a term frequency matrix. It used the “fit_transform” function to count the number of times each word appears. The “fit_transform” function did two functions. First, it fits the model and learns the vocabulary; second, it transforms our training data into feature vectors. The input to the “fit_transform” function should be a list of strings. After calling this function, each review data is converted into a 1 X 4096 numeric vector. Last, we transformed each 1 X 4096 row vector into a 64 X 64 matrix.
Considering that numpy arrays are easy to work with, we imported the “numpy” library and used the “toarray” function to convert the list
2020100710 05 May
CM train_data_features to a numpy array. We used the “train_test_split” function from the skleam’s bag to divide the data set into a 70% training set and a 30% test set, both including samples and sentiment labels.
As for the labels, we simply took the values in the sentiment column into a list and converted them into one-hot encoding which can represent the probability of belonging to each category.
3. The architecture of the neural network
Compared to the traditional fully connected neural network, we used a better-designed convolutional neural network which has fewer parameters due to its weight-sharing property and therefore has less possibility to become overfit. Moreover, CNN is able to automatically extract complex features from the original data step by step, thereby eliminating the labor of artificially extracting features, and can significantly improve the accuracy of the model.
we trained a 6-layer network including 4 convolutional layers followed by two fully-connected layers. There is a pooling layer after every two convolutional layers. We consider that a pooling layer is part of a convolutional layer and we can simply use a hyperparameter to control whether the pooling layer is defined.
The architecture of CNN is shown in Figure 3 and described in detail below.
2020100710 05 May
a) Input layer
We trained the network using a batch of 64 data each time and each individual data has one channel. Therefore, the input data is a 4-dimensional tensor of shape (64, 64, 64, 1) which can be interpreted as (batch size, individual data length, individual data width, channel). As for the test sets, we simply changed the batch size to 500, hence, a tensor of size (500, 64, 64, 1) was used to test the generalization ability of our network each time. However, in the following pages, instead of using the batch size of data, we would consider an individual sample data as the input to explain the specific details of the network.
b) Convolutional layer and Max pooling layer
In these layers, a set of 3 X 3 convolutional kernels and 2X2 pooling kernels are used to perform convolutional calculation and max-pooling calculation respectively on the input data. For each individual input data, with a size of (64, 64, 1) specifically. The first convolutional layer containing 32 kernels of size 3X3 yields a 3-dimensional tensor of size (64, 64, 32) which means that the output of the first convolutional layer contains 32 channels. For the reason that we set the hyperparameter “padding” of the convolutional layer to the value SAME and the stride size of the convolutional kernel to (1, 1, 1, 1), the output of the convolutional layer will not change the size
2020100710 05 May cq of the input data of this layer. In addition, we applied the relu function as the activation function to the result of each convolutional layer in order to remove unrelated features. The second convolutional layer uses 32 kernels with a size of 3 x 3 and 32 channels to perform further feature extraction on the output which is the result of the first convolutional layer. After carrying out convolution calculation twice, the tensor that passed through the consecutive convolutional layers is used as the input to the first pooling layer. We used the max-pooling, which is a technique for selecting the largest value as a representation of circumjacent values. A max-pooling kernel with 32 channels is applied to convert the size of the output data of the second convolutional layer to 32X32 with 32 channels as a result. Then, the data is transferred to the third convolutional layer which has 32 kernels with a size of 3 X 3 and 32 channels, which is the same size as the fourth convolutional layer’s. The third and fourth convolutional layers also extract more complicated features that are hard for a human to understand but simple for a computer to train our neural network. Then, as before, the data flow is delivered to the second max-pooling layer and converted to a tensor with a size of 16X16 with 32 channels.
c) Fully-connected layer
After passing through the pooling layer, a flattening process is
2020100710 05 May cq performed to convert a 3-dimensional feature map of shape (16, 16, 32) into a 1-dimensional row vector of length 16X16X32. Then, the vector is sequentially delivered to two fully-connected layers containing 128 nodes and 2 nodes respectively The first layer computes the transformation a(W*x + b), where W is an 8192X 128 weight matrix, b the vector of bias containing 128 values, x the 1-dimensional row vector and a the rectified linear (relu) function. The second layer does the same transformation as the first layer. In this layer, however, Wisal28X2 weight matrix, b is a vector of bias containing 2 values, x is the output of the previous fully-connected layer, and a is softmax function whose output is the probability corresponding to each category.
4. Optimization
Through the calculation of the last fully-connection layer, we obtained the probabilities that a sample data was judged to belong to positive and negative. Afterward, we used the cross-entropy function to calculate the total loss of the model and reduced the loss using some techniques which will be described in detail below.
a) mini-batch training
We used 64 training data as a mini-batch for each training, which makes the parameters update faster and as a result, it is conducive to the convergence of the loss function.
2020100710 05 May cj b) L2 regularization
In order to prevent overfitting, we applied 12 regularization to the loss function. When we optimize the loss function, we simultaneously decrease the value of weights and bias in the network, which reduces the complexity of our network to a great extent.
c) learning rate decay
Too large learning rate will make the algorithm hover around the optimal solution, but will not converge. However, On the contrary, too small learning rate will also make training extremely slow. Therefore, we used a technique named learning rate decay which can gradually decrease the learning rate as the number of iterations increases. During the initial phases, while our learning rate alpha is still large, we can still have relatively fast learning. But then as alpha gets smaller, our learning steps will be slower and smaller, hence, our algorithm can end up oscillating in a tiger region around the minimum of the loss function. Specifically, we set the initial learning rate to 0.001, decay rate to 0.99, and decay steps to 100, which means that updating the learning rate to 99% of the original every 100 times.
d) Dropout
Dropout is also a powerful technique that we used in the fully connected layers to suppress overfitting. Nodes from the fully-connected layers are randomly dropped during training, which
2020100710 05 May cq prevents nodes from co-adapting too much. Specifically, we set the dropout rate to 0.9 meaning that we randomly set some values in the data flow to 0 and the rest to be scaled up by 10 / 9.
e) Adam
Instead of using the traditional stochastic gradient descent algorithm, we used Adam to optimize the loss function. The adam algorithm is able to dynamically adjust the learning rate of each parameter. The main advantage of Adam is that after the offset correction, the learning rate for each iteration has a certain range, which makes the parameters relatively stable.
Figure 15 shows the results under different parameters. We can see that our model eventually achieved an accurary of 81.3%, which is a relatively impressive result.
Claims (2)
- rj What we claim is:1. A method for sentiment analysis of film reviews based on deep learning and natural language processing, characterized in that: uses the deep learning model consisting of four convolution neural network, two pooling layers and two full connected layers, trained with train data set, the model uses the Adam optimizer to reduce the losses and optimize weight and bias of the model to improve the accuracy of the model on the test set, to avoid overfitting, the method uses L2 Regularization Loss method and Dropout method.
- 2. A method for sentiment analysis of film reviews based on deep learning and natural language processing, which uses bag-of-words model to vectorize the film reviews text data in order to obtain more accurate prediction results with less calculations and simpler deep learning model structure, since bag of words model can accurately and efficiently describe the characteristics of long text, such as the film reviews.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2020100710A AU2020100710A4 (en) | 2020-05-05 | 2020-05-05 | A method for sentiment analysis of film reviews based on deep learning and natural language processing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2020100710A AU2020100710A4 (en) | 2020-05-05 | 2020-05-05 | A method for sentiment analysis of film reviews based on deep learning and natural language processing |
Publications (1)
Publication Number | Publication Date |
---|---|
AU2020100710A4 true AU2020100710A4 (en) | 2020-06-11 |
Family
ID=70976400
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
AU2020100710A Ceased AU2020100710A4 (en) | 2020-05-05 | 2020-05-05 | A method for sentiment analysis of film reviews based on deep learning and natural language processing |
Country Status (1)
Country | Link |
---|---|
AU (1) | AU2020100710A4 (en) |
Cited By (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111858945A (en) * | 2020-08-05 | 2020-10-30 | 上海哈蜂信息科技有限公司 | Deep learning-based comment text aspect level emotion classification method and system |
CN112329434A (en) * | 2020-11-26 | 2021-02-05 | 北京百度网讯科技有限公司 | Text information identification method and device, electronic equipment and storage medium |
CN112463959A (en) * | 2020-10-29 | 2021-03-09 | 中国人寿保险股份有限公司 | Service processing method based on uplink short message and related equipment |
CN112687374A (en) * | 2021-01-12 | 2021-04-20 | 湖南师范大学 | Psychological crisis early warning method based on text and image information joint calculation |
CN113035334A (en) * | 2021-05-24 | 2021-06-25 | 四川大学 | Automatic delineation method and device for radiotherapy target area of nasal cavity NKT cell lymphoma |
CN113096640A (en) * | 2021-03-08 | 2021-07-09 | 北京达佳互联信息技术有限公司 | Voice synthesis method and device, electronic equipment and storage medium |
CN113158656A (en) * | 2020-12-25 | 2021-07-23 | 北京中科闻歌科技股份有限公司 | Ironic content identification method, ironic content identification device, electronic device, and storage medium |
CN113660515A (en) * | 2021-09-09 | 2021-11-16 | 深圳市易平方网络科技有限公司 | Hot spot data processing method, device, terminal and medium based on smart television |
CN113674846A (en) * | 2021-09-18 | 2021-11-19 | 浙江远图互联科技股份有限公司 | Hospital intelligent service public opinion monitoring platform based on LSTM network |
CN113779991A (en) * | 2021-09-18 | 2021-12-10 | 广州荔支网络技术有限公司 | Text emotion recognition method and device, computer equipment and storage medium |
CN114416969A (en) * | 2021-11-30 | 2022-04-29 | 西安交通大学 | LSTM-CNN online comment sentiment classification method and system based on background enhancement |
CN114428853A (en) * | 2021-12-15 | 2022-05-03 | 哈尔滨理工大学 | Text classification method and system based on deep learning |
CN114639139A (en) * | 2022-02-16 | 2022-06-17 | 南京邮电大学 | Emotional image description method and system based on reinforcement learning |
CN114757183A (en) * | 2022-04-11 | 2022-07-15 | 北京理工大学 | Cross-domain emotion classification method based on contrast alignment network |
CN114937182A (en) * | 2022-04-18 | 2022-08-23 | 江西师范大学 | Image emotion distribution prediction method based on emotion wheel and convolutional neural network |
CN115563987A (en) * | 2022-10-17 | 2023-01-03 | 北京中科智加科技有限公司 | Comment text analysis processing method |
CN115982473A (en) * | 2023-03-21 | 2023-04-18 | 环球数科集团有限公司 | AIGC-based public opinion analysis arrangement system |
CN116431816A (en) * | 2023-06-13 | 2023-07-14 | 浪潮电子信息产业股份有限公司 | Document classification method, apparatus, device and computer readable storage medium |
CN116501879A (en) * | 2023-05-16 | 2023-07-28 | 重庆邮电大学 | APP software user comment demand classification method based on big data |
CN116528065A (en) * | 2023-06-30 | 2023-08-01 | 深圳臻像科技有限公司 | Efficient virtual scene content light field acquisition and generation method |
US11715108B2 (en) | 2021-07-19 | 2023-08-01 | Mastercard International Incorporated | Methods and systems for enhancing purchase experience via audio web-recording |
CN117076613A (en) * | 2023-10-13 | 2023-11-17 | 中国长江电力股份有限公司 | Electric digital data processing system based on Internet big data |
CN117124910A (en) * | 2023-09-20 | 2023-11-28 | 漳州建源电力工程有限公司 | Smart city charging pile node fault alarm system and method |
CN117573988A (en) * | 2023-10-17 | 2024-02-20 | 广东工业大学 | Offensive comment identification method based on multi-modal deep learning |
CN117707501A (en) * | 2023-12-18 | 2024-03-15 | 广州擎勤网络科技有限公司 | Automatic code generation method and system based on AI and big data |
CN118227796A (en) * | 2024-05-23 | 2024-06-21 | 国家计算机网络与信息安全管理中心 | Automatic classification and threshold optimization method and system for long text specific content |
-
2020
- 2020-05-05 AU AU2020100710A patent/AU2020100710A4/en not_active Ceased
Cited By (40)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111858945A (en) * | 2020-08-05 | 2020-10-30 | 上海哈蜂信息科技有限公司 | Deep learning-based comment text aspect level emotion classification method and system |
CN111858945B (en) * | 2020-08-05 | 2024-04-23 | 上海哈蜂信息科技有限公司 | Deep learning-based comment text aspect emotion classification method and system |
CN112463959A (en) * | 2020-10-29 | 2021-03-09 | 中国人寿保险股份有限公司 | Service processing method based on uplink short message and related equipment |
CN112329434A (en) * | 2020-11-26 | 2021-02-05 | 北京百度网讯科技有限公司 | Text information identification method and device, electronic equipment and storage medium |
CN112329434B (en) * | 2020-11-26 | 2024-04-12 | 北京百度网讯科技有限公司 | Text information identification method, device, electronic equipment and storage medium |
CN113158656B (en) * | 2020-12-25 | 2024-05-14 | 北京中科闻歌科技股份有限公司 | Ironic content recognition method, ironic content recognition device, electronic device, and storage medium |
CN113158656A (en) * | 2020-12-25 | 2021-07-23 | 北京中科闻歌科技股份有限公司 | Ironic content identification method, ironic content identification device, electronic device, and storage medium |
CN112687374A (en) * | 2021-01-12 | 2021-04-20 | 湖南师范大学 | Psychological crisis early warning method based on text and image information joint calculation |
CN112687374B (en) * | 2021-01-12 | 2023-09-15 | 湖南师范大学 | Psychological crisis early warning method based on text and image information joint calculation |
CN113096640A (en) * | 2021-03-08 | 2021-07-09 | 北京达佳互联信息技术有限公司 | Voice synthesis method and device, electronic equipment and storage medium |
CN113035334A (en) * | 2021-05-24 | 2021-06-25 | 四川大学 | Automatic delineation method and device for radiotherapy target area of nasal cavity NKT cell lymphoma |
US11715108B2 (en) | 2021-07-19 | 2023-08-01 | Mastercard International Incorporated | Methods and systems for enhancing purchase experience via audio web-recording |
CN113660515A (en) * | 2021-09-09 | 2021-11-16 | 深圳市易平方网络科技有限公司 | Hot spot data processing method, device, terminal and medium based on smart television |
CN113674846A (en) * | 2021-09-18 | 2021-11-19 | 浙江远图互联科技股份有限公司 | Hospital intelligent service public opinion monitoring platform based on LSTM network |
CN113779991A (en) * | 2021-09-18 | 2021-12-10 | 广州荔支网络技术有限公司 | Text emotion recognition method and device, computer equipment and storage medium |
CN114416969A (en) * | 2021-11-30 | 2022-04-29 | 西安交通大学 | LSTM-CNN online comment sentiment classification method and system based on background enhancement |
CN114416969B (en) * | 2021-11-30 | 2024-10-15 | 西安交通大学 | LSTM-CNN online comment emotion classification method and system based on background enhancement |
CN114428853B (en) * | 2021-12-15 | 2024-09-13 | 哈尔滨理工大学 | Text classification method and system based on deep learning |
CN114428853A (en) * | 2021-12-15 | 2022-05-03 | 哈尔滨理工大学 | Text classification method and system based on deep learning |
CN114639139A (en) * | 2022-02-16 | 2022-06-17 | 南京邮电大学 | Emotional image description method and system based on reinforcement learning |
CN114757183A (en) * | 2022-04-11 | 2022-07-15 | 北京理工大学 | Cross-domain emotion classification method based on contrast alignment network |
CN114757183B (en) * | 2022-04-11 | 2024-05-10 | 北京理工大学 | Cross-domain emotion classification method based on comparison alignment network |
CN114937182A (en) * | 2022-04-18 | 2022-08-23 | 江西师范大学 | Image emotion distribution prediction method based on emotion wheel and convolutional neural network |
CN114937182B (en) * | 2022-04-18 | 2024-04-09 | 江西师范大学 | Image emotion distribution prediction method based on emotion wheel and convolutional neural network |
CN115563987A (en) * | 2022-10-17 | 2023-01-03 | 北京中科智加科技有限公司 | Comment text analysis processing method |
CN115982473A (en) * | 2023-03-21 | 2023-04-18 | 环球数科集团有限公司 | AIGC-based public opinion analysis arrangement system |
CN115982473B (en) * | 2023-03-21 | 2023-06-23 | 环球数科集团有限公司 | Public opinion analysis arrangement system based on AIGC |
CN116501879B (en) * | 2023-05-16 | 2024-07-09 | 芽米科技(广州)有限公司 | APP software user comment demand classification method based on big data |
CN116501879A (en) * | 2023-05-16 | 2023-07-28 | 重庆邮电大学 | APP software user comment demand classification method based on big data |
CN116431816B (en) * | 2023-06-13 | 2023-09-19 | 浪潮电子信息产业股份有限公司 | Document classification method, apparatus, device and computer readable storage medium |
CN116431816A (en) * | 2023-06-13 | 2023-07-14 | 浪潮电子信息产业股份有限公司 | Document classification method, apparatus, device and computer readable storage medium |
CN116528065A (en) * | 2023-06-30 | 2023-08-01 | 深圳臻像科技有限公司 | Efficient virtual scene content light field acquisition and generation method |
CN116528065B (en) * | 2023-06-30 | 2023-09-26 | 深圳臻像科技有限公司 | Efficient virtual scene content light field acquisition and generation method |
CN117124910A (en) * | 2023-09-20 | 2023-11-28 | 漳州建源电力工程有限公司 | Smart city charging pile node fault alarm system and method |
CN117076613A (en) * | 2023-10-13 | 2023-11-17 | 中国长江电力股份有限公司 | Electric digital data processing system based on Internet big data |
CN117573988B (en) * | 2023-10-17 | 2024-05-14 | 广东工业大学 | Offensive comment identification method based on multi-modal deep learning |
CN117573988A (en) * | 2023-10-17 | 2024-02-20 | 广东工业大学 | Offensive comment identification method based on multi-modal deep learning |
CN117707501A (en) * | 2023-12-18 | 2024-03-15 | 广州擎勤网络科技有限公司 | Automatic code generation method and system based on AI and big data |
CN118227796A (en) * | 2024-05-23 | 2024-06-21 | 国家计算机网络与信息安全管理中心 | Automatic classification and threshold optimization method and system for long text specific content |
CN118227796B (en) * | 2024-05-23 | 2024-07-19 | 国家计算机网络与信息安全管理中心 | Automatic classification and threshold optimization method and system for long text specific content |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
AU2020100710A4 (en) | A method for sentiment analysis of film reviews based on deep learning and natural language processing | |
CN109271522B (en) | Comment emotion classification method and system based on deep hybrid model transfer learning | |
CN110609897B (en) | Multi-category Chinese text classification method integrating global and local features | |
CN107066553B (en) | Short text classification method based on convolutional neural network and random forest | |
CN110059181A (en) | Short text stamp methods, system, device towards extensive classification system | |
CN109743732B (en) | Junk short message distinguishing method based on improved CNN-LSTM | |
CN112749274B (en) | Chinese text classification method based on attention mechanism and interference word deletion | |
US11996116B2 (en) | Methods and systems for implementing on-device non-semantic representation fine-tuning for speech classification | |
CN110321805B (en) | Dynamic expression recognition method based on time sequence relation reasoning | |
CN112307714A (en) | Character style migration method based on double-stage deep network | |
CN111368088A (en) | Text emotion classification method based on deep learning | |
CN110580287A (en) | Emotion classification method based ON transfer learning and ON-LSTM | |
CN118277538B (en) | Legal intelligent question-answering method based on retrieval enhancement language model | |
Luo et al. | English text quality analysis based on recurrent neural network and semantic segmentation | |
Chen et al. | Deep neural networks for multi-class sentiment classification | |
CN113821635A (en) | Text abstract generation method and system for financial field | |
CN114462420A (en) | False news detection method based on feature fusion model | |
Li et al. | Learning policy scheduling for text augmentation | |
Jing et al. | News text classification and recommendation technology based on wide & deep-bert model | |
CN117688944A (en) | Chinese emotion analysis method and system based on multi-granularity convolution feature fusion | |
CN113779966A (en) | Mongolian emotion analysis method of bidirectional CNN-RNN depth model based on attention | |
CN112364160A (en) | Patent text classification method combining ALBERT and BiGRU | |
CN113283530B (en) | Image classification system based on cascade characteristic blocks | |
CN113190733B (en) | Network event popularity prediction method and system based on multiple platforms | |
CN114548293A (en) | Video-text cross-modal retrieval method based on cross-granularity self-distillation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
FGI | Letters patent sealed or granted (innovation patent) | ||
MK22 | Patent ceased section 143a(d), or expired - non payment of renewal fee or expiry |