AU2020100710A4

AU2020100710A4 - A method for sentiment analysis of film reviews based on deep learning and natural language processing

Info

Publication number: AU2020100710A4
Application number: AU2020100710A
Authority: AU
Inventors: Dadu Chen; Yuheng Feng; Wenxin Lai; Ziwei Lin; Haotian Liu; Wenzhi Wang
Original assignee: Lai Wenxin Miss
Current assignee: Lai Wenxin Miss
Priority date: 2020-05-05
Filing date: 2020-05-05
Publication date: 2020-06-11
Anticipated expiration: 2028-05-05

Abstract

A method for sentiment analysis of film reviews based on deep learning and natural language processing is disclosed. The method for analyzing emotions of film reviews by deep learning includes: getting film reviews text data and marking positive and negative emotions in film reviews; preprocessing the film reviews by removing redundant information; vectorizing film reviews text according to the bag-of-words model; splitting the vectorized film reviews into training sets and test sets; setting up the initial deep learning model of film reviews sentiment analysis, which connects and integrates four convolution neural network layers, two pooling layers, and two full connected layers; training the initial deep learning model by training data set to generate the final deep learning model, using the final deep learning model to detect the film reviews test set and output the detection results. The invention can accurately distinguish positive and negative emotions of film reviews, and the deep learning model has a simple structure and a small amount of calculation, thereby improving the speed of emotion analysis of film reviews. a raw review -Remove HTML, review text -n n-leters- letters only lowercase & split into individual ones Join a clean review 4-the words- meaningfulwords * Remove - words together stopwords Figure 1 [sentense] John likes to watch movies. Mary likes too. John also likes to watch football game. mokes" 52 to":3 John likes to watchmovies.Marylikestoo. Johnalso likes to watch football game. [1, 1, 1, 1, 0, 1, 1, 1, 0, 0] Figure 2

Description

CM DESCRIPTION

TITLE

A method for sentiment analysis of film reviews based on deep learning and natural language processing

FIELD OF THE INVENTION

The present invention relates to sentiment analysis of film reviews, and more particularly, relates to sentiment analysis of film reviews based on convolutional neural network and nature language processing, which can distinguish positive and negative attitudes of film reviews accurate and fast.

BACKGROUND OF THE INVENTION

With the continuous development of the Internet, more and more people start to express their views online about commodities, films and hot events. These comments are very important and meaningful. For e-commerces, analyzing the emotional tendency of customer’s evaluation towards specific product can help them to understand customer preferences, and thus improve service and enhance customer satisfaction. For enterprises, digging online comments can help them understand consumers’ evaluation of their products and therefore optimize their products. From the perspective of the government, the comments on the Internet involve people’s attitudes and views on hot issues and national policies, so in this way the government can better respond to public

2020100710 05 May cq opinions and thus make good changes.

Our project is mainly about film reviews, one of the hottest Internet comments. Film reviews have many characteristics such as short space, random content, rich semantic diversity, and a large number of emoji. Reading movie reviews can let audience have a clear view and comprehensive understanding of the movie they may choose to watch. For the cinema, they can check the film reviews in order to explore the movie’s reputation and repercussions, so as to adjust its arrangement and publicity for the movie, and then to maximize their benefits.

Since mainly focusing on the emotional tendency of the comment text, we are trying to use sentiment analysis. Sentiment analysis, also known as emotional tendency analysis, is a direction of natural language processing (NLP), which is aiming to analyze the positive or negative aspects of text description.

There used to be two kinds of sentiment classification technology: emotion dictionary and machine learning. The former mainly uses emotional words to judge the emotional tendency of the text, and therefore needs to construct an emotional dictionary manually. The latter (such as Naive Bayes, maximum entropy classification, and support vector machines) mainly uses machine learning algorithm to train the statistical language model, and uses the trained classifier to classify the emotion of the text. This method can take the context semantic

2020100710 05 May cq information into account, and it is more accurate to judge the emotional tendency of the sentence as a whole. However, the common algorithms rely on the manual extraction of features, which requires higher experience of experts.

Recently, deep learning techniques (such as recurrent neural network, convolutional neural network) for sentiment analysis have become very popular. Compared with the approaches based on manually extracted features, this method can provide automatic feature extraction and better performance. And among various deep learning models in multiple sentiment classification task datasets, researchers found that convolutional neural network has achieved the same or even better results as other methods. Our project is mainly about a method for sentiment analysis of film reviews based on deep learning and natural language processing

SUMMARY OF THE INVENTION

In order to solve the shortcomings, problems of the above method, the present invention proposes a method for sentiment analysis of film reviews based on deep learning and natural language processing, which is several multi-layer convolutional neural networks and fully connected neural networks connected in series. This method can give full play to the advantages of automatic learning features of deep learning, and effectively solve the above-mentioned problems such as difficulties in extracting comment features and low accuracy of automatic real-time

CM

2020100710 05 May learning.

The technical scheme of this patent is as follows:

This is a method for sentiment analysis of film reviews based on deep learning and natural language processing, including parti: data acquisition part2: data processing part3: deep learning structure design part4: model training and optimization part5: real-time movie review sentiment recognition.

Parti:data acquisition: we crawl relevant movie reviews for various types of movies from major video websites.

Part2:data processing:

A. Remove html tags: use the beautifulsoup library in bs4 to remove html tags.

B. Lower case: lower() function of strings is used to achieve lower case.

C. Remove stop words: in information retrieval, to save storage space and improve search efficiency, certain words are automatically filtered out before or after processing natural language data (or text), which are called stop Words, we use the nltk library stopwords class to remove stop words.

D. Remove non-character data: use Python's re library to remove non-character data through regular expression string matching. The A to D process mentioned above is shown in figure 1.

E. Establish a bag-of-words model: Put all the words in one bag,

2020100710 05 May 2020 regardless of their morphology and word order so that each word is independent. Build a dictionary for mapping matches. The sentence can be represented by a vector, and the corresponding subscript matches the subscript of the mapping dictionary. The value is the number of times the word appears in the sentence. The process of building a bag-of-words model is shown in Figure 2.

F. Onehot encoding: one-hot encoding, also known as one-bit effective encoding, mainly uses N-bit status registers to encode N states, each state is independent of its register bit, and only one bit at any time is valid. Its purpose is to transform categorical variables into a form that machine learning algorithms can easily use.

G. Train test is divided into two parts: the train_test_split() function of the model_selection module in Python's skleam library is used to split the training set and test set.

Part3:deep learning structure design:

Before the introduction of neural networks, the old-fashioned process was: raw data-> artificial feature extractions algorithms-> results, After the introduction of convolutional neural network: raw data-> convolutional networks algorithms result, what convolution has to solve is automatic feature extraction

It can be seen from Fig.3. and Fig. 13. that the convolutional neural network is very similar in structure to the fully connected neural network.

2020100710 05 May 2020

The convolutional neural network is also organized through layers of nodes. Like a fully connected neural network, each node in a convolutional neural network is a neuron. In a fully-connected neural network, nodes between each adjacent two layers have edge connections, so the nodes in each fully-connected layer are generally organized into a column, which is convenient for displaying the connection structure. For convolutional neural networks, only some nodes between two adjacent layers are connected. In order to show the dimensions of each layer of neurons, the nodes of each convolutional layer are generally organized into a three-dimensional matrix.

In addition to the similar structure, the input and output and training process of the convolutional neural network are basically the same as those of the fully connected neural network. Taking image classification as an example, the input layer of the convolutional neural network is the original pixels of the image, and each node in the output layer represents different types of credibility. This is consistent with the input and output of a fully connected neural network. Both the loss function and the parameter optimization process are also applicable to convolutional neural networks. The process of training a convolutional neural network in TensorFlow is no different from training a fully connected neural network. The only difference between a convolutional neural network

2020100710 05 May 2020 and a fully connected neural network is the connection between two adjacent layers in the neural network.

The biggest problem with using fully connected neural networks to process images is that there are too many parameters in the fully connected layer. In addition to slowing down the calculation, increasing parameters can easily lead to overfitting problems. Therefore, a more reasonable neural network structure is needed to effectively reduce the number of parameters in the neural network. Convolutional neural networks can achieve this goal.

Fully connected network: Fully connected means that the nodes of each layer are not connected to each other. The nodes of each layer are connected to all nodes of the previous layer and the next layer. The fully connected feature allows each layer to be represented individually by a matrix, and operations from each layer to the next layer can be performed in parallel using matrix operations. Composition: input layer, activation function, fully connected layer.

Convolutional neural network: Composition: Input layer, Convolutional layer, Rectified kinear Units layer (REFU layer), Pooling layer, Fully-Connected layer.

The input layer size depends on the input data which is the input of the entire neural network.

2020100710 05 May 2020

Convolutional layer: As you can see from the name, the Convolutional layer is the most important part of a convolutional neural network. Each convolutional layer in a convolutional neural network consists of several convolutional units. The parameters of each convolutional unit are obtained through optimization of the back-propagation algorithm. The input of each node in the convolutional layer is only a small block of the previous layer of the neural network. The commonly used size of this small block is 3 x 3 or 5 x 5. The purpose of the convolution operation is to extract different features of the input. The first layer of convolutional layers can usually only extract some low-level features such as edges and lines. Higher-level networks can iteratively extract more complex features from low-level features. In general, the node matrix processed by the convolutional layer will become deeper.

To better understand this layer, first define a few symbols:

W: the size of the input unit, usually expressed by the width or height of the input unit

F: Receptive filed, refers to the size of the area on the input image where the pixels on the feature map output by the convolutional neural network are mapped

S: stride, which controls the distance between two adjacent hidden units at the same depth and the input area connected to them. If the stride is

2020100710 05 May 2020 small, there will be a lot of overlap in the input area of adjacent hidden units, and the overlap area will be reduced if the stride is large.

P: zero-padding. We can change the overall size of the input unit by padding zeros around the input unit to control the size of the output unit. K: depth, controls the depth of the output unit, that is, the number of filters, and the number of neurons connected to the same area.

Then we can use the formula. 8. to calculate that there can be several hidden units in an output unit in one dimension: ^-^{F + 2f} ₊ l (1) (W: the size of the input unit, F: receptive filed, S: stride, P:zero-padding) Then introduce the weight sharing principle. The so-called weight sharing is to give an input image and use a filter to scan this image. The number in the filter is called weight, which can greatly reduce the number of weight parameters and simplify the network structure. The embodiment of this principle in convolutional neural networks is that for an input matrix, only one filter is used for scanning, and there is no need to define different filters for each position of the matrix.

On the convolutional layer, if the input layer size isWl * Hl * DI, four additional parameters need to be given, the number of filter (K), and the size of filter, that is, received filed (F), stride (S), the amount of zero padding (P). The output is a three-dimensional unit W2 * H2 * D2, where

W-F + 2P

W₇= ' ^/+z/ ₊, (2)

2020100710 05 May 2020 _H_x-F + 2P ii ₀ —--r 1 s

(3)

A =K (4) for example:

calculation process: Figures 4 and 5 describe the basic process of the calculation, In the first step, there is a sliding window of the same size as the filter on the input matrix, and then the part of the input matrix in the sliding window is multiplied by the corresponding position of the filter matrix, the second step is to sum the results produced by the three matrices and add the bias term, then, make the window slide the size of the stride, repeat the above operation. Figures 6, 7, 10, and 11 correspond to convolutional layers 1 -4

Rectified Linear Units layer (RELU layer): The activation function of neurons in this layer uses the RELU function.

Activation Function is a function that runs on the neurons of the artificial neural network and is responsible for mapping the input of the neuron to the output. The properties of the activation function: (1) Non-linear. The linear activation layer has no effect on the deep neural network, because its role will still be various linear transformations of the input. . (2) Continuously differentiable. Requirements for the gradient descent method. (3)The range is preferably not saturated. When there is a saturated interval segment, if the system is optimized to enter this io

2020100710 05 May 2020 segment, the gradient is approximately 0, and the learning of the network will stop.(4) Monotonicity. When the activation function is monotonic, the error function of the single-layer neural network is convex, which is easy to optimize. (5) It is approximately linear at the origin, so when the weight is initialized to a random value close to 0, the network can learn faster, without adjusting the initial value of the network. Currently, the commonly used activation functions have only the above-mentioned properties, and none of them have all the above-mentioned properties.

The most commonly used activation function ReLU function /'(x) = max(0,x) (5)

Pooling layer: The neural network of the pooling layer does not change the depth of the three-dimensional matrix, but it can reduce the size of the matrix. Features with large dimensions are usually obtained after the convolutional layer. The Pooling layer divides the features into several regions and takes the maximum or average value to obtain new, smaller-dimensional features, which can further reduce the size of the final fully connected layer. The number of nodes is decisive for achieving the purpose of reducing parameters in the entire neural network. By compressing the input feature maps, on the one hand, the feature maps are made smaller, simplifying the computational complexity of the network; on the other hand, feature compression is performed to extract the main features. There are two types of pooling operations, one is Avg

2020100710 05 May

C9 Pooling and the other is Max Pooling, for example:

Using a 2 * 2 filter, max pooling is to find the maximum value in each area, where stride equal to 2, and finally extract the main features from the original feature map to get the right picture. As shown in Figure 8.

In convolutional neural networks, we often encounter pooling operations, and the pooling layer is often behind the convolutional layer. Through the pooling, the feature vectors output by the convolutional layer are reduced, and the results are improved (not easy to overfit). Figures 9 and 12 correspond to pooling layer 1, pooling layer 2, respectively.

Fully Connected layer: After multiple rounds of convolutional layers and pooling layers are processed, the final classification result of the convolutional neural network usually consists of 1 or 2 fully connected layers. After several rounds of processing of the convolutional layer and pooling layer, it can be considered that the information has been abstracted into features with higher information content. We can consider the convolutional layer and pooling layer as the process of automatic feature extraction. After the feature extraction is completed, the fully connected layer still needs to be used to complete the classification task.

In general, it is the combination of all local features into global features, which are used to calculate the score of each final category. Figure 14 shows the structure of the fully connected layer

Part4: model training and optimization:The algorithms and ideas used in

2020100710 05 May cq this part: the concept of batching, dropout, adam, learning rate decay, iterations, regular L2 loss, weight Initialization.

Batching: In the process of model training, due to large data sets and other reasons, it is often impossible to read all the data at one time. In order to overcome this, the concept of batching was introduced to train or test the data in batches to reduce memory usage and improve training speed. If batch size is too small, system will frequently I/O, resulting low training efficiency. If too large, the computer cannot load so many images to memory, application throws exceptions.

Dropout: In a machine learning model, if the model has too many parameters and too few training samples, the trained model is prone to overfitting. Over-fitting problems are often encountered when training neural networks. Over-fitting is specifically manifested in the following: the model has a small loss function on the training data and a high prediction accuracy; but the test data has a large loss function and the prediction Accuracy is low. Overfitting is a common problem in many machine learning. If the model is overfitting, the resulting model is almost useless. In order to solve the problem of overfitting, a model integration method is generally adopted, that is, multiple models are trained to be combined. At this time, it takes a lot of time to train models. Not only does it take time to train multiple models, it also takes time to test multiple models. In summary, when training deep neural networks,

2020100710 05 May cq there are always two major disadvantages: (1) easy to overfit (2) time consuming. Dropout can effectively alleviate the occurrence of overfitting, to a certain extent Regularization effect. Dropout means that during the training process of a deep learning network, some neural network units are temporarily discarded from the network with a certain probability, which is equivalent to finding a thinner network from the original network.

Adam: Adam optimization algorithm is a first-order optimization algorithm that can replace the traditional stochastic gradient descent process. It can iteratively update neural network weights based on training data. Adam's algorithm is different from traditional stochastic gradient descent. Stochastic gradient descent maintains a single learning rate (ie, alpha) to update all weights, and the learning rate does not change during the training process. Adam designs independent adaptive learning rates for different parameters by calculating the first and second moment estimates of the gradient, and obtains the advantages of two stochastic gradient descent extended algorithms of AdaGrad and RMSProp.

Learning rate decay: If the learning rate is too large, the speed of training will be improved, but the accuracy of the result is not enough, and it may also lead to a situation that it cannot converge and oscillate. The learning rate is too small, the accuracy will be improved, but the training speed is

2020100710 05 May 2020 slow and it takes more time. So we can use degraded learning rate, also known as decay learning rate. Its role is to attenuate the value of the learning rate during the training process. After the training reaches a certain level, a small learning rate is used to improve the accuracy Selection of the number of iterations: Increasing the number of iterations when other parameters are constant can usually improve the accuracy, but it will reduce the training speed, so we need to choose a reasonable number of iterations.

Regular L21oss: The process of training a machine learning model is to minimize the error between the model's output and the actual collected results. Therefore, we also have an indicator for measuring errors, called the loss functioned stands for variation, me stands for model complexity) ^min(/d⁺/_mc) (6)

After adding the L2 regular, the linear regression loss function becomes like this:

N /055 =|| β)||² +^(a'x_n-y_ny (7)

H=1 (co is a matrix of weights, x is the input value, y is the actual value, and T is the transpose of the matrix)

The purpose is to minimize errors and model complexity. The smaller the error, the higher the model's fit, the smaller the model complexity, the simpler the calculation, and the stronger the generalization ability.

Weight Initialization:We generally want the weight to be a small value

2020100710 05 May 2020 close to zero, and at the same time to have some randomness, so we consider using a normal distribution with a mathematical expectation of 0 to select the initialization weight.(p is the mathematical expectation and σ ^Λ 2 is the variance) ^{/ω =} 7έσ^εΧΡ<·«Τ-’ ⁽⁸⁾

Part5: real-time movie review emotion recognition: enter test comments into the trained model for recognition

DESCRIPTION OF THE DRAWINGS

The appended drawings are only for the purpose of description and explanation but not for limitation, wherein:

Fig.l represents the process of data processing.

Fig.2 gives an example of building a bag-of-words model and one-hot encoding.

Fig.3 shows the structure of our entire neural network.

Fig.4 gives an example of the calculation principle of the convolutional layer.

Fig.5 illustrates how the convolutional layer moves.

Fig.6 shows the working structure of convolutional layer 1.

Fig.7 shows the working structure of convolutional layer 2.

Fig.8 gives an example of how the max pooling layer computes.

Fig.9 shows the working structure of max pooling layer 1.

Fig. 10 shows the working structure of convolutional layer 3.

2020100710 05 May

CM Fig. 11 shows the working structure of convolutional layer 4.

Fig. 12 shows the working structure of max pooling layer 2.

Fig. 13 gives an example of a fully connected neural network.

Fig. 14 shows the working structure of full-connected layers.

Fig. 15 lists our data when debugging the network.

DESCRIPTION OF PREFERRED EMBODIMENT

In this part, we will describe the specific methods and details used in the implementation of this invention.

1. Data:

Before we started to use our movie review analysis system for processing, we collected some movie reviews online and processed the data. Our raw data set, which is saved as a tsv file, contains 25,000 reviews. The information type of each movie review is divided into 3 categories, which are id, sentiment, and review. The sentiment is composed of two kinds of labels representing positive and negative.

2. Preprocessing:

First, considering the grammar and sentence ambiguity of natural language, we needed to perform text analysis. A simple way was to extract the words that made sense from the text.

We first imported the “pandas” library and used the “read_csv” function to read the data set. Then, we use a function, in which we

2020100710 05 May rq perform the preliminary processing of the raw reviews, to convert each raw review to a preprocessed movie review.

In this function, the input should be a single review that is a string. We imported the “BeautifulSoup” library from the “bs4” library. Then, we took advantage of the “BeautifulSoup” library and the “Ixml” library to parse the html tags of the review and use the “get_text” function to get the plain text inside. After that, we clear non-character data from the review using regular expressions, replacing all non-character data with spaces. Later we can split the remaining words in the review using the “split” function. The split function works by slicing a string by specifying a delimiter and here our separator is a space. At the same time, we lowercase all characters because it was easier to match. The review was now a list of strings of many lowercase words. Next, “nltk” is the main toolkit for processing languages under python, which helps to remove stop words, part-of-speech tagging, and word segmentation and clauses. Considering that there were many invalid words in the list, we imported the “stopwords” library from the “nltk.corpus” library to judge whether the words make sense and are not stop words. We only preserve the valid ones because it can not only improve training speed but also save memory. Finally, we join the words back into one string separated by space. The result is returned as a string.

2020100710 05 May cq To handle all the reviews, we get the size of the reviews and call the function above for each review. The result is saved as a list of strings named clean_train_reviews.

Afterward, we initialize the “CountVectorizer” object, which is scikit-leam's bag of words tool, named vectorizer. The “CountVectorizer” belongs to the common feature numerical calculation class and is a text feature extraction method. For each training text, it only consider how often each word appeared in the training text. In this case, we only set to keep the first 4096 non-repeating words that appear in each review. The number 4096 was taken because it was convenient for us to reshape the matrix later and optimize performance. What’s more, “CountVectorizer” will convert the words in the text into a term frequency matrix. It used the “fit_transform” function to count the number of times each word appears. The “fit_transform” function did two functions. First, it fits the model and learns the vocabulary; second, it transforms our training data into feature vectors. The input to the “fit_transform” function should be a list of strings. After calling this function, each review data is converted into a 1 X 4096 numeric vector. Last, we transformed each 1 X 4096 row vector into a 64 X 64 matrix.

Considering that numpy arrays are easy to work with, we imported the “numpy” library and used the “toarray” function to convert the list

2020100710 05 May

CM train_data_features to a numpy array. We used the “train_test_split” function from the skleam’s bag to divide the data set into a 70% training set and a 30% test set, both including samples and sentiment labels.

As for the labels, we simply took the values in the sentiment column into a list and converted them into one-hot encoding which can represent the probability of belonging to each category.

3. The architecture of the neural network

Compared to the traditional fully connected neural network, we used a better-designed convolutional neural network which has fewer parameters due to its weight-sharing property and therefore has less possibility to become overfit. Moreover, CNN is able to automatically extract complex features from the original data step by step, thereby eliminating the labor of artificially extracting features, and can significantly improve the accuracy of the model.

we trained a 6-layer network including 4 convolutional layers followed by two fully-connected layers. There is a pooling layer after every two convolutional layers. We consider that a pooling layer is part of a convolutional layer and we can simply use a hyperparameter to control whether the pooling layer is defined.

The architecture of CNN is shown in Figure 3 and described in detail below.

2020100710 05 May

a) Input layer

We trained the network using a batch of 64 data each time and each individual data has one channel. Therefore, the input data is a 4-dimensional tensor of shape (64, 64, 64, 1) which can be interpreted as (batch size, individual data length, individual data width, channel). As for the test sets, we simply changed the batch size to 500, hence, a tensor of size (500, 64, 64, 1) was used to test the generalization ability of our network each time. However, in the following pages, instead of using the batch size of data, we would consider an individual sample data as the input to explain the specific details of the network.

b) Convolutional layer and Max pooling layer

In these layers, a set of 3 X 3 convolutional kernels and 2X2 pooling kernels are used to perform convolutional calculation and max-pooling calculation respectively on the input data. For each individual input data, with a size of (64, 64, 1) specifically. The first convolutional layer containing 32 kernels of size 3X3 yields a 3-dimensional tensor of size (64, 64, 32) which means that the output of the first convolutional layer contains 32 channels. For the reason that we set the hyperparameter “padding” of the convolutional layer to the value SAME and the stride size of the convolutional kernel to (1, 1, 1, 1), the output of the convolutional layer will not change the size

2020100710 05 May cq of the input data of this layer. In addition, we applied the relu function as the activation function to the result of each convolutional layer in order to remove unrelated features. The second convolutional layer uses 32 kernels with a size of 3 x 3 and 32 channels to perform further feature extraction on the output which is the result of the first convolutional layer. After carrying out convolution calculation twice, the tensor that passed through the consecutive convolutional layers is used as the input to the first pooling layer. We used the max-pooling, which is a technique for selecting the largest value as a representation of circumjacent values. A max-pooling kernel with 32 channels is applied to convert the size of the output data of the second convolutional layer to 32X32 with 32 channels as a result. Then, the data is transferred to the third convolutional layer which has 32 kernels with a size of 3 X 3 and 32 channels, which is the same size as the fourth convolutional layer’s. The third and fourth convolutional layers also extract more complicated features that are hard for a human to understand but simple for a computer to train our neural network. Then, as before, the data flow is delivered to the second max-pooling layer and converted to a tensor with a size of 16X16 with 32 channels.

c) Fully-connected layer

After passing through the pooling layer, a flattening process is

2020100710 05 May cq performed to convert a 3-dimensional feature map of shape (16, 16, 32) into a 1-dimensional row vector of length 16X16X32. Then, the vector is sequentially delivered to two fully-connected layers containing 128 nodes and 2 nodes respectively The first layer computes the transformation a(W*x + b), where W is an 8192X 128 weight matrix, b the vector of bias containing 128 values, x the 1-dimensional row vector and a the rectified linear (relu) function. The second layer does the same transformation as the first layer. In this layer, however, Wisal28X2 weight matrix, b is a vector of bias containing 2 values, x is the output of the previous fully-connected layer, and a is softmax function whose output is the probability corresponding to each category.

4. Optimization

Through the calculation of the last fully-connection layer, we obtained the probabilities that a sample data was judged to belong to positive and negative. Afterward, we used the cross-entropy function to calculate the total loss of the model and reduced the loss using some techniques which will be described in detail below.

a) mini-batch training

We used 64 training data as a mini-batch for each training, which makes the parameters update faster and as a result, it is conducive to the convergence of the loss function.

2020100710 05 May cj b) L2 regularization

In order to prevent overfitting, we applied 12 regularization to the loss function. When we optimize the loss function, we simultaneously decrease the value of weights and bias in the network, which reduces the complexity of our network to a great extent.

c) learning rate decay

Too large learning rate will make the algorithm hover around the optimal solution, but will not converge. However, On the contrary, too small learning rate will also make training extremely slow. Therefore, we used a technique named learning rate decay which can gradually decrease the learning rate as the number of iterations increases. During the initial phases, while our learning rate alpha is still large, we can still have relatively fast learning. But then as alpha gets smaller, our learning steps will be slower and smaller, hence, our algorithm can end up oscillating in a tiger region around the minimum of the loss function. Specifically, we set the initial learning rate to 0.001, decay rate to 0.99, and decay steps to 100, which means that updating the learning rate to 99% of the original every 100 times.

d) Dropout

Dropout is also a powerful technique that we used in the fully connected layers to suppress overfitting. Nodes from the fully-connected layers are randomly dropped during training, which

2020100710 05 May cq prevents nodes from co-adapting too much. Specifically, we set the dropout rate to 0.9 meaning that we randomly set some values in the data flow to 0 and the rest to be scaled up by 10 / 9.

e) Adam

Instead of using the traditional stochastic gradient descent algorithm, we used Adam to optimize the loss function. The adam algorithm is able to dynamically adjust the learning rate of each parameter. The main advantage of Adam is that after the offset correction, the learning rate for each iteration has a certain range, which makes the parameters relatively stable.

Figure 15 shows the results under different parameters. We can see that our model eventually achieved an accurary of 81.3%, which is a relatively impressive result.

Claims

rj What we claim is:

1. A method for sentiment analysis of film reviews based on deep learning and natural language processing, characterized in that: uses the deep learning model consisting of four convolution neural network, two pooling layers and two full connected layers, trained with train data set, the model uses the Adam optimizer to reduce the losses and optimize weight and bias of the model to improve the accuracy of the model on the test set, to avoid overfitting, the method uses L2 Regularization Loss method and Dropout method.
2. A method for sentiment analysis of film reviews based on deep learning and natural language processing, which uses bag-of-words model to vectorize the film reviews text data in order to obtain more accurate prediction results with less calculations and simpler deep learning model structure, since bag of words model can accurately and efficiently describe the characteristics of long text, such as the film reviews.