AU2019101147A4

AU2019101147A4 - A sentimental analysis system for film review based on deep learning

Info

Publication number: AU2019101147A4
Application number: AU2019101147A
Authority: AU
Inventors: Haoran Han; Yilin Hao; Yisiyuan Huang; Yufei Meng; Zixing Shen; Keyao Wu
Original assignee: Hao Yilin Miss; Meng Yufei Miss; Wu Keyao Miss
Current assignee: Hao Yilin Miss; Meng Yufei Miss; Wu Keyao Miss
Priority date: 2019-09-30
Filing date: 2019-09-30
Publication date: 2019-10-31
Anticipated expiration: 2027-09-30

Abstract

Abstract This application will be introduced as sentimental analysis system of film criticism based on deep learning. This project contains four main processing sections, which are as follows: data processing can be recognized as the first part of the patent, which includes some well-know models and theories in the area of information searching. Moreover, in this sentimental analysis system, accuracy is the main criterion to measure the degree of system optimization and the efficiency of target realization. Compared with other systems, our sentimental analysis system based on deep learning has plenty of advantages, including simple structure, high accuracy, and rapid encoding speed.

Description

This invention belongs to field of information processing, which is the sentimental analysis system of film criticism based on deep learning. BACKGROUND

It is widely acknowledged that, due to the rapid development of the Internet, a great many emerging social websites, well-known forums, and blog writers tend to take advantage of user’s sentimental comments, feelings, perspective and so on, in order to produce a great deal of data that usually about various events on society, products, brand, politics, film. To be specific, according to films, the feelings users compressed play a crucial role on the films’ latter viewers, their political images, and their network service providers. For instance, In Douban, a common website we make use of to make books or films comments, contains a large number of users’ positive or negative emotional reviews. Analyzing the sentimental tendencies of the comments in Douban, however, lays a solid foundation for investors to make decisions, and also can be regarded as a mean to assist them to improve the quality of their works. Consequently, since the decentralized and unstructured data need to be

2019101147 30 Sep 2019 properly managed, under such background, the emotional analysis has been attached to a great importance.

With the explosive growth of this type of comment, the demand of the technology of the sentimental analysis, a section of the natural language processing (NLP), are gradually increasing, as it can be employed to analyze and judge the emotional types of text description, so that the machine can be prone to comprehend the emotions and views expressed in the text. Nonetheless, due to both the complexity and diversity of the human languages, the applications of the sentimental analysis are considered as a challenging task.

Previous researches show that the basic machine learning techniques can accomplish some natural language processing tasks effectively, such as document subject classification. However, the same techniques cannot be applied to the field of emotional classification, since it requires more efforts to overcome the challenges emotional analysis faced and to deal with the diversity involved in emotional analysis.

There are two main thoughts applied to emotion analysis so far. The first is the one that based on the emotion thesaurus, which needs to calculate the emotional tendency of the text according to the constructed emotion thesaurus- quantifying the emotion of the text according to the

2019101147 30 Sep 2019 semantics and dependencies-and the final classification effect depends on the integrity of the emotion thesaurus. What’s more, as for this particular method, an excellent linguistic foundation is required. It is necessary to know when a sentence is usually expressed as positive or negative under different situations. Nevertheless, the emotions expressed by words are difficult to judge as absolute positive or negative emotions due to the complexity of modem language, so it is difficult to perfect the judgment of emotions by this mean. Another common method that based on the machine learning is to select emotional words as feature words, and then matrix the text. Among those methods, logistic Regression, Naive Bayes (NB), and Support Vector Machine (SVM) are more commonly used. The final classification effect always depends on the selection of the training text and the emotional labeling. Each of these approaches, however, has their own drawbacks. For example, in theory, Naive Bayes model used to have the minimum error compared with other classification methods. In fact, this is not always the case. The Naive Bayesian model assumes that attributes are independent of each other, this assumption, therefore, is often not valid in the practical applications. Classification effect shows bad effect when the number of attributes is big or the correlation between attributes is large. In other words, Naive

2019101147 30 Sep 2019

Bayes performs best when attribute correlation is small. At this point, an emerging algorithm, semi-Naive Bayes, partly improved the problem of the correlation. In addition, prior probability that depends on hypothesis in many cases needs to be known before using this model. But there are many kinds of hypothesis models, which make it to be more likely to come out with the bad prediction effect. Moreover, there has a certain error rate in classification decision and it is sensitive to the expression form of input data. At present, most of the relevant researches use the emotional characteristics manually annotated by SVM or Naive Bayer’s to conduct emotional analysis on Weibo. However, as Weibo usually contain limited contextual information, it is challenging to conduct emotional analysis on it. Meanwhile, these methods require us to extract features manually. It is almost impossible to do that due to the large sample size, so their applicability is limited.

Accordingly, we decide to adopt convolutional neural network (CNN) and fully connected neural network (FC) in our program of emotional analysis of film reviews. With its special structure of local weight sharing, convolutional neural network has unique advantages in speech recognition and image processing. Its layout is closer to the actual biological neural network. In CNN, Weight sharing reduces the

2019101147 30 Sep 2019 complexity of the network and further decreases the number of parameters. Moreover, the image of multi-dimensional input vector can be directly input into the network, which avoids the complexity of data reconstruction in feature extraction and classification. In the pooling layer, the max-pooling method can be used to compress features to achieve dimensionality reduction and facilitate the extraction of main features. In this invention, we take advantage of 4 convolution layers, 2 max pooling layers, and 2 full connection layers, which not only makes the overall framework structure of the project relatively simple and speeds up, but achieves the content that the previous design wants to optimize as well.

SUMMARY

Our Patent is a sentimental analysis system for film review based on deep learning, and this system includes six major steps:

1) Firstly, the original film review data would be preprocessed, which includes eliminating html tags; deleting non-character information; and utilizing the nltk. stop word in python to cast off the stop words.

2) Then our system would transform the preprocessed data to the form of Bag-of-words Model, which serves to transform natural language information to arrays conducted with numbers. The review in the form of

2019101147 30 Sep 2019

Bag-of-words Model is totally independent with their grammar or words’ orders, but the choice of words as well as words’ frequency would be the decisive elements in the Bag-of-words Model. Our system utilizes skleam in python to achieve this goal.

As to the specific process of Bag-of-words Model Transformation, our system would firstly transfer every review in our film review data to a list made up of each word in every review. And then it would go over all the review data and conduct a dictionary that specifically reveals the vocabulary used in our review data. Yet since the dictionary at this time is too redundant and includes lots of vocabulary with little analytical value, the system would rearrange the dictionary according to the frequency of each word in a descending order. The dictionary would be renewed to the top m words with the most frequency.

Then the system would conduct a matrix with the shape of [1, m] from each review according to the new dictionary. The indexes in this matrix correspond to the indexes in the dictionary, while every element in the matrix reflects the frequency of the corresponding word with the same index in the dictionary. Thus the final processed review data is a matrix with the shape of [n, m] (n is the total number of reviews in the data)

3) The system would then transfer the sentiment in the data into the

2019101147 30 Sep 2019 form of one-hot-encoding. While the system is executing the Bag-ofwords Model transformation, it would label the Bag-of-words from each review with their corresponding sentiment. Then it would transfer every sentiment into a list with i elements (i represents the number of categories of the sentiment: for instance, if the sentiment ranges from 1-star to 5stars, then i would be 5). The actual value of the sentiment would be used as the index for one-hot-encoding, and the corresponding element in the list with that index is given the value 1 while other elements in the list would have the value 0. After these steps, there will be a one-hotencoding labels as binary lists with the length i for corresponding reviews.

4) The data after such process, along with the sentiments in the form of one-hot-encoding, would be imported as input samples and labels into the deep learning network in our system.

As it is indicated on the picture above, the basic structure of this network is composed of four convolutional(CNN) layers and two maxpooling layer, followed by two fully-connected(FC) layers.

With the Rectified Linear Unit (ReLU) as the activation function, the CNN layers mainly use the filter to go over the whole data and extract the features from the review data, which are the choice of words and the

2019101147 30 Sep 2019 frequency of words. Then in regarding the new matrix (with the size of a) the system gets from CNN layers, MaxPooling would extract the maximum value from each portion and then conducts another matrix (with the size of a/2) with the maximum values extracted from the convolutional layers. Such process is to ensure that there would not be too much features entering the fully-connected layers.

By executing the above process twice, the shape of input review data is transferred from n*m*l*l to (n/4)*(m*4)*l*l and then enters the FC layers. In the FC layers, we use Softmax function to categorize the features extracted from the CNN layers.

5) And then we use primarily three methods for the optimization of the learning structure.

Firstly, we use Regularization, which utilizes L2 norm to calculate the loss for the tensors; but instead of square rooting the result of this norm, we reduce the value in half. And we use this value as the new loss to prevent over-fitting situations.

Also, we apply random dropout at the FC layers at the training phase of our learning network. By dropping the nodes at a given probability p, the over-fitting situation would also be effectively suppressed.

Lastly, we can update our parameters through gradient descent,

2019101147 30 Sep 2019

Newton, momenta and Adam. What’s more, we optimize the learning rate by decaying it with given batch-size.

6) Last but not least, we split the film review data to the training set and testing set in at a ratio of p:q. The system uses the training set as input data to train and modify our emotional analysis model and uses the testing set to calculate the accuracy of the model.

DESCRIPTION OF DRAWING

Ligure 1 shows the data flow of our convolutional neutral network.

Ligure 2 shows the data flow of our convolutional neutral network that substituted with real data.

Ligure 3-5 show the fully-connected simple neural network.

Ligure 6-7 show the mechanism of the optimizing method-dropout. Ligure 8 shows the mechanism of the functional chain.

DECRIPTION OF PREFERRED EMBODIMENT

Data Processing & One hot encoding

To accomplish our final goal, the first and inevitable step is data preprocessing. As our data is given in the form of a tsv file, we must import a library to open such a file. The best choice and our choice are no doubt the “pandas”. You can type in “pip install pandas” in command ( for windows users) like any other libraries, but do note that is method is

2019101147 30 Sep 2019 valid only when you alter the suffix of the file into tsv in the parameter. In, the next few steps, we will be using what is called the “regular operation expressions”, which is also a library called “re” when imported. Normally, the basic means of data preprocessing should cover but not limit to “removing html tags, turning the text into all lower-case letters, removing non-characters and meaningless stop words.” We used two “for” loops to make sure it goes through the whole file so that we would get our expected outcome. By using re.comilie(), we transferred the string expression into a complied “pattern” in order to proceed the following steps. Then, we replace every non-character by a empty string while using the .lowerQ function provided by python to transfer the whole file into lowercase letters. The last yet crucial move is to import a library nltk, which stands for Natural Language Toolkit, to clean the stop words from the text and split the text into words known as tokenization academically. Regardless of the possible connection among words (which most likely exists, and in a way will affect the accuracy despite the fact that we still reached an accurateness over eighty five percent) as our work is based on an old model called “Naive Bayes Classifier”. When all of the above is done, we applied .append method within the python interpreter to put the preprocessed data into an empty list “a” as io

2019101147 30 Sep 2019 we defined before. The following task would be turning the labels, in this case, the “sentiment” column of the data, into one-hot encoding labels. So, one hot encoding is process that pivots categorical values which represents numerical values into a form that suits machine learning algorithms for better predictions. The reason why we chose one hot encoding instead of label encoding is because label encoding considers the ones with higher categorical values are generally “better data”, whereas obviously, it is not. However, the same problem will not happen to one hot encoding as it is more of a “binarization” process that is far more objective and efficient as not all data provided for machine learning is not always sequential, in other words, categorical. To some extent, we even added one more feature for extraction. We assigned these movie reviews with two values, zero and one, one meaning positive while zero representing passive. (The opposite would work just fine if you want) Through one hot encoding, we successfully pivoted ones into [0,1] matrices and zeros into [1,0] matrices.

Bag of Words

After the success of data cleaning, we conducted data processing and feature construction. We use the bag-of-words to construct text features and it was originally used in the field of information retrieval. For a

2019101147 30 Sep 2019 document, it is assumed that regardless of the order relationship and syntax of the words in the document, only the occurrence of the word is considered. Assuming that there are five categories of topics, our task is to determine a document and which topic it belongs to. In the training set, we have several documents whose topic types are known. We pick out some documents, each document contains some words, and we use these words to build word bags. The word bags can be in this form: { watch , sports, phone, like, Roman...}, and then each document can be converted into a histogram with each word as the horizontal coordinate and the word occurrence times as the vertical coordinate. After that, normalization is carried out and the frequency of each word occurrence is taken as the feature of the document. This model ignores the grammar and word order of the text and converts every comment we make into a vector. We took 996 reviews and broke them down into individual words. Then we culled the top 5,000 words. Make these five thousand words as a dictionary. This dictionary is a train, which used to be tested in the later process. The next step is to create a document vector that converts each document of free text into a text vector that we can use as input or output to the machine-learning model. The simplest way to design a word is to mark its presence as a Boolean, with 0 for negative and 1 for positive.

2019101147 30 Sep 2019

Using any order listed in our dictionary, we can convert reviews to binary vectors. All sorts of traditional document-like words are discarded, and we can use this generic method to extract features from any document in our corpus, which can then be used for modeling.

Convolutional Neural Network

When the data preprocessing is done, we used neural network, which comprises of four convolutional layers, two pooling layers, and two fully connected layers, to train our models. Except for neural network, random forest is also a commonly used machine-learning algorithm; we will discuss their differences and explain why our final option is neural network. First of all, the random forest algorithm is rather independent and conventional, which holds for every one of it decision tree whereas the neural network is closely bonded with all of its neurons, one cannot work without each other. Secondly, random forest algorithm can only process data provided in chart form which would have cost us a lot of inconveniences if we had used this. In comparison, neural network can deal with a variety forms of data, including audio, pictures, text and so on. Clearly, we chose the bag of words model therefore our best choice is the neural network, though theoretically the random forest algorithm can achieve the same goal but clearly it is unnecessary. Put random forest

2019101147 30 Sep 2019 aside, we split the original text into “train” and “test” with a proportionality of seven to three, enough and adequate for both. The next thing we are going to do is to use the “reshape” feature in the library NumPy to change the shape into a [17500, 5000, 1,1] matrix for “train” and a [7500, 5000, 1, 1] matrix for “test” as a result of splitting totally 25000 movie reviews. For inputs, there are a few parameters that we need to explain and specify. To start with, we defined batch size to be 64neither too large or too small as both them could result in lower accuracy. The bigger the batch size, the bigger the learning rate, to keep the standard division of gradient a constant so we set the basic learning rate to 0.001 and since if the learning rate is too big, it begins to oscillate due to too big step, thus it is critically important to have a high decay rate to prevent the model from overfitting. Usually it decays in maximum speed when the decay right is 1 so here we set the value to be 0.99. Unlike the normal neural network, the method we are applying is called the convolutional neural network. The advantage of is most significantly shown on its function to extract partial features, in this very way, reducing the input characteristic variables, reducing the amount of parameters as well. The same filter goes through the input matrix several times; in this case, we set up totally four convolutional layers. After the

2019101147 30 Sep 2019 convolutional layers, we continued to use the max pooling method, every two layers of convolution then a max pooling to be precise. We set the patch size to be, in_depth and out_depth both to be 32 for every layer except the first one that has been assigned the value of feature_col. The we would yield the result same as the input if the input is greater than zero, and input times a coefficient if the input is less or equal to zero. The pooling scale is two, which means we would get a new two by two matrix which four largest values are selected from each group in order to extract maximum feature thus improve accuracy. The final step is to add two fully connected layers where every neuron in one layer relates to every neuron in the next layer. Our first fully connected layer’s input number of nodes is equivalent to feature size divided by four since we added two max pooling layers, times feature_col times thirty two which is the out depth of the fourth convolutional layer. The output number of nodes is one hundred and twenty eight and the activation function is still ReLU as we have discussed before, this process can be simply described as “wx+b” which w is weight, x is input value and b is the value of bias. The second fully connected layer has the input number of nodes same as the output number of nodes of the first fully connected layer, which is one hundred and twenty eight, the output number of nodes is 2 and this

2019101147 30 Sep 2019 time since it is the last layer we do not have an activation function.

Optimize

The last thing we need to do is optimize this program. The whole has two aspects. The first is the optimization of the whole code, and the second is the optimization method for the design architecture. When we actually did the convolution, we found that each convolution layer was coded with a different name, and the contents were the same. At this point, we can choose to make an overall optimization of our code to make our code cleaner and faster. The goal of the design architecture optimization is to prevent our data from overfitting. Since our convolution and neural network are nonlinear, it is easy to overfit. Overfitting usually makes very few mistakes on the training set, but when you have new data on the test set the results are usually not accurate. Under fitting and overfitting are both bad and need to be optimized. First, we use the regular expression, which usually has two expressions. We adopted L2 loss method because it's unique. Its advantages are that it is flexible, logical, and functional that can quickly and easily achieve complex control over strings. Then we used dropout, which was used during the training phase of FC, not at the convolution layer. Dropout is to drop some nodes randomly at a given probability on the full

2019101147 30 Sep 2019 connection layer of the training stage to reduce overfitting. Dropout has a great inhibiting effect on overfitting. And the last thing we want to do is to update the parameters, the way to update the parameters is through gradient descent, Newton, momenta, Adam, etc. Here we use Adam, which has the best effect. By comparing a gif of parameter updating methods, Adam is the fastest method with the least error.

Table 1 The training data.

	Average Accuracy	Standard Deviation	Dropout rate	Base Learning Rate	Decay Rate	Iteration Step
1	84.814	2.042	0.95	0.001	0.99	4000
2	84.771	2.185	0.95	0.001	0.99	3500
3	85.207	2.042	0.90	0.001	0.99	4000
4	49.514	1.718	0.90	0.005	0.99	4000
5	84.214	1.850	0.91	0.001	0.99	4000
6	84.542	2.163	0.92	0.001	0.99	4000
7	84.899	1.922	0.93	0.001	0.99	4000
8	83.642	2.294	0.94	0.001	0.99	4000
9	84.314	2..203	0.93	0.002	0.99	4000
10	82.685	2.259	0.93	0.003	0.99	4000
11	49.515	1.718	0.93	0.004	0.99	4000
12	84.671	2.215	0.90	0.001	0.98	4000
13	84.614	2.071	0.90	0.001	0.97	4000
14	84.786	2.018	0.90	0.001	0.96	4000

2019101147 30 Sep 2019

Claims

1. A sentimental analysis system for film review based on deep learning, wherein the top 5000 high frequency words are choosed, only 2 MaxPooling layers are needed, which led to this structure relatively simple with higher processing speed.

2. The sentimental analysis system for film review based on deep learning according to calim 1, wherein a structure with high accuracy, convolutional neural network are used, to build up our system; consequently, with appropriate parameters, our system can maintain a relatively high accuracy.

2019101147 30 Sep 2019

FIGURE 1

FIGURE 2

FIGURES

2019101147 30 Sep 2019

FIGURE 4

Output

^xi <1 _r “1 hi w^y**11 1’ Vl /* , -_t , 1 ^xi · ή* hj E » i V_k ‘ 1 w^y. ^wkj X_P ♦ ^VP & “z. h_L ¹ y , ^wml Input Layer | w^h Hidden Layer w^y .1. Output Layer

yi y_m

FIGURE 5

2019101147 30 Sep 2019 a

c

d

2019101147 30 Sep 2019