AU2021102725A4

AU2021102725A4 - Sentiment Analysis of Human being with Effective Word Embedding Methodologies

Info

Publication number: AU2021102725A4
Application number: AU2021102725A
Authority: AU
Inventors: Saroj K. Meher; Bhabani Shankar Prasad Mishra; Santwana Sagnika
Original assignee: Meher Saroj K Dr; Prasad Mishra Bhabani Shankar Dr
Current assignee: Meher Saroj K Dr; Prasad Mishra Bhabani Shankar Dr
Priority date: 2021-05-21
Filing date: 2021-05-21
Publication date: 2022-03-17
Anticipated expiration: 2029-05-21

Abstract

The present disclosure relates to a method of enhancing performance of word embedding approaches by integrating sentiment-based information. A method 100 of enhancing performance of word embedding approaches by integrating sentiment-based information comprises the following steps: at step 102, cleaning text by considering individual tokens of said text consists words, numbers or symbols, and removing symbols and hypertext that are not relevant to sentiment presented by said text; at step 104, parsing sentences into word sets representing each sentence as a group of words; at step 106, generating word embeddings to represent words by preparing a text in a form of numerical inputs; at step 108, generating averaged sentence vectors by taking an average of vectors of all words contained in a sentence, wherein averaged sentence vector is extended to obtain vectors for an entire input text; and at step 110, categorizing in a positive or negative class or in 1-star to 5-star rating using classifiers. 23 iN W00 0 0 o 0 H E M 464 'A 22 CE L M'4 ti L , Wa 0 rH

Description

iN W00 0 0 o 0 H

M E

464

'A 22

CE L M'4 ti

L Wa ,

rH

Sentiment Analysis of Human being with Effective Word Embedding Methodologies

FIELD OF THE INVENTION

The present disclosure relates to a method of enhancing performance of word embedding approaches by integrating sentiment-based information.

BACKGROUND OF THE INVENTION

For the sentiment analysis task, several attempts have been made to solve problems in a multitude of domains using machine learning methods, which was otherwise difficult owing to the intractability of processing large-scale data by traditional approaches. For example, polarity detection, i.e., categorizing an opinion into positive or negative, is essentially a classification problem, and hence, classifiers like support vector machine (SVM), naive bayes (NB), random forest (RF), etc. have proved to be extremely useful and efficient. Many research works are carried out using supervised, unsupervised and semi-supervised techniques on all kinds of domains, like product reviews, movie reviews, news, tweets, etc. The application of naive bayes has been demonstrated, maximum entropy and support vector machine classifiers on sentiment analysis, which are the initial works in this domain. This work became the basis of many further applications of supervised learning methods. Again, a movie review dataset has been presented that is used as a standard dataset for sentiment analysis tasks. A technique to extract sentiments from multimodal data by combining language-based formalization along with a machine learning approach has been provided, and used hidden Markov model to represent extracted emotions for the purpose of classification. Unsupervised learning methods have also been applied, which employed part-of-speech tagging to identify recommendable reviews. Recently, semi-supervised learning has been described by under utilization techniques for sentiment analysis in case of inadequate availability of labelled data. The solution of a multi-dimensional classification problem was demonstrated using a semi-supervised Bayesian classifier to identify subjectivity, polarity and will-to-influence aspects. The invention (EP2515242B1) relates to a sentiment classifier in a machine learning system for sentiment classification of content associated with a domain comprising several aspects, wherein lexicon knowledge is incorporated into the machine learning system and wherein an aspect describes a feature of the domain. The invention (US20180232362A1) relates to a method and a system where users receive information which must be filtered, processed, analysed, reviewed, consolidated and distributed or acted upon. Prior art tools automatically processing content to assign sentiment to the content are ineffective as essential aspects such as context are not considered. Embodiments of the invention provide automatic contextual based sentiment classification of content in terms of both sentiments expressed and their intensity. Further a content set is analysed to rapidly establish an "at-a-glance" type assessment of the key topics/themes present within the content set and sentimentally annotate each. Importantly embodiments of the invention also provide for a user to establish the basis for the sentiment associated with an item of or set of content, i.e., make it explainable. Further embodiments of the invention provide for the establishment of psychological tone to sentiments where the sentiments and psychological tones to be tuned from the context or domain of the content. The invention (CN106383815B) provides the neural network sentiment analysis method of a kind of combination user and product information, comprising: obtains text segment information to be analysed, user information and target information to be evaluated; According to the semantic vector of text segment information and shot and long term Memory Neural Networks model construction sentence, text chunk; The enhancing vector of sentence and text chunk is obtained according to the semantic vector and user information of sentence and text chunk and target information to be evaluated; The emotional semantic classification of text chunk is obtained according to the enhancing vector of text chunk. The neural network sentiment analysis method of combination user provided by the invention and product information propose an effective neural network sentiment classification model by combining the information of user and product. Attention mechanism is introduced respectively in connection with the characteristic information of user and product in the word level and sentence level of text, so that emotional semantic classification performance greatly promotes. The invention (US20180165554A1) relates to A method of modelling data, comprising: training an objective function of a linear classifier, based on a set of labelled data, to derive a set of classifier weights; defining a posterior probability distribution on the set of classifier weights of the linear classifier; approximating a marginalized loss function for an autoencoder as a Bregman divergence, based on the posterior probability distribution on the set of classifier weights learned from the linear classifier; and classifying unlabelled data using the autoencoder according to the marginalized loss function. The invention (US20170308523A1) relates to a system and a method for classifying text messages, such as social media messages into sentiment valence categories are provided. The system comprising a module for decomposing text messages, a module for cleaning text messages, a module for producing feature data of text messages, and a module for classifying text messages into sentiment valence categories. The module for decomposing text messages is configured to: receive a text message, parse the text message into separate portions in response to parsing criteria based on sentence delimiters, wherein the separate portions are sentences, phrases and words, and rejoin at least some of the separate portions of the text message into sentences in response to predefined linguistic conditions. The invention (US9704097B2) relates to training data for training a neural network usable for electronic sentiment analysis can be automatically constructed. For example, an electronic communication usable for training the neural network and including multiple characters can be received. A sentiment dictionary including multiple expressions mapped to multiple sentiment values representing different sentiments can be received. Each expression in the sentiment dictionary can be mapped to a corresponding sentiment value. An overall sentiment for the electronic communication can be determined using the sentiment dictionary. Training data usable for training the neural network can be automatically constructed based on the overall sentiment of the electronic communication. The neural network can be trained using the training data. A second electronic communication including an unknown sentiment can be received. At least one sentiment associated with the second electronic communication can be determined using the neural network. In order to overcome the aforementioned drawbacks, there is a need to develop a method of enhancing performance of word embedding approaches by integrating sentiment based information. SUMMARY OF THE INVENTION

The present disclosure relates to a method of enhancing performance of word embedding approaches by integrating sentiment-based information. The proposed method is a supporting step in the automatic analysis of sentiments and ratings, which reduces manual effort and time. The technique captures sentiment at the word level, in correlation to its neighbouring and similar words. It improves the performance of sentiment analysis, as opposed to using generalized pre-trained word embeddings. The method can work on smaller as well as larger datasets and provide a reasonably good performance. It works well on both Word2Vec and GloVe embeddings and enhances their performance, which demonstrates that it is generalized enough to work on different embed- ding methods. The experiment highlights the suitability of CNN, SVM and MLP classifiers towards text processing in general and sentiment analysis in particular.

This disclosure attempts to achieve more accurate results in sentiment classification, using a mechanism to modify pre-trained word embeddings, namely Word2Vec and GloVe. These embeddings are modified based on the word's sentiment values, which have been referred from an existing sentiment corpus. Words which are sentimentally similar are brought closer to one another, using PSO algorithm. This helps the classifier learn their inter relationships better, and in turn, provide a higher accuracy. The mechanism is tested on various datasets and classifiers, and improvement in accuracy is observed. Future work can involve speeding up the process of minimizing the distances between word vectors. Besides PSO, other optimization methods can also be used, and tested for faster or better results.

In an embodiment, a method 100 of enhancing performance of word embedding approaches by integrating sentiment-based information comprises the following steps: at step 102, cleaning text by considering individual tokens of said text consists words, numbers or symbols, and removing symbols and hypertext that are not relevant to sentiment presented by said text; at step 104, parsing sentences into word sets representing each sentence as a group of words; at step 106, generating word embeddings to represent words by preparing a text in a form of numerical inputs; at step 108, generating averaged sentence vectors by taking an average of vectors of all words contained in a sentence, wherein averaged sentence vector is extended to obtain vectors for an entire input text; and at step 110, categorizing in a positive or negative class or in1-star to 5-star rating using classifiers.

To further clarify advantages and features of the present disclosure, a more particular description of the invention will be rendered by reference to specific embodiments thereof, which is illustrated in the appended drawings. It is appreciated that these drawings depict only typical embodiments of the invention and are therefore not to be considered limiting of its scope. The invention will be described and explained with additional specificity and detail with the accompanying drawings.

BRIEF DESCRIPTION OF FIGURES

These and other features, aspects, and advantages of the present disclosure will become better understood when the following detailed description is read with reference to the accompanying drawings in which like characters represent like parts throughout the drawings, wherein:

Figure 1 illustrates a method of enhancing performance of word embedding approaches by integrating sentiment-based information in accordance with an embodiment of the present disclosure. Figure 2 illustrates (a) Semantic representation of the workflow; (b) Task performed by a SOM; and (c) Steps of the PSO algorithm in accordance with an embodiment of the present disclosure. Figure 3 illustrates (A) ROC curve comparison of top 3 classifiers on IMDb dataset using generalized and modified Word2Vec; and (B) ROC curve comparison of top 3 classifiers on IMDb dataset using generalized and modified GloVe in accordance with an embodiment of the present disclosure. Figure 4 illustrates (A) ROC curve comparison of top 3 classifiers on Yelp dataset using generalized and modified Word2Vec; and (B) ROC curve comparison of top 3 classifiers on Yelp dataset using generalized and modified GloVe in accordance with an embodiment of the present disclosure.

Further, skilled artisans will appreciate that elements in the drawings are illustrated for simplicity and may not have been necessarily been drawn to scale. For example, the flow charts illustrate the method in terms of the most prominent steps involved to help to improve understanding of aspects of the present disclosure. Furthermore, in terms of the construction of the device, one or more components of the device may have been represented in the drawings by conventional symbols, and the drawings may show only those specific details that are pertinent to understanding the embodiments of the present disclosure so as not to obscure the drawings with details that will be readily apparent to those of ordinary skill in the art having benefit of the description herein. DETAILED DESCRIPTION

For the purpose of promoting an understanding of the principles of the invention, reference will now be made to the embodiment illustrated in the drawings and specific language will be used to describe the same. It will nevertheless be understood that no limitation of the scope of the invention is thereby intended, such alterations and further modifications in the illustrated system, and such further applications of the principles of the invention as illustrated therein being contemplated as would normally occur to one skilled in the art to which the invention relates.

It will be understood by those skilled in the art that the foregoing general description and the following detailed description are exemplary and explanatory of the invention and are not intended to be restrictive thereof.

Reference throughout this specification to "an aspect", "another aspect" or similar language means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. Thus, appearances of the phrase "in an embodiment", "in another embodiment" and similar language throughout this specification may, but do not necessarily, all refer to the same embodiment.

The terms "comprises", "comprising", or any other variations thereof, are intended to cover a non-exclusive inclusion, such that a process or method that comprises a list of steps does not include only those steps but may include other steps not expressly listed or inherent to such process or method. Similarly, one or more devices or sub-systems or elements or structures or components proceeded by "comprises...a" does not, without more constraints, preclude the existence of other devices or other sub-systems or other elements or other structures or other components or additional devices or additional sub-systems or additional elements or additional structures or additional components. Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The system, methods, and examples provided herein are illustrative only and not intended to be limiting. Embodiments of the present disclosure will be described below in detail with reference to the accompanying drawings.

Referring to Figure 1 illustrates a method of enhancing performance of word embedding approaches by integrating sentiment-based information in accordance with an embodiment of the present disclosure. A method 100 of enhancing performance of word embedding approaches by integrating sentiment-based information comprises the following steps: at step 102, cleaning text by considering individual tokens of said text consists words, numbers or symbols, and removing symbols and hypertext that are not relevant to sentiment presented by said text; at step 104, parsing sentences into word sets representing each sentence as a group of words; at step 106, generating word embeddings to represent words by preparing a text in a form of numerical inputs; at step 108, generating averaged sentence vectors by taking an average of vectors of all words contained in a sentence, wherein averaged sentence vector is extended to obtain vectors for an entire input text; and at step 110, categorizing in a positive or negative class or in 1-star to 5-star rating using classifiers.

Figure 2 illustrates (a) Semantic representation of the workflow; (b) Task performed by a SOM; and (c) Steps of the PSO algorithm in accordance with an embodiment of the present disclosure. The schematic flow diagram of the proposed sentiment analysis model using an effective word embedding approach is shown in Fig. 2a. The operational steps of the proposed model are similar to the generic sentiment analysis model, except the operation performed in the word embedding step. This paper aims to provide an improved method of informative word vector representation (the third step in Fig. 2a). A brief description of the operational steps of the proposed sentiment analysis model is made in the following subsection. The different steps of the workflow are detailed as follows.

Preprocessing The input text is generally in the form of a review, tweet, or blog post. The purpose is to categorize it to a class (positive/negative etc.), or to a rating (1 star/5 stars etc.). The input text is unstructured and noisy. The first step is to clean the text by considering the individual tokens of the text, such as words, numbers or symbols, and removing symbols, hypertext, etc. which are not relevant to the sentiment presented by the text. Stop words like a, the, of s, etc. are also removed to provide better understandability. Finally, the words are stemmed, i.e. reduced to their base form, by removing tense and converting them to singular. In this way, the general preprocessing is done.

Parsing

This step deals with specific preprocessing required by the model used in this work. Once cleaned, the reviews are parsed sentence-wise, representing each sentence as a group of words. A corpus is created out of all the sentences split in this manner. This is done because the method considers reviews as an aggregate of words. These words can combinedly represent the sentiment of the entire review.

Generating word embeddings to represent words This step deals with feature representation, i.e., preparing the text in the form of numerical inputs, to make them ready to be fed into a classifier for analysis. In this step, a word embedding model is taken, which represents each word as a numeric vector of fixed dimensions. Various techniques exist for this, and the authors take neural network based embedding methods, namely Word2Vec and GloVe, which are previously explained. Instead of using generalized pre-defined word embeddings, the authors modify the word vectors to include the sentiment of the word, based on the part of speech the word belongs to. The modification enhances the performance of sentiment classification of the model which is trained on the modified word embeddings. This modification is done based on how close the word is to other words that share similar sentiment with it. In this work, the authors have modified word embeddings generated by Word2Vec and GloVe using sentiment information based on a neural network algorithm known as self-organizing map (SOM). The modification takes place based on sentiments, i.e., sentimentally similar words (e.g., good and awesome) are clustered together, while sentimentally dissimilar words (e.g., good and bad) are moved farther from each other. The modified word embeddings are then used for sentiment analysis, and tested on different datasets and classifiers. The technique proposed in this paper for modification of word embeddings can be described according to the following algorithm.

Proposedalgorithm - adjustingword embeddings

1. Take a set of generalized pre-trained word embeddings, created by Word2Vec or GloVe. 2. Use a sentiment lexicon which contains words and their sentiment ratings. 3. Identify the affect words using Part-of-Speech (PoS) tagging. 4. Create a two-dimensional (2-D) mapping of affect words in the sentiment lexicon, using a Self-Organizing Map (SOM). 5. For an affect word A, find its pre-trained word embedding W. 6. Find its grid location G(x,y) from the 2-D mapping.

7. Find all other words present in G, which is the set of sentimentally similar words for A. 8. Adjust the word embedding W to W', using Particle Swarm Optimization (PSO) algorithm, and Eqn. 4 as its fitness function. 9. Replace the pre-trained word embedding W with the modified word embedding W. 10. Repeat steps 5 to 9, for all affect words in the vocabulary. The algorithm steps are detailed as follows.

Initial preparation (Steps 1-3) Initially a sentiment lexicon is taken which contains sentiment scores of various words. For this work, the authors take E-ANEW as the sentiment lexicon which contains words and their valence scores in a range of 0 to 10, 0 being the most negative and 10 being the most positive. A Part-of-Speech tagging is done to identify the affect words (words that convey feeling or emotion) in the lexicon.

Finding sentimentally alike words (Steps 4-7) The identified affect words are considered and a self-organizing map is created to cluster these words based on their sentiment scores. The concept of a self-organizing map is explained as follows.

Self-organizing maps The concept behind SOM is to create a mapping that projects multi-dimensional data into a simple lower dimensional grid, generally two-dimensional, especially suitable for visualization. It works as a clustering method, since it preserves the proximity of data points in the generated 2-D map. Experiments show that the clustering performance is comparable with the most popular clustering techniques like k-means. SOM is an unsupervised neural network having only an input and an output layer. It follows a competitive learning approach to adjust the weights of the neurons in the output layer. It preserves topological connections between the input data, which means data points which are closer to each other in the multi dimensional space, are also closer to each other in the output two-dimensional grid. Figure 2b shows the concept of a SOM. Every neuron in the output grid has a weight vector. The points in the input space are mapped onto one of the points in the output grid, whose weight vector is the nearest to the input point. Most of the time this leads to multiple input data points being mapped to the same output data point. This helps in simplification of visualization of the data, as well as clustering of the data. A simple algorithmic approach to implement a SOM is described as follows.

Algorithm - implementing a SOM 1. Let the input data be a set of points I = {il, i2,...,ip} E9", where m is the number of dimensions. Let the output grid consist of nodes 0 = {oxy ; x E [0, p] and y E [0, q] }, where p x q is the output grid size. Let a be the learning rate. r is the initial neighbor radius. 2. Initialize each node's weight randomly. W = {wxy ; x E [0, p] and y E [0, q]} 3. For input data point ik, find its distance from all output nodes, i.e. dxy = dist (ik, wxy where disto is the Euclidean distance. 4. Find the node which gives the minimum distance, denoted by owi1 , and its associated weight as wwin. 5. Locate all neighboring nodes of owi, using a neighboring function N (wwiN, wxy r), which is a function of the iteration number and an initial neighbor radius, for the two nodes wwin and

wxy. 6. Update the weights of all neighboring nodes to become closer to owin. Mathematically,

wxy = wxy + a * dist(ik,Wxy) * N(wwin, wxy,r) (1)

7. Decrease values of a and r. 8. Repeat steps 2-6 till a approaches 0.

The neighborhood function No is designed such that the nodes closer to the winner node owin will be updated by an amount larger than the farther nodes, which will be updated by a smaller amount. As the number of iterations progresses, the neighbor radius also decreases, done taking the sentiment ratings of the words as input vectors. The words in the same output grid are considered similar to each other sentimentally, while words in distant grids are considered sentimentally dissimilar. The SOM also maps this clustering into a two dimensional vector, which makes it easy and less time-consuming to access the closest words. Thus, the words which have similar sentiment scores tend to get allotted to the same or nearby grid points in the map, whereas the words with dissimilar scores get allotted to grid points which are farther apart. After this map is created, it is used for the next step, which is the modification of word embeddings of affect words.

Modifying words according to the SOM (Step 8) For a given affect word, first all the words in the same grid point are identified. Then the vector of the word is modified, to move it closer to the vectors of all other affect words in the same grid point. This modification is done, keeping in mind that the word maintains almost equal distances from all similar words, while not moving too far from its original vector, so that its identity is preserved. By performing this modification, it is ensured that the affect words are closer to sentimentally alike words, and are farther from sentimentally different words. This is because the SOM makes sure that words with similar sentiment are clustered Multimedia Tools and Application which isolates the winner node more and more from the effect of its neighbours. On the other hand, the learning rate decreases over time, which eventually leads to convergence of the algorithm. closer to each other and words with different sentiments are clustered separately. Thus, words like "good" and "awesome" will be closer to each other, being sentimentally positive. Similarly, words like "bad" and "terrible" will be closer to each other, being sentimentally negative. At the same time, both these groups of words will be farther from each other, due to the modification technique applied. For an example, where the word "good" is progressively moved towards similar words like "great", "beautiful" and "fantastic", while simultaneously moving away from the dissimilar word "bad". Let W be a word embedding for a random affect word, generated by a word embedding algorithm (Word2Vec or GloVe). The modified vector W' can be generated by shifting W closer to sentimentally similar words, but not too far from its original position W. Hence, this modification can be represented as a mathematical equation having two parts. Let distance of W' from W = D(W',W) Let total distance of W' from sentimentally alike words = 2= D(W', Wk) where, Wk is the kth sentimentally alike word toW, and n is the total number of sentimentally alike words. There are various ways to calculate distances between two vectors. The authors have used one of the most widely used methods, i.e., Euclidean distance. ED(x,y) represents the Euclidean distance between two vectors x and y, given by

ED(x, y)= (xd - yd) 2 d=1 (2) where m is the number of dimensions of the vectors. D(x,y) represents the square of Euclidean distance. D(x, y) = ED 2(XY) (3)

To find the optimal value of W', the sum of both distances mentioned above need to be minimized. Depending on how much distance it is desirable for the embedding to be shifted from its original position, the corresponding weightage can be assigned to each distance component. Thus, the shifting process can be expressed as

Minimize{a * D(W', W) + (1 - )* D(W', Wk) k=1 (4) Here, aC {0, 1} is the weightage parameter assigned to control the movement of the word embedding. This helps maintain proportion between the two parts of the equation. A greater value of a keeps W' closer to the original embedding W, whereas a smaller value of a moves W' closer to the sentimentally alike words. Equation 4 is the objective function for this problem, which is solved using an optimization algorithm, i.e. Particle Swarm Optimization (PSO). The concept of Particle swarm optimization follows the behavior of birds searching for food. It is a swift and efficient technique that performs optimization on large search spaces, which is a NP-Complete problem. It uses a population of birds/particles starting at random locations, which search their neighborhood for food, i.e., solutions. Depending on their proximity to the solutions, they decide whether to proceed in that direction or change the search direction. Each bird retains its own best position (local best), and uses it to direct its search. This best data is also shared with other birds, to find out the best position among all birds (global best). Together, the local best and global best combine to direct the new velocity, and consequently position, so that the birds converge at the best solution after a certain number of iterations. This makes the method work in parallel, thereby achieving optimal results in a feasible amount of time. Figure 2c shows the working steps of a basic PSO algorithm. Here, the position and velocity of the particles is determined as follows. +ai*w ri - . od pod i"I" + a1 * r1 * (pbestlod - Piold) + a2 * r2 * (gbest"d - Pd) (5)

pew - pold + Vew I i I (6)

In these equations, Viold is the velocity of i' particle in the previous generation, V "' is the velocity of i' particle in the current generation, Pold is the position of i' particle in the previous generation, P"' is the position of ith particle in the current generation, ai and a2 are accelerating factors of local and global information respectively, ri and r 2 are random values between 0 and 1. Pbesti is the personal best of the ith particle and gest is the global best among all particles.

Here, PSO is employed to minimize the fitness function as mentioned in Equation 4. There are numerous swarm-based methods available for optimization. The authors chose PSO because of its simplicity and fast convergence in lesser time and iterations, which is beneficial for the task in hand. For the given affect word, a population of particles is taken, where each particle represents a solution, i.e., a possible modified word embedding for a word. The fitness of each particle is assessed, and the local and global best are found out. New velocity and positions are calculated. This process goes on iteratively till the best solution is found, which is the best possible modified word embedding for the affect word. For all the affect words, the PSO algorithm is re-executed with a fresh set of population particles. At the end, the set of modified word embeddings is obtained for the vocabulary of the pre-trained word embeddings.

Final steps (Steps 9-10)

The obtained word embedding is used to replace the original word embedding of the pretrained Word2Vec or GloVe set. This is done for all affect words in the vocabulary. The remaining word embeddings stay as they are. Now the further steps of training the classifier can go on, using the modified set of word embeddings.

Averaged word vector generation

After obtaining the modified word embeddings, sentence vectors are generated by taking the average of the vectors of all words contained in a sentence. This is extended to obtain vectors for the entire input text. Now the input is ready in a numeric form to be passed to a classifier. The set of input vectors are then split into training and testing sets.

Classification

The training set feature vectors are used to train a classifier, against their class or rating. The authors experiment with various classifiers to find more suitable ones for sentiment analysis. The trained classifier is then tested on the testing set vectors to predict their class or rating. Parameters like accuracy, precision, recall and kappa coefficient are utilized to quantify the performance.

Implementation details

Datasetsused

To demonstrate the effectiveness of the proposed sentiment analysis using a modified word embedding approach, we have used two different datasets in the present study. One is the International Movie Database (IMDb) movies reviews dataset, and the other is the Yelp dataset that contains restaurant reviews and corresponding star ratings. The IMDb dataset is a collection of movie reviews and their ratings. This dataset contains 25,000 reviews, and their sentiment scores. The scores are scaled as per the ratings, i.e. ratings less than 5 have a score of 0, and above 7 have a score of 1, in order to make the dataset suitable for a binary classification problem.

The Yelp dataset consists of 10,000 restaurant reviews, having the star ratings between 1 and 5. However, the dataset is reduced to include only 1-star and 5-star ratings, in order to make it a binary classification problem.

Classifiers used

For a comparative analysis of the proposed sentiment classification model, we have used different classifiers that are given both the generalized word embedding (Word2Vec and GloVe) based inputs along with their modified versions. The classifiers considered here are Gaussian naive bayes, random forest, decision tree, gradient boosting, support vector machine, multi-layer perceptron, convolutional neural network (CNN), and CNN layered with long short-term memory (LSTM) classifiers.

Experimentalsetup

The simulation is performed using Python 3.5 on an Intel i5 desktop with 32 GB RAM and 2.71 GHz frequency. The significant packages used are NLTK for NLP tasks, Keras, Theano and Tensor flow for implementation of deep architectures, and Scikit-learn and SciPy for standard machine learning architectures and performance measurements. For the classification task, the whole dataset is partitioned randomly into two parts. One part is used for training the model and the other is for testing the performance. Three different ratios (80%, 70%, and 60%) of training and test (20%, 30%, and 40%) are prepared for the experiment. Each model is trained and tested with 10 rounds of experiments with the same train-test splits, and the average of 10 readings is depicted in the result section. The random split, tests the consistency of the model performance over varying sets of training and testing data during the multiple iterations.

Performancemeasurement indices

In the present study, various performance measurement indices are used for comparative analysis; they are Precision (P), Recall (R), Accuracy (A), Kappa Score (K), and Receiver Operating Characteristic (ROC) curve. True Positives is the number of data items predicted to be true, which are actually true in the dataset. False Positives is the number of data items predicted to be true, which are actually false in the dataset. True Negatives is the number of data items predicted to be false, which are actually false in the dataset. False Negatives is the number of data items predicted to be false, which are actually true in the dataset. Accuracy(A) signifies the fraction of correct classifications out of the total number of data items provided.

It represents the ability of the model to correctly identify data items belonging to each of the classes. Precision(P) represents the fraction of positive data items correctly classified out of the positive data items provided. It shows the ratio of the relevant cases found correctly, out of all the cases that are found to be relevant. Recall(R) is the fraction of positive data items correctly classified out of the total data items provided. It shows the ratio of the relevant cases found correctly, out of all the cases that are actually relevant in the entire dataset.

Kappa score(K) compares the obtained accuracy with the accuracy of a random system. It controls data items that might have been correctly classified by chance, by measuring how closely the data items classified by the model match the data items labelled as ground truth. A kappa value of 1 denotes perfect match, while a kappa value of 0 denotes no match. Equation 7 gives the formula for Kappa score.

X Kappa Score(K) = N (True Positives + True Negatives) - 2 N - X (7)

Where,

X = (True Positives + False Positives) * (True Positives + False Negatives) +(True Negatives + False Negatives)* (True Negatives + False Positives) (8)

Here, N denotes the total number of data items in the dataset. Receiver Operating Characteristic (ROC) curves are used to provide graphical analysis of the results. ROC curve plots true positive rate and false positive rate for each classifier. The area under the curve (AUC) provides an aggregate measure of the performance, i.e., more the area, better the model.

In addition to the above indexes, we have used the standard deviation of accuracy (Astd) over the ten rounds of each model in order to analyse the statistical consistency in performance. This analysis demonstrates the variation of the model performances, and a lower value of standard deviation signifies higher stability.

Figure 3 illustrates (A) ROC curve comparison of top 3 classifiers on IMDb dataset using generalized and modified Word2Vec; and (B) ROC curve comparison of top 3 classifiers on IMDb dataset using generalized and modified GloVe in accordance with an embodiment of the present disclosure.

Results and discussion

The proposed sentiment analysis model along with the different classifiers are applied on both the datasets. At first, the sentiment analysis for the dataset is performed using the generalized Word2Vec and GloVe embeddings, and the same process is repeated using the modified Word2Vec and GloVe embedding approach. Performances of various classifier based models on these datasets using both types of embeddings. The performance of models with different classifiers using the modified Word2Vec embeddings of the IMDb dataset as input, as compared to the generalized Word2Vec embeddings has been performed. The results provide better accuracy, precision, recall and kappa score by all classifiers. The MLP, CNN and SVM based models gave the best accuracies for all train-test split ratios using the modified word embeddings. But these models have higher standard deviations, showing some inconsistency over different train-test split sets. The Gradient Boosting and CNN classifiers showed the most consistent results, as by the minimum standard deviation values. In all classifiers, the usage of modified word embeddings has shown better performance for the parameters than generalized ones.

Figure 3a shows the ROC curves of the top three classifiers. MLP, CNN and SVM, that have performed sentiment analysis using modified Word2Vec embeddings on the IMDb dataset in comparison to the generalized Word2Vec embeddings. The ROC curves reiterate similar results. The area under the curve is higher in all cases for the modified Word2Vec embeddings.

The performance of models with different classifiers by taking the modified GloVe embeddings of the IMDb dataset as input, as compared to the generalized GloVe embeddings has been performed. Here too, the results provide better accuracy, precision, recall and kappa score by all classifiers. The CNN, MLP and SVM based models gave the best accuracies for all train-test split ratios using the modified word embeddings. The accuracies of GloVe embeddings are slightly lower than Word2Vec embeddings for the IMDb dataset. But the variation shown by the classifiers over different train-test split sets using GloVe embeddings is lower, as in the standard deviation column. Hence, GloVe embeddings are more stable than Word2Vec embeddings for most classifiers. The CNN classifier showed the most consistent results, as by the minimum standard deviation values. In all classifiers, the usage of modified word embeddings has shown better performance for the parameters than generalized ones.

Figure 3b shows the ROC curves of the top three classifiers. MLP, CNN and SVM, that have performed sentiment analysis using modified GloVe embeddings on the IMDb dataset in comparison to the generalized GloVe embeddings. The ROC curves graphically provide the results. The area under the curve is higher in all cases for the modified Word2Vec embeddings.

The performance of models with different classifiers by taking the modified Word2Vec embeddings of the Yelp dataset as input, as compared to the generalized Word2Vec embeddings has been performed. Like the IMDb dataset, here also the results provide better accuracy, precision, recall and kappa score by all classifiers. The MLP, CNN, SVM and CNN-LSTM based models gave the best accuracies for all train-test split ratios using the modified word embeddings. The variation shown by these classifiers over different train-test split sets is quite considerable, due to the smaller size of the dataset, as in the standard deviation column. More entries in the dataset can help stabilize the variations. The Decision Tree classifier showed the most consistent results, as by the minimum standard deviation values, although its accuracy is average. In all classifiers, the usage of modified word embeddings has shown better performance for the parameters than generalized ones.

Figure 4 illustrates (A) ROC curve comparison of top 3 classifiers on Yelp dataset using generalized and modified Word2Vec; and (B) ROC curve comparison of top 3 classifiers on Yelp dataset using generalized and modified GloVe in accordance with an embodiment of the present disclosure.

Figure 4a shows the ROC curves of the top three classifiers. MLP, CNN and SVM, that have performed sentiment analysis using modified Word2Vec embeddings on the Yelp dataset in comparison to the generalized Word2Vec embeddings. The ROC curves graphically provide the results. The area under the curve is higher in all cases for the modified Word2Vec embeddings.

The performance of with models different classifiers by taking the modified GloVe embeddings of the Yelp dataset as input, as compared to taking the generalized GloVe embeddings has been performed. In this case also the results provide better accuracy, precision, recall and kappa score by all classifiers. The CNN, SVM, MLP and Gradient Boosting models gave the best accuracies for all train-test split ratios using the modified word embeddings. The variation shown by these classifiers over different train-test split sets is moderate, as in the standard deviation column. It is observed from this that the performance of GloVe vectors is more stable than Word2Vec vectors. The SVM classifier showed the most consistent results, as by the minimum standard deviation value. In all classifiers, the usage of modified word embeddings has shown better performance for the parameters than generalized ones.

Figure 4b shows the ROC curves of the top three classifiers. MLP, CNN and SVM, that have performed sentiment analysis using modified GloVe embeddings on the Yelp dataset in comparison to the generalized GloVe embeddings. The ROC curves graphically reiterate the results. The area under the curve is higher in all cases for the modified Word2Vec embeddings.

Word vectors are representative of linguistic properties of words. They can be applied for most natural language processing tasks, to give acceptable results. By focusing specifically on the sentiment aspect of words, the proposed method tries to make the word embeddings more suitable and sentiment-representative, and hence achieves better results for sentiment analysis Some additional comparative analysis shows that the performance of different classifiers is closer to one another on the IMDb dataset than the Yelp dataset. This can be because of the fact that the Yelp dataset is smaller in size, and hence enough training data is not provided. The larger size of the IMDb dataset trains the classifiers well and makes them more accurate.

The Word2Vec embeddings perform better on the IMDb dataset, whereas the GloVe embeddings work better on the Yelp dataset. This goes to show that different datasets are characteristically different, and the suitability of a particular type of embedding can vary over different datasets. So, there is no clear superiority of either of the embeddings. An additional observation of our work is the assessment of classifier suitability for sentiment analysis. In all cases, the SVM, MLP and CNN classifiers perform well and have relatively better accuracies. The better performance of these classifiers is because they are suitable for high-dimensional feature spaces.

Since we use 300-dimensional word embeddings, SVM is able to deal with such high dimensionality using appropriate kernel functions. MLP has higher approximation quality within a single hidden layer, and can efficiently work on various feature combinations. CNN, because of its ability to identify features at higher levels of abstraction, is also effective on text processing, since it can identify word connections and sentence structures. Thus, it is observed that these classifiers are quite suitable for text processing, especially sentiment analysis.

The drawings and the forgoing description give examples of embodiments. Those skilled in the art will appreciate that one or more of the described elements may well be combined into a single functional element. Alternatively, certain elements may be split into multiple functional elements. Elements from one embodiment may be added to another embodiment. For example, orders of processes described herein may be changed and are not limited to the manner described herein. Moreover, the actions of any flow diagram need not be implemented in the order shown; nor do all of the acts necessarily need to be performed. Also, those acts that are not dependent on other acts may be performed in parallel with the other acts. The scope of embodiments is by no means limited by these specific examples. Numerous variations, whether explicitly given in the specification or not, such as differences in structure, dimension, and use of material, are possible. The scope of embodiments is at least as broad as given by the following claims.

Benefits, other advantages, and solutions to problems have been described above with regard to specific embodiments. However, the benefits, advantages, solutions to problems, and any component(s) that may cause any benefit, advantage, or solution to occur or become more pronounced are not to be construed as a critical, required, or essential feature or component of any or all the claims.

Claims

WE CLAIM

1. A method of enhancing performance of word embedding approaches by integrating sentiment-based information, said method comprises:

cleaning text by considering individual tokens of said text consists words, numbers or symbols, and removing symbols and hypertext that are not relevant to sentiment presented by said text; parsing sentences into word sets representing each sentence as a group of words; generating word embeddings to represent words by preparing a text in a form of numerical inputs; generating averaged sentence vectors by taking an average of vectors of all words contained in a sentence, wherein averaged sentence vector is extended to obtain vectors for an entire input text; and categorizing in a positive or negative class or in 1-star to 5-star rating using classifiers.

2. The method as claimed in claim 1, wherein in text cleaning, stop words are removed to provide better understandability, wherein words are reduced to base form by removing tense and thereafter converting into singular.

3. The method as claimed in claim 1, wherein a corpus is created out of all sentences split in this manner, which is done because method considers reviews as an aggregate of words which combinedly represent a sentiment of entire review.

4. The method as claimed in claim 1, wherein steps for generating word embeddings comprises:

taking a set of generalized pre-trained word embeddings created by Word2Vec or GloVe; using a sentiment lexicon that contains words and their sentiment ratings; identifying affect words using part-of-speech (PoS) tagging; creating a two-dimensional (2-D) mapping of affect words in a sentiment lexicon using a self-organizing map (SOM); finding pre-trained word embedding for an affect word; finding grid location from a 2-D mapping and finding all other words which is a set of sentimentally similar words for said affected word; adjusting word embedding using particle swarm optimization (PSO) approach; and replacing pre-trained word embedding with a modified word embedding.

5. The method as claimed in claim 4, wherein modified word embeddings generated by Word2Vec and GloVe using sentiment information based on a self-organizing map (SOM) neural network approach.

6. The method as claimed in claim 5, wherein modifying words according to SOM comprises:

identifying all words in same grid point and modifying vector of said word to move it closer to vectors of all other affect words in same grid point, wherein said modification is done keeping in mind that word maintains almost equal distances from all similar words, while not moving too far from its original vector, so that its identity is preserved; ensuring that affect words are closer to sentimentally alike words, and are farther from sentimentally different words by performing said modification; and wherein said SOM makes sure that words with similar sentiment are clustered closer to each other and words with different sentiments are clustered separately.

7. The method as claimed in claim 1, wherein obtained word embedding is used to replace original word embedding of pretrained Word2Vec or GloVe set.

8. The method as claimed in claim 1, wherein said classifiers are Gaussian naive bayes, random forest, decision tree, gradient boosting, support vector machine, multi-layer perceptron, convolutional neural network (CNN), and CNN layered with long short-term memory (LSTM) classifiers.

9. The method as claimed in claim 4, wherein part-of-speech tagging is performed to identify affect words in a lexicon.

10. The method as claimed in claim 10, wherein a E-ANEW sentiment lexicon is considered which contains words and their valence scores in a range of 0 to 10, 0 being most negative and 10 being most positive.