AU2021102957A4

AU2021102957A4 - A system and method for predicting the stock market news sentiments using machine learning

Info

Publication number: AU2021102957A4
Application number: AU2021102957A
Authority: AU
Inventors: Gurpal Singh Chhabra; Mohit Chowdhary; Santosh Kumar; Neha Sharma; Shelja Sharma; Ishpreet Singh Virk; Gurwinder SINGH; Durgesh Srivastava
Original assignee: Individual
Current assignee: Individual
Priority date: 2021-05-29
Filing date: 2021-05-29
Publication date: 2021-09-30
Anticipated expiration: 2029-05-29

Abstract

The present disclosure relates to a system and method for predicting the stock market news sentiments using machine learning. The sentiments are predicted based on the polarity and textual information using the Convolution Neural Network (CNN) as a machine learning approach. The stock market news are collected from the public websites and portals related to the stock market news. The features are extracted from these data using a Lexicon-Based dictionary. The opinions are generated and optimized using the Artificial Bee Colony (ABC) algorithm to achieve better results. The ABC algorithm is used as a feature selection and optimization approach using the fitness function to train the model using CNN classifier. The designed model predicts the sentiment in term of positive, negative and neutral for the stock market data collected. 13 < - 05 mu -s It CL o 4A 5 %- E- E I- o ~ 0 r~IL U> >*~*~ l. T70-7 -- o /CA0T0

Description

< - 05 mu

It -s

CL o4A 5 %-

E- E

I- o~ 0 r~IL

U> >*~*~ l.

T70-7 -- o /CA0T0

A SYSTEM AND METHOD FOR PREDICTING THE STOCK MARKET NEWS SENTIMENTS USING MACHINE LEARNING FILED OF THE INVENTION

The present disclosure relates to a system and method for predicting the stock market news sentiments using machine learning.

BACKGROUND OF THE INVENTION

Initially the prediction of stock market was done by efficient market hypothesis (EHM) Fama and random walk theory. In line with these theories it's impossible to predict an accurate stock price trend. The financial market is randomly driven and this is why the accuracy of the prediction is limited to 50 %. Later it was found out that the sentiments index of investors affects financial market. The sentiment analysis technology studies the sentiment, investors opinions

& feedbacks, and understands the emotions of investors & and why they choose a particular investment instrument. This is one of the key reasons for the success of sentiment analysis technologies in the modem days.

Since in current situation the twitter is more suitable for discussing financial instruments that is why the twitter is used for the current sentiment studies and analysis.

In one existing solution, the combination approach of lexicon-based approach was used to achieve the prediction accuracy of 71%. In another existing solution an ensemble method using random forest, support vector machine, regression algorithm and a combination approach of studying and analyzing the embedded words and text has been applied and implemented and this method won the first prize for stock market news sentiment analysis. In another existing solution deep learning & data mining methods were used to analyze the stock market tweets from Stocktwits and some regression algorithms were also evaluated. It was found that by implementing CNN an accuracy of 90.8% can be achieved.

In one prior art solution (US8515739B2), a method was proposed for determining the sentiment associated with an entity. The method comprises: imputing the plurality of text associated with the entity; labeling seed words in the plurality of texts as positive or negative; determining a score estimate for the plurality of words based on the labeling; re-enumerating paths of the plurality of words and determining a number of sentiment alternations; determining a final score for the plurality of words using only paths whose number of alternations is within a threshold; converting the final scores to corresponding z-scores for each of the plurality of words; and outputting the sentiment associated with the entity.

In another prior art solution (CN103778215B), the invention proposed a stock market forecasting method merged based on the sentiment analysis and HMM. The method comprises: gathering the information; pre-processing the gathered information; building language material; analyzing sentiment; technical analysis of stock market; using the proposed methodology to predict the stock market trend.

In another prior art solution (US8856056B2), the invention proposed a sentiment calculator which uses social media messages for the real-time evaluation of publicly assets, in particular traded equities and commodities wherein a sentiment is an integer computed based upon pairs of lexical items in local syntactic context. The sentiment calculator includes a mechanism for determining polarity in social media messages and a mechanism for determining a strength value of lexical items used in social media messages.

However, most of the present studies have low accuracy of prediction because the datasets used are more specific to their prediction context. In the existing solutions, the pre processing of data cannot provide the normalized data and because of that the possibilities of irrelevant features are more because of the appearance of data that are un-normalized, data with punctuation, and stop words. It was also seen that the unsupervised clustering techniques operate on the estimated centroid and if the centroid values varied then there are huge chances of irrelevant results. Therefore, there is a need for a more efficient and effective system and method for predicting the stock market news sentiments using machine learning.

SUMMARY OF THE INVENTION

The present disclosure relates to a system and method for predicting the stock market news sentiments using machine learning. The main objective of the disclosure is to predict the emotions of the stock market news efficiently based on the polarity and textual information using the Convolution Neural Network (CNN) as a machine learning approach. To predict the stock's textual reviews' accurately, the swarm-based Artificial Bee Colony (ABC) algorithm is used with the Lexicon feature extraction approach using a novel fitness function. For better model training the ABC algorithm is integrated with CNN so that the proposed approach can predict the stock market new efficiently. The data of the stock news is collected from the public websites and portals relate to the stock market and the repository used for the simulation of the proposed model is called Stocktwits Database. For the simulation and validation of the proposed architecture, 15000 twits and 5000 datasets are taken for each category of sentiments. The predictions are classified as positive, negative and neutral for the stock news data. The sentiments classification is done by convolution neural networks and then the generated opinions are optimized by ABC algorithm to achieve the best results.

The present disclosure seeks to provide a system for predicting the stock market news sentiment using machine learning. The system comprises: a pre-processing unit for data normalization, removing punctuations, removing stop words, and tokenizing the data; a feature extraction unit for extracting the feature sets from the pre-processed data using the Lexicon based dictionary; a feature selection unit for selecting the relevant features and discard irrelevant features from the extracted features according to a novel fitness function; and a database unit consisting of the trained CNN structure for sentiment classification.

The present disclosure also seeks to provide a method for predicting the stock market news sentiments using machine learning. The method comprises: uploading data for training and testing of the model; pre-processing the uploaded data to generate a consistent data with the help of data normalization, punctuation removal, stop words removal, and tokenization of data; extracting features from the pre-processes data to extract features sets from positive, negative, and neutral data using the Lexicon based dictionary; optimizing features to remove the unwanted feature sets and selecting only relevant feature sets from extracted features according to a novel fitness function; initializing the Convolutional neural network (CNN) classifier to train the dataset based on the optimized data and storing the trained datasets into a database; and testing the uploaded test data.

An objective of the present disclosure is to provide a system and method for predicting the stock market news sentiments using machine learning.

Another object of the present disclosure is to collect the stock market data from various websites and portal related to stock market.

Another object of the present disclosure is to integrate ABC algorithm with the CNN to provide an efficient stock market prediction.

Another object of the present disclosure is to design the lexicon dictionary by twitting for the feature extraction with ABC as a feature selection approach.

Another object of the present disclosure is to classify the stock market news as positive, negative, and neutral.

Yet, another object of the present disclosure is to calculate performance metrics such as Precision, Recall, F-score, execution time, error, and classification accuracy and compare it with existing solutions.

To further clarify advantages and features of the present disclosure, a more particular description of the invention will be rendered by reference to specific embodiments thereof, which is illustrated in the appended drawings. It is appreciated that these drawings depict only typical embodiments of the invention and are therefore not to be considered limiting of its scope. The invention will be described and explained with additional specificity and detail with the accompanying drawings.

BRIEF DESCRIPTION OF FIGURES

These and other features, aspects, and advantages of the present disclosure will become better understood when the following detailed description is read with reference to the accompanying drawings in which like characters represent like parts throughout the drawings, wherein:

Figure 1 illustrates a block diagram of a system for predicting the stock market news sentiment using machine learning in accordance with an embodiment of the present disclosure;

Figure 2 illustrates a flow chart of a method for predicting the stock market news sentiment using machine learning in accordance with an embodiment of the present disclosure;

Figure 3 illustrates the architecture of the proposed model in accordance with an embodiment of the present disclosure;

Figure 4 illustrates the flow chart of the proposed model in accordance with an embodiment of the present disclosure;

Figure 5 illustrates the user interface of the proposed model in accordance with an embodiment of the present disclosure;

Figure 6 illustrates a table of average results of the different parameters in accordance with an embodiment of the present disclosure;

Further, skilled artisans will appreciate that elements in the drawings are illustrated for simplicity and may not have been necessarily been drawn to scale. For example, the flow charts illustrate the method in terms of the most prominent steps involved to help to improve understanding of aspects of the present disclosure. Furthermore, in terms of the construction of the device, one or more components of the device may have been represented in the drawings by conventional symbols, and the drawings may show only those specific details that are pertinent to understanding the embodiments of the present disclosure so as not to obscure the drawings with details that will be readily apparent to those of ordinary skill in the art having benefit of the description herein.

DETAILED DESCRIPTION

For the purpose of promoting an understanding of the principles of the invention, reference will now be made to the embodiment illustrated in the drawings and specific language will be used to describe the same. It will nevertheless be understood that no limitation of the scope of the invention is thereby intended, such alterations and further modifications in the illustrated system, and such further applications of the principles of the invention as illustrated therein being contemplated as would normally occur to one skilled in the art to which the invention relates.

It will be understood by those skilled in the art that the foregoing general description and the following detailed description are exemplary and explanatory of the invention and are not intended to be restrictive thereof.

Reference throughout this specification to "an aspect", "another aspect" or similar language means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. Thus, appearances of the phrase "in an embodiment", "in another embodiment" and similar language throughout this specification may, but do not necessarily, all refer to the same embodiment.

The terms "comprises", "comprising", or any other variations thereof, are intended to cover a non-exclusive inclusion, such that a process or method that comprises a list of steps does not include only those steps but may include other steps not expressly listed or inherent to such process or method. Similarly, one or more devices or sub-systems or elements or structures or components proceeded by "comprises...a" does not, without more constraints, preclude the existence of other devices or other sub-systems or other elements or other structures or other components or additional devices or additional sub-systems or additional elements or additional structures or additional components.

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The system, methods, and examples provided herein are illustrative only and not intended to be limiting.

Embodiments of the present disclosure will be described below in detail with reference to the accompanying drawings.

Figure 1 illustrates a block diagram of a system for predicting the stock market news sentiment using machine learning in accordance with an embodiment of the present disclosure. The system 100 includes a pre-processing unit 102 for data normalization, removing punctuations, removing stop words, and tokenizing the data.

In an embodiment, a feature extraction unit 104 is used for extracting the feature sets from the pre-processed data using the Lexicon-based dictionary.

In an embodiment, a feature selection unit 106 is used for selecting the relevant features and discard irrelevant features from the extracted features according to a novel fitness function.

In an embodiment, a database unit 108 which consists the trained CNN structure for sentiment classification.

Figure 2 illustrates a flow chart of a method for predicting the stock market news sentiment using machine learning in accordance with an embodiment of the present disclosure. At step 202 the method 200 includes, uploading data for training and testing of the model. The data is collected from pubic websites and portals related to the stock market and 15,000 stocktwits and 5000 datasets are taken from each category, such as positive, negative and neutral.

At step 204 the method 200 includes, pre-processing the uploaded data to generate a consistent data with the help of data normalization, punctuation removal, stop words removal, and tokenization of data. The said pre-processing is applied in both testing and training section.

At step 206 the method 200 includes, extracting features from the pre-processes data to extract features sets from positive, negative, and neutral data using the Lexicon based dictionary.

At step 208 the method 200 includes, optimizing features to remove the unwanted feature sets and selecting only relevant feature sets from extracted features according to a novel fitness function. In a feature optimization technique an Artificial Bee Colony (ABC) algorithm is used in the extracted lexicon-based feature sets.

At step 210 the method 200 includes, initializing the Convolutional neural network (CNN) classifier to train the dataset based on the optimized data and storing the trained datasets into a database. The database will be used for the classification of the test data.

At step 212 the method 200 includes, testing the uploaded test data. The data is tested with the help of trained datasets in the database. If the elements gets matched then the results are classified with categories and performance parameter are calculated and the process will come to a stop. But if the element doesn't gets matched, then only calculation of performance parameter will be done.

Figure 3 illustrates the architecture of the proposed model in accordance with an embodiment of the present disclosure. The proposed system comprises of a pre-processing unit, a feature extraction and feature selection unit, and a database of trained convolutional neural network (CNN) structure. The architecture can be divided in two parts, one is designing a framework for sentiment analysis and the other is training and testing of the proposed system. The stock market datasets are collected from various website and online portals related to the stock market, and then pre-processing is done on the dataset which is done to make data according to the requirements. The pre-processing unit includes steps such as data normalization, punctuation removal, stop word removal, and tokenizing the data.

In the feature extraction unit, features sets from positive, negative and neutral data are extracted from the pre-processed data using the lexicon based dictionary, and then in the feature selection unit unwanted feature sets are removed from the extracted features according to the fitness function. As a feature selection technique Artificial Bee Colony (ABC) algorithm is used on the extracted features.

The CNN classifier is initialized to train the system based on the optimized features. The optimized feature is used as an input of CNN for training and testing. After this the data are classified according to the classifiers' trained structure in trained CNN unit. At last the parameters such as Precision, Recall, F-measure, Execution Time, and Accuracy is calculated to validate the proposed system.

Figure 4 illustrates the flow chart of the proposed model in accordance with an embodiment of the present disclosure. The proposed methodology can be divided into two parts. First is, uploading the data in the database for training purposes and the other is uploading the test data for sentiment analysis. At the initial stages after the uploading the data pre-processing of the dataset is done which includes, data normalization, punctuation removal, and stop word removal and tokenizing the data, these thing were done for making data according to the requirements, once the pre-processing is done, the features are extracted from the pre-processed data using a Lexicon-based dictionary and then features are optimized for better accuracy in sentiment analysis, the optimization of features is used to remove the unwanted feature sets and selecting only relevant features from the extracted features according to fitness function. These steps have been carried out with some activation function for uploading the database for training. After that the datasets are trained using CNN and the trained CNN structure is finally stored in a database that will be used later for testing the data and extracting the information regarding that element. While testing the data, if the element gets matched, then the result is classified with category, and performance parameters such as Precision, Recall, F-measure, Execution Time, and Accuracy is calculated, but if the element doesn't match, then only the performance parameters are calculated.

Figure 5 illustrates the user interface of the proposed model in accordance with an embodiment of the present disclosure. We can see that the user interface has two panels one is training panel, which includes training button used for training the dataset and the other is testing panel which includes the upload test data button which helps in uploading the dataset, a pre processing button which helps in normalizing the test data, removing the punctuations, removing the stop words and thereby generating the tokenized data by assigning the token value. The feature extraction button then represents the feature data values seen from the last text box labeled feature data. On clicking the ABC button and Classification button, the work is done on the code window, and when the processing is complete, the message will be displayed in the output window. Then on clicking the result button the class of dataset will be shown, which means, whether the dataset is positive, negative, or neutral, along with values of performance parameters such as error percentage, execution time, precision, recall, F-measure, and accuracy.

Figure 6 illustrates a table of average results of the different parameters in accordance with an embodiment of the present disclosure. The table shows that the purposed model has achieved a minimum error of 0.636 with an execution time of 0.44 sec, a maximum precision value of 94.57. It has achieved a recall value of 93.72 with an F-measure value of 92.85, along with 99.98% accuracy.

The drawings and the forgoing description give examples of embodiments. Those skilled in the art will appreciate that one or more of the described elements may well be combined into a single functional element. Alternatively, certain elements may be split into multiple functional elements. Elements from one embodiment may be added to another embodiment. For example, orders of processes described herein may be changed and are not limited to the manner described herein. Moreover, the actions of any flow diagram need not be implemented in the order shown; nor do all of the acts necessarily need to be performed. Also, those acts that are not dependent on other acts may be performed in parallel with the other acts. The scope of embodiments is by no means limited by these specific examples. Numerous variations, whether explicitly given in the specification or not, such as differences in structure, dimension, and use of material, are possible. The scope of embodiments is at least as broad as given by the following claims.

Benefits, other advantages, and solutions to problems have been described above with regard to specific embodiments. However, the benefits, advantages, solutions to problems, and any component(s) that may cause any benefit, advantage, or solution to occur or become more pronounced are not to be construed as a critical, required, or essential feature or component of any or all the claims.

Claims

WE CLAIM

1. A system for predicting the stock market news sentiments using machine learning. The system comprises:

a pre-processing unit for data normalization, removing punctuations, removing stop words, and tokenizing the data;

a feature extraction unit for extracting the feature sets from the pre-processed data using the Lexicon-based dictionary;

a feature selection unit for selecting the relevant features and discard irrelevant features from the extracted features according to a novel fitness function; and

a database unit consisting of the trained CNN structure for sentiment classification;

2. The system as claimed in claim 1, wherein said data is collected from pubic websites and portals related to the stock market.

3. The system as claimed in claim 1, wherein 15,000 stocktwits and 5000 datasets are taken from each category, such as positive, negative and neutral.

4. The system as claimed in claim 1, wherein said Lexicon-based dictionary creates a list of words based on their polarity.

5. The system as claimed in claim 1, wherein for said feature selection is done by applying an Artificial Bee Colony (ABC) algorithm as a feature selection approach in extracted feature sets.

6. A method for predicting the stock market news sentiments using machine learning, wherein the method comprises:

uploading data for training and testing of the model;

pre-processing the uploaded data to generate a consistent data with the help of data normalization, punctuation removal, stop words removal, and tokenization of data;

extracting features from the pre-processes data to extract features sets from positive, negative, and neutral data using the Lexicon based dictionary; optimizing features to remove the unwanted feature sets and selecting only relevant feature sets from extracted features according to a novel fitness function; initializing the Convolution neural network (CNN) classifier to train the dataset based on the optimized data and storing the trained datasets into a database; and testing the uploaded test data.

7. The method as claimed in claim 6, wherein said pre-processing is applied in both testing and training section.

8. The method as claimed in claim 6, wherein in a feature optimization technique an Artificial Bee Colony (ABC) algorithm is used in the extracted lexicon-based feature sets.

9. The method as claimed in claim 6, wherein an initialization of the CNN classifier comprises:

selecting an optimized feature as an input of CNN for training and testing; and

computing the total emotions categories generated by the optimized data using classifiers, and wherein said emotions are positive, negative, and neutral.

10. The method as claimed in claim 6, wherein said testing of the uploaded data, comprises:

uploading the test data;

classifying the results with categories if the data or element gets matched; and

calculating the performance parameters, and wherein said performance parameters are error percentage, execution time, precision, recall, F-measure, and accuracy.

01 Aug 2021 pre‐processing unit 102 feature extraction unit 104

2021102957 feature selection unit 106 database unit 108

Figure 1

202

01 Aug 2021 uploading data for training and testing of the model

pre‐processing the uploaded data to generate a consistent data with the help of data normalization, 204 punctuation removal, stop words removal, and tokenization of data

extracting features from the pre‐processes data to extract features sets from positive, negative, and 206 neutral data using the Lexicon based dictionary

2021102957 optimizing features to remove the unwanted feature sets and selecting only relevant feature sets 208 from extracted features according to a novel fitness function

initializing the Convolutional neural network (CNN) classifier to train the dataset based on the 210 optimized data and storing the trained datasets into a database

212 testing the uploaded test data

Figure 2

01 Aug 2021 2021102957

Figure 3

01 Aug 2021 2021102957 Figure 4

Upload Is upset that he Is upset that he Test-Data Is upset that he cant update his cant update his cant update his upset update Facebook by Facebook by Facebook by Facebook by Pre- texting it… texting it… texting it and texting result

01 Aug 2021 Processing and might cry and might cry might cry as a school today as a result as a result result school blah school today school today Feature- today also blah Extraction also…Blah! also…blah!

ABC- Graphical Representation of Feature Value Algorithm

Classification

2021102957 Figure 5

01 Aug 2021 2021102957 Figure 6