WO2021060967A1 - A system and method for predictive analytics of articles - Google Patents

A system and method for predictive analytics of articles Download PDF

Info

Publication number
WO2021060967A1
WO2021060967A1 PCT/MY2020/050056 MY2020050056W WO2021060967A1 WO 2021060967 A1 WO2021060967 A1 WO 2021060967A1 MY 2020050056 W MY2020050056 W MY 2020050056W WO 2021060967 A1 WO2021060967 A1 WO 2021060967A1
Authority
WO
WIPO (PCT)
Prior art keywords
article
articles
module
entity
trend
Prior art date
Application number
PCT/MY2020/050056
Other languages
French (fr)
Inventor
Mohamed Farid Bin NOOR BATCHA
Duc Nghia PHAM
Original Assignee
Mimos Berhad
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mimos Berhad filed Critical Mimos Berhad
Publication of WO2021060967A1 publication Critical patent/WO2021060967A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9537Spatial or temporal dependent retrieval, e.g. spatiotemporal queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/04Trading; Exchange, e.g. stocks, commodities, derivatives or currency exchange

Definitions

  • the present invention relates to a system and method for predictive analytics of articles.
  • the present invention provides a system and method for a user to extract multiple levels of sentiment from an article, extract the trend of related happenings, and provide a prediction of the next course of action due to an event.
  • United States Patent Application Publication No. US 2013/0030981 A1 entitled “Stock Market Prediction Using Natural Language Processing” (hereinafter referred to as the US 981 A1 Publication) having a filing date of 5 October 2012 (Applicant: Herz Frederick SM) relates to a method of using natural language processing (NLP) techniques to extract information from online news feeds and then using the extracted information to predict changes in stock prices or Volatilities.
  • NLP natural language processing
  • keywords such as company names is recognized and simple templates describing the actions of the company is automatically filled using pattern matching on words or around the sentence containing the company name.
  • prediction is performed based on information available in the given template and based on statistical pattern only.
  • articles or news are updated by using a weighted textual attribute in order to determine similarities of the present news releases to those the previous one.
  • Cipheral Patent Application Publication No. CN 106227802 entitled “Chinese natural language processing and multi-core classifier based multi-information-source stock price prediction method” (hereinafter referred to as the CN 802 Publication) having a filing date of 20 July 2016 (Applicant: UNIV GUANGDONG TECHNOLOGY) relates to data mining, machine learning and artificial intelligence, and more particularly to a keyword-based text analysis on the extracted emotion model score.
  • the invention as disclosed in CN 802 Publication provides a stock price prediction method based on natural language processing and multi-core classifier, mainly for Chinese Language and a text sentiment dictionary and keyword based dictionary for the research report, which are predefined is utilized.
  • CN 802 Publication articles are analyzed based on numerical and text-type variables.
  • the invention as disclosed in the CN 802 Publication further utilizes Support Vector Machine for prediction analysis whereby for article trend analysis, an evaluation process to evaluate the performance of prediction through K-fold cross- validation method is introduced and further verifying the performance of prediction.
  • Chinese Patent Application Publication No. CN 106384166 entitled “Deep learning stock market prediction method combined with financial news” (hereinafter referred to as the CN 166 Publication) having a filing date of 12 September 2016 (Applicant: SUN YAT-SEN UNIV) relates to a deep learning stock market prediction method combined with financial news.
  • the prediction method as disclosed in the CN 166 Publication comprises steps of first using web crawling technology for financial news, to crawl for relevant financial information related to stocks from Sina Finance News and Netease Financial News; processing financial news information and conduct news sentiment analysis; using Recurrent Neural Network, RNN deep learning network of historical trained data for prediction; training the feature extraction; and performing model training and prediction. Further, in CN 166 Publication, articles are analyzed based on a number of positive words against a number of negative words whereby web crawler is used to crawl for related data and thereafter processing the information based on sentiment analysis only.
  • the present invention relates to a system and method for predictive analytics of articles.
  • the present invention provides a system and method for a user to extract multiple levels of sentiment from an article, extract the trend of related happenings, and provide a prediction of the next course of action due to an event.
  • the system (100) comprising at least one Entity of Interest Query Module (102) for analysing an entity of interest received from a user; at least one Corpus Generator Module (104) for collecting data relating to an entity of interest; at least one Crawler Module (106) for crawling information on the entity of interest on provided sources of information continuously and crawling keywords defined in the at least one Corpus Generator Module (104) for latest updates in articles; at least one Articles Analysis Engine (108) for analysing information in articles received from the at least one Crawler Module (106); at least one Prediction Engine (110) for performing predictive analysis of articles for next course of action; at least one Trend Analysis Engine (112) for analysing articles based on historical trend and global trend; at least one Suggestion Generation Module (114) for suggesting to user on next course of action predicted; and at least one Actual Outcome Module (116) for providing feedback on actual outcome.
  • the system (100) comprising at least one Entity of Interest Query Module (102) for analysing an entity of interest received from a user; at least one Corpus Generator Module (104) for collecting
  • the at least one Prediction Engine (110) for performing predictive analysis of articles for next course of action further comprises (400) at least one Sentiment Correlation Engine (402) for analysing sentiment of statement in articles and weighing the sentiments according to influence of entities through closeness of corpus relationship; and at least one Prediction Correlation Engine (404) for performing prediction by using parameters obtained from at least historical trend, global trend, relevant statements and updated documents in articles.
  • the Articles Analysis Engine (108) for analysing information in articles received from the at least one Crawler Module (106) further comprises (200) at least one Statement Extraction Module (202) for extracting statements using machine learning tool; at least one Statement Entity Relation Module (204) for associating a respective entity to a statement; at least one Statement Weightage Module (206) for analysing weightage of statement according to influence of entities through closeness of corpus relationship; at least one Statement Sentiment Module (208) for analysing sentiment of statement according to influence of entities through closeness of corpus relationship; at least one Article Categorisation Module (210) for categorising articles using rule based technique; at least one Topic Sentiment Module (212) for analysing sentiment of each article by grouping of articles to its respective category based on nouns extracted and matching articles to a predefined topic grouping; at least one Duplicate Article Filter Module (214) for filtering duplicate articles; and at least one Update Detection Module (216) for updating articles that have new updates to a recent issue and linking to a previous
  • a further aspect of the invention provides that the at least one Trend Analysis Engine (112) for analysing articles based on historical trend and global trend further comprises (300) at least one Global Analysis Module (302) for extracting global parameters from global corpus related to category of article; at least one Feedback Monitoring Module (304a) for providing feedback of actual outcome; and at least one Trend Monitor Module (304) for applying weightage to trend based on importance of category.
  • Another aspect of the invention provides a method (500) for predictive analytics of articles.
  • the method comprising steps of determining if input from user is available (502); proceeding to step (514) if input from user is not available; obtaining user query for entity of interest if input from user is available (504, 504a); determining if an entity corpus has been built for entity of interest (506); building and updating the entity corpus with keywords if the entity corpus has yet to be built (508, 510); selecting and updating the entity corpus with keywords if the entity corpus has been built (512); extracting keywords from a predefined or a user defined URL's for corpus generation and crawling of necessary articles related to keywords of corpus (514); extracting metadata of articles (518) upon receipt of articles (516); determining if timestamp on article is current (520); extracting keywords from a predefined or a user defined URL's for corpus generation and crawling of necessary articles related to keywords of corpus (514) if timestamp of article is not current and reiterate step (516
  • the step for performing prediction and suggesting to user on next course of action predicted (538) further comprises steps of (900) determining if statement is from Master Entity keyword (902); increasing relevance if statement is from Master Entity keyword (904); reducing relevance if statement is not from Master Entity keyword (906); adjusting weightage according to relevance (908) and producing statement positivity weight (916); determining if update is available (910); using previous outcome with high weightage (914) based on recent prediction and outcome database (912); processing decision based aggregation (926) from trend positivity (920), article positivity weight (918), historical trend weight (922) and global trend weight (924); and providing suggested outcome based on prediction (928).
  • a further aspect of the invention provides that the step for performing duplicate record detection filtering (522) further comprises steps of (600) determining timestamp of article (602); querying a list of historical articles titles from a database of articles (602a) in the last X days (604); performing article comparison on document similarity using known algorithms (606); determining if article is a duplicate by performing thresholding comparison (608); continuing with analysis of article if article is not a duplicate (610); and discarding article if article is found to be a duplicate resulting from thresholding comparison (612).
  • the step for updating detection of article (532) further comprises steps of (700) querying article summary (702) from a database of articles for summarizing similarity of articles (7004); comparing percentage of similarity of keyword by determining if percentage of similarity is above threshold (706); confirming article update is false if percentage of similarity of keyword is below threshold (718); performing timestamp comparison if percentage of similarity is above threshold (708); and determining if timestamp is too far (710); and determining if article is of a same category if timestamp is not too far (712) and confirming article update is correct if articles are of the same category with recent predictions and outcome (714) from recent predictions and outcome database (716) else confirming article update is false if articles are not of the same category (718); identifying article as a new article if timestamp is too far (720); and confirming article update is false (718).
  • the step for performing trend analysis (536) further comprises steps of (800) extracting global parameters related to article category (804) from global entity parameters database (802); querying article related to global parameters (806); extracting time zone of article from metadata (808); determining if article is earlier in time zone (810); comparing article with previous categories showing similar sentiments (822, 820) upon undergoing relevance filter (812), noun extraction (814), article categorization (816) and sentiment analysis (818); and applying weightage to trend positivity based on category of importance (824).
  • the present invention consists of features and a combination of parts hereinafter fully described and illustrated in the accompanying drawings, it being understood that various changes in the details may be made without departing from the scope of the invention or sacrificing any of the advantages of the present invention.
  • FIG. 1.0 illustrates a general architecture of the system block diagram of the present invention.
  • FIG. 1.Oa illustrates an example of an entity corpus.
  • FIG. 1.Ob illustrates an example of a global corpus.
  • FIG. 1.Oc illustrates an example of a neural network.
  • FIG. 2.0 illustrates a detailed block diagram of the Article Analysis Engine of the present invention.
  • FIG. 3.0 illustrates a detailed block diagram of the Trend Analysis Engine of the present invention.
  • FIG. 4.0 illustrates a detailed block diagram of the Prediction Engine of the present invention.
  • FIG. 5.0 is a flowchart illustrating the general methodology of the present invention.
  • FIG. 6.0 is a flowchart illustrating the steps of performing duplicate record detection filtering.
  • FIG. 7.0 is a flowchart illustrating the steps of updating detection of article.
  • FIG. 8.0 is a flowchart illustrating the steps of performing trend analysis.
  • FIG. 9.0 is a flowchart illustrating the steps of performing prediction and suggesting to user on next course of action predicted.
  • the present invention relates to a system and method for predictive analytics of articles.
  • the present invention provides a system and method for a user to extract multiple levels of sentiment from an article, extract the trend of related happenings, and provide a prediction of the next course of action due to an event.
  • this specification will describe the present invention according to the preferred embodiments. It is to be understood that limiting the description to the preferred embodiments of the invention is merely to facilitate discussion of the present invention and it is envisioned without departing from the scope of the appended claims.
  • article is defined as a piece of writing on a particular subject in a newspaper or magazine, or on the internet.
  • An event may be occurred in the article.
  • Event is an occurrence happening at a determinable time and place, with or without the participation of human agents. It may be a part of a chain of occurrences as an effect of a preceding occurrence and as the cause of a succeeding occurrence.
  • a sentiment analysis is used in analytics of articles to allow us to obtain an overview of the wider public opinion for certain topics or events that have been discussed.
  • FIG. 1 .0 illustrates a general architecture of the system block diagram of the present invention.
  • the system (100) for predictive analytics of articles comprising at least one Entity of Interest Query Module (102) for analysing an entity of interest received from a user; at least one Corpus Generator Module (104) for collecting data relating to an entity of interest; at least one Crawler Module (106) for crawling information on the entity of interest on provided sources of information continuously and crawling keywords defined in the at least one Corpus Generator Module (104) for latest updates in articles; at least one Articles Analysis Engine (108) for analysing information in articles received from the at least one Crawler Module (106); at least one Prediction Engine (110) for performing predictive analysis of articles for next course of action; at least one Trend Analysis Engine (112) for analysing articles based on historical trend and global trend; at least one Suggestion Generation Module (114) for suggesting to user on next course of action predicted; and at least one Actual Outcome Module (116) for providing feedback on actual outcome.
  • the system (100) for predictive analytics of articles comprising at least
  • the Articles Analysis Engine (108) for analysing information in articles received from the at least one Crawler Module (106) further comprises (200) at least one Statement Extraction Module (202) for extracting statements using machine learning tool; at least one Statement Entity Relation Module (204) for associating a respective entity to a statement; at least one Statement Weightage Module (206) for analysing weightage of statement according to influence of entities through closeness of corpus relationship; at least one Statement Sentiment Module (208) for analysing sentiment of statement according to influence of entities through closeness of corpus relationship; at least one Article Categorisation Module (210) for categorising articles using rule based technique; at least one Topic Sentiment Module (212) for analysing sentiment of each article by grouping of articles to its respective category based on nouns extracted and matching articles to a predefined topic grouping; at least one Duplicate Article Filter Module (214) for filtering duplicate articles; and at least one Update
  • At least one Trend Analysis Engine (112) is used for categorising the article based on the output obtained from Article Categorisation Module (210) and Topic Sentiment Module (212). Based on the data obtained from Article Sentiment Module (218), it will determine the predicted output or the next possible outcome for the article by performing prediction by Prediction Engine (110).
  • the at least one Trend Analysis Engine (112) for analysing articles based on historical trend and global trend further comprises (300) at least one Global Analysis Module (302) for extracting global parameters from global corpus related to category of article and for monitoring the impact of foreign nations (prior time zones) to the article; at least one Trend Monitor Module (304) for providing feedback of actual outcome and comparing category and sentiment of the article and at least one weightage of articles by category (306) is applied to the article trend based on importance of category before the actual outcome of the article is produced by Actual Outcome Module (116).
  • the at least one Prediction Engine (110) for performing predictive analysis of articles for next course of action further comprises (400) at least one Sentiment Correlation Engine (402) for analysing sentiment of statement in articles for master entity influential statement’s sentiment and weighing the sentiments of entire articles according to influence of entities through closeness of corpus relationship; and at least one Prediction Correlation Engine (404) for performing prediction by data from previous prediction outcome of article (is the article update is true), categorizing article based on historical trend and analysing current positivity of the article.
  • FIG. 1.0a illustrates an example of an entity corpus while FIG. 1 .0b illustrates an example of a global corpus and FIG. 1.0c illustrates an example of a neural network.
  • FIG. 1.0a illustrates an example of a corpus for Air Asia. As illustrated in FIG.
  • the corpus comprises factors that contribute to the company which includes statements from major shareholders, fuel price hike, tourism tax, airport tax, GST, recent air incidents involving the company, statements from directors, political or geographical incidents in the destination countries. Further, booking information could reveal the seat availability for each destination, for monitoring of the popularity of the airline in terms of choice. Other factors such as external factors affecting the company including currency exchange due to government credit ratings, as well as weather forecast are important elements in the stock market of the company.
  • the duplicate record detection filter checks if the article has been processed recently, while the article update detection check if the newly crawled article is an update of another recent article.
  • Various levels of importance are provided to articles from different domains. Domains or specific social media accounts set by user or predefined by user or administrator or social media account of users whose names are mentioned in the corpus will be allocated with the highest weightage, and subsequently followed by news sites, or official company sites and finally to random blogs and social media. Each article that has been crawled will be categorized accordingly. The categorization process uses the nouns extracted from the article to fill a set of rules that will match the selected category.
  • this type of categorization helps build a historical trend of events, to ensure that if a similar trend occurred in the future, where fuel price was hiked and sentiment was negative, the system would predict a dip in around “3% x delta” for ‘X’ days.
  • trend analysis takes in two consideration, firstly the historical trend and secondly the global trend.
  • FIG. 1.0b illustrates an example of a global corpus.
  • the global trend requires the timestamp metadata of crawled data from a global corpus, which requires an additional relevance filter to fit the relevance of the entity corpus.
  • Article summarization is in Article update detection filter. Statements in the articles are also given special attention as statements are extracted, analyzed and linked back to the person, and if the person is in the corpus list, the statements positivity will carry a higher weightage as it has become influential.
  • the parameters retrieved from the historical trend, statement weightage, article update detection, sentiment of article, will be used in the prediction algorithm to suggest to the user the possible next outcome.
  • the prediction algorithm adjusts the weightages of the parameters and computes the aggregation.
  • the “Decision based Aggregation” could use a simple hard decision , soft decision or a more complex neural network model that could be updated as the system accumulates more historical data and the feedbacks received to improve accuracy over time.
  • Information from the Statement, Article update, Article positivity, historical trend, and global trend will be fed to a decision based aggregation to compute using a neural network model that could train its weights over time from historical data.
  • the weights in the neural network would be able to update in a smooth manner over time for an improved accuracy.
  • FIG. 5.0 is a flowchart illustrating the general methodology of the present invention.
  • the method (500) for predictive analytics of articles of the present invention is first initiated by determining if input from user is available (502); and proceeding to step (514) if input from user is not available. If input from user is available, user query is obtained for an entity of interest (504, 504a). Thereafter, it is determined if an entity corpus has been built for an entity of interest (506). A user is able to search for an entity of interest and the user is provided with the option to set some predefined URL’s and keywords (504a).
  • the new keywords provided by the user and the new pronouns detected from the URL provided will be selected, extracted and updated into the entity corpus (512). Else, proceed to build and update the entity corpus with keywords if the entity corpus has yet to be built (508, 510). Subsequently, keywords are extracted from a predefined or a user defined URL's for corpus generation and from the keywords extracted, crawler robots are launched to crawl necessary articles related to keywords of corpus from various sites including the predefined sites (514). The articles extracted from the predefined sites will have higher importance compared to those that are not from the predefined sites. Metadata of articles (518) are extracted upon receipt of articles (516).
  • determining if the timestamp on article is current (520) and keywords from the predefined or a user defined URL's are extracted for corpus generation and crawling of necessary articles related to keywords of corpus (514) if the timestamp of article is not current and reiterate step (516) and step (518) accordingly.
  • Duplicate record detection filtering is performed (522) if the timestamp on article is current by retrieving article weightage from table depending on article source domain (524) and performing sentiment analysis on article and scaled by article weight (526a, 526b).
  • the sentiment of each article is analyzed while grouping the articles to its respective category based on the nouns extracted and matching to the predefined topic grouping.
  • the sets of data consisting of the sentiment and article category are saved to a database. For example “Negative” sentiment for category “FUEL”. This would basically indicate that an increase in fuel has impacted the turnover of the company. Whereas if the sentiment was positive for the same category, would indicate a savings for the company in term of fuel price drop.
  • keyword extraction (528a), article categorization (528b) and article is summarized (530) and detection of article is updated (532).
  • a statement from article is extracted (534a) by associating respective entity to the statement entity relation (534b); and analyzing sentiment of the statement and the statement is weighted according to entities of influence based on closeness of corpus relationship (534c).
  • trend analysis (536) and prediction are performed and user is provided with suggestion on next course of action predicted (538).
  • FIG. 6.0 is a flowchart illustrating the steps of performing duplicate record detection filtering.
  • Duplicate record or articles are filtered in order to reduce processing power and articles that have new updates to a recent issue will have to be linked to the previous outcomes that was observed.
  • timestamp of article is first determined (602) and a list of historical articles titles from a database of articles (602a) in a last X number of days (604) are queried. Thereafter, article comparison on document similarity is performed using known algorithms such cosine similarity or Euclidean distance to compare the documents (606).
  • article is a duplicate by performing thresholding comparison (608) and continue with analysis of article as represented by yes option, if article is not a duplicate (610). Else, article is discarded as represented by no option, if article is found to be a duplicate resulting from thresholding comparison (612).
  • FIG. 7.0 is a flowchart illustrating the steps of updating detection of article.
  • article summary (702) is queried from a database of articles for summarizing similarity of articles using e.g Euclidean distance (704) and percentage of similarity of keyword is compared by determining if percentage of similarity is above threshold (706). Thereafter, it is confirmed that article update is false if percentage of similarity of keyword is below threshold (718). Timestamp comparison is performed if percentage of similarity is above threshold (708). It is further determined if timestamp is too far (710), if the timestamp is too far apart, the outcome correlation may vary significantly and therefore will be treated as a new article (720).
  • article update is false if articles are not of the same category (718). Thereafter, article is identified as a new article if timestamp is too far (720) and further confirming article update is false (718).
  • performing trend analysis further comprises steps of (800) first extracting global parameters related to article category (804) from global entity parameters database (802). Thereafter, article related to global parameters is queried (806) and time zone of article is extracted from metadata (808). The timestamp information of the articles are extracted and checked to see if the time zone is earlier. This is used in scenarios where for example if in USA the Federal Reserve has increased the interest rates, it will have an impact to Malaysian stock exchange the following working day. Therefore due to difference in time zone, this information could be used to help predict the impact to a selected entity.
  • article is earlier in time zone (810) and article is compared with previous categories showing similar sentiments (822, 820) upon undergoing relevance filter (812), noun extraction (814), article categorization (816) and sentiment analysis (818). Then weightage is applied to trend positivity based on category of importance (824).
  • the relevance filter (812) is required to only extract information that is related to the entity of interest. Since the global parameters could be of many topics, the topics that fulfil the entity of interest from the article category will be used as one of the criteria for relevance filtering.
  • the global articles extracted are then categorized and sentiment analysis is performed.
  • the information is then stored in a database, for historical reference. Historical trends are also retrieved from the database to analyze if a scenario similar has occurred in the past. For example if the price of fuel increase has dropped the market share of a company by X% in the past Y months, we could use this historical outcome as one of the parameters to predict the future outcome with similar category.
  • FIG. 9.0 is a flowchart illustrating the steps of performing prediction and suggesting to user on next course of action predicted.
  • performing prediction and suggesting to user on next course of action predicted further comprises steps of (900) first determining if statement is from Master Entity keyword (902).
  • Statements extracted from the article usually play a high role in a company performance. If a distinct person made a negative remark on a company, the share price will be at risk of being impacted negatively. If a distinct personnel has agreed to award a new contract to a company, this in return would increase the share price. The relevance of the personnel is weighted for each statement.
  • the present invention assist a user to extract multiple levels of sentiment from an article, extract the trend of related happenings, and provide a system and method where the next course of action is predicted due to an event.

Abstract

The present invention relates to a system and method for predictive analytics of articles. The system (100) of the present invention comprising at least one Entity of Interest Query Module (102); at least one Corpus Generator Module (104); at least one Crawler Module (106); at least one Articles Analysis Engine (108); at least one Prediction Engine (110); at least one Trend Analysis Engine (112); at least one Suggestion Generation Module (114); and at least one Actual Outcome Module (116). The at least one Prediction Engine (110) for performing predictive analysis of articles for next course of action further comprises (400) at least one Sentiment Correlation Engine (402) for analysing sentiment of statement in articles and weighing the sentiments according to influence of entities through closeness of corpus relationship; and at least one Prediction Correlation Engine (404) for performing prediction by using parameters obtained from at least historical trend, global trend, relevant statements and updated documents in articles. The present invention provides a system and method for a user to extract multiple levels of sentiment from an article, extract the trend of related happenings, and provide a prediction of the next course of action due to an event.

Description

A SYSTEM AND METHOD FOR PREDICTIVE ANALYTICS OF ARTICLES
FIELD OF INVENTION
The present invention relates to a system and method for predictive analytics of articles. In particular, the present invention provides a system and method for a user to extract multiple levels of sentiment from an article, extract the trend of related happenings, and provide a prediction of the next course of action due to an event.
BACKGROUND ART
Availability of abundance of information and data over the internet has caused a huge amount of issues from blogs, forums or social sites on the perception of an entity which includes a person or an organization. It is therefore difficult to create intelligence from the huge amount of data to visualize an overall sentiment of the entities caused by many perspectives. Currently, an automated system for prediction of possible outcome upon processing the article is not available.
United States Patent Application Publication No. US 2013/0030981 A1 entitled “Stock Market Prediction Using Natural Language Processing” (hereinafter referred to as the US 981 A1 Publication) having a filing date of 5 October 2012 (Applicant: Herz Frederick SM) relates to a method of using natural language processing (NLP) techniques to extract information from online news feeds and then using the extracted information to predict changes in stock prices or Volatilities. In the invention as disclosed in US 981 A1 Publication keywords such as company names is recognized and simple templates describing the actions of the company is automatically filled using pattern matching on words or around the sentence containing the company name. In US 981 A1 Publication, prediction is performed based on information available in the given template and based on statistical pattern only. Further, in the US 981 A1 Publication, articles or news are updated by using a weighted textual attribute in order to determine similarities of the present news releases to those the previous one.
Chinese Patent Application Publication No. CN 106227802 entitled “Chinese natural language processing and multi-core classifier based multi-information-source stock price prediction method” (hereinafter referred to as the CN 802 Publication) having a filing date of 20 July 2016 (Applicant: UNIV GUANGDONG TECHNOLOGY) relates to data mining, machine learning and artificial intelligence, and more particularly to a keyword-based text analysis on the extracted emotion model score. The invention as disclosed in CN 802 Publication provides a stock price prediction method based on natural language processing and multi-core classifier, mainly for Chinese Language and a text sentiment dictionary and keyword based dictionary for the research report, which are predefined is utilized. Further, in the CN 802 Publication, articles are analyzed based on numerical and text-type variables. The invention as disclosed in the CN 802 Publication further utilizes Support Vector Machine for prediction analysis whereby for article trend analysis, an evaluation process to evaluate the performance of prediction through K-fold cross- validation method is introduced and further verifying the performance of prediction. Chinese Patent Application Publication No. CN 106384166 entitled “Deep learning stock market prediction method combined with financial news” (hereinafter referred to as the CN 166 Publication) having a filing date of 12 September 2016 (Applicant: SUN YAT-SEN UNIV) relates to a deep learning stock market prediction method combined with financial news. The prediction method as disclosed in the CN 166 Publication comprises steps of first using web crawling technology for financial news, to crawl for relevant financial information related to stocks from Sina Finance News and Netease Financial News; processing financial news information and conduct news sentiment analysis; using Recurrent Neural Network, RNN deep learning network of historical trained data for prediction; training the feature extraction; and performing model training and prediction. Further, in CN 166 Publication, articles are analyzed based on a number of positive words against a number of negative words whereby web crawler is used to crawl for related data and thereafter processing the information based on sentiment analysis only.
With reference to the above-mentioned disclosures, there is indeed a need for a system and method that is able to automatically predict the outcome after processing an article considering the enormous amount of information on a perception of an entity.
SUMMARY OF INVENTION
The present invention relates to a system and method for predictive analytics of articles. In particular, the present invention provides a system and method for a user to extract multiple levels of sentiment from an article, extract the trend of related happenings, and provide a prediction of the next course of action due to an event.
One aspect of the invention provides a system (100) for predictive analytics of articles. The system (100) comprising at least one Entity of Interest Query Module (102) for analysing an entity of interest received from a user; at least one Corpus Generator Module (104) for collecting data relating to an entity of interest; at least one Crawler Module (106) for crawling information on the entity of interest on provided sources of information continuously and crawling keywords defined in the at least one Corpus Generator Module (104) for latest updates in articles; at least one Articles Analysis Engine (108) for analysing information in articles received from the at least one Crawler Module (106); at least one Prediction Engine (110) for performing predictive analysis of articles for next course of action; at least one Trend Analysis Engine (112) for analysing articles based on historical trend and global trend; at least one Suggestion Generation Module (114) for suggesting to user on next course of action predicted; and at least one Actual Outcome Module (116) for providing feedback on actual outcome. The at least one Prediction Engine (110) for performing predictive analysis of articles for next course of action further comprises (400) at least one Sentiment Correlation Engine (402) for analysing sentiment of statement in articles and weighing the sentiments according to influence of entities through closeness of corpus relationship; and at least one Prediction Correlation Engine (404) for performing prediction by using parameters obtained from at least historical trend, global trend, relevant statements and updated documents in articles.
Another aspect of the invention provides that the Articles Analysis Engine (108) for analysing information in articles received from the at least one Crawler Module (106) further comprises (200) at least one Statement Extraction Module (202) for extracting statements using machine learning tool; at least one Statement Entity Relation Module (204) for associating a respective entity to a statement; at least one Statement Weightage Module (206) for analysing weightage of statement according to influence of entities through closeness of corpus relationship; at least one Statement Sentiment Module (208) for analysing sentiment of statement according to influence of entities through closeness of corpus relationship; at least one Article Categorisation Module (210) for categorising articles using rule based technique; at least one Topic Sentiment Module (212) for analysing sentiment of each article by grouping of articles to its respective category based on nouns extracted and matching articles to a predefined topic grouping; at least one Duplicate Article Filter Module (214) for filtering duplicate articles; and at least one Update Detection Module (216) for updating articles that have new updates to a recent issue and linking to a previous outcome that was observed.
A further aspect of the invention provides that the at least one Trend Analysis Engine (112) for analysing articles based on historical trend and global trend further comprises (300) at least one Global Analysis Module (302) for extracting global parameters from global corpus related to category of article; at least one Feedback Monitoring Module (304a) for providing feedback of actual outcome; and at least one Trend Monitor Module (304) for applying weightage to trend based on importance of category.
Another aspect of the invention provides a method (500) for predictive analytics of articles. The method comprising steps of determining if input from user is available (502); proceeding to step (514) if input from user is not available; obtaining user query for entity of interest if input from user is available (504, 504a); determining if an entity corpus has been built for entity of interest (506); building and updating the entity corpus with keywords if the entity corpus has yet to be built (508, 510); selecting and updating the entity corpus with keywords if the entity corpus has been built (512); extracting keywords from a predefined or a user defined URL's for corpus generation and crawling of necessary articles related to keywords of corpus (514); extracting metadata of articles (518) upon receipt of articles (516); determining if timestamp on article is current (520); extracting keywords from a predefined or a user defined URL's for corpus generation and crawling of necessary articles related to keywords of corpus (514) if timestamp of article is not current and reiterate step (516) and step (518); performing duplicate record detection filtering (522) if the timestamp on article is current by retrieving article weightage from table depending on article source domain (524) and performing sentiment analysis on article and scaled by article weight (526a, 526b); performing keyword extraction (528a), article categorization (528b), summarizing article (530) and updating detection of article (532); extracting a statement from article (534a); associating respective entity to the statement (534b); analyzing sentiment of the statement and the statement is weighted according to entities of influence based on closeness of corpus relationship (534c); performing trend analysis (536); and performing prediction and suggesting to user on next course of action predicted (538). The step for performing prediction and suggesting to user on next course of action predicted (538) further comprises steps of (900) determining if statement is from Master Entity keyword (902); increasing relevance if statement is from Master Entity keyword (904); reducing relevance if statement is not from Master Entity keyword (906); adjusting weightage according to relevance (908) and producing statement positivity weight (916); determining if update is available (910); using previous outcome with high weightage (914) based on recent prediction and outcome database (912); processing decision based aggregation (926) from trend positivity (920), article positivity weight (918), historical trend weight (922) and global trend weight (924); and providing suggested outcome based on prediction (928).
A further aspect of the invention provides that the step for performing duplicate record detection filtering (522) further comprises steps of (600) determining timestamp of article (602); querying a list of historical articles titles from a database of articles (602a) in the last X days (604); performing article comparison on document similarity using known algorithms (606); determining if article is a duplicate by performing thresholding comparison (608); continuing with analysis of article if article is not a duplicate (610); and discarding article if article is found to be a duplicate resulting from thresholding comparison (612).
Yet another aspect of the invention provides that the step for updating detection of article (532) further comprises steps of (700) querying article summary (702) from a database of articles for summarizing similarity of articles (7004); comparing percentage of similarity of keyword by determining if percentage of similarity is above threshold (706); confirming article update is false if percentage of similarity of keyword is below threshold (718); performing timestamp comparison if percentage of similarity is above threshold (708); and determining if timestamp is too far (710); and determining if article is of a same category if timestamp is not too far (712) and confirming article update is correct if articles are of the same category with recent predictions and outcome (714) from recent predictions and outcome database (716) else confirming article update is false if articles are not of the same category (718); identifying article as a new article if timestamp is too far (720); and confirming article update is false (718).
Still another aspect of the invention provides that the step for performing trend analysis (536) further comprises steps of (800) extracting global parameters related to article category (804) from global entity parameters database (802); querying article related to global parameters (806); extracting time zone of article from metadata (808); determining if article is earlier in time zone (810); comparing article with previous categories showing similar sentiments (822, 820) upon undergoing relevance filter (812), noun extraction (814), article categorization (816) and sentiment analysis (818); and applying weightage to trend positivity based on category of importance (824). The present invention consists of features and a combination of parts hereinafter fully described and illustrated in the accompanying drawings, it being understood that various changes in the details may be made without departing from the scope of the invention or sacrificing any of the advantages of the present invention.
BRIEF DESCRIPTION OF ACCOMPANYING DRAWINGS
To further clarify various aspects of some embodiments of the present invention, a more particular description of the invention will be rendered by references to specific embodiments thereof, which are illustrated in the appended drawings. It is appreciated that these drawings depict only typical embodiments of the invention and are therefore not to be considered limiting of its scope. The invention will be described and explained with additional specificity and detail through the accompanying drawings.
FIG. 1.0 illustrates a general architecture of the system block diagram of the present invention.
FIG. 1.Oa illustrates an example of an entity corpus.
FIG. 1.Ob illustrates an example of a global corpus.
FIG. 1.Oc illustrates an example of a neural network.
FIG. 2.0 illustrates a detailed block diagram of the Article Analysis Engine of the present invention.
FIG. 3.0 illustrates a detailed block diagram of the Trend Analysis Engine of the present invention.
FIG. 4.0 illustrates a detailed block diagram of the Prediction Engine of the present invention.
FIG. 5.0 is a flowchart illustrating the general methodology of the present invention.
FIG. 6.0 is a flowchart illustrating the steps of performing duplicate record detection filtering. FIG. 7.0 is a flowchart illustrating the steps of updating detection of article.
FIG. 8.0 is a flowchart illustrating the steps of performing trend analysis.
FIG. 9.0 is a flowchart illustrating the steps of performing prediction and suggesting to user on next course of action predicted. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
The present invention relates to a system and method for predictive analytics of articles. In particular, the present invention provides a system and method for a user to extract multiple levels of sentiment from an article, extract the trend of related happenings, and provide a prediction of the next course of action due to an event. Hereinafter, this specification will describe the present invention according to the preferred embodiments. It is to be understood that limiting the description to the preferred embodiments of the invention is merely to facilitate discussion of the present invention and it is envisioned without departing from the scope of the appended claims.
According to Cambridge Business English Dictionary by Cambridge University Press, article is defined as a piece of writing on a particular subject in a newspaper or magazine, or on the internet. An event may be occurred in the article. Event is an occurrence happening at a determinable time and place, with or without the participation of human agents. It may be a part of a chain of occurrences as an effect of a preceding occurrence and as the cause of a succeeding occurrence. A sentiment analysis is used in analytics of articles to allow us to obtain an overview of the wider public opinion for certain topics or events that have been discussed.
Reference is first made to FIG. 1 .0 which illustrates a general architecture of the system block diagram of the present invention. As illustrated in FIG. 1.0, the system (100) for predictive analytics of articles comprising at least one Entity of Interest Query Module (102) for analysing an entity of interest received from a user; at least one Corpus Generator Module (104) for collecting data relating to an entity of interest; at least one Crawler Module (106) for crawling information on the entity of interest on provided sources of information continuously and crawling keywords defined in the at least one Corpus Generator Module (104) for latest updates in articles; at least one Articles Analysis Engine (108) for analysing information in articles received from the at least one Crawler Module (106); at least one Prediction Engine (110) for performing predictive analysis of articles for next course of action; at least one Trend Analysis Engine (112) for analysing articles based on historical trend and global trend; at least one Suggestion Generation Module (114) for suggesting to user on next course of action predicted; and at least one Actual Outcome Module (116) for providing feedback on actual outcome.
Reference is now made to FIG. 2.0 which illustrates a detailed block diagram of the Article Analysis Engine of the present invention. As illustrated in FIG. 2.0, the Articles Analysis Engine (108) for analysing information in articles received from the at least one Crawler Module (106) further comprises (200) at least one Statement Extraction Module (202) for extracting statements using machine learning tool; at least one Statement Entity Relation Module (204) for associating a respective entity to a statement; at least one Statement Weightage Module (206) for analysing weightage of statement according to influence of entities through closeness of corpus relationship; at least one Statement Sentiment Module (208) for analysing sentiment of statement according to influence of entities through closeness of corpus relationship; at least one Article Categorisation Module (210) for categorising articles using rule based technique; at least one Topic Sentiment Module (212) for analysing sentiment of each article by grouping of articles to its respective category based on nouns extracted and matching articles to a predefined topic grouping; at least one Duplicate Article Filter Module (214) for filtering duplicate articles; and at least one Update Detection Module (216) for updating articles that have new updates to a recent issue and linking to a previous outcome that was observed. At least one Trend Analysis Engine (112) is used for categorising the article based on the output obtained from Article Categorisation Module (210) and Topic Sentiment Module (212). Based on the data obtained from Article Sentiment Module (218), it will determine the predicted output or the next possible outcome for the article by performing prediction by Prediction Engine (110).
Reference is now made to FIG. 3.0 which illustrates a detailed block diagram of the Trend Analysis Engine (112) of the present invention. As illustrated in FIG. 3.0, the at least one Trend Analysis Engine (112) for analysing articles based on historical trend and global trend further comprises (300) at least one Global Analysis Module (302) for extracting global parameters from global corpus related to category of article and for monitoring the impact of foreign nations (prior time zones) to the article; at least one Trend Monitor Module (304) for providing feedback of actual outcome and comparing category and sentiment of the article and at least one weightage of articles by category (306) is applied to the article trend based on importance of category before the actual outcome of the article is produced by Actual Outcome Module (116).
Reference is now made to FIG. 4.0 which illustrates a detailed block diagram of the Prediction Engine of the present invention. As illustrated in FIG. 4.0, the at least one Prediction Engine (110) for performing predictive analysis of articles for next course of action further comprises (400) at least one Sentiment Correlation Engine (402) for analysing sentiment of statement in articles for master entity influential statement’s sentiment and weighing the sentiments of entire articles according to influence of entities through closeness of corpus relationship; and at least one Prediction Correlation Engine (404) for performing prediction by data from previous prediction outcome of article (is the article update is true), categorizing article based on historical trend and analysing current positivity of the article.
The system of the present invention provides a user interface where user is able to search for an entity of interest including a company name, a specific product, tax policy, or even fuel price, and gather relevant data on the search query and predict the impact to other entities in the future. Reference is now made to FIG. 1.0a, FIG. 1.0b and FIG. 1.0c respectively. FIG. 1 .Oa illustrates an example of an entity corpus while FIG. 1 .0b illustrates an example of a global corpus and FIG. 1.0c illustrates an example of a neural network. FIG. 1.0a illustrates an example of a corpus for Air Asia. As illustrated in FIG. 1.0a, the corpus comprises factors that contribute to the company which includes statements from major shareholders, fuel price hike, tourism tax, airport tax, GST, recent air incidents involving the company, statements from directors, political or geographical incidents in the destination countries. Further, booking information could reveal the seat availability for each destination, for monitoring of the popularity of the airline in terms of choice. Other factors such as external factors affecting the company including currency exchange due to government credit ratings, as well as weather forecast are important elements in the stock market of the company. Once a related article has been crawled, a metadata of the article is extracted. A timestamp information is checked to ensure that the crawled article is recent. The article will go through two filters which are duplicate record detection filter and article update detection filter. The duplicate record detection filter, checks if the article has been processed recently, while the article update detection check if the newly crawled article is an update of another recent article. Various levels of importance are provided to articles from different domains. Domains or specific social media accounts set by user or predefined by user or administrator or social media account of users whose names are mentioned in the corpus will be allocated with the highest weightage, and subsequently followed by news sites, or official company sites and finally to random blogs and social media. Each article that has been crawled will be categorized accordingly. The categorization process uses the nouns extracted from the article to fill a set of rules that will match the selected category.
Example of an article on Air Asia:
Higher fuel prices to pressure AirAsia’s earnings
“KUALA LUMPUR: CIMB Equities Research expects higher fuel prices to pressure Air Asia Group Bhd’s (AAGB) earnings as it only hedged about 12% of its FY18F jet fuel needs at US$68.55 per barrel" Read more at hltpsY/www. ihestar com m y/business/business- news/2018/05/25/hiaher-fuei-Dhces-lo-Dressure -airasia-earninos/#ccFQSAaPWK8RruKX.99 The above segment of the article will be categorized under a category of “FUEL”. The sentiment of the article is extracted and saved into a database of that category. For example: Negative Sentiment on category of “FUEL” with reason of “Price Increase” and assuming the result of this article, the share price dipped 3%.
As described, this type of categorization helps build a historical trend of events, to ensure that if a similar trend occurred in the future, where fuel price was hiked and sentiment was negative, the system would predict a dip in around “3% x delta” for ‘X’ days. In general, trend analysis takes in two consideration, firstly the historical trend and secondly the global trend.
Reference is now made to FIG. 1.0b which illustrates an example of a global corpus. The global trend requires the timestamp metadata of crawled data from a global corpus, which requires an additional relevance filter to fit the relevance of the entity corpus. Article summarization is in Article update detection filter. Statements in the articles are also given special attention as statements are extracted, analyzed and linked back to the person, and if the person is in the corpus list, the statements positivity will carry a higher weightage as it has become influential. The parameters retrieved from the historical trend, statement weightage, article update detection, sentiment of article, will be used in the prediction algorithm to suggest to the user the possible next outcome. The prediction algorithm adjusts the weightages of the parameters and computes the aggregation. The “Decision based Aggregation” could use a simple hard decision , soft decision or a more complex neural network model that could be updated as the system accumulates more historical data and the feedbacks received to improve accuracy over time.
As illustrated in FIG. 1 .0c, Information from the Statement, Article update, Article positivity, historical trend, and global trend will be fed to a decision based aggregation to compute using a neural network model that could train its weights over time from historical data. The weights in the neural network would be able to update in a smooth manner over time for an improved accuracy.
Reference is now made to FIG. 5.0 which is a flowchart illustrating the general methodology of the present invention. As illustrated in FIG. 5.0, the method (500) for predictive analytics of articles of the present invention is first initiated by determining if input from user is available (502); and proceeding to step (514) if input from user is not available. If input from user is available, user query is obtained for an entity of interest (504, 504a). Thereafter, it is determined if an entity corpus has been built for an entity of interest (506). A user is able to search for an entity of interest and the user is provided with the option to set some predefined URL’s and keywords (504a). If the corpus of the entity of interest has been built, the new keywords provided by the user and the new pronouns detected from the URL provided will be selected, extracted and updated into the entity corpus (512). Else, proceed to build and update the entity corpus with keywords if the entity corpus has yet to be built (508, 510). Subsequently, keywords are extracted from a predefined or a user defined URL's for corpus generation and from the keywords extracted, crawler robots are launched to crawl necessary articles related to keywords of corpus from various sites including the predefined sites (514). The articles extracted from the predefined sites will have higher importance compared to those that are not from the predefined sites. Metadata of articles (518) are extracted upon receipt of articles (516). Thereafter, determining if the timestamp on article is current (520) and keywords from the predefined or a user defined URL's are extracted for corpus generation and crawling of necessary articles related to keywords of corpus (514) if the timestamp of article is not current and reiterate step (516) and step (518) accordingly. Duplicate record detection filtering is performed (522) if the timestamp on article is current by retrieving article weightage from table depending on article source domain (524) and performing sentiment analysis on article and scaled by article weight (526a, 526b). The sentiment of each article is analyzed while grouping the articles to its respective category based on the nouns extracted and matching to the predefined topic grouping. The sets of data consisting of the sentiment and article category are saved to a database. For example “Negative” sentiment for category “FUEL”. This would basically indicate that an increase in fuel has impacted the turnover of the company. Whereas if the sentiment was positive for the same category, would indicate a savings for the company in term of fuel price drop.
Simultaneously, keyword extraction (528a), article categorization (528b) and article is summarized (530) and detection of article is updated (532). A statement from article is extracted (534a) by associating respective entity to the statement entity relation (534b); and analyzing sentiment of the statement and the statement is weighted according to entities of influence based on closeness of corpus relationship (534c). Thereafter, trend analysis (536) and prediction are performed and user is provided with suggestion on next course of action predicted (538).
Reference is now made to FIG. 6.0 which is a flowchart illustrating the steps of performing duplicate record detection filtering. Duplicate record or articles are filtered in order to reduce processing power and articles that have new updates to a recent issue will have to be linked to the previous outcomes that was observed. As illustrated in FIG. 6.0, in performing duplicate record detection filtering, timestamp of article is first determined (602) and a list of historical articles titles from a database of articles (602a) in a last X number of days (604) are queried. Thereafter, article comparison on document similarity is performed using known algorithms such cosine similarity or Euclidean distance to compare the documents (606). It is further determined if article is a duplicate by performing thresholding comparison (608) and continue with analysis of article as represented by yes option, if article is not a duplicate (610). Else, article is discarded as represented by no option, if article is found to be a duplicate resulting from thresholding comparison (612).
Reference is now made to FIG. 7.0 which is a flowchart illustrating the steps of updating detection of article. As illustrated in FIG. 7.0, article summary (702) is queried from a database of articles for summarizing similarity of articles using e.g Euclidean distance (704) and percentage of similarity of keyword is compared by determining if percentage of similarity is above threshold (706). Thereafter, it is confirmed that article update is false if percentage of similarity of keyword is below threshold (718). Timestamp comparison is performed if percentage of similarity is above threshold (708). It is further determined if timestamp is too far (710), if the timestamp is too far apart, the outcome correlation may vary significantly and therefore will be treated as a new article (720). It is determined if article is of a same category if timestamp is not too far (712) and further confirming article update is correct if articles are of the same category with recent predictions and outcome (714) from recent predictions and outcome database (716). Else, it is confirmed article update is false if articles are not of the same category (718). Thereafter, article is identified as a new article if timestamp is too far (720) and further confirming article update is false (718).
Reference is now made to FIG. 8.0 which is a flowchart illustrating the steps of performing trend analysis. As illustrated in FIG. 8.0, performing trend analysis (536) further comprises steps of (800) first extracting global parameters related to article category (804) from global entity parameters database (802). Thereafter, article related to global parameters is queried (806) and time zone of article is extracted from metadata (808). The timestamp information of the articles are extracted and checked to see if the time zone is earlier. This is used in scenarios where for example if in USA the Federal Reserve has increased the interest rates, it will have an impact to Malaysian stock exchange the following working day. Therefore due to difference in time zone, this information could be used to help predict the impact to a selected entity. Subsequently, it is determined if article is earlier in time zone (810) and article is compared with previous categories showing similar sentiments (822, 820) upon undergoing relevance filter (812), noun extraction (814), article categorization (816) and sentiment analysis (818). Then weightage is applied to trend positivity based on category of importance (824). The relevance filter (812) is required to only extract information that is related to the entity of interest. Since the global parameters could be of many topics, the topics that fulfil the entity of interest from the article category will be used as one of the criteria for relevance filtering. The global articles extracted are then categorized and sentiment analysis is performed. The information is then stored in a database, for historical reference. Historical trends are also retrieved from the database to analyze if a scenario similar has occurred in the past. For example if the price of fuel increase has dropped the market share of a company by X% in the past Y months, we could use this historical outcome as one of the parameters to predict the future outcome with similar category.
Reference is now made to FIG. 9.0 which is a flowchart illustrating the steps of performing prediction and suggesting to user on next course of action predicted. As illustrated in FIG. 9.0, in performing prediction and suggesting to user on next course of action predicted (538) further comprises steps of (900) first determining if statement is from Master Entity keyword (902). Statements extracted from the article usually play a high role in a company performance. If a distinct person made a negative remark on a company, the share price will be at risk of being impacted negatively. If a distinct personnel has agreed to award a new contract to a company, this in return would increase the share price. The relevance of the personnel is weighted for each statement. Therefore, relevance is increased if statement is from Master Entity keyword (904) and relevance is reduced if statement is not from Master Entity keyword (906). Thereafter, weightage is adjusted according to relevance (908) and statement positivity weight is produced (916). It is further determined if update is available (910) and using previous outcome with high weightage (914) based on recent prediction and outcome database (912). Decision based aggregation is processed (926) if the update is not available from trend positivity (920), article positivity weight (918), historical trend weight (922) and global trend weight (924) using either a simple hard decision, or a neural network model that could train its weights over time from historical data. The weights in the neural network would be able to update in a smooth manner over time to improve accuracy.
Finally, suggested outcome is provided based on prediction (928).
The present invention assist a user to extract multiple levels of sentiment from an article, extract the trend of related happenings, and provide a system and method where the next course of action is predicted due to an event.
Unless the context requires otherwise or specifically stated to the contrary, integers, steps or elements of the invention recited herein as singular integers, steps or elements clearly encompass both singular and plural forms of the recited integers, steps or elements. Throughout this specification, unless the context requires otherwise, the word “comprise”, or variations such as “comprises” or “comprising”, will be understood to imply the inclusion of a stated step or element or integer or group of steps or elements or integers, but not the exclusion of any other step or element or integer or group of steps, elements or integers. Thus, in the context of this specification, the term “comprising” is used in an inclusive sense and thus should be understood as meaning “including principally, but not necessarily solely”.

Claims

1 . A system (100) for predictive analytics of articles, the system (100) comprising: at least one Entity of Interest Query Module (102) for analysing an entity of interest received from a user; at least one Corpus Generator Module (104) for collecting data relating to the entity of interest; at least one Crawler Module (106) for crawling information on the entity of interest on provided sources of information continuously and crawling keywords defined in the at least one Corpus Generator Module (104) for latest updates in articles; at least one Articles Analysis Engine (108) for analysing information in articles received from the at least one Crawler Module (106); at least one Prediction Engine (110) for performing predictive analysis of articles for next course of action; at least one Trend Analysis Engine (112) for analysing articles based on historical trend and global trend; at least one Suggestion Generation Module (114) for suggesting to user on next course of action predicted; and at least one Actual Outcome Module (116) for providing feedback on actual outcome, characterized in that the at least one Prediction Engine (110) for performing predictive analysis of articles for next course of action further comprises (400): at least one Sentiment Correlation Engine (402) for analysing sentiment of statement in articles and weighing the sentiments according to influence of entities through closeness of corpus relationship; and at least one Prediction Correlation Engine (404) for performing prediction by using parameters obtained from at least historical trend, global trend, relevant statements and updated documents in articles.
2. The system (100) according to Claim 1 , wherein the Articles Analysis Engine (108) for analysing information in articles received from the at least one Crawler Module (106) further comprises (200): at least one Statement Extraction Module (202) for extracting statements using machine learning tool; at least one Statement Entity Relation Module (204) for associating a respective entity to a statement; at least one Statement Weightage Module (206) for analysing weightage of statement according to influence of entities through closeness of corpus relationship; at least one Statement Sentiment Module (208) for analysing sentiment of statement according to influence of entities through closeness of corpus relationship; at least one Article Categorisation Module (210) for categorising articles using rule based technique; at least one Topic Sentiment Module (212) for analysing sentiment of each article by grouping of articles to its respective category based on nouns extracted and matching articles to a predefined topic grouping; at least one Duplicate Article Filter Module (214) for filtering duplicate articles; and at least one Update Detection Module (216) for updating articles that have new updates to a recent issue and linking to a previous outcome that was observed.
3. The system (100) according to Claim 1 , wherein the at least one Trend Analysis Engine (112) for analysing articles based on historical trend and global trend further comprises (300): at least one Global Analysis Module (302) for extracting global parameters from global corpus related to category of article; at least one Trend Monitor Module (304) for providing feedback of actual outcome; and at least one weightage of articles by category (306) is applied to the article trend based on importance of category before the actual outcome of the article is produced by Actual Outcome Module (116).
4. A method (500) for predictive analytics of articles, the method comprising steps of: determining if input from user is available (502); proceeding to step (514) if input from user is not available; obtaining user query for entity of interest if input from user is available (504,
504a); determining if an entity corpus has been built for entity of interest (506); building and updating the entity corpus with keywords if the entity corpus has yet to be built (508, 510); selecting and updating the entity corpus with keywords if the entity corpus has been built (512); extracting keywords from a predefined or a user defined URL's for corpus generation and crawling of necessary articles related to keywords of corpus (514); extracting metadata of articles (518) upon receipt of articles (516); determining if timestamp on article is current (520); extracting keywords from a predefined or a user defined URL's for corpus generation and crawling of necessary articles related to keywords of corpus (514) if timestamp of article is not current and reiterate step (516) and step (518); performing duplicate record detection filtering (522) if the timestamp on article is current; retrieving article weightage from table depending on article source domain (524) and performing sentiment analysis on article and scaled by article weight (526a, 526b); performing keyword extraction (528a), article categorization (528b), summarizing article (530) and updating detection of article (532); extracting a statement from article (534a); associating respective entity to the statement (534b); analyzing sentiment of the statement and the statement is weighted according to entities of influence based on closeness of corpus relationship (534c); performing trend analysis (536); and performing prediction and suggesting to user on next course of action predicted (538), characterized in that performing prediction and suggesting to user on next course of action predicted (538) further comprises steps of (900): determining if statement is from Master Entity keyword (902); increasing relevance if statement is from Master Entity keyword (904); reducing relevance if statement is not from Master Entity keyword (906); adjusting weightage according to relevance (908) and producing statement positivity weight (916); determining if update is available (910); using previous outcome with high weightage (914) based on recent prediction and outcome database (912) if the update is available; processing decision based aggregation (926) from trend positivity (920), article positivity weight (918), historical trend weight (922) and global trend weight (924) if the update is not available; and providing suggested outcome based on prediction (928).
5. The method (500) according to Claim 4, wherein performing duplicate record detection filtering (522) further comprises steps of (600): determining timestamp of article (602); querying a list of historical articles titles from a database of articles (602a) in the last X days (604); performing article comparison on document similarity using known algorithms (606); determining if article is a duplicate by performing thresholding comparison (608); continuing with analysis of article, if article is not a duplicate (610); and discarding article, if article is found to be a duplicate resulting from thresholding comparison (612).
6. The method (500) according to Claim 4, wherein updating detection of article (532) further comprises steps of (700): querying article summary (702) from a database of articles for summarizing similarity of articles (704); comparing percentage of similarity of keyword by determining if percentage of similarity is above threshold (706); confirming article update is false if percentage of similarity of keyword is below threshold (718); performing timestamp comparison if percentage of similarity is above threshold (708); determining if timestamp is too far (710); determining if article is of a same category if timestamp is not too far (712); confirming article update is correct (714) if articles are of the same category with recent predictions and outcome from recent predictions and outcome database (716); else confirming article update is false if articles are not of the same category (718); identifying article as a new article if timestamp is too far (720); and confirming article update is false (718).
7. The method (500) according to Claim 4, wherein performing trend analysis (536) further comprises steps of (800): extracting global parameters related to article category (804) from global entity parameters database (802); querying article related to global parameters (806); extracting time zone of article from metadata (808); determining if article is earlier in time zone (810); comparing article with previous categories showing similar sentiments (822, 820) upon undergoing relevance filter (812), noun extraction (814), article categorization (816) and sentiment analysis (818) if the article is earlier in time zone; and applying weightage to trend positivity based on category of importance (824).
PCT/MY2020/050056 2019-09-27 2020-07-23 A system and method for predictive analytics of articles WO2021060967A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
MYPI2019005732 2019-09-27
MYPI2019005732 2019-09-27

Publications (1)

Publication Number Publication Date
WO2021060967A1 true WO2021060967A1 (en) 2021-04-01

Family

ID=75165960

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/MY2020/050056 WO2021060967A1 (en) 2019-09-27 2020-07-23 A system and method for predictive analytics of articles

Country Status (1)

Country Link
WO (1) WO2021060967A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020169762A1 (en) * 1999-05-07 2002-11-14 Carlos Cardona System and method for database retrieval, indexing and statistical analysis
JP2007183903A (en) * 2005-12-06 2007-07-19 Nippon Hoso Kyokai <Nhk> Trend information analyzer
US20090187540A1 (en) * 2008-01-22 2009-07-23 Microsoft Corporation Prediction of informational interests
US8037068B2 (en) * 2005-04-06 2011-10-11 Google Inc. Searching through content which is accessible through web-based forms
US8775406B2 (en) * 2007-08-14 2014-07-08 John Nicholas Gross Method for predicting news content

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020169762A1 (en) * 1999-05-07 2002-11-14 Carlos Cardona System and method for database retrieval, indexing and statistical analysis
US8037068B2 (en) * 2005-04-06 2011-10-11 Google Inc. Searching through content which is accessible through web-based forms
JP2007183903A (en) * 2005-12-06 2007-07-19 Nippon Hoso Kyokai <Nhk> Trend information analyzer
US8775406B2 (en) * 2007-08-14 2014-07-08 John Nicholas Gross Method for predicting news content
US20090187540A1 (en) * 2008-01-22 2009-07-23 Microsoft Corporation Prediction of informational interests

Similar Documents

Publication Publication Date Title
Li et al. Text-based crude oil price forecasting: A deep learning approach
Shu et al. Beyond news contents: The role of social context for fake news detection
Tsai et al. On the risk prediction and analysis of soft information in finance reports
Nassirtoussi et al. Text mining of news-headlines for FOREX market prediction: A Multi-layer Dimension Reduction Algorithm with semantics and sentiment
Chen et al. Modeling public mood and emotion: Blog and news sentiment and socio-economic phenomena
Luo et al. Knowledge empowered prominent aspect extraction from product reviews
Mankar et al. Stock market prediction based on social sentiments using machine learning
US20220138572A1 (en) Systems and Methods for the Automatic Classification of Documents
CN112231593B (en) Financial information intelligent recommendation system
Lim et al. Examining machine learning techniques in business news headline sentiment analysis
Gao et al. Sentiment classification for stock news
Kurniawan et al. Indonesian twitter sentiment analysis using Word2Vec
Pandiaraj et al. Sentiment analysis on newspaper article reviews: contribution towards improved rider optimization-based hybrid classifier
CN105159879A (en) Automatic determination method for network individual or group values
KR20210033294A (en) Automatic manufacturing apparatus for reports, and control method thereof
Shihavuddin et al. Prediction of stock price analyzing the online financial news using Naive Bayes classifier and local economic trends
Chen et al. Topic generation for Chinese stocks: a cognitively motivated topic modeling method using social media data
Duman Social media analytical CRM: a case study in a bank
WO2021060967A1 (en) A system and method for predictive analytics of articles
Dai et al. Contrastive Learning for User Sequence Representation in Personalized Product Search
Paul et al. Real-time stock trend prediction via sentiment analysis of news article
Ying et al. The clues in the news media coverage: detecting Chinese collective action trend from a text analytics research framework
Manzoor et al. Stock exchange prediction using financial news and sentiment analysis
Bhopale et al. Temporal topic modeling of scholarly publications for future trend forecasting
Kumar et al. Twitter based information extraction

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20867315

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20867315

Country of ref document: EP

Kind code of ref document: A1