WO2015044934A1

WO2015044934A1 - A method for adaptively classifying sentiment of document snippets

Info

Publication number: WO2015044934A1
Application number: PCT/ID2014/000012
Authority: WO
Inventors: Ismail FAHMI; Widanardi SATRYATOMO
Original assignee: ABIDIN, Indira Ratna Dewi
Priority date: 2013-09-30
Filing date: 2014-08-29
Publication date: 2015-04-02

Abstract

The present invention relates to a method for adaptively classifying sentiment of document snippets, comprising the steps of: crawling Web documents to search document snippets; populating document snippets by entity and using sentiment lexicon to annotate sentiment-bearing snippets; creating initial training data sets from the populated document snippets and building a machine learning model for each initial training data set; classifying sentiment of new document snippets of an entity using the machine learning model from the initial training data sets; characterised by the steps of: inspecting and correcting results of the sentiment classification performed by the machine learning model at a predetermined time interval; adapting the initial training data sets according to the corrected results of the sentiment classification, building a new machine learning model according to the adapted training data sets; and classifying sentiment of new document snippets of an entity adaptively using the adapted training data sets.

Description

A METHOD FOR ADAPTIVE LY CLASSIFYING SENTIMENT OF DOCUMENT

SNIPPETS

Background of the Invention Field of the Invention

This invention relates to a computer implemented method for classifying sentiment of document snippets, and more particularly to a method for adaptively classifying sentiment of document snippets. Description of Related Arts

In business marketing or corporate institution, brand management is an important strategy to establish and sustain in a particular industry or even globally. A measure to determine efficacy of brand management has become a challenge since content sharing is made easy by the widening availability of internet and mobile devices. Brand's gossip, debate or conversation has continuously moving from one media platform to another, worldwide and anytime of the day. Moreover, high volume of brand citations doesn't necessarily mean good reception. Sentiment tones need to be considered to get clearer picture. Hence, there is a growing need for brand owners or manager to collate information of who are talking about their brands, where, when and how do they talk about the brands.

Therefore, a few approaches to monitor the media platform have been currently made to cope with the rising numbers of online and social media channels, while the traditional media like prints, television and radio are also taking the digital road to distribute their brands. As for example, a patent publication of US 2012/0101808 A1 disclosed methods and systems for extracting and analyzing user-generated content (UGC) in order to provide opinion-bearing information concerning different categories of a product. The cited publication disclosed a computer-implemented method for determining sentiment in Web documents, comprising: harvesting one or more Web documents form one or more content sources and extracting keywords from the Web documents; filtering at a server, according to a phase transition method to produce filtered keywords; then determining sentiment expressed in the Web documents on a category-by category basis; said sentiment determined by analysis of words identified as sentiment-bearing or opinion bearing within the web document; and reporting the sentiment so determined. However the cited patent does not provide a method for an adaptive domain-specific sentiment classification to adapt with the evolving informal words and acronyms of documents snippets.

Another US patent 8266148 B2 disclosed various embodiments of a method for Business Intelligence (Bl) metrics on unstructured data. The disclosed method is a machine-implemented method for a pipelined process of capture, classification and dimensioning of data from a plurality of data sources that include unstructured data having no explicit dimensions associated with the unstructured data to generate a domain-relevant classified data index that is useable by a plurality of different intelligence metrics to perform different kinds of business intelligence metrics to perform different kinds of business intelligence analytics. In one embodiment of the cited patent, user-feedback is obtained from the user in response to the analytics results that are presented for the user and cause a data processing machine to adaptively utilize the user-feedback to modify the relevance classification. However, the cited patent does not disclosed method for generating document snippet features in details,

Accordingly, it can be seen in the prior arts that there exists a need to provide an improved method for adaptively classifying sentiment of document snippets to adapt with the evolving informal words and acronyms of documents snippets.

Summary of Invention

It is an objective of the present invention to provide a method for classifying sentiment of document snippets.

It is also an objective of the present invention to provide an adaptive method for classifying sentiment of document snippets. It is yet another objective of the present invention to provide a natural language processing technology to adapt with the evolving informal words and acronyms of documents snippets. Accordingly, these objectives may be achieved by following the teachings of the present invention. The present invention relates to a method for adaptively classifying sentiment of document snippets, comprising the steps of: crawling Web documents to search for document snippets; populating document snippets by entity and using sentiment lexicon to annotate sentiment-bearing snippets; creating initial training data sets from the populated document snippets and building a machine learning model for each initial training data set; classifying sentiment of new document snippets of an entity using the machine learning model from the initial training data sets; characterised by the steps of: inspecting and correcting results of the sentiment classification performed by the machine learning model at a predetermined time interval; adapting the initial training data sets according to the corrected results of the sentiment classification, then building a new machine learning model according to the adapted training data sets; and classifying sentiment of new document snippets of an entity adaptively using the adapted training data sets.

Brief Description of the Drawings

The features of the invention will be more readily understood and appreciated from the following detailed description when read in conjunction with the accompanying drawings of the preferred embodiment of the present invention, in which:

Fig. 1 is a flow chart of a method for adaptively classifying sentiment of document snippets.

Fig. 2 is a flow chart of a method for crawling process.

Fig. 3 is a flow chart of a method for populating document snippets by entity. Fig. 4 is a flow chart of a method for generating document snippets features. Fig. 5 is a flow chart of a method for creating initial training data sets. Fig. 6 is a flow chart of a method for classifying sentiment of new document snippets.

Fig. 7 is a flow chart of a method for adaptive sentiment classification.

Detailed Description of the Invention

As required, detailed embodiments of the present invention are disclosed herein; however, it is to be understood that the disclosed embodiments are merely exemplary of the invention, which may be embodied in various forms. Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting but merely as a basis for claims. It should be understood that the drawings and detailed description thereto are not intended to limit the invention to the particular form disclosed, but on the contrary, the invention is to cover all modification, equivalents and alternatives falling within the scope of the present invention as defined by the appended claims. As used throughout this application, the word "may" is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words "include," "including," and "includes" mean including, but not limited to. Further, the words "a" or "an" mean "at least one" and the word "plurality" means one or more, unless otherwise mentioned. Where the abbreviations or technical terms are used, these indicate the commonly accepted meanings as known in the technical field. For ease of reference, common reference numerals will be used throughout the figures when referring to the same or similar features common to the figures. The present invention will now be described with reference to Figs. 1-7.

The present invention relates to a method (100) for adaptively classifying sentiment of document snippets, comprising the steps of:

crawling Web documents to search for document snippets;

populating document snippets by entity and using sentiment lexicon to annotate sentiment-bearing snippets; creating initial training data sets from the populated document snippets and building a machine learning model for each initial training data set;

classifying sentiment of new document snippets of an entity using the machine learning model from the initial training data sets;

characterised by the steps of:

inspecting and correcting results of the sentiment classification performed by the machine learning model at a predetermined time interval;

adapting the initial training data sets according to the corrected results of the sentiment classification, then building a new machine learning model according to the adapted training data sets; and

classifying sentiment of new document snippets of an entity adaptively using the adapted training data sets.

Method (100) in the present invention is preferably for providing a monitoring service on online multi-media platforms for the benefit of a brand owner. By means of said monitoring service, the brand owner gets to monitor and analysis public conversation about their brand reputation while comparing to the competitors. On top of that, the sentiment analysis allows the brand owner to track down and catch up positive, neutral or negative conversation without having to do a survey manually. Thus, allowing the brand owner to fix any complaint at the soonest and further improvising their service.

As referring to Fig.1 , the method (100) for adaptively classifying sentiment of document snippets is preferably begins with crawling Web documents to search for document snippets. The Web document herein refers to a document or information resources that can be accessed by users over a network such as the Internet or an intranet. Content of the Web documents may includes text, multimedia files and images which are typically viewed as Web pages with the aid of a Web Browser. In a preferred embodiment of the method (100) for adaptively classifying sentiment of document snippets, the Web documents include social networking services, discussion sites and web newspaper. Examples of the social networking services may be the Facebook and Twitter, while the discussions sites may be an internet forum or blog. On the other hand, the web newspaper may be an online version of a printed periodical that exist on the Internet or commonly known as the World Wide Web (WWW). However, it should be understood that the Web documents herein should not be limited to documents or information resources which are extracted from Surface Web of the WWW but may also includes the Deep Web.

In a preferred embodiment of the method (100) for adaptively classifying sentiment of document snippets, the step of crawling Web documents includes parsing and indexing search results of the document snippets. Fig. 2 shows a preferred embodiment for the step of crawling Web documents namely a method for crawling process (101 ). In general, the method for crawling process (101 ) browses contents of the Web documents to search for document snippets related to the brand owner, brand reputation and/or their product or services. The document snippets used herein is referring to a programming term of a re-usable text. The document snippets may be a sentence or a paragraph. The method for crawling process (101 ) may be performed by a Web crawler that mainly used to create a copy of all the browsed Web documents for later processing by a local search engine. Said search engine then performs indexing, collecting, parsing and storing the document snippets as referred to Fig. 2 to facilitate fast and accurate information retrieval.

Fig. 3 shows a preferred embodiment for the step of populating document snippets by entity and using sentiment lexicon to annotate sentiment-bearing snippets, namely a method for populating document snippets by entity (102). The method for populating document snippets by entity (102) comprises of searching index obtained from the method for crawling process (101 ); and retrieving relevant document snippets by entity. The entity refers herein are domain of the documents snippets such as politic, economic, market players, industries and any domain concerning client as the brand owner, brand reputation and/or their product or services. These entities are preferably identified and filtered using a method as referred to Fig. 4 namely a method for generating document snippets features (103). The method for generating document snippets features (103) also further identify sentiment-bearing snippets to classify sentiment of document snippets.

In a preferred embodiment of the method (100) for adaptively classifying sentiment of document snippets, the step of populating document snippets includes normalizing words in document snippets and generating n-grams from the entity. Referring to Fig. 4, the method for generating document snippets features ( 03) further characterized by normalizing words in document snippets for bag-of-words features; generating n-grams from words surrounding the entity; using a sentiment lexicon to annotate sentiment-bearing snippets and representing each document snippet according to the method for generating document snippets features (103).

The step of normalizing words in document snippets for bag-of-words features in the preferred embodiment simplifies document snippets to be used in natural language processing and information retrieval. In the preferred embodiment, all words found in the document snippets will first be collected in an unordered collection of words which also known as the bag-of-words. The bag-of-words are then further normalized to reduce redundant words for example normalizing past tense to present tense, plural to singular, and capital letter to lower case letter. Therefore, number of words considered in the document snippet can be reduced to improve classification speed.

The number of words in the bag-of-words (for example n number of words) from the document snippets will then be modelled using an n-grams model in the step of generating n-grams from words surrounding the entity. In the preferred embodiment, generating n-grams is generally generating a contiguous sequence of n words in the bag-of-words collected from document snippets. In addition, an n-grams model is a type of probabilistic language model for predicting the next item in said sequence in a form of (n-1 ). Therefore, the n-grams model generally a language model that exploit ordering of n words from document snippets which by generating said n-grams model will classify sentiment of document snippets. Thus, by generating the n-grams model, sentiment of the document snippets can be predicted and valued in a range of -1 to +1 , wherein -1 is for negative sentiment, 0 is for neutral sentiment and +1 is for positive sentiment. In a simple example, if a document snippets state that company A is better than company 8 in an entity of airline industry in terms of their services and fares, by generating the n-grams, the word better will be valued as +1 to refer to a positive adjective. Hence, from the document snippets, it can be predicted that company A gains a positive sentiment while company B automatically gains a negative sentiment. It should be understood by person skills in the art that generating the n-grams model is common in speech recognition and sentiment analysis.

Said value is then indexed through a dictionary of phrases according to their comparative scores which is known as sentiment lexicon. The step of using a sentiment lexicon to annotate sentiment-bearing snippets in the preferred embodiment will list down similar adjective words according to their comparative score. Therefore, the sentiment-bearing snippets are annotated according to the sentiment lexicon determined at this step. Finally, the step of representing each document snippet according to the method for generating document snippets features (103) will populate document snippets by entity and sentiment-bearing snippets.

Fig. 5 shows a preferred embodiment for the step of creating initial training data sets from the populated document snippets and building a machine learning model for each initial training data set, namely a method for creating initial training data sets (104), The method for creating initial training data sets (104) comprises manually creating initial training data sets from the populated document snippets by entity. In the preferred embodiment, users may create the initial training data sets using a table to accommodate numerous entities generated in the method for generating document snippets features (104). For example, a training data set is created for each entity in a table form to simplify monitoring. The initial training data sets will then be used as an input to train the machine learning model. A machine learning model is preferably built for each initial training data set. In the preferred embodiment, the machine learning model learns to classify sentiment of document snippets according to the initial training data sets. During training session, the machine learning model deals with generalization which is the ability of said machine learning model to accurately classify sentiment on new, unseen examples of document snippets after having trained on the initial training data sets. There are a few types of algorithm approach for the machine learning model to learn how to perform classification of sentiment document snippets such as supervised learning, unsupervised learning, semi-supervised learning, reinforcement learning and learning to learn algorithm. The selection of the algorithm implemented for the machine learning model depends on the desired outcome or input available. For example, the supervised learning generates a function that maps input to desired output wherein the machine learning model approximates a function mapping a vector into classes by looking at input-output examples of the function. However, any algorithm for the machine learning model may be used in the present invention depends on the needs and skilled person in the art. The machine learning model is then expected to automatically perform sentiment classification of new document snippets. Referring to Fig. 6 is a preferred embodiment of the step of classifying sentiment of new document snippets of an entity using the machine learning model from the initial training data sets namely a method for classifying sentiment of new document snippets (105). In said method (105), each machine learning model will automatically performs sentiment classification of new document snippets. The machine learning model being trained by the initial training data sets herein is also regarded as a default model. However, the default model does not capable to cope with informal words and acronyms of document snippets available throughout various social media such as the Facebook and Twitter. Therefore, to adapt with the evolving informal words and acronyms of document snippets, the method (100) in the present invention is characterised by the step of inspecting and correcting results of the sentiment classification performed by the machine learning model at a predetermined time interval. In a preferred embodiment of the method (100) for adaptively classifying sentiment of document snippets, the step of inspecting and correcting results of the sentiment classification is done manually by user to adapt with evolving informal words and acronyms of documents snippets. Users may inspect and correct results of the sentiment classification performed using the default model at a predetermined time interval such as weekly, daily or hourly. The step of inspecting and correcting results of the sentiment classification may be performed randomly or systematically depends on the users' management system. Using the step of inspecting, users will inspect new words which are overlooked by the default model that may represent sentiment tones of the document snippets. The new words may be in 'slang' form, short form, new abbreviation and acronyms that were not found or defined using the method for generating document snippets features (104). The users will also inspect results of the sentiment classification performed using the default model which was trained by the initial training data sets. If there is any mistake or inconsistency between said results and the initial training data sets, the users will then correct the results of the sentiment classification and the initial training data sets. Besides user of the monitoring service on online multi-media platforms, client may also contribute to give a feedback by inspecting and correcting results of the sentiment classification.

Any correction done on the results will be adapted in the initial training data sets. In a preferred embodiment of the method (100) for adaptively classifying sentiment of document snippets, the step of adapting the initial training data sets includes creating new training data sets. Fig. 7 shows a preferred embodiment for the step of adapting the initial training data sets according to the corrected results of the sentiment classification, then building a new machine learning model according to the adapted training data sets, namely a method for adaptive sentiment classification (106). The method for adaptive sentiment classification (106) comprises creating new training data set and machine learning model if there is more than one correction on a particular entity. The more and accurate training data sets are added and a custom machine learning model is built for each entity; the results of the sentiment classification will have a higher accuracy. However, higher number of machine learning models may affect time consumed for the sentiment classification in general.

The method for adaptive sentiment classification (106) then characterized by the step of classifying sentiment of new document snippets of an entity adaptively using the adapted training data sets. The adapted training data sets include the initial training data sets, the corrected initial training data sets and the newly created training data sets. Machine learning models are also built and corrected according to the adapted training data sets. Therefore, each machine learning model may automatically perform sentiment classification of new document snippets adaptively.

The present invention may also be adapted to understand English Language or any local language to cater the client's need. Although the present invention has been described with reference to specific embodiments, also shown in the appended figures, it will be apparent for those skilled in the art that many variations and modifications can be done within the scope of the invention as described in the specification and defined in the following claims. Description of the reference numerals used in the accompanying drawings according to the present invention:

Reference

Description

Numerals

method for adaptively classifying sentiment of document

100

snippets

101 method for crawling process

102 method for populating document snippets by entity

103 method for generating document snippets features

104 method for creating initial training data sets

105 method for classifying sentiment of new document snippets

106 method for adaptive sentiment classification

Claims

Claims I/We claim:

1. A method (100) for adaptively classifying sentiment of document snippets, comprising the steps of:

crawling Web documents to search for document snippets;

populating document snippets by entity and using sentiment lexicon to annotate sentiment-bearing snippets;

creating initial training data sets from the populated document snippets and building a machine learning model for each initial training data set;

characterised by the steps of:

inspecting and correcting results of the sentiment classification performed by the machine learning model at a predetermined time interval; adapting the initial training data sets according to the corrected results of the sentiment classification, then building a new machine learning model according to the adapted training data sets; and

2. A method (100) for adaptively classifying sentiment of document snippets according to claim 1 , wherein the Web documents include social networking services, discussion sites and web newspaper.

3. A method (100) for adaptively classifying sentiment of document snippets according to claim 1 , wherein the step of crawling Web documents includes parsing and indexing search results of the document snippets.

4. A method (100) for adaptively classifying sentiment of document snippets according to claim 1 , wherein the step of populating document snippets includes normalizing words in document snippets and generating n-grams from the entity.

5. A method (100) for adaptively classifying sentiment of document snippets according to claim 1 , wherein the step of inspecting and correcting results of the sentiment classification is done manually by user to adapt with evolving informal words and acronyms of documents snippets.

6. A method (100) for adaptively classifying sentiment of document snippets according to claim 1 , wherein the step of adapting the initial training data sets includes creating new training data sets.