WO2015044934A1 - A method for adaptively classifying sentiment of document snippets - Google Patents

A method for adaptively classifying sentiment of document snippets Download PDF

Info

Publication number
WO2015044934A1
WO2015044934A1 PCT/ID2014/000012 ID2014000012W WO2015044934A1 WO 2015044934 A1 WO2015044934 A1 WO 2015044934A1 ID 2014000012 W ID2014000012 W ID 2014000012W WO 2015044934 A1 WO2015044934 A1 WO 2015044934A1
Authority
WO
WIPO (PCT)
Prior art keywords
sentiment
snippets
document snippets
training data
data sets
Prior art date
Application number
PCT/ID2014/000012
Other languages
French (fr)
Inventor
Ismail FAHMI
Widanardi SATRYATOMO
Original Assignee
ABIDIN, Indira Ratna Dewi
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ABIDIN, Indira Ratna Dewi filed Critical ABIDIN, Indira Ratna Dewi
Publication of WO2015044934A1 publication Critical patent/WO2015044934A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • This invention relates to a computer implemented method for classifying sentiment of document snippets, and more particularly to a method for adaptively classifying sentiment of document snippets.
  • brand management is an important strategy to establish and sustain in a particular industry or even globally.
  • a measure to determine efficacy of brand management has become a challenge since content sharing is made easy by the widening availability of internet and mobile devices. Brand's gossip, debate or conversation has continuously moving from one media platform to another, worldwide and anytime of the day. Moreover, high volume of brand citations doesn't necessarily mean good reception. Sentiment tones need to be considered to get clearer picture. Hence, there is a growing need for brand owners or manager to collate information of who are talking about their brands, where, when and how do they talk about the brands.
  • the cited publication disclosed a computer-implemented method for determining sentiment in Web documents, comprising: harvesting one or more Web documents form one or more content sources and extracting keywords from the Web documents; filtering at a server, according to a phase transition method to produce filtered keywords; then determining sentiment expressed in the Web documents on a category-by category basis; said sentiment determined by analysis of words identified as sentiment-bearing or opinion bearing within the web document; and reporting the sentiment so determined.
  • the cited patent does not provide a method for an adaptive domain-specific sentiment classification to adapt with the evolving informal words and acronyms of documents snippets.
  • Another US patent 8266148 B2 disclosed various embodiments of a method for Business Intelligence (Bl) metrics on unstructured data.
  • the disclosed method is a machine-implemented method for a pipelined process of capture, classification and dimensioning of data from a plurality of data sources that include unstructured data having no explicit dimensions associated with the unstructured data to generate a domain-relevant classified data index that is useable by a plurality of different intelligence metrics to perform different kinds of business intelligence metrics to perform different kinds of business intelligence analytics.
  • user-feedback is obtained from the user in response to the analytics results that are presented for the user and cause a data processing machine to adaptively utilize the user-feedback to modify the relevance classification.
  • the cited patent does not disclosed method for generating document snippet features in details
  • the present invention relates to a method for adaptively classifying sentiment of document snippets, comprising the steps of: crawling Web documents to search for document snippets; populating document snippets by entity and using sentiment lexicon to annotate sentiment-bearing snippets; creating initial training data sets from the populated document snippets and building a machine learning model for each initial training data set; classifying sentiment of new document snippets of an entity using the machine learning model from the initial training data sets; characterised by the steps of: inspecting and correcting results of the sentiment classification performed by the machine learning model at a predetermined time interval; adapting the initial training data sets according to the corrected results of the sentiment classification, then building a new machine learning model according to the adapted training data sets; and classifying sentiment of new document snippets of an entity adaptively using the adapted training data sets.
  • Fig. 1 is a flow chart of a method for adaptively classifying sentiment of document snippets.
  • Fig. 2 is a flow chart of a method for crawling process.
  • Fig. 3 is a flow chart of a method for populating document snippets by entity.
  • Fig. 4 is a flow chart of a method for generating document snippets features.
  • Fig. 5 is a flow chart of a method for creating initial training data sets.
  • Fig. 6 is a flow chart of a method for classifying sentiment of new document snippets.
  • Fig. 7 is a flow chart of a method for adaptive sentiment classification.
  • the present invention relates to a method (100) for adaptively classifying sentiment of document snippets, comprising the steps of:
  • Method (100) in the present invention is preferably for providing a monitoring service on online multi-media platforms for the benefit of a brand owner.
  • the brand owner gets to monitor and analysis public conversation about their brand reputation while comparing to the competitors.
  • the sentiment analysis allows the brand owner to track down and catch up positive, neutral or negative conversation without having to do a survey manually.
  • the brand owner fix any complaint at the soonest and further improvising their service.
  • the method (100) for adaptively classifying sentiment of document snippets is preferably begins with crawling Web documents to search for document snippets.
  • the Web document herein refers to a document or information resources that can be accessed by users over a network such as the Internet or an intranet.
  • Content of the Web documents may includes text, multimedia files and images which are typically viewed as Web pages with the aid of a Web Browser.
  • the Web documents include social networking services, discussion sites and web newspaper. Examples of the social networking services may be the Facebook and Twitter, while the discussions sites may be an internet forum or blog.
  • the web newspaper may be an online version of a printed periodical that exist on the Internet or commonly known as the World Wide Web (WWW).
  • WWW World Wide Web
  • the Web documents herein should not be limited to documents or information resources which are extracted from Surface Web of the WWW but may also includes the Deep Web.
  • the step of crawling Web documents includes parsing and indexing search results of the document snippets.
  • Fig. 2 shows a preferred embodiment for the step of crawling Web documents namely a method for crawling process (101 ).
  • the method for crawling process (101 ) browses contents of the Web documents to search for document snippets related to the brand owner, brand reputation and/or their product or services.
  • the document snippets used herein is referring to a programming term of a re-usable text.
  • the document snippets may be a sentence or a paragraph.
  • the method for crawling process (101 ) may be performed by a Web crawler that mainly used to create a copy of all the browsed Web documents for later processing by a local search engine. Said search engine then performs indexing, collecting, parsing and storing the document snippets as referred to Fig. 2 to facilitate fast and accurate information retrieval.
  • Fig. 3 shows a preferred embodiment for the step of populating document snippets by entity and using sentiment lexicon to annotate sentiment-bearing snippets, namely a method for populating document snippets by entity (102).
  • the method for populating document snippets by entity (102) comprises of searching index obtained from the method for crawling process (101 ); and retrieving relevant document snippets by entity.
  • the entity refers herein are domain of the documents snippets such as politic, economic, market players, industries and any domain concerning client as the brand owner, brand reputation and/or their product or services.
  • These entities are preferably identified and filtered using a method as referred to Fig. 4 namely a method for generating document snippets features (103).
  • the method for generating document snippets features (103) also further identify sentiment-bearing snippets to classify sentiment of document snippets.
  • the step of populating document snippets includes normalizing words in document snippets and generating n-grams from the entity.
  • the method for generating document snippets features ( 03) further characterized by normalizing words in document snippets for bag-of-words features; generating n-grams from words surrounding the entity; using a sentiment lexicon to annotate sentiment-bearing snippets and representing each document snippet according to the method for generating document snippets features (103).
  • the step of normalizing words in document snippets for bag-of-words features in the preferred embodiment simplifies document snippets to be used in natural language processing and information retrieval.
  • all words found in the document snippets will first be collected in an unordered collection of words which also known as the bag-of-words.
  • the bag-of-words are then further normalized to reduce redundant words for example normalizing past tense to present tense, plural to singular, and capital letter to lower case letter. Therefore, number of words considered in the document snippet can be reduced to improve classification speed.
  • the number of words in the bag-of-words (for example n number of words) from the document snippets will then be modelled using an n-grams model in the step of generating n-grams from words surrounding the entity.
  • generating n-grams is generally generating a contiguous sequence of n words in the bag-of-words collected from document snippets.
  • an n-grams model is a type of probabilistic language model for predicting the next item in said sequence in a form of (n-1 ). Therefore, the n-grams model generally a language model that exploit ordering of n words from document snippets which by generating said n-grams model will classify sentiment of document snippets.
  • sentiment of the document snippets can be predicted and valued in a range of -1 to +1 , wherein -1 is for negative sentiment, 0 is for neutral sentiment and +1 is for positive sentiment.
  • -1 for negative sentiment
  • 0 for neutral sentiment
  • +1 for positive sentiment.
  • a document snippets state that company A is better than company 8 in an entity of airline industry in terms of their services and fares
  • the word better will be valued as +1 to refer to a positive adjective.
  • company A gains a positive sentiment while company B automatically gains a negative sentiment.
  • Said value is then indexed through a dictionary of phrases according to their comparative scores which is known as sentiment lexicon.
  • the step of using a sentiment lexicon to annotate sentiment-bearing snippets in the preferred embodiment will list down similar adjective words according to their comparative score. Therefore, the sentiment-bearing snippets are annotated according to the sentiment lexicon determined at this step.
  • the step of representing each document snippet according to the method for generating document snippets features (103) will populate document snippets by entity and sentiment-bearing snippets.
  • Fig. 5 shows a preferred embodiment for the step of creating initial training data sets from the populated document snippets and building a machine learning model for each initial training data set, namely a method for creating initial training data sets (104),
  • the method for creating initial training data sets (104) comprises manually creating initial training data sets from the populated document snippets by entity.
  • users may create the initial training data sets using a table to accommodate numerous entities generated in the method for generating document snippets features (104). For example, a training data set is created for each entity in a table form to simplify monitoring.
  • the initial training data sets will then be used as an input to train the machine learning model.
  • a machine learning model is preferably built for each initial training data set.
  • the machine learning model learns to classify sentiment of document snippets according to the initial training data sets.
  • the machine learning model deals with generalization which is the ability of said machine learning model to accurately classify sentiment on new, unseen examples of document snippets after having trained on the initial training data sets.
  • algorithm approach for the machine learning model to learn how to perform classification of sentiment document snippets such as supervised learning, unsupervised learning, semi-supervised learning, reinforcement learning and learning to learn algorithm.
  • the selection of the algorithm implemented for the machine learning model depends on the desired outcome or input available.
  • the supervised learning generates a function that maps input to desired output wherein the machine learning model approximates a function mapping a vector into classes by looking at input-output examples of the function.
  • any algorithm for the machine learning model may be used in the present invention depends on the needs and skilled person in the art.
  • the machine learning model is then expected to automatically perform sentiment classification of new document snippets.
  • Fig. 6 is a preferred embodiment of the step of classifying sentiment of new document snippets of an entity using the machine learning model from the initial training data sets namely a method for classifying sentiment of new document snippets (105). In said method (105), each machine learning model will automatically performs sentiment classification of new document snippets.
  • the machine learning model being trained by the initial training data sets herein is also regarded as a default model.
  • the default model does not capable to cope with informal words and acronyms of document snippets available throughout various social media such as the Facebook and Twitter. Therefore, to adapt with the evolving informal words and acronyms of document snippets, the method (100) in the present invention is characterised by the step of inspecting and correcting results of the sentiment classification performed by the machine learning model at a predetermined time interval.
  • the step of inspecting and correcting results of the sentiment classification is done manually by user to adapt with evolving informal words and acronyms of documents snippets.
  • Users may inspect and correct results of the sentiment classification performed using the default model at a predetermined time interval such as weekly, daily or hourly.
  • the step of inspecting and correcting results of the sentiment classification may be performed randomly or systematically depends on the users' management system. Using the step of inspecting, users will inspect new words which are overlooked by the default model that may represent sentiment tones of the document snippets. The new words may be in 'slang' form, short form, new abbreviation and acronyms that were not found or defined using the method for generating document snippets features (104).
  • the users will also inspect results of the sentiment classification performed using the default model which was trained by the initial training data sets.
  • client may also contribute to give a feedback by inspecting and correcting results of the sentiment classification.
  • the step of adapting the initial training data sets includes creating new training data sets.
  • Fig. 7 shows a preferred embodiment for the step of adapting the initial training data sets according to the corrected results of the sentiment classification, then building a new machine learning model according to the adapted training data sets, namely a method for adaptive sentiment classification (106).
  • the method for adaptive sentiment classification (106) comprises creating new training data set and machine learning model if there is more than one correction on a particular entity. The more and accurate training data sets are added and a custom machine learning model is built for each entity; the results of the sentiment classification will have a higher accuracy. However, higher number of machine learning models may affect time consumed for the sentiment classification in general.
  • the method for adaptive sentiment classification (106) then characterized by the step of classifying sentiment of new document snippets of an entity adaptively using the adapted training data sets.
  • the adapted training data sets include the initial training data sets, the corrected initial training data sets and the newly created training data sets.
  • Machine learning models are also built and corrected according to the adapted training data sets. Therefore, each machine learning model may automatically perform sentiment classification of new document snippets adaptively.

Abstract

The present invention relates to a method for adaptively classifying sentiment of document snippets, comprising the steps of: crawling Web documents to search document snippets; populating document snippets by entity and using sentiment lexicon to annotate sentiment-bearing snippets; creating initial training data sets from the populated document snippets and building a machine learning model for each initial training data set; classifying sentiment of new document snippets of an entity using the machine learning model from the initial training data sets; characterised by the steps of: inspecting and correcting results of the sentiment classification performed by the machine learning model at a predetermined time interval; adapting the initial training data sets according to the corrected results of the sentiment classification, building a new machine learning model according to the adapted training data sets; and classifying sentiment of new document snippets of an entity adaptively using the adapted training data sets.

Description

A METHOD FOR ADAPTIVE LY CLASSIFYING SENTIMENT OF DOCUMENT
SNIPPETS
Background of the Invention Field of the Invention
This invention relates to a computer implemented method for classifying sentiment of document snippets, and more particularly to a method for adaptively classifying sentiment of document snippets. Description of Related Arts
In business marketing or corporate institution, brand management is an important strategy to establish and sustain in a particular industry or even globally. A measure to determine efficacy of brand management has become a challenge since content sharing is made easy by the widening availability of internet and mobile devices. Brand's gossip, debate or conversation has continuously moving from one media platform to another, worldwide and anytime of the day. Moreover, high volume of brand citations doesn't necessarily mean good reception. Sentiment tones need to be considered to get clearer picture. Hence, there is a growing need for brand owners or manager to collate information of who are talking about their brands, where, when and how do they talk about the brands.
Therefore, a few approaches to monitor the media platform have been currently made to cope with the rising numbers of online and social media channels, while the traditional media like prints, television and radio are also taking the digital road to distribute their brands. As for example, a patent publication of US 2012/0101808 A1 disclosed methods and systems for extracting and analyzing user-generated content (UGC) in order to provide opinion-bearing information concerning different categories of a product. The cited publication disclosed a computer-implemented method for determining sentiment in Web documents, comprising: harvesting one or more Web documents form one or more content sources and extracting keywords from the Web documents; filtering at a server, according to a phase transition method to produce filtered keywords; then determining sentiment expressed in the Web documents on a category-by category basis; said sentiment determined by analysis of words identified as sentiment-bearing or opinion bearing within the web document; and reporting the sentiment so determined. However the cited patent does not provide a method for an adaptive domain-specific sentiment classification to adapt with the evolving informal words and acronyms of documents snippets.
Another US patent 8266148 B2 disclosed various embodiments of a method for Business Intelligence (Bl) metrics on unstructured data. The disclosed method is a machine-implemented method for a pipelined process of capture, classification and dimensioning of data from a plurality of data sources that include unstructured data having no explicit dimensions associated with the unstructured data to generate a domain-relevant classified data index that is useable by a plurality of different intelligence metrics to perform different kinds of business intelligence metrics to perform different kinds of business intelligence analytics. In one embodiment of the cited patent, user-feedback is obtained from the user in response to the analytics results that are presented for the user and cause a data processing machine to adaptively utilize the user-feedback to modify the relevance classification. However, the cited patent does not disclosed method for generating document snippet features in details,
Accordingly, it can be seen in the prior arts that there exists a need to provide an improved method for adaptively classifying sentiment of document snippets to adapt with the evolving informal words and acronyms of documents snippets.
Summary of Invention
It is an objective of the present invention to provide a method for classifying sentiment of document snippets.
It is also an objective of the present invention to provide an adaptive method for classifying sentiment of document snippets. It is yet another objective of the present invention to provide a natural language processing technology to adapt with the evolving informal words and acronyms of documents snippets. Accordingly, these objectives may be achieved by following the teachings of the present invention. The present invention relates to a method for adaptively classifying sentiment of document snippets, comprising the steps of: crawling Web documents to search for document snippets; populating document snippets by entity and using sentiment lexicon to annotate sentiment-bearing snippets; creating initial training data sets from the populated document snippets and building a machine learning model for each initial training data set; classifying sentiment of new document snippets of an entity using the machine learning model from the initial training data sets; characterised by the steps of: inspecting and correcting results of the sentiment classification performed by the machine learning model at a predetermined time interval; adapting the initial training data sets according to the corrected results of the sentiment classification, then building a new machine learning model according to the adapted training data sets; and classifying sentiment of new document snippets of an entity adaptively using the adapted training data sets.
Brief Description of the Drawings
The features of the invention will be more readily understood and appreciated from the following detailed description when read in conjunction with the accompanying drawings of the preferred embodiment of the present invention, in which:
Fig. 1 is a flow chart of a method for adaptively classifying sentiment of document snippets.
Fig. 2 is a flow chart of a method for crawling process.
Fig. 3 is a flow chart of a method for populating document snippets by entity. Fig. 4 is a flow chart of a method for generating document snippets features. Fig. 5 is a flow chart of a method for creating initial training data sets. Fig. 6 is a flow chart of a method for classifying sentiment of new document snippets.
Fig. 7 is a flow chart of a method for adaptive sentiment classification.
Detailed Description of the Invention
As required, detailed embodiments of the present invention are disclosed herein; however, it is to be understood that the disclosed embodiments are merely exemplary of the invention, which may be embodied in various forms. Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting but merely as a basis for claims. It should be understood that the drawings and detailed description thereto are not intended to limit the invention to the particular form disclosed, but on the contrary, the invention is to cover all modification, equivalents and alternatives falling within the scope of the present invention as defined by the appended claims. As used throughout this application, the word "may" is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words "include," "including," and "includes" mean including, but not limited to. Further, the words "a" or "an" mean "at least one" and the word "plurality" means one or more, unless otherwise mentioned. Where the abbreviations or technical terms are used, these indicate the commonly accepted meanings as known in the technical field. For ease of reference, common reference numerals will be used throughout the figures when referring to the same or similar features common to the figures. The present invention will now be described with reference to Figs. 1-7.
The present invention relates to a method (100) for adaptively classifying sentiment of document snippets, comprising the steps of:
crawling Web documents to search for document snippets;
populating document snippets by entity and using sentiment lexicon to annotate sentiment-bearing snippets; creating initial training data sets from the populated document snippets and building a machine learning model for each initial training data set;
classifying sentiment of new document snippets of an entity using the machine learning model from the initial training data sets;
characterised by the steps of:
inspecting and correcting results of the sentiment classification performed by the machine learning model at a predetermined time interval;
adapting the initial training data sets according to the corrected results of the sentiment classification, then building a new machine learning model according to the adapted training data sets; and
classifying sentiment of new document snippets of an entity adaptively using the adapted training data sets.
Method (100) in the present invention is preferably for providing a monitoring service on online multi-media platforms for the benefit of a brand owner. By means of said monitoring service, the brand owner gets to monitor and analysis public conversation about their brand reputation while comparing to the competitors. On top of that, the sentiment analysis allows the brand owner to track down and catch up positive, neutral or negative conversation without having to do a survey manually. Thus, allowing the brand owner to fix any complaint at the soonest and further improvising their service.
As referring to Fig.1 , the method (100) for adaptively classifying sentiment of document snippets is preferably begins with crawling Web documents to search for document snippets. The Web document herein refers to a document or information resources that can be accessed by users over a network such as the Internet or an intranet. Content of the Web documents may includes text, multimedia files and images which are typically viewed as Web pages with the aid of a Web Browser. In a preferred embodiment of the method (100) for adaptively classifying sentiment of document snippets, the Web documents include social networking services, discussion sites and web newspaper. Examples of the social networking services may be the Facebook and Twitter, while the discussions sites may be an internet forum or blog. On the other hand, the web newspaper may be an online version of a printed periodical that exist on the Internet or commonly known as the World Wide Web (WWW). However, it should be understood that the Web documents herein should not be limited to documents or information resources which are extracted from Surface Web of the WWW but may also includes the Deep Web.
In a preferred embodiment of the method (100) for adaptively classifying sentiment of document snippets, the step of crawling Web documents includes parsing and indexing search results of the document snippets. Fig. 2 shows a preferred embodiment for the step of crawling Web documents namely a method for crawling process (101 ). In general, the method for crawling process (101 ) browses contents of the Web documents to search for document snippets related to the brand owner, brand reputation and/or their product or services. The document snippets used herein is referring to a programming term of a re-usable text. The document snippets may be a sentence or a paragraph. The method for crawling process (101 ) may be performed by a Web crawler that mainly used to create a copy of all the browsed Web documents for later processing by a local search engine. Said search engine then performs indexing, collecting, parsing and storing the document snippets as referred to Fig. 2 to facilitate fast and accurate information retrieval.
Fig. 3 shows a preferred embodiment for the step of populating document snippets by entity and using sentiment lexicon to annotate sentiment-bearing snippets, namely a method for populating document snippets by entity (102). The method for populating document snippets by entity (102) comprises of searching index obtained from the method for crawling process (101 ); and retrieving relevant document snippets by entity. The entity refers herein are domain of the documents snippets such as politic, economic, market players, industries and any domain concerning client as the brand owner, brand reputation and/or their product or services. These entities are preferably identified and filtered using a method as referred to Fig. 4 namely a method for generating document snippets features (103). The method for generating document snippets features (103) also further identify sentiment-bearing snippets to classify sentiment of document snippets.
In a preferred embodiment of the method (100) for adaptively classifying sentiment of document snippets, the step of populating document snippets includes normalizing words in document snippets and generating n-grams from the entity. Referring to Fig. 4, the method for generating document snippets features ( 03) further characterized by normalizing words in document snippets for bag-of-words features; generating n-grams from words surrounding the entity; using a sentiment lexicon to annotate sentiment-bearing snippets and representing each document snippet according to the method for generating document snippets features (103).
The step of normalizing words in document snippets for bag-of-words features in the preferred embodiment simplifies document snippets to be used in natural language processing and information retrieval. In the preferred embodiment, all words found in the document snippets will first be collected in an unordered collection of words which also known as the bag-of-words. The bag-of-words are then further normalized to reduce redundant words for example normalizing past tense to present tense, plural to singular, and capital letter to lower case letter. Therefore, number of words considered in the document snippet can be reduced to improve classification speed.
The number of words in the bag-of-words (for example n number of words) from the document snippets will then be modelled using an n-grams model in the step of generating n-grams from words surrounding the entity. In the preferred embodiment, generating n-grams is generally generating a contiguous sequence of n words in the bag-of-words collected from document snippets. In addition, an n-grams model is a type of probabilistic language model for predicting the next item in said sequence in a form of (n-1 ). Therefore, the n-grams model generally a language model that exploit ordering of n words from document snippets which by generating said n-grams model will classify sentiment of document snippets. Thus, by generating the n-grams model, sentiment of the document snippets can be predicted and valued in a range of -1 to +1 , wherein -1 is for negative sentiment, 0 is for neutral sentiment and +1 is for positive sentiment. In a simple example, if a document snippets state that company A is better than company 8 in an entity of airline industry in terms of their services and fares, by generating the n-grams, the word better will be valued as +1 to refer to a positive adjective. Hence, from the document snippets, it can be predicted that company A gains a positive sentiment while company B automatically gains a negative sentiment. It should be understood by person skills in the art that generating the n-grams model is common in speech recognition and sentiment analysis.
Said value is then indexed through a dictionary of phrases according to their comparative scores which is known as sentiment lexicon. The step of using a sentiment lexicon to annotate sentiment-bearing snippets in the preferred embodiment will list down similar adjective words according to their comparative score. Therefore, the sentiment-bearing snippets are annotated according to the sentiment lexicon determined at this step. Finally, the step of representing each document snippet according to the method for generating document snippets features (103) will populate document snippets by entity and sentiment-bearing snippets.
Fig. 5 shows a preferred embodiment for the step of creating initial training data sets from the populated document snippets and building a machine learning model for each initial training data set, namely a method for creating initial training data sets (104), The method for creating initial training data sets (104) comprises manually creating initial training data sets from the populated document snippets by entity. In the preferred embodiment, users may create the initial training data sets using a table to accommodate numerous entities generated in the method for generating document snippets features (104). For example, a training data set is created for each entity in a table form to simplify monitoring. The initial training data sets will then be used as an input to train the machine learning model. A machine learning model is preferably built for each initial training data set. In the preferred embodiment, the machine learning model learns to classify sentiment of document snippets according to the initial training data sets. During training session, the machine learning model deals with generalization which is the ability of said machine learning model to accurately classify sentiment on new, unseen examples of document snippets after having trained on the initial training data sets. There are a few types of algorithm approach for the machine learning model to learn how to perform classification of sentiment document snippets such as supervised learning, unsupervised learning, semi-supervised learning, reinforcement learning and learning to learn algorithm. The selection of the algorithm implemented for the machine learning model depends on the desired outcome or input available. For example, the supervised learning generates a function that maps input to desired output wherein the machine learning model approximates a function mapping a vector into classes by looking at input-output examples of the function. However, any algorithm for the machine learning model may be used in the present invention depends on the needs and skilled person in the art. The machine learning model is then expected to automatically perform sentiment classification of new document snippets. Referring to Fig. 6 is a preferred embodiment of the step of classifying sentiment of new document snippets of an entity using the machine learning model from the initial training data sets namely a method for classifying sentiment of new document snippets (105). In said method (105), each machine learning model will automatically performs sentiment classification of new document snippets. The machine learning model being trained by the initial training data sets herein is also regarded as a default model. However, the default model does not capable to cope with informal words and acronyms of document snippets available throughout various social media such as the Facebook and Twitter. Therefore, to adapt with the evolving informal words and acronyms of document snippets, the method (100) in the present invention is characterised by the step of inspecting and correcting results of the sentiment classification performed by the machine learning model at a predetermined time interval. In a preferred embodiment of the method (100) for adaptively classifying sentiment of document snippets, the step of inspecting and correcting results of the sentiment classification is done manually by user to adapt with evolving informal words and acronyms of documents snippets. Users may inspect and correct results of the sentiment classification performed using the default model at a predetermined time interval such as weekly, daily or hourly. The step of inspecting and correcting results of the sentiment classification may be performed randomly or systematically depends on the users' management system. Using the step of inspecting, users will inspect new words which are overlooked by the default model that may represent sentiment tones of the document snippets. The new words may be in 'slang' form, short form, new abbreviation and acronyms that were not found or defined using the method for generating document snippets features (104). The users will also inspect results of the sentiment classification performed using the default model which was trained by the initial training data sets. If there is any mistake or inconsistency between said results and the initial training data sets, the users will then correct the results of the sentiment classification and the initial training data sets. Besides user of the monitoring service on online multi-media platforms, client may also contribute to give a feedback by inspecting and correcting results of the sentiment classification.
Any correction done on the results will be adapted in the initial training data sets. In a preferred embodiment of the method (100) for adaptively classifying sentiment of document snippets, the step of adapting the initial training data sets includes creating new training data sets. Fig. 7 shows a preferred embodiment for the step of adapting the initial training data sets according to the corrected results of the sentiment classification, then building a new machine learning model according to the adapted training data sets, namely a method for adaptive sentiment classification (106). The method for adaptive sentiment classification (106) comprises creating new training data set and machine learning model if there is more than one correction on a particular entity. The more and accurate training data sets are added and a custom machine learning model is built for each entity; the results of the sentiment classification will have a higher accuracy. However, higher number of machine learning models may affect time consumed for the sentiment classification in general.
The method for adaptive sentiment classification (106) then characterized by the step of classifying sentiment of new document snippets of an entity adaptively using the adapted training data sets. The adapted training data sets include the initial training data sets, the corrected initial training data sets and the newly created training data sets. Machine learning models are also built and corrected according to the adapted training data sets. Therefore, each machine learning model may automatically perform sentiment classification of new document snippets adaptively.
The present invention may also be adapted to understand English Language or any local language to cater the client's need. Although the present invention has been described with reference to specific embodiments, also shown in the appended figures, it will be apparent for those skilled in the art that many variations and modifications can be done within the scope of the invention as described in the specification and defined in the following claims. Description of the reference numerals used in the accompanying drawings according to the present invention:
Reference
Description
Numerals
method for adaptively classifying sentiment of document
100
snippets
101 method for crawling process
102 method for populating document snippets by entity
103 method for generating document snippets features
104 method for creating initial training data sets
105 method for classifying sentiment of new document snippets
106 method for adaptive sentiment classification

Claims

Claims I/We claim:
1. A method (100) for adaptively classifying sentiment of document snippets, comprising the steps of:
crawling Web documents to search for document snippets;
populating document snippets by entity and using sentiment lexicon to annotate sentiment-bearing snippets;
creating initial training data sets from the populated document snippets and building a machine learning model for each initial training data set;
classifying sentiment of new document snippets of an entity using the machine learning model from the initial training data sets;
characterised by the steps of:
inspecting and correcting results of the sentiment classification performed by the machine learning model at a predetermined time interval; adapting the initial training data sets according to the corrected results of the sentiment classification, then building a new machine learning model according to the adapted training data sets; and
classifying sentiment of new document snippets of an entity adaptively using the adapted training data sets.
2. A method (100) for adaptively classifying sentiment of document snippets according to claim 1 , wherein the Web documents include social networking services, discussion sites and web newspaper.
3. A method (100) for adaptively classifying sentiment of document snippets according to claim 1 , wherein the step of crawling Web documents includes parsing and indexing search results of the document snippets.
4. A method (100) for adaptively classifying sentiment of document snippets according to claim 1 , wherein the step of populating document snippets includes normalizing words in document snippets and generating n-grams from the entity.
5. A method (100) for adaptively classifying sentiment of document snippets according to claim 1 , wherein the step of inspecting and correcting results of the sentiment classification is done manually by user to adapt with evolving informal words and acronyms of documents snippets.
6. A method (100) for adaptively classifying sentiment of document snippets according to claim 1 , wherein the step of adapting the initial training data sets includes creating new training data sets.
PCT/ID2014/000012 2013-09-30 2014-08-29 A method for adaptively classifying sentiment of document snippets WO2015044934A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
ID20130770 2013-09-30
IDP00201300770 2013-09-30

Publications (1)

Publication Number Publication Date
WO2015044934A1 true WO2015044934A1 (en) 2015-04-02

Family

ID=52742179

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/ID2014/000012 WO2015044934A1 (en) 2013-09-30 2014-08-29 A method for adaptively classifying sentiment of document snippets

Country Status (1)

Country Link
WO (1) WO2015044934A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10373278B2 (en) 2017-02-15 2019-08-06 International Business Machines Corporation Annotation of legal documents with case citations
US10452780B2 (en) 2017-02-15 2019-10-22 International Business Machines Corporation Tone analysis of legal documents
CN111026337A (en) * 2019-12-30 2020-04-17 中科星图股份有限公司 Distributed storage method based on machine learning and ceph thought
CN113094567A (en) * 2021-03-31 2021-07-09 四川新网银行股份有限公司 Malicious complaint identification method and system based on text clustering
US11436528B2 (en) 2019-08-16 2022-09-06 International Business Machines Corporation Intent classification distribution calibration

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110208522A1 (en) * 2010-02-21 2011-08-25 Nice Systems Ltd. Method and apparatus for detection of sentiment in automated transcriptions
US20110252036A1 (en) * 2007-08-23 2011-10-13 Neylon Tyler J Domain-Specific Sentiment Classification
US20120131021A1 (en) * 2008-01-25 2012-05-24 Sasha Blair-Goldensohn Phrase Based Snippet Generation
US20130103385A1 (en) * 2011-10-24 2013-04-25 Riddhiman Ghosh Performing sentiment analysis
US20130151443A1 (en) * 2011-10-03 2013-06-13 Aol Inc. Systems and methods for performing contextual classification using supervised and unsupervised training

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110252036A1 (en) * 2007-08-23 2011-10-13 Neylon Tyler J Domain-Specific Sentiment Classification
US20120131021A1 (en) * 2008-01-25 2012-05-24 Sasha Blair-Goldensohn Phrase Based Snippet Generation
US20110208522A1 (en) * 2010-02-21 2011-08-25 Nice Systems Ltd. Method and apparatus for detection of sentiment in automated transcriptions
US20130151443A1 (en) * 2011-10-03 2013-06-13 Aol Inc. Systems and methods for performing contextual classification using supervised and unsupervised training
US20130103385A1 (en) * 2011-10-24 2013-04-25 Riddhiman Ghosh Performing sentiment analysis

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10373278B2 (en) 2017-02-15 2019-08-06 International Business Machines Corporation Annotation of legal documents with case citations
US10452780B2 (en) 2017-02-15 2019-10-22 International Business Machines Corporation Tone analysis of legal documents
US10929615B2 (en) 2017-02-15 2021-02-23 International Business Machines Corporation Tone analysis of legal documents
US11436528B2 (en) 2019-08-16 2022-09-06 International Business Machines Corporation Intent classification distribution calibration
CN111026337A (en) * 2019-12-30 2020-04-17 中科星图股份有限公司 Distributed storage method based on machine learning and ceph thought
CN113094567A (en) * 2021-03-31 2021-07-09 四川新网银行股份有限公司 Malicious complaint identification method and system based on text clustering

Similar Documents

Publication Publication Date Title
US20210232762A1 (en) Architectures for natural language processing
US20180232362A1 (en) Method and system relating to sentiment analysis of electronic content
Eke et al. Sarcasm identification in textual data: systematic review, research challenges and open directions
CN106599022B (en) User portrait forming method based on user access data
CN107368515B (en) Application page recommendation method and system
CN103678418B (en) Information processing method and message processing device
US20160048754A1 (en) Classifying resources using a deep network
US11106718B2 (en) Content moderation system and indication of reliability of documents
US20160034512A1 (en) Context-based metadata generation and automatic annotation of electronic media in a computer network
KR20190062391A (en) System and method for context retry of electronic records
CN110674317B (en) Entity linking method and device based on graph neural network
US11461353B2 (en) Identifying and extracting addresses within content
CN108334489B (en) Text core word recognition method and device
US9996504B2 (en) System and method for classifying text sentiment classes based on past examples
US10740406B2 (en) Matching of an input document to documents in a document collection
WO2015044934A1 (en) A method for adaptively classifying sentiment of document snippets
WO2011111038A2 (en) Method and system of providing completion suggestion to a partial linguistic element
CN112380868B (en) Multi-classification device and method for interview destination based on event triplets
Bhattacharjee et al. Sentiment analysis using cosine similarity measure
CN111488453A (en) Resource grading method, device, equipment and storage medium
US20090182759A1 (en) Extracting entities from a web page
CN111401047A (en) Method and device for generating dispute focus of legal document and computer equipment
Afolabi et al. Semantic text mining using domain ontology
CN103678400B (en) Web page classification method and device based on collective search behavior
US20240020476A1 (en) Determining linked spam content

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14850076

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14850076

Country of ref document: EP

Kind code of ref document: A1