US20130191735A1 - Advanced summarization on a plurality of sentiments based on intents - Google Patents
Advanced summarization on a plurality of sentiments based on intents Download PDFInfo
- Publication number
- US20130191735A1 US20130191735A1 US13/746,324 US201313746324A US2013191735A1 US 20130191735 A1 US20130191735 A1 US 20130191735A1 US 201313746324 A US201313746324 A US 201313746324A US 2013191735 A1 US2013191735 A1 US 2013191735A1
- Authority
- US
- United States
- Prior art keywords
- sentences
- sentiment
- sentence
- keywords
- content
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G06F17/24—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/358—Browsing; Visualisation therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
- G06F16/345—Summarisation for human users
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
Definitions
- the embodiments herein generally relate to content summarization, and more particularly to a system and method for summarizing content around a sentiment using weighted formal concept analysis (wFCA) based on user intent.
- wFCA weighted formal concept analysis
- Documents obtained via an electronic medium are often provided in such volume that it is important to summarize them. It is desired to be able to quickly obtain a brief summary of a document rather than reading in its entirety. Typically, such document may span multiple paragraphs to several pages in length.
- Summarization or abstraction is even more essential in the framework of emerging “push” technologies, where a user has hardly any control over what documents arrive at the desktop for his/her attention. Summarization is always a key feature in content extraction and there is currently no solution available that provides a summary that is comparable to that of a human. Conventionally, summarization of content is manually performed by users, which is time consuming and also expensive. Further, it is slow and also not scalable for a large number of documents.
- Summarization involves representing whole content into a limited set of words without losing main crux of the content.
- Traditional summarization of content (in general a document) is based on lexical chaining, in which the longest chain is assumed to best represent the content, and first sentence of a summary is taken from first sentence of the longest chain. The second-longest chain is assumed to be the next best, and second sentence of the summary is then taken from first sentence of the second longest chain.
- this lexical chaining approach tends to not only miss out on important content related to intent of the user but also fails to elaborate it in a manner in which it can be easily understood. Accordingly, there remains need for a system to automatically analyze one or more documents and generate an accurate summary based on user intent.
- an embodiment herein provides a method of summarizing content around a sentiment using weighted Formal Concept Analysis (wFCA).
- the method includes identifying one or more sentences associated with the content based on parts of speech, identifying one or more sentiments from the one or more sentences based on parts of speech, identifying one or more keywords in the one or more sentences, disambiguating at least one ambiguous keyword from the one or more keywords using the wFCA, computing a weight for each sentence of the one or more sentences based on a number of keywords of the one or more keywords associated with each sentence, processing, an input including an indication of the sentiment in the content, and generating a summary on the content around the sentiment based on (i) the weight, (ii) the one or more sentiments, and (iii) the indication.
- wFCA weighted Formal Concept Analysis
- the sentiment may be a positive sentiment. Each sentence of the one or more sentences includes at least one positive statement.
- the sentiment may be a negative sentiment. Each sentence of the one or more sentences includes at least one negative statement.
- the one or more sentences may include (a) a first set of sentences. Each sentence of the first set of sentences includes at least one sentiment.
- the one or more sentences may further include (b) a second set of sentences that are not associated with the at least one sentiment.
- the weight may be computed based on a number of associations of each keyword within the one or more sentences.
- the summary may be expanded based on (a) the weight assigned for each sentence of the one or more sentences, and (b) at least one of i) the one or more sentiments, and ii) the indication.
- a system for summarizing content around a sentiment based on weighted Formal Concept Analysis (wFCA) using a content summarization engine includes (a) a memory unit that stores (i) a set of modules, and (ii) a database, (b) a display unit, and (c) a processor that executes said set of modules.
- the set of modules include (i) a sentence identifying module executed by the processor that identifies (a) a first set of sentences, and (b) a second set of sentences in the content based on parts of speech.
- Each sentence of the first set of sentences includes at least one keyword that indicates at least one sentiment.
- the second set of sentences is not associated with sentiments.
- the summary may include at least one sentence that is only obtained from the first set of sentences.
- the summary may include at least one sentence from the second set of sentences.
- the weight may be computed based on a number of associations of each keyword within (a) the first set of sentences, and (b) the second set of sentences.
- FIG. 2 illustrates an exploded view of the user system with a memory storage unit for storing the content summarization engine of FIG. 1 , and an external database according to an embodiment herein;
- FIG. 3 illustrates an exploded of the content summarization engine of FIG. 1 according to an embodiment herein;
- FIG. 4 illustrates a user interface view of the content collection module of FIG. 3 of the content summarization engine of FIG. 1 according to an embodiment herein;
- FIG. 5 illustrates a user interface view of an input content to be summarized using the content summarization engine of FIG. 1 according to an embodiment herein;
- FIG. 6 illustrates an exploded view of the content annotation module of FIG. 3 of the content summarization engine of FIG. 1 according to an embodiment herein;
- FIG. 7 illustrates a user interface view of intent selection by the user of FIG. 1 according to an embodiment herein;
- FIG. 8 is a flow diagram illustrating a method of summarizing of content around a sentiment using weighted Formal Concept Analysis (wFCA) according to an embodiment herein;
- wFCA weighted Formal Concept Analysis
- FIG. 9 illustrates a graphical representation of a lattice that is generated to disambiguate a keyword “kingfisher” of the input content using the lattice construction module of FIG. 3 according to an embodiment herein;
- FIG. 10 is a graphical representation illustrating a graph that indicates an association between one or more keywords and sentences of the input content of FIG. 5 according to an embodiment herein;
- FIG. 11 is a table view illustrating a weight that is computed for each keyword that is identified by the keyword identifying module of FIG. 3 based on number of associations of each keyword with sentences of the input content according to an embodiment herein;
- FIG. 12 is a table view illustrating a weight that is computed for each sentence of the input content based on associated keywords according to an embodiment herein;
- FIG. 13 illustrates a schematic diagram of a computer architecture according to an embodiment herein.
- the user intent may be an overall summarization, a keyword based summarization, a page wise summarization, and/or a section wise summarization.
- the content summarization engine computes crux of content of a document, and generates a main story surrounding the content.
- the content summarization engine also provides an option for an end user to expand a summary for better understanding.
- the content summarization engine takes the end user directly to a main concept from where he/she can expand, and read the content in a flow that can give a better understanding without expending time in reading entire content.
- FIG. 1 illustrates a system view 100 of a user 102 communicating with a user system 104 to generate a summary based on one or more sentiments using a content summarization engine 106 according to an embodiment herein.
- the user system 104 A-N may be a personal computer (PC) 104 A, a tablet 104 B and/or a smart phone 104 N.
- the content summarization engine 106 summarizes content.
- FIG. 2 illustrates an exploded view of the user system 104 A-N with a memory storage unit 202 for storing the content summarization engine 106 of FIG. 1 , and an external database 216 according to an embodiment herein.
- the user system 104 A-N includes the memory storage unit 202 , a bus 204 , a communication device 206 , a processor 208 , a cursor control 210 , a keyboard 212 , and a display 214 .
- the memory storage unit 202 stores the content summarization engine 106 that includes one or more modules to perform various functions on an input content and generates a summary intent surrounding the input content.
- the external database 216 includes a knowledge base 218 that is constructed based on concepts of linked data.
- the knowledge base 218 further includes a set of categories that correspond to various keywords.
- FIG. 3 illustrates an exploded of the content summarization engine 106 of FIG. 1 according to an embodiment herein.
- the content summarization engine 106 includes a database 302 , a content collection module 304 , a content parsing/extraction module 306 , a content cleaning module 308 , a content annotation module 310 , an annotation extractor module 312 , a sentence identifying module 314 A that includes a sentiment identifying module 314 B, a keyword identifying module 316 , a disambiguating module 318 , a graph generating module 320 , a sentiment indicating module 321 , an intent building module 322 , and an intent expanding module 324 .
- the content collection module 304 collects content from at least one format of text (e.g., including multiple different documents of different formats) provided by the user 102 .
- Such formats may include, for example, .doc, .pdf, .rtf, a URL, a blog, a feed, etc.
- the content extraction/parsing module 306 fetch the content from these one or more documents (e.g. abc.doc, xyz.pdf, etc), and provide the content require to generate a summary. Further, the content extraction/parsing module 306 parses HTML content in case one of the sources of input is a URL. The content parsing/extraction module 306 extracts one or more sentences from the content.
- the content cleaning module 308 cleans the content that may include removal of junk characters, new lines that are not useful, application specific symbols (e.g., MS Word bullets), non-unicode characters, etc.
- specific parts of a document e.g., footer
- exclusions can be specified based on a type of content that is provided (e.g., news article, resume etc), and the content cleaning module 308 is configurable accordingly.
- the graph generating module 320 constructs/generates a graph to obtain an association between (i) sentences, (ii) keywords and sentences, and (iii) sentences and durations.
- the graph includes one or more nodes, and each node corresponds to a sentence of the content.
- the graph generating module 320 includes a weight assigning module 330 that assigns a weight to each sentence of the content.
- the weight assigning module 330 computes a weight for each sentence of (a) the first set of sentences, and (b) the second set of sentences based on a number of keywords associated with each of the sentence.
- the intent building module 322 tailors the one or more sentences based on a) the weight assigned for each sentence of the content, b) at least one sentiment that is identified by the sentiment identifying module 314 B, and c) the indication provided by the user 102 .
- the intent expanding module 324 expands the summary while preserving intent and elaborating further.
- the summary may include at least one of (i) at least one sentence obtained only from the first set of sentences, (ii) at least one sentence from the second set of sentences, and (iii) at least one sentence from the third set of sentences.
- the content annotation module 310 includes a sentence annotations module 602 , a token annotations module 604 , a stem annotations module 606 , a forced new lines, paragraphs and indentations computing module 608 , a parts of speech tag (POS) token annotations module 610 , a POS line annotation module 612 , a duration and quantities determining module 614 , a section annotations module 616 , a section span annotations module 618 , and a section duration annotation module 620 .
- the dotted lines (arrows having a dotted line property) of FIG. 6 represent internal dependencies among various modules, and whereas the solid lines (arrows having a solid line property) represent the flow of annotation process.
- a cleaned content is annotated by performing various levels of annotations using sub-modules of the content annotation module 310 .
- the sentence annotations module 602 extracts each and every sentence in the input content 502 .
- a first sentence of the input content 502 may be extracted by the sentence annotations module 602 .
- the first sentence includes
- the forced new lines, paragraphs and indentations computing module 608 determines white spaces like new lines that are forced (an enter received, list of sentence that are not properly phrased), paragraphs, and/or indentations, etc. It is used to extract new lines, and sentences separately as content in the case of documents like feeds and tweets which most often do not follow the language semantics. Such documents may also contain sentences that are not phrased correctly. In such cases, the extraction of new lines and one or more sentences are more valuable.
- the POS token annotations module 610 generates one or more parts of speech (POS) tag such as a noun, a verb, etc. for each token in the sentences such that each token annotation has an associated POS tag.
- POS parts of speech
- the user 102 can specify a section name of the input content 502 around which a summary has to be generated. For example, from the input content 502 , two sections names are extracted such as “introduction” and “shares”. The user 102 can obtain information from the content associated with any of these two sections by using the section span annotations module 618 .
- the section duration annotations module 620 determines (i) one or more durations that appear in the section name specified by the user 102 , and (ii) text associated with the one or more durations. If the user 102 does not specify the section name, a summary may be generated for an entire content.
- the annotation extractor module 312 extracts all the required artifacts (e.g., sentences, keywords, duration, sections, etc) from the annotations.
- the annotation extractor module 312 extracts one or more sentences, one or more keywords, one or more sections, one or more durations within the section, one or more spans of the one or more durations, etc. occurred within the input content 502 , and provides to the user 102 for an intent selection.
- FIG. 7 illustrates a user interface view 700 of the intent selection by the user 102 of FIG. 1 according to an embodiment herein.
- the user interface view 700 of the intent selection includes the header 402 , the input content 502 , a create folder(s) or organize content button 702 , an intent analytics field 704 , and an intent selection field 706 .
- the input content 502 may be obtained from one or more documents (e.g., an uploaded content). These documents may be listed and/or displayed as one or more scrollable lists (e.g., a left to right scrollable list, a right to left scrollable list, an up to down scrollable list and/or a down to up scrollable list).
- the create folder(s) or organize content button 702 is used for organizing those contents, and enables to create new folders, where the content from the scrollable lists can be dragged and dropped to a required folder to organize them.
- the intent analytics field 704 displays an analysis for the input content 502 which is selected from the scrollable lists. The analysis include, but not limited to sections, summary, identified keywords, and other details such as duration information.
- the intent selection field 706 provides one or more options to specify various intents (analysis) around which summarization are to be done.
- the user 102 can specify a section, and/or a page of a document that includes content to be summarized.
- the user 102 specifies a keyword around which summarization of content needs to be done.
- the user 102 specifies an overall summarization when the user 102 intents to summarize the entire content.
- FIG. 8 is a flow diagram illustrating a method of summarizing content around a sentiment using the weighted Formal Concept Analysis (wFCA) according to an embodiment herein.
- wFCA weighted Formal Concept Analysis
- one or more sentences associated with the content are identified based on parts of speech.
- the one or more sentences may include the first set of sentences, the second set of sentences.
- the first set of sentences may include either positive sentences or negative sentences.
- a positive sentence S1 may be “REUTERS: the chairman of kingfisher Airlines, Vijay Mallaya, said in an interview with the financial times that he was close to sealing a $370 million deal with tan Indian private investor and a consortium of banks that would save the airlines”.
- a positive sentiment S2 may be “The Bangalore-based entrepreneur told the FT he was nearing a deal with 14 banks led by State Bank of India that would provide the loss-making carrier with S2 working capital of 6 billion rupees”.
- a negative sentence S11 may be “The airlines become No. 2 private carrier since it began its operations in 2005 as the economy boomed but has become one of the main causalities of high fuel costs and a fierce price war between a handful of airlines which, between them, have ordered hundreds of aircraft on delivery over the next decade in an ambitious bet on the future.
- one or more sentiments are identified in the one or more sentences based on the parts of speech.
- a positive statement and a negative statement may indicate type of sentiment in a sentence.
- one or more keywords in the one or more sentences are identified.
- the keyword identifying module 316 identifies the one or more keywords based on the Parts of speech (POS) tagged by the POS token annotations module 610 , and/or the POS line annotations module 612 of FIG. 6 .
- one or more ambiguous keywords from the one or more keywords are disambiguated using (i) the wFCA, by generating a lattice that includes one or more concepts.
- the one or more concepts are generated with (i) the one or more keywords as objects, and (ii) categories associated with the one or more keywords as attributes.
- the categories may be obtained from the knowledge base 218 of FIG. 2 .
- a weight is computed and assigned for each sentence of the one or more sentences based on a number of keywords of the one or more keywords that are associated with each sentence.
- the weight is computed based on a number of associations of each keyword within the one or more sentences.
- a graph may be generated based on the one or more sentences and the one or more keywords to compute the weight.
- the graph includes one or more nodes. Each node indicates a sentence of the one or more sentences that are associated with the sentiment.
- an input that includes an indication of the sentiment is processed.
- the indication may include summarization around the sentiment (e.g., a positive sentiment, or a negative sentiment). In one embodiment, the indication may be provided by the user 102 .
- a summary on the content around the sentiment is generated based on (i) the weight and at least one of (a) the one or more sentiments, and (b) the indication.
- the summary may include at least one of (i) at least one sentence obtained only from the first set of sentences, and (ii) at least one sentence from the second set of sentences.
- the content summarization engine 106 expands the summary based on (a) the weight assigned for each sentence of the one or more sentences, and (b) at least one of i) the one or more sentiments, and ii) the indication.
- the one or more keywords are identified and extracted based on parts of speech (POS) tag generated by the POS token annotations module 610 , and/or the POS line annotations module 612 .
- POS parts of speech
- a noun is very likely to be a keyword in a sentence.
- co-occurring nouns and its derivatives are also a keyword.
- a keyword chunker is used to obtain these keywords and keyword phrases depending on the noun and related tags.
- entire input content 502 is summarized.
- the annotation extractor module 312 extracts the one or more keywords (e.g., 6 keywords) using POS tag.
- the extracted keywords are:
- Vijay Mallya—POS Tag says that it is a noun followed by a noun (phrase)
- the content summarization engine 106 determines different disambiguated terms from the one or more keywords, and their related categories. Further, the content summarization engine 106 uses the knowledge base 218 stored in the external database 216 for obtaining categories for the one or more keywords. Each keyword is queried separately against the knowledge base 218 and corresponding categories are obtained. For example, for the above keywords, the categories obtained are
- the keyword “kingfisher” has got two disambiguations (e.g., one for “Kingfisher (Bird)” and one for “Kingfisher Airlines”.
- the categories corresponding to each word are shown against them.
- the categories may be modified by the user 102 .
- the modification is taken as a feedback to the categories suggested for the keywords and is used to train the knowledge base 218 for preferred categories.
- the content summarization engine 106 uses the lattice construction module 326 .
- the lattice construction module 326 constructs a lattice based on the weighted Formal Concept Analysis (wFCA).
- FIG. 9 illustrates a graphical representation 900 of a lattice that is generated to disambiguate a keyword “kingfisher” of the input content 502 using the lattice construction module 326 of FIG. 3 according to an embodiment herein.
- the lattice construction module 326 forms various concepts with one or more keywords, and their associated categories. For example, concept-1 to concept-9 associated with FIG. 9 are:
- concept 1 to concept 5 defines distinct category sets for each keyword.
- the keyword “chairman” does not have a distinct concept because it is a subset of the category set of the keyword “Vijay Mallya”. This implies that the keywords “chairman” and “Vijay Mallya” are strongly related in a context of the input content 502 .
- the concept 6 and concept 7 provide contextual information that the keywords “chairman”, “Kingfisher Airlines”, “Shares” and “Vijay Mallya” are related in the context of the input content 502 .
- the keyword “reuters” and “Kingfisher” are not related to any other keywords and are treated as an unimportant (less priority) in the context of the input content 502 , and there is no concept that covers all the categories.
- the keyword “chairman” does not have a distinct category because it is a subset of the category set of keyword “Vijay Mallya”.
- categories are “Management occupations”, “Management” and “Business”. Further, this categories are common for both keyword chairman” and “Vijay Mallya” and hence they are strongly associated in the context of the input content 502 . This makes the concept 6.
- a category “business” is associated with the categories of the concept 2.
- the keywords “kingfisher Airlines”, “shares”, “vijay Mallya”, “chairman” is strongly associated with the category business.
- the score for the concept 7 will be (1 ⁇ 6)*4 which is 66.67%.
- the keyword “kingfisher Airlines” as described is strongly associated with the category “business”.
- the keyword “kingfisher” is treated as “kingfisher Airlines” and not as “kingfisher (bird)” by using weighted FCA.
- all the keywords K1, K2, K3, . . . and K21 have an equal weight of 1/21 (i.e., 0.04762).
- the actual weight may vary.
- the keyword K1 “kingfisher Airlines” is associated with S1 directly, also indirectly with S4, S5, S9, and S10 as “kingfisher”, and with S7, S10, and S11 as “Airline or Airlines”.
- the keyword “kingfisher” is treated here as “kingfisher Airlines” as already disambiguated.
- the keyword “kingfisher Airlines” is thus associated 7 sentences.
- a weight for the keyword K1 “kingfisher Airlines” computed as 0.04762/7.
- a weight is computed for each keyword based on number of associations.
- Keywords K4, K7, K8, K9, K11 and K12 are associated with the second sentence S2.
- Keywords K8, and K13 are associated with the third sentence S3.
- Keywords K1, and K13 are associated with the fourth sentence S4.
- Keywords K1, and K6 are associated with the fifth sentence S5.
- Keywords K7, K13 and K14 are associated with the sixth sentence S6.
- Keywords K1, K8, K9, K13, and K16 are associated with the seventh sentence S7.
- Keywords K5, and K13 are associated with the eighth sentence S8.
- Keywords K1, and K2 are associated with the ninth sentence S9.
- FIG. 11 is a table view 1100 illustrating a weight 1102 that is computed for each keyword that is identified by the keyword identifying module 316 of FIG. 3 based on number of associations 1104 of each keyword with sentences of the input content 502 according to an embodiment herein.
- a weight is computed for each keyword as shown in the table.
- the weight assigning module 330 assigns a weight for a sentence in the input content 502 based on a count that corresponds to the keywords associated with a node that corresponds to the sentence using simple heuristics.
- a weight is computed for each sentence of the input content 502 .
- a weight for the first sentence S1 is computed based on a count that correspond to the keywords associate with a node that corresponds to the first sentence S1.
- Such keywords that associate with the first sentence S1 are K1, K3, K5, K6, K7, K8, K9, K10, and K13.
- the weight for the first sentence S1 is computed as summation of weights associate with the keywords that corresponds to the first sentence S1.
- FIG. 12 is a table view 1200 illustrating a weight 1202 that is computed for each sentence of the input content 502 based on associated keywords 1204 according to an embodiment herein. As described, computing the weight for the sentence S1, a weight is computed for each sentence of the input content 502 as shown in the table.
- the graph generating module 320 interprets that S1 is most important sentence when compared to other sentences, because it has more number of associations with the keywords.
- the current example explains a simple weighting scheme based on keywords.
- weighting scheme can also depend on various factors, like sentence selection, section selection and sentiments selection.
- the intent building module 322 is used for tailoring sentences together in the exact sequence in which they appear in the original text, and provides summary of the input content 502 .
- the number of sentences to be used as a summary depends on an input parameter from the user 102 as well as a weighted cut-off that is configurable.
- the user 102 can expand the summary of the input content 502 using the intent expanding module 324 .
- a first level of summary for the input content 502 has only S1 (having a highest weight). If the user wants to expand the summary to a second level, the intent expanding module 324 relaxes weight of sentences. Then, a next most important sentence is S11 (associated with 5 keywords, and having next highest weight) is tailored with S1 in the exact sequence in which they appear in the input content 502 .
- S11 associated with 5 keywords, and having next highest weight
- the summary is generated by considering only sentences occurring in that section.
- the summary is generated by considering only sentences occurring in that page.
- the user 102 intents to summarize content around a particular keyword.
- the sentiments that have highest weight are arranged at the top, and the sentiments that have higher, high, low, lower, and lowest weight are followed, in one example embodiment.
- the positive sentiments are at the top followed by neutral sentiments, in another example embodiment.
- the content summarization engine 106 arranges the sentence S1 having a weight 0.2013 and the sentence S6 having a weight 0.07143 as positive sentiments at the top and may be followed by one or more neutral sentence (e.g., the neutral sentence S2 having a weight 0.1865, the neutral sentence S5 having a weight 0.03061, and the neutral sentence S3 having a weight 0.01984).
- the content summarization engine 106 arranges the sentence S11 having a weight 0.1973 and the sentiment S10 are negative sentences at the top and may be followed by one or more neutral sentences (e.g., the neutral sentences S2, S5, and S3).
- the techniques provided by the embodiments herein may be implemented on an integrated circuit chip (not shown).
- the chip design is created in a graphical computer programming language, and stored in a computer storage medium (such as a disk, tape, physical hard drive, or virtual hard drive such as in a storage access network). If the designer does not fabricate chips or the photolithographic masks used to fabricate chips, the designer transmits the resulting design by physical means (e.g., by providing a copy of the storage medium storing the design) or electronically (e.g., through the Internet) to such entities, directly or indirectly.
- the stored design is then converted into the appropriate format (e.g., GDSII) for the fabrication of photolithographic masks, which typically include multiple copies of the chip design in question that are to be formed on a wafer.
- the photolithographic masks are utilized to define areas of the wafer (and/or the layers thereon) to be etched or otherwise processed.
- the resulting integrated circuit chips can be distributed by the fabricator in raw wafer form (that is, as a single wafer that has multiple unpackaged chips), as a bare die, or in a packaged form.
- the chip is mounted in a single chip package (such as a plastic carrier, with leads that are affixed to a motherboard or other higher level carrier) or in a multichip package (such as a ceramic carrier that has either or both surface interconnections or buried interconnections).
- the chip is then integrated with other chips, discrete circuit elements, and/or other signal processing devices as part of either (a) an intermediate product, such as a motherboard, or (b) an end product.
- the end product can be any product that includes integrated circuit chips, ranging from toys and other low-end applications to advanced computer products having a display, a keyboard or other input device, and a central processor.
- the embodiments herein can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment including both hardware and software elements.
- the embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc.
- the medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium.
- Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk.
- Current examples of optical disks include compact disk-read only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.
- a data processing system suitable for storing and/or executing program code will include at least one processor coupled directly or indirectly to memory elements through a system bus.
- the memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.
- I/O devices can be coupled to the system either directly or through intervening I/O controllers.
- Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.
- the system further includes a user interface adapter 19 that connects a keyboard 15 , mouse 17 , speaker 24 , microphone 22 , and/or other user interface devices such as a touch screen device (not shown) to the bus 12 to gather user input.
- a communication adapter 20 connects the bus 12 to a data processing network 25
- a display adapter 21 connects the bus 12 to a display device 23 which may be embodied as an output device such as a monitor, printer, or transmitter, for example.
- the content summarization engine 106 provides the user 102 a more precise summary on content, and assists the user 102 to grasp it quickly. Moreover, sentences of the content are stitched in an exact order as appear in the content. This provides continuity and clear understanding to the user 102 while reviewing the summary.
- the content summarization engine 106 saves considerable amount of user's time by providing the summary, and the user 102 can also expand the summary for better understanding. Also, the user 102 obtains a summary on the content based on their intent.
- the content summarization engine 106 may also enable summarization around a keyword.
- a summary around a keyword may be generated based on (i) a sentence having a positive sentiment, or (ii) a sentence having a negative sentiment.
- the user 102 intents to summarize the input content 502 around the keyword K1—Kingfisher Airlines that is either associated with a set of positive sentences, or associated with a set of negative sentence.
- the user 102 intents to summarize the keyword K1—Kingfisher Airlines that is associated with the set of positive sentences (S1 and S6) only the sentence S1 is selected for summarization since the keyword K1 is not associated the sentence S6.
- At least one sentence is selected for summarization (e.g., S4). Since the keyword K1 is not associated the sentence S8, the sentence S8 is not selected for summarization by the content summarization engine 106 .
Abstract
A method of summarizing content around a sentiment using weighted Formal Concept Analysis (wFCA) is provided. The method includes identifying one or more sentences associated with the content based on parts of speech, identifying, at least one sentiment associated with the one or more sentences based on the parts of speech, identifying one or more keywords in the one or more sentences, disambiguating at least one ambiguous keyword from the one or more keywords using the wFCA, computing a weight for each sentence of the one or more sentences based on a number of keywords of the one or more keywords associated with each sentence, processing, an input including an indication of the sentiment, and generating a summary on the content around the sentiment based on (i) the weight, and b) at least one of i) the at least one sentiment, and ii) the indication.
Description
- This application claims priority to Indian patent application no. 263/CHE/2012 filed on Jan. 23, 2012, the complete disclosure of which, in its entirety, is herein incorporated by reference.
- 1. Technical Field
- The embodiments herein generally relate to content summarization, and more particularly to a system and method for summarizing content around a sentiment using weighted formal concept analysis (wFCA) based on user intent.
- 2. Description of the Related Art
- Documents obtained via an electronic medium (i.e., Internet or on-line services, or any other services) are often provided in such volume that it is important to summarize them. It is desired to be able to quickly obtain a brief summary of a document rather than reading in its entirety. Typically, such document may span multiple paragraphs to several pages in length.
- Summarization or abstraction is even more essential in the framework of emerging “push” technologies, where a user has hardly any control over what documents arrive at the desktop for his/her attention. Summarization is always a key feature in content extraction and there is currently no solution available that provides a summary that is comparable to that of a human. Conventionally, summarization of content is manually performed by users, which is time consuming and also expensive. Further, it is slow and also not scalable for a large number of documents.
- Summarization involves representing whole content into a limited set of words without losing main crux of the content. Traditional summarization of content (in general a document) is based on lexical chaining, in which the longest chain is assumed to best represent the content, and first sentence of a summary is taken from first sentence of the longest chain. The second-longest chain is assumed to be the next best, and second sentence of the summary is then taken from first sentence of the second longest chain. However, this lexical chaining approach tends to not only miss out on important content related to intent of the user but also fails to elaborate it in a manner in which it can be easily understood. Accordingly, there remains need for a system to automatically analyze one or more documents and generate an accurate summary based on user intent.
- In view of the foregoing, an embodiment herein provides a method of summarizing content around a sentiment using weighted Formal Concept Analysis (wFCA). The method includes identifying one or more sentences associated with the content based on parts of speech, identifying one or more sentiments from the one or more sentences based on parts of speech, identifying one or more keywords in the one or more sentences, disambiguating at least one ambiguous keyword from the one or more keywords using the wFCA, computing a weight for each sentence of the one or more sentences based on a number of keywords of the one or more keywords associated with each sentence, processing, an input including an indication of the sentiment in the content, and generating a summary on the content around the sentiment based on (i) the weight, (ii) the one or more sentiments, and (iii) the indication.
- The sentiment may be a positive sentiment. Each sentence of the one or more sentences includes at least one positive statement. The sentiment may be a negative sentiment. Each sentence of the one or more sentences includes at least one negative statement. The one or more sentences may include (a) a first set of sentences. Each sentence of the first set of sentences includes at least one sentiment. The one or more sentences may further include (b) a second set of sentences that are not associated with the at least one sentiment. The weight may be computed based on a number of associations of each keyword within the one or more sentences. The summary may be expanded based on (a) the weight assigned for each sentence of the one or more sentences, and (b) at least one of i) the one or more sentiments, and ii) the indication.
- In another aspect, a non-transitory program storage device readable by computer, and including a program of instructions executable by the computer to perform a method for summarizing content around a sentiment based on weighted Formal Concept Analysis (wFCA) is provided. The method includes identifying one or more sentences associated with the content based on parts of speech, identifying one or more sentences associated in the one or more sentences based on the parts of speech, identifying one or more keywords in the one or more sentences, disambiguating at least one ambiguous keyword from the one or more keywords using the wFCA, generating a graph to compute a weight for each sentence of the one or more sentences based on a number of keywords of the one or more keywords associated with each sentence, processing, an input including an indication of the sentiment in the content, and generating a summary on the content based on (a) the weight, and b) at least one of i) one or more sentiments, and ii) the indication. The weight may be computed based on a number of associations of each keyword within the one or more sentences. The summary may be expanded based on a) the weight, and b) at least one of i) one or more sentiments, and ii) the indication.
- In yet another aspect, a system for summarizing content around a sentiment based on weighted Formal Concept Analysis (wFCA) using a content summarization engine is provided. The system includes (a) a memory unit that stores (i) a set of modules, and (ii) a database, (b) a display unit, and (c) a processor that executes said set of modules. The set of modules include (i) a sentence identifying module executed by the processor that identifies (a) a first set of sentences, and (b) a second set of sentences in the content based on parts of speech. Each sentence of the first set of sentences includes at least one keyword that indicates at least one sentiment. The second set of sentences is not associated with sentiments.
- The set of modules further include (ii) a sentiment identifying module executed by the processor that identifies one or more sentiments associated with each sentence of the first set of sentences based on the parts of speech, (iii) a keyword identifying module executed by the processor that identifies one or more keywords from the first set of sentences, and the second set of sentences, (iv) a disambiguating module executed by the processor that disambiguates at least one ambiguous keyword from the one or more keywords using the wFCA by generating a lattice with the one or more keywords as objects, and categories associated with the one or more keywords as attributes. The categories are obtained from a knowledge base.
- The set of modules further includes (v) a weight computing module executed by the processor that computes a weight for each sentence of (a) the first set of sentences, and (b) the second set of sentences based on a number of keywords associated with each sentence, and (vi) a sentiment indicating module executed by the processor that processes an input including an indication of the sentiment in the content, (vii) an intent building module executed by the processor that generates a summary on the content based on a) the weight, and b) at least one of (i) the one or more sentiments, and (ii) the indication.
- The summary may include at least one sentence that is only obtained from the first set of sentences. The summary may include at least one sentence from the second set of sentences. The weight may be computed based on a number of associations of each keyword within (a) the first set of sentences, and (b) the second set of sentences.
- The embodiments herein will be better understood from the following detailed description with reference to the drawings, in which:
-
FIG. 1 illustrates a system view of a user communicating with a user system to generate a summary based on one or more sentiments using a content summarization engine according to an embodiment herein; -
FIG. 2 illustrates an exploded view of the user system with a memory storage unit for storing the content summarization engine ofFIG. 1 , and an external database according to an embodiment herein; -
FIG. 3 illustrates an exploded of the content summarization engine ofFIG. 1 according to an embodiment herein; -
FIG. 4 illustrates a user interface view of the content collection module ofFIG. 3 of the content summarization engine ofFIG. 1 according to an embodiment herein; -
FIG. 5 illustrates a user interface view of an input content to be summarized using the content summarization engine ofFIG. 1 according to an embodiment herein; -
FIG. 6 illustrates an exploded view of the content annotation module ofFIG. 3 of the content summarization engine ofFIG. 1 according to an embodiment herein; -
FIG. 7 illustrates a user interface view of intent selection by the user ofFIG. 1 according to an embodiment herein; -
FIG. 8 is a flow diagram illustrating a method of summarizing of content around a sentiment using weighted Formal Concept Analysis (wFCA) according to an embodiment herein; -
FIG. 9 illustrates a graphical representation of a lattice that is generated to disambiguate a keyword “kingfisher” of the input content using the lattice construction module ofFIG. 3 according to an embodiment herein; -
FIG. 10 is a graphical representation illustrating a graph that indicates an association between one or more keywords and sentences of the input content ofFIG. 5 according to an embodiment herein; -
FIG. 11 is a table view illustrating a weight that is computed for each keyword that is identified by the keyword identifying module ofFIG. 3 based on number of associations of each keyword with sentences of the input content according to an embodiment herein; -
FIG. 12 is a table view illustrating a weight that is computed for each sentence of the input content based on associated keywords according to an embodiment herein; and -
FIG. 13 illustrates a schematic diagram of a computer architecture according to an embodiment herein. - The embodiments herein and the various features and advantageous details thereof are explained more fully with reference to the non-limiting embodiments that are illustrated in the accompanying drawings and detailed in the following description. Descriptions of well-known components and processing techniques are omitted so as to not unnecessarily obscure the embodiments herein. The examples used herein are intended merely to facilitate an understanding of ways in which the embodiments herein may be practiced and to further enable those of skill in the art to practice the embodiments herein. Accordingly, the examples should not be construed as limiting the scope of the embodiments herein.
- As mentioned, there remains a need for a system to automatically analyze one or more documents and generate an accurate summary based on user intent. The user intent may be an overall summarization, a keyword based summarization, a page wise summarization, and/or a section wise summarization. The content summarization engine computes crux of content of a document, and generates a main story surrounding the content. The content summarization engine also provides an option for an end user to expand a summary for better understanding. The content summarization engine takes the end user directly to a main concept from where he/she can expand, and read the content in a flow that can give a better understanding without expending time in reading entire content. Referring now to the drawings, and more particularly to
FIGS. 1 through 13 , where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments. -
FIG. 1 illustrates asystem view 100 of auser 102 communicating with a user system 104 to generate a summary based on one or more sentiments using acontent summarization engine 106 according to an embodiment herein. In one embodiment, theuser system 104A-N may be a personal computer (PC) 104A, a tablet 104B and/or asmart phone 104N. Thecontent summarization engine 106 summarizes content. -
FIG. 2 illustrates an exploded view of theuser system 104A-N with amemory storage unit 202 for storing thecontent summarization engine 106 ofFIG. 1 , and anexternal database 216 according to an embodiment herein. Theuser system 104A-N includes thememory storage unit 202, a bus 204, acommunication device 206, aprocessor 208, acursor control 210, akeyboard 212, and adisplay 214. Thememory storage unit 202 stores thecontent summarization engine 106 that includes one or more modules to perform various functions on an input content and generates a summary intent surrounding the input content. Theexternal database 216 includes aknowledge base 218 that is constructed based on concepts of linked data. Theknowledge base 218 further includes a set of categories that correspond to various keywords. -
FIG. 3 illustrates an exploded of thecontent summarization engine 106 ofFIG. 1 according to an embodiment herein. Thecontent summarization engine 106 includes adatabase 302, acontent collection module 304, a content parsing/extraction module 306, acontent cleaning module 308, acontent annotation module 310, anannotation extractor module 312, a sentence identifying module 314A that includes a sentiment identifying module 314B, akeyword identifying module 316, adisambiguating module 318, a graph generating module 320, asentiment indicating module 321, anintent building module 322, and anintent expanding module 324. - The
content collection module 304 collects content from at least one format of text (e.g., including multiple different documents of different formats) provided by theuser 102. Such formats may include, for example, .doc, .pdf, .rtf, a URL, a blog, a feed, etc. The content extraction/parsing module 306 fetch the content from these one or more documents (e.g. abc.doc, xyz.pdf, etc), and provide the content require to generate a summary. Further, the content extraction/parsing module 306 parses HTML content in case one of the sources of input is a URL. The content parsing/extraction module 306 extracts one or more sentences from the content. - The
content cleaning module 308 cleans the content that may include removal of junk characters, new lines that are not useful, application specific symbols (e.g., MS Word bullets), non-unicode characters, etc. In one embodiment, specific parts of a document (e.g., footer) are specified as to be excluded. Such exclusions can be specified based on a type of content that is provided (e.g., news article, resume etc), and thecontent cleaning module 308 is configurable accordingly. - The
content annotation module 310 annotates the content for useful information, such as sentences, keywords, tokens, sentiments, durations, sections, durations within the sections, quantities, sentences associated with the sections, and sentences associated with durations and quantities. Theannotation extractor module 312 extracts annotated information from thecontent annotation module 310. The sentence identifying module 314A identifies the one or more sentences from the content. The sentiment identifying module 314B identifies one or more sentiments from the one or more sentences. The one or more sentences may include a first set of sentences that are associated with a sentiment, a second set of sentences that may not be associated with the sentiment (e.g., a set of neutral sentences). The first set of sentences may be a set of positive sentiments. The first set of sentences may be a set of negative sentiments. The sentence identifying module 314A may identify a third set of sentences that includes positive sentiment and negative sentiments. When a sentence is identified having at least one (a) a positive sentiment, (b) a negative sentiment, and c) a neutral sentiment, the sentence identifying module 314A may prompt theuser 102 to identify, edit/modify, and validate the sentence as (a) a positive sentiment sentence, b) a negative sentiment sentence, or a neutral sentiment sentences. The sentence identifying module 314A may then accept an input from theuser 102 that confirms a type of the sentence and one or more sentiments associated with the sentence. The sentence identifying module 314A also identifies one or more sentences associated with a keyword when theuser 102 intents to summarize the content around the sentiment. - The
keyword identifying module 316 identifies one or more keywords from the first set of sentences, and the second set of sentences of the content based on a parts of speech. Thekeyword identifying module 316 may further identify one or more keywords from the third set of sentences. Thedisambiguating module 318 disambiguates at least one ambiguous keyword from the one or more keywords in a context of its meaning using weighted formal concept analysis (wFCA). The wFCA includes generation of a lattice with the one or more of keywords as objects, and one or more categories associated with the one or more keywords as attributes. The one or more categories are obtained from theknowledge base 218. - The
disambiguating module 318 further includes alattice construction module 326 that generates a lattice to disambiguate the at least one ambiguous keyword. The lattice includes one or more concepts that are generated based on the one or more keywords, and the one or more categories associated with the one or more keywords. In one embodiment, thedisambiguating module 318 further includes ascore computing module 328 that computes a score for each concept of the one or more concepts. The score is used to disambiguate the at least one ambiguous keyword. - The graph generating module 320 constructs/generates a graph to obtain an association between (i) sentences, (ii) keywords and sentences, and (iii) sentences and durations. In one embodiment, the graph includes one or more nodes, and each node corresponds to a sentence of the content. The graph generating module 320 includes a
weight assigning module 330 that assigns a weight to each sentence of the content. Theweight assigning module 330 computes a weight for each sentence of (a) the first set of sentences, and (b) the second set of sentences based on a number of keywords associated with each of the sentence. - The
graph generation module 318 generates a graph based on (a) the first set of sentences, (b) the second set of sentences, and (c) the plurality of keywords. Each node indicates a sentence of (a) the first set of sentences, or (b) the second set of sentences. Thesentiment indicating module 321 processes an input that includes an indication of the sentiment in the content. The indication is provided by theuser 102, in one example embodiment. Theuser 102 may indicate a type of sentiment for which summarization of the content occurs. Theintent building module 322 generates a summary by tailoring one or more sentences in the same exact order as it appears in the content. In one embodiment, theintent building module 322 tailors the one or more sentences based on a) the weight assigned for each sentence of the content, b) at least one sentiment that is identified by the sentiment identifying module 314B, and c) the indication provided by theuser 102. Theintent expanding module 324 expands the summary while preserving intent and elaborating further. The summary may include at least one of (i) at least one sentence obtained only from the first set of sentences, (ii) at least one sentence from the second set of sentences, and (iii) at least one sentence from the third set of sentences. -
FIG. 4 illustrates a user interface view of thecontent collection module 304 ofFIG. 3 of thecontent summarization engine 106 ofFIG. 1 according to an embodiment herein. The user interface view of thecontent collection module 304 includes aheader 402, atext field 404, an uploadbutton 406, anURL text field 408, a fetchbutton 410, a drag anddrop field 412, an uploadfile button 414, a task status table 416, atask progress field 418, and a proceedbutton 420. Theheader 402 displays a logo, a welcome message, and a status of an application. Through, thetext field 404, theuser 102 can provide a plain text to be summarized, and submits the plain text to a server, using the uploadbutton 406. The plain text may also be provided as a Uniform (or universal) resource locator (URL) in theURL text field 408, and text associated with the URL is crawled using the fetchbutton 410. The drag anddrop field 412 is used to drag and drop any files with text to be uploaded. Through, the uploadfile button 414, theuser 102 can browse a file to be uploaded. The task status table 416 displays uploaded text, the URL and/or the file. Thetask progress field 418 notifies theuser 102 about progress of a summarization process for each uploaded content, and the proceedbutton 420 directs theuser 102 to a next page. -
FIG. 5 illustrates auser interface view 500 of aninput content 502 to be summarized using thecontent summarization engine 106 ofFIG. 1 according to an embodiment herein. Theinput content 502 may be obtained in the form of a document, an URL, and/or plain-text as specified by theuser 102 using thecontent collection module 304. Thecontent collection module 304 collects theinput content 502 and stores it on the server. - In one embodiment, the
input content 502 is collected from one or more documents (e.g., abc.doc, and/or xyz.pdf), and are parsed/extracted (e.g., using the content parsing/extraction module 306 ofFIG. 3 ). In another embodiment, theinput content 502 may be fed as an URL (e.g., www.xyzairlines.com/content.html). Thecontent collection module 304 fetches content associated with the URL, and parses/extracts using the content parsing/extraction module 306. Thecontent summarization engine 106 obtains theinput content 502 to generate a summary on theinput content 502. - The
content cleaning module 308 cleans theinput content 502 before performing annotation. Cleaning the content includes removing (i) junk characters, (ii) new lines that are not useful, (iii) application specific symbols (word processing bullets, etc.), and/or (iv) non-Unicode characters, etc. Theinput content 502 may be in a form of a cleaned text which does not require removing (i) junk characters, (ii) new lines that are not useful, (iii) application specific symbols (word processing bullets, etc.), and/or (iv) non-Unicode characters, etc., in one example embodiment. -
FIG. 6 illustrates an exploded view of thecontent annotation module 310 ofFIG. 3 of thecontent summarization engine 106 ofFIG. 1 according to an embodiment herein. Thecontent annotation module 310 annotates theinput content 502 for useful information (e.g., one or more keywords, one or more sentences, etc). Thecontent annotation module 310 includes asentence annotations module 602, atoken annotations module 604, astem annotations module 606, a forced new lines, paragraphs andindentations computing module 608, a parts of speech tag (POS) token annotations module 610, a POSline annotation module 612, a duration andquantities determining module 614, a section annotations module 616, a sectionspan annotations module 618, and a sectionduration annotation module 620. The dotted lines (arrows having a dotted line property) ofFIG. 6 represent internal dependencies among various modules, and whereas the solid lines (arrows having a solid line property) represent the flow of annotation process. - After parsing and cleaning the
input content 502, a cleaned content is annotated by performing various levels of annotations using sub-modules of thecontent annotation module 310. Thesentence annotations module 602 extracts each and every sentence in theinput content 502. For example, a first sentence of theinput content 502 may be extracted by thesentence annotations module 602. The first sentence includes -
- S1: REUTERS: the chairman of kingfisher Airlines, Vijay Mallaya, said in an interview with the financial times that he was close to sealing a $370 million deal with tan Indian private investor and a consortium of banks that would save the airlines.
- Similarly, the
sentence annotations module 602 extracts all the sentences of theinput content 502. -
- S2: The Bangalore-based entrepreneur told the FT he was nearing a deal with 14 banks led by State Bank of India that would provide the loss-making carrier with working capital of 6 billion rupees.
- S3: He did not name the banks.
- S4: Earlier this week, kingfisher said its net loss for the September quarter doubled but Mallaya offered little to revive its finances.
- S5: It had also said it has been approached by strategic investors.
- S6: Mallaya, a flamboyant liquor baron who owns a Formula One Motor-racing team, told the paper he was finalizing a separate $250 million equity injection from an unnamed wealthy Indian individual to recapitalize the cash-strapped carrier.
- S7: He added that he was about to conclude a deal with the banks to reduce the interest rate which the airlines is currently paying on its $1.4 billion debt pile.
- S8: Mallaya said on the social networking site Twitter® that the report was “factually wrong” but he did not elaborate Reuters could not immediately reach company officials for a comment.
- S9: Shares kingfisher which is named after its best selling beer, were down more than 5 percentage in early trade on Friday in Mumbai.
- S10: Kingfisher, which listed when it bought out budget airline, Air Deccan in 2008, has never made a profit and its market value has plunged 64 percentage this year.
- S11: The airlines become No. 2 private carrier since it began its operations in 2005 as the economy boomed but has become one of the main causalities of high fuel costs and a fierce price war between a handful of airlines which, between them, have ordered hundreds of aircraft on delivery over the next decade in an ambitious bet on the future.
- The
token annotations module 604 determines each and every token in the sentences of theinput content 502. For example, “The”, “chairman”, “of”, “kingfisher”, “airlines”, “vijay”, “mallya”, “said”, “in”, “an “interview”, “with” are all tokens in a first line of theinput content 502. Thestem annotations module 606 computes a root word for each and every token identified by thetoken annotations module 604. For example, -
- “reuter—the chairman of kingfish airlin, vijay mallaya, said in an interview with 370 million deal with an indian privat investor and a consortium of bank that would save the airlin.”
- The forced new lines, paragraphs and
indentations computing module 608 determines white spaces like new lines that are forced (an enter received, list of sentence that are not properly phrased), paragraphs, and/or indentations, etc. It is used to extract new lines, and sentences separately as content in the case of documents like feeds and tweets which most often do not follow the language semantics. Such documents may also contain sentences that are not phrased correctly. In such cases, the extraction of new lines and one or more sentences are more valuable. The POS token annotations module 610 generates one or more parts of speech (POS) tag such as a noun, a verb, etc. for each token in the sentences such that each token annotation has an associated POS tag. Further, POSline annotations module 612 tags each token in the new lines as a noun or a verb, etc. In addition, the new lines are also useful for section extraction because section names may not be proper sentences. For example, in theinput content 502, “Shares” and “Introduction” are not proper sentences but a word, and they are captured using the section annotations module 616 as a new line because they occur in a separate line. - The duration and
quantities determining module 614 extracts duration and quantities wherever it occurs in a text of thecontent 502. For example, it extracts duration, like “2008, “2005” and quantities like “64 percentage” from theinput content 502. The section annotations module 616 determines a group of sentences that form a section that has a heading. To determine a start point and an end point of the section, various heuristics are used that includes lookup for well-known sections, sentence construction based on the parts of speech, relevance with respect to surrounded text, exclusion terms, term co-occurrence, etc. - In one embodiment, the
user 102 can specify a section name of theinput content 502 around which a summary has to be generated. For example, from theinput content 502, two sections names are extracted such as “introduction” and “shares”. Theuser 102 can obtain information from the content associated with any of these two sections by using the sectionspan annotations module 618. The sectionduration annotations module 620 determines (i) one or more durations that appear in the section name specified by theuser 102, and (ii) text associated with the one or more durations. If theuser 102 does not specify the section name, a summary may be generated for an entire content. - Once annotations are done, the
annotation extractor module 312 extracts all the required artifacts (e.g., sentences, keywords, duration, sections, etc) from the annotations. Theannotation extractor module 312 extracts one or more sentences, one or more keywords, one or more sections, one or more durations within the section, one or more spans of the one or more durations, etc. occurred within theinput content 502, and provides to theuser 102 for an intent selection. - With reference to
FIG. 6 ,FIG. 7 illustrates auser interface view 700 of the intent selection by theuser 102 ofFIG. 1 according to an embodiment herein. Theuser interface view 700 of the intent selection includes theheader 402, theinput content 502, a create folder(s) or organizecontent button 702, anintent analytics field 704, and anintent selection field 706. In one embodiment, theinput content 502 may be obtained from one or more documents (e.g., an uploaded content). These documents may be listed and/or displayed as one or more scrollable lists (e.g., a left to right scrollable list, a right to left scrollable list, an up to down scrollable list and/or a down to up scrollable list). The create folder(s) or organizecontent button 702 is used for organizing those contents, and enables to create new folders, where the content from the scrollable lists can be dragged and dropped to a required folder to organize them. Theintent analytics field 704 displays an analysis for theinput content 502 which is selected from the scrollable lists. The analysis include, but not limited to sections, summary, identified keywords, and other details such as duration information. Theintent selection field 706 provides one or more options to specify various intents (analysis) around which summarization are to be done. In one embodiment, theuser 102 can specify a section, and/or a page of a document that includes content to be summarized. In another embodiment, theuser 102 specifies a keyword around which summarization of content needs to be done. In yet another embodiment, theuser 102 specifies an overall summarization when theuser 102 intents to summarize the entire content. -
FIG. 8 is a flow diagram illustrating a method of summarizing content around a sentiment using the weighted Formal Concept Analysis (wFCA) according to an embodiment herein. Instep 802, one or more sentences associated with the content are identified based on parts of speech. The one or more sentences may include the first set of sentences, the second set of sentences. The first set of sentences may include either positive sentences or negative sentences. A positive sentence S1 may be “REUTERS: the chairman of kingfisher Airlines, Vijay Mallaya, said in an interview with the financial times that he was close to sealing a $370 million deal with tan Indian private investor and a consortium of banks that would save the airlines”. Similarly, a positive sentiment S2 may be “The Bangalore-based entrepreneur told the FT he was nearing a deal with 14 banks led by State Bank of India that would provide the loss-making carrier with S2 working capital of 6 billion rupees”. Similarly, a negative sentence S11 may be “The airlines become No. 2 private carrier since it began its operations in 2005 as the economy boomed but has become one of the main causalities of high fuel costs and a fierce price war between a handful of airlines which, between them, have ordered hundreds of aircraft on delivery over the next decade in an ambitious bet on the future. - In
step 804, one or more sentiments are identified in the one or more sentences based on the parts of speech. A positive statement and a negative statement may indicate type of sentiment in a sentence. - In
step 806, one or more keywords in the one or more sentences are identified. In one embodiment, thekeyword identifying module 316 identifies the one or more keywords based on the Parts of speech (POS) tagged by the POS token annotations module 610, and/or the POSline annotations module 612 ofFIG. 6 . Instep 808, one or more ambiguous keywords from the one or more keywords are disambiguated using (i) the wFCA, by generating a lattice that includes one or more concepts. The one or more concepts are generated with (i) the one or more keywords as objects, and (ii) categories associated with the one or more keywords as attributes. The categories may be obtained from theknowledge base 218 ofFIG. 2 . Instep 810, a weight is computed and assigned for each sentence of the one or more sentences based on a number of keywords of the one or more keywords that are associated with each sentence. The weight is computed based on a number of associations of each keyword within the one or more sentences. A graph may be generated based on the one or more sentences and the one or more keywords to compute the weight. The graph includes one or more nodes. Each node indicates a sentence of the one or more sentences that are associated with the sentiment. Instep 812, an input that includes an indication of the sentiment is processed. The indication may include summarization around the sentiment (e.g., a positive sentiment, or a negative sentiment). In one embodiment, the indication may be provided by theuser 102. Instep 814, a summary on the content around the sentiment is generated based on (i) the weight and at least one of (a) the one or more sentiments, and (b) the indication. The summary may include at least one of (i) at least one sentence obtained only from the first set of sentences, and (ii) at least one sentence from the second set of sentences. When theuser 102 intents to expand the summary, thecontent summarization engine 106 expands the summary based on (a) the weight assigned for each sentence of the one or more sentences, and (b) at least one of i) the one or more sentiments, and ii) the indication. - For example, from the
input content 502, the one or more keywords are identified and extracted based on parts of speech (POS) tag generated by the POS token annotations module 610, and/or the POSline annotations module 612. For instance, a noun is very likely to be a keyword in a sentence. Similarly, co-occurring nouns and its derivatives are also a keyword. A keyword chunker is used to obtain these keywords and keyword phrases depending on the noun and related tags. In one embodiment, when theuser 102 does not specify a section of theinput content 502, thenentire input content 502 is summarized. Theannotation extractor module 312 extracts the one or more keywords (e.g., 6 keywords) using POS tag. For instance, the extracted keywords are: - reuters—POS Tag says that it is a noun
- chairman—POS Tag says that it is a noun
- Kingfisher Airlines—POS Tag says that it is a noun followed by a noun (phrase)
- Vijay Mallya—POS Tag says that it is a noun followed by a noun (phrase)
- Shares—POS Tag says that it is a noun
- Kingfisher—POS Tag says that it is a noun
- Once these keywords are identified and extracted, they need to be disambiguated to find right meaning in which the one or more keywords are used in the
input content 502. To disambiguate, thecontent summarization engine 106 determines different disambiguated terms from the one or more keywords, and their related categories. Further, thecontent summarization engine 106 uses theknowledge base 218 stored in theexternal database 216 for obtaining categories for the one or more keywords. Each keyword is queried separately against theknowledge base 218 and corresponding categories are obtained. For example, for the above keywords, the categories obtained are -
- REUTERS—{Society, Corporate Groups, Organizations, Organizations by type, Agencies, News agencies}
- chairman—{Business, Management, Management occupations}
- Kingfisher Airlines—{Business, Industry, Service Industries, Travel, Transportation, Transport by mode, Aviation, Aviation by Continent, Aviation in Asia, Aviation in India, Airlines of India}
- Vijay Mallya—{Business, Management, Management occupations, Business executives, Chief executives, Chief executives by nationality, Indian chief executives}
- Shares—{Business, Finance, Financial Economics, Financial markets, Stock market, Share (finance)}
- Kingfisher—{Nature, Natural Sciences, Biology, Zoology, Animals, Chordates, Vertebrates, Birds, Birds by common name, Kingfishers}
- For example, the keyword “kingfisher” has got two disambiguations (e.g., one for “Kingfisher (Bird)” and one for “Kingfisher Airlines”. The categories corresponding to each word are shown against them. In one embodiment, the categories may be modified by the
user 102. The modification is taken as a feedback to the categories suggested for the keywords and is used to train theknowledge base 218 for preferred categories. In order to disambiguate the keyword “kingfisher” and to compute a context in the right meaning, thecontent summarization engine 106 uses thelattice construction module 326. Thelattice construction module 326 constructs a lattice based on the weighted Formal Concept Analysis (wFCA). -
FIG. 9 illustrates agraphical representation 900 of a lattice that is generated to disambiguate a keyword “kingfisher” of theinput content 502 using thelattice construction module 326 ofFIG. 3 according to an embodiment herein. Thelattice construction module 326 forms various concepts with one or more keywords, and their associated categories. For example, concept-1 to concept-9 associated withFIG. 9 are: -
- Concept-1: [Kingfisher Airlines]: [Aviation in India, Travel, Aviation, Transport by mode, Aviation by Continent, Transportation, Business, Aviation in Asia, Service Industries, Airlines of India, Industry]
- Concept-2: [Shares]: [Finance, Business, Share (finance), Financial Economics, Stock market, Financial markets]
- Concept-3: [Vijay Mallya]: [Management occupations, Management, Business, Chief executives, Business executives, Indian chief executives, Chief executives by nationality]
- Concept-4: [Kingfisher]: [Animals, Zoology, Natural Sciences, Chordates, Vertebrates, Birds, Birds by common name, Kingfishers, Nature, Biology]
- Concept-5: [REUTERS]: [Society, Organizations by type, Agencies, Organizations, Corporate Groups, News agencies]
- Concept-6: [chairman, Vijay Mallya]: [Management occupations, Management, Business]
- Concept-7: [chairman, Kingfisher Airlines, Shares, Vijay Mallya]: [Business]
- Concept-8: [ ]: [Aviation in India, Society, Travel, Animals, Management, Management occupations, Organizations, Chief executives, Indian chief executives, Corporate Groups, Share (finance), Financial Economics, Chordates, Airlines of India, Transport by mode, Transportation, Agencies, Aviation in Asia, Aviation, Organizations by type, Zoology, Aviation by Continent, Business, Finance, Business executives, Natural Sciences, News agencies, Chief executives by nationality, Birds by common name, Nature, Service Industries, Stock market, Vertebrates, Birds, Financial markets, Kingfishers, Industry, Biology]
- Concept-9: [chairman, Kingfisher Airlines, Shares, Kingfisher, Vijay Mallya, REUTERS]: [ ]
- In one embodiment, the
lattice construction module 326 interprets that theconcept 4 “Kingfisher” is not associated with any other concept and there are no matching contexts. Whereas, theconcept 1 “Kingfisher Airlines” is associated with theconcept 3 “Vijay Mallya”, theconcept 2 “shares” and theconcept 6 “chairman”, and it also has an overlapping context “business” (concept 7). Thus, the word kingfisher is treated as “Kingfisher Airlines” and not “Kingfisher (bird)”. - Further,
concept 1 toconcept 5 defines distinct category sets for each keyword. The keyword “chairman” does not have a distinct concept because it is a subset of the category set of the keyword “Vijay Mallya”. This implies that the keywords “chairman” and “Vijay Mallya” are strongly related in a context of theinput content 502. In addition, theconcept 6 andconcept 7 provide contextual information that the keywords “chairman”, “Kingfisher Airlines”, “Shares” and “Vijay Mallya” are related in the context of theinput content 502. The keyword “reuters” and “Kingfisher” are not related to any other keywords and are treated as an unimportant (less priority) in the context of theinput content 502, and there is no concept that covers all the categories. - The
score computing module 328 computes a score (shown in the percentage) for each node or concept inFIG. 9 using the weighted FCA. A simple heuristic model of the weighted FCA computes score of the nodes, and a node with highest score is used to disambiguate a keyword in the context of right meaning. For computing score, the heuristic may assign equal probability for all six keywords. Hence, there are totally 6 keywords having a score of ⅙ each. Theconcept 1 toconcept 5 defines a distinct category set for each keyword. Therefore, the score for each keyword ofconcept 1 toconcept 5 is ⅙ (16.67%). - However, as described the keyword “chairman” does not have a distinct category because it is a subset of the category set of keyword “Vijay Mallya”. Such categories are “Management occupations”, “Management” and “Business”. Further, this categories are common for both keyword chairman” and “Vijay Mallya” and hence they are strongly associated in the context of the
input content 502. This makes theconcept 6. - Further, a score for the
concept 6 is (⅙)*2 which is 33.33%. - In addition, a category “business” is associated with the categories of the
concept 2. Thus, the keywords “kingfisher Airlines”, “shares”, “vijay Mallya”, “chairman” is strongly associated with the category business. This makes theconcept 7. The score for theconcept 7 will be (⅙)*4 which is 66.67%. Hence, the keyword “kingfisher Airlines” as described is strongly associated with the category “business”. Thus, the keyword “kingfisher” is treated as “kingfisher Airlines” and not as “kingfisher (bird)” by using weighted FCA. - In one embodiment, the weighted FCA can be further drilled down to provide more precise results and are often useful to obtain more contextual information that are useful for disambiguation. In one embodiment, using the weighted FCA, the disambiguation is done by treating all the categories at the same level and ignoring hierarchy. Whereas, in drill down FCA all the associated categories form a hierarchy in the
knowledge base 218. - For example, consider the hierarchy for Chairman, Kingfisher Airlines and Vijay Mallya.
-
- Chairman—Business->Management->Management occupations}
- Kingfisher Airlines—{Business->Industry->Service Industries->Travel->Transportation->Transport by mode->Aviation->Aviation by Continent->Aviation in Asia->Aviation in India, Airlines of India}
- Vijay Mallya—{Business->Management->Management occupations->Business executives->Chief executives->Chief executives by nationality->Indian chief executives}
In first level, the weighted FCA, i.e., considering the root element “business”, the weight for all three keywords in the concept is (⅓)*3=1.0. Hence, in the context of “business” all are related. But, using the drill down FCA, if the “Business” category is drilled down to a set of categories such as {Business, Management, Industry}, two drill down concepts will be obtained with respect to the three concepts.
- Concept-10: [chairman, Vijay Mallya]: [Business, Management] Weight: ⅔˜0.67
- Concept-11: [Kingfisher Airlines]: [Business, Industry] Weight: ⅓˜0.33
- By performing the drill-down FCA with the subset of categories, contextual information is obtained. The contextual information indicates an affinity among the keywords “chairman”, “Kingfisher Airlines” and “Vijay Mallya”. For instance, from the
concept 10 andconcept 11 shows that although “chairman”, “Kingfisher Airlines” and “Vijay Mallya” are related in a context of “Business”, but “Kingfisher Airlines” is a different concept in a context of “Business” with respect to “Industry”, whereas, “chairman” and “Vijay Mallya” are related in a context of “Business” with respect to “Management”. Similarly, the drill down FCA can be performed until all the contextual information is retrieved and the disambiguation is achieved. Further, thecontent summarization engine 106 accepts theinput content 502 at the disambiguation step as well and theuser 102 can correct the incorrect associations by viewing at alternative category associations. - In an embodiment, the
user 102 may also disambiguate one or more keywords in the context of right meaning using popularity of the one or more keywords. In yet another embodiment, theuser 102 is provided with a graphical representation for visualizing summarization around the one or more keywords. Theuser 102 can view a graph having the one or more keywords around which related text such as keywords, sentences, and/or content are associated. Once disambiguation is over, thecontent summarization engine 106 has one or more disambiguated keywords to generate a content graph. -
FIG. 10 is a graphical representation illustrating agraph 1000 that indicates an association between one or more keywords and sentences of theinput content 502 ofFIG. 5 according to an embodiment herein. The graph generating module 320 generates thegraph 1000 that indicates the sentences S1 till S11 and their associated keywords. The sentences S1 and S6 are positive sentences. The sentences S4, S7, S8, S9, S10, and S11 are negative sentences. The sentences S2, S3 and S5 are the set of neutral sentences. Thegraph 1000 includes one or more nodes, and each node corresponds to at least one sentence of theinput content 502. Thegraph 1000 further includes one or more keywords of theinput content 502 identified by thekeyword identifying module 316. - For example, the
graph 1000 is generated for theinput content 502 with sentences S1, S2, S3, S4 S5, S6, S7, S8, S9, S10, and S11 associated with the keywords K1, K2, K3 . . . and K21 such as, - K1: Kingfisher Airlines
- K2: Shares
- K3: chairman
- K4: State bank of India
- K5: Reuters
- K6: Investor
- K7: Financial times
- K8: Banks
- K9: Deal
- K10: Interview
- K11: Bangalore
- K12: Entrepreneur
- K13: Vijay Mallaya
- K14: Equity/Equity injection
- K15: Market value
- K16: Debt
- K17: Profit
- K18: Economy
- K19: Aircraft
- K20: Fuel
- K21: Causalities
- For instance, all the keywords K1, K2, K3, . . . and K21 have an equal weight of 1/21 (i.e., 0.04762). However, based on number of associations of each keyword with the sentences, the actual weight may vary. For example, the keyword K1 “kingfisher Airlines” is associated with S1 directly, also indirectly with S4, S5, S9, and S10 as “kingfisher”, and with S7, S10, and S11 as “Airline or Airlines”. The keyword “kingfisher” is treated here as “kingfisher Airlines” as already disambiguated. The keyword “kingfisher Airlines” is thus associated 7 sentences. Hence, a weight for the keyword K1 “kingfisher Airlines” computed as 0.04762/7. Similarly, a weight is computed for each keyword based on number of associations.
- Similarly, keywords K4, K7, K8, K9, K11 and K12 are associated with the second sentence S2. Keywords K8, and K13 are associated with the third sentence S3. Keywords K1, and K13 are associated with the fourth sentence S4. Keywords K1, and K6 are associated with the fifth sentence S5. Keywords K7, K13 and K14 are associated with the sixth sentence S6. Keywords K1, K8, K9, K13, and K16 are associated with the seventh sentence S7. Keywords K5, and K13 are associated with the eighth sentence S8. Keywords K1, and K2 are associated with the ninth sentence S9. Keywords K1, K15, and K17 are associated with the second tenth S10. Keywords K1, K18, K19, K20 and K21 are associated with the eleventh sentence S11.
-
FIG. 11 is atable view 1100 illustrating aweight 1102 that is computed for each keyword that is identified by thekeyword identifying module 316 ofFIG. 3 based on number ofassociations 1104 of each keyword with sentences of theinput content 502 according to an embodiment herein. As described, computing the weight for the keyword K1 “kingfisher Airlines” based on number of associations of the keyword K1 with sentences of theinput content 502, a weight is computed for each keyword as shown in the table. - Once the weight is computed for each keyword, the
weight assigning module 330 assigns a weight for a sentence in theinput content 502 based on a count that corresponds to the keywords associated with a node that corresponds to the sentence using simple heuristics. Similarly, a weight is computed for each sentence of theinput content 502. For example, a weight for the first sentence S1 is computed based on a count that correspond to the keywords associate with a node that corresponds to the first sentence S1. Such keywords that associate with the first sentence S1 are K1, K3, K5, K6, K7, K8, K9, K10, and K13. The weight for the first sentence S1 is computed as summation of weights associate with the keywords that corresponds to the first sentence S1. - Thus,
-
Weight of S1=(Weight of K1)+(Weight of K3)+(Weight of K5)+(Weight of K6)+(Weight of K7)+(Weight of K8)+(Weight of K9)+(Weight of K10)+(Weight of K13)=(0.04762/7)+(0.04762)+(0.04762/2)+(0.04762/2)+(0.04762/3)+(0.04762/4)+(0.04762/3)+(0.04762)+(0.04762/6)=0.2013 -
FIG. 12 is atable view 1200 illustrating aweight 1202 that is computed for each sentence of theinput content 502 based on associatedkeywords 1204 according to an embodiment herein. As described, computing the weight for the sentence S1, a weight is computed for each sentence of theinput content 502 as shown in the table. - From the
graph 1000 ofFIG. 10 , the graph generating module 320 interprets that S1 is most important sentence when compared to other sentences, because it has more number of associations with the keywords. In one embodiment, the current example explains a simple weighting scheme based on keywords. However, weighting scheme can also depend on various factors, like sentence selection, section selection and sentiments selection. - Once the most important sentences are identified, the
intent building module 322 is used for tailoring sentences together in the exact sequence in which they appear in the original text, and provides summary of theinput content 502. The number of sentences to be used as a summary depends on an input parameter from theuser 102 as well as a weighted cut-off that is configurable. - In one embodiment, the
user 102 can expand the summary of theinput content 502 using theintent expanding module 324. For example, a first level of summary for theinput content 502 has only S1 (having a highest weight). If the user wants to expand the summary to a second level, theintent expanding module 324 relaxes weight of sentences. Then, a next most important sentence is S11 (associated with 5 keywords, and having next highest weight) is tailored with S1 in the exact sequence in which they appear in theinput content 502. In one embodiment, when a summary on a particular section is requested, then the summary is generated by considering only sentences occurring in that section. Similarly, in another embodiment, when a summary on a particular page is requested, then the summary is generated by considering only sentences occurring in that page. In yet another embodiment, theuser 102 intents to summarize content around a particular keyword. - The sentiments that have highest weight are arranged at the top, and the sentiments that have higher, high, low, lower, and lowest weight are followed, in one example embodiment. For example, the positive sentiments are at the top followed by neutral sentiments, in another example embodiment. When the
user 102 intents to summarize theinput content 502 around the one or more positive sentiments, thecontent summarization engine 106 arranges the sentence S1 having a weight 0.2013 and the sentence S6 having a weight 0.07143 as positive sentiments at the top and may be followed by one or more neutral sentence (e.g., the neutral sentence S2 having a weight 0.1865, the neutral sentence S5 having a weight 0.03061, and the neutral sentence S3 having a weight 0.01984). Similarly, when theuser 102 intents to summarize theinput content 502 around the one or more negative sentiments, thecontent summarization engine 106 arranges the sentence S11 having a weight 0.1973 and the sentiment S10 are negative sentences at the top and may be followed by one or more neutral sentences (e.g., the neutral sentences S2, S5, and S3). - In case, where there are no neutral sentences, only the positive sentiments are arranged at the top followed by the positive sentiments that have lesser weight. Similarly, in case, where there are no neutral sentences, only the negative sentences having a highest weight are arranged at the top and may be followed by the negative sentences that have lesser weight. Further, to summarize around a positive sentiment, (i) one or more keywords are identified from one or more positive sentences and not from the
entire input content 502, and (ii) one or more ambiguous keywords are disambiguated from the one or more keywords, in one example embodiment. Similarly, to summarize around a negative sentiment, (i) one or more keywords are identified from one or more negative sentences and not from theentire input content 502, and (ii) one or more ambiguous keywords are disambiguated from the one or more keywords, in another example embodiment. Once the summary is generated around the positive sentiment, or around the negative sentiment, the summary may be displayed either in a text format, or in a graphical representation (e.g., a bar chart, or a pie chart). - The techniques provided by the embodiments herein may be implemented on an integrated circuit chip (not shown). The chip design is created in a graphical computer programming language, and stored in a computer storage medium (such as a disk, tape, physical hard drive, or virtual hard drive such as in a storage access network). If the designer does not fabricate chips or the photolithographic masks used to fabricate chips, the designer transmits the resulting design by physical means (e.g., by providing a copy of the storage medium storing the design) or electronically (e.g., through the Internet) to such entities, directly or indirectly. The stored design is then converted into the appropriate format (e.g., GDSII) for the fabrication of photolithographic masks, which typically include multiple copies of the chip design in question that are to be formed on a wafer. The photolithographic masks are utilized to define areas of the wafer (and/or the layers thereon) to be etched or otherwise processed.
- The resulting integrated circuit chips can be distributed by the fabricator in raw wafer form (that is, as a single wafer that has multiple unpackaged chips), as a bare die, or in a packaged form. In the latter case the chip is mounted in a single chip package (such as a plastic carrier, with leads that are affixed to a motherboard or other higher level carrier) or in a multichip package (such as a ceramic carrier that has either or both surface interconnections or buried interconnections). In any case the chip is then integrated with other chips, discrete circuit elements, and/or other signal processing devices as part of either (a) an intermediate product, such as a motherboard, or (b) an end product. The end product can be any product that includes integrated circuit chips, ranging from toys and other low-end applications to advanced computer products having a display, a keyboard or other input device, and a central processor. The embodiments herein can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment including both hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc.
- Furthermore, the embodiments herein can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
- The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk-read only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.
- A data processing system suitable for storing and/or executing program code will include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.
- Input/output (I/O) devices (including but not limited to keyboards, displays, pointing devices, etc.) can be coupled to the system either directly or through intervening I/O controllers. Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.
- A representative hardware environment for practicing the embodiments herein is depicted in
FIG. 13 . This schematic drawing illustrates a hardware configuration of an information handling/computer system in accordance with the embodiments herein. The system comprises at least one processor or central processing unit (CPU) 10. TheCPUs 10 are interconnected viasystem bus 12 to various devices such as a random access memory (RAM) 14, read-only memory (ROM) 16, and an input/output (I/O)adapter 18. The I/O adapter 18 can connect to peripheral devices, such asdisk units 11 and tape drives 13, or other program storage devices that are readable by the system. The system can read the inventive instructions on the program storage devices and follow these instructions to execute the methodology of the embodiments herein. The system further includes a user interface adapter 19 that connects akeyboard 15,mouse 17,speaker 24,microphone 22, and/or other user interface devices such as a touch screen device (not shown) to thebus 12 to gather user input. Additionally, acommunication adapter 20 connects thebus 12 to adata processing network 25, and adisplay adapter 21 connects thebus 12 to adisplay device 23 which may be embodied as an output device such as a monitor, printer, or transmitter, for example. - The
content summarization engine 106 provides the user 102 a more precise summary on content, and assists theuser 102 to grasp it quickly. Moreover, sentences of the content are stitched in an exact order as appear in the content. This provides continuity and clear understanding to theuser 102 while reviewing the summary. Thecontent summarization engine 106 saves considerable amount of user's time by providing the summary, and theuser 102 can also expand the summary for better understanding. Also, theuser 102 obtains a summary on the content based on their intent. - The
content summarization engine 106 may also enable summarization around a keyword. For example, a summary around a keyword may be generated based on (i) a sentence having a positive sentiment, or (ii) a sentence having a negative sentiment. For instances, theuser 102 intents to summarize theinput content 502 around the keyword K1—Kingfisher Airlines that is either associated with a set of positive sentences, or associated with a set of negative sentence. When theuser 102 intents to summarize the keyword K1—Kingfisher Airlines that is associated with the set of positive sentences (S1 and S6), only the sentence S1 is selected for summarization since the keyword K1 is not associated the sentence S6. When theuser 102 intents to summarize the keyword K1—Kingfisher Airlines that is associated with the set of negative sentences (S4, S7, S8, S9, S10, S11), at least one sentence (S4, S7, S9, S10, and/or S11) is selected for summarization (e.g., S4). Since the keyword K1 is not associated the sentence S8, the sentence S8 is not selected for summarization by thecontent summarization engine 106. - The foregoing description of the specific embodiments will so fully reveal the general nature of the embodiments herein that others can, by applying current knowledge, readily modify and/or adapt for various applications such specific embodiments without departing from the generic concept, and, therefore, such adaptations and modifications should and are intended to be comprehended within the meaning and range of equivalents of the disclosed embodiments. It is to be understood that the phraseology or terminology employed herein is for the purpose of description and not of limitation. Therefore, while the embodiments herein have been described in terms of preferred embodiments, those skilled in the art will recognize that the embodiments herein can be practiced with modification within the spirit and scope of the appended claims.
Claims (14)
1. A method of summarizing content around a sentiment using weighted Formal Concept Analysis (wFCA), said method comprising:
(i) identifying, by a processor, a plurality of sentences associated with said content based on parts of speech;
(ii) identifying, by said processor, at least one sentiment associated with said plurality of sentences based on said parts of speech;
(iii) identifying, by said processor, a plurality of keywords in said plurality of sentences;
(iv) disambiguating, by said processor, at least one ambiguous keyword from said plurality of keywords using said wFCA;
(v) computing, by said processor, a weight for each sentence of said plurality of sentences based on a number of keywords of said plurality of keywords associated with said each sentence;
(vi) processing, by said processor, an input comprising an indication of said sentiment; and
(vii) generating, by said processor, a summary of said content based on (a) said weight, and b) at least one of i) said at least one sentiment, and ii) said indication.
2. The method of claim 1 , wherein said sentiment is a positive sentiment, wherein at least one sentence of said plurality of sentences comprises at least one positive statement.
3. The method of claim 1 , wherein said sentiment is a negative sentiment, wherein at least one sentence of said plurality of sentences comprises at least one negative statement.
4. The method of claim 1 , wherein said plurality of sentences comprise (a) a first set of sentences, wherein each sentence of said first set of sentences comprises said at least one sentiment.
5. The method of claim 4 , wherein said plurality of sentences further comprise (b) a second set of sentences that are not associated with said at least one sentiment.
6. The method of claim 1 , wherein said weight is computed based on a number of associations of each keyword within said plurality of sentences.
7. The method of claim 6 , further comprising expanding said summary based on (a) said weight assigned for each sentence of said plurality of sentences, and (b) at least one of i) said at least one sentiment, and ii) said indication.
8. A non-transitory program storage device readable by computer, and comprising a program of instructions executable by said computer to perform a method for summarizing content around a sentiment based on weighted Formal Concept Analysis (wFCA), said method comprising:
(i) identifying, by a processor, a plurality of sentences associated with said content based on parts of speech;
(ii) identifying, by said processor, at least one sentiment associated with said plurality of sentences based on said parts of speech;
(iii) identifying, by said processor, a plurality of keywords in said plurality of sentences;
(iv) disambiguating, by said processor, at least one ambiguous keyword from said plurality of keywords using said wFCA;
(v) generating, by said processor, a graph to compute a weight for each sentence of said plurality of sentences based on a number of keywords of said plurality of keywords associated with each sentence;
(vi) processing, by said processor, an input comprising an indication of said sentiment; and
(vi) generating, by said processor, a summary of said content based on (a) said weight, and b) at least one of i) said at least one sentiment, and ii) said indication.
9. The non-transitory program storage device of claim 8 , wherein said weight is computed based on a number of associations of each keyword within said plurality of sentences.
10. The non-transitory program storage device of claim 9 , wherein said method further comprises expanding said summary based on (a) said weight, and b) at least one of i) said at least one sentiment, and ii) said indication.
11. A system for summarizing content around a sentiment based on weighted Formal Concept Analysis (wFCA) using a content summarization engine, said system comprising:
(a) a memory unit that stores (i) a set of modules, and (ii) a database;
(b) a display unit;
(c) a processor that executes said set of modules, wherein said set of modules comprise:
(i) a sentence identifying module executed by said processor that identifies (a) a first set of sentences, and (b) a second set of sentences in said content based on parts of speech, wherein each sentence of said first set of sentences comprises at least one keyword that indicates at least one sentiment, wherein said second set of sentences are not associated with sentiments;
(ii) a sentiment identifying module executed by said processor that identifies said at least one sentiment associated with said each sentence of said first set of sentences based on said parts of speech;
(iii) a keyword identifying module executed by said processor that identifies a plurality of keywords from said first set of sentences, and said second set of sentences;
(iv) a disambiguating module executed by said processor that disambiguates at least one ambiguous keyword from said plurality of keywords using said wFCA, wherein said wFCA comprises generation of a lattice with said plurality of keywords as objects, and categories associated with said plurality of keywords as attributes, wherein said categories are obtained from a knowledge base;
(v) a weight computing module executed by said processor that computes a weight for each sentence of (a) said first set of sentences, and (b) said second set of sentences based on a number of keywords associated with said each sentence;
(vi) (i) a sentiment indicating module executed by said processor that processes an input comprising an indication of said sentiment in said content; and
(vii) an intent building module executed by said processor that generates a summary of said content based on (a) said weight, and b) at least one of i) said at least one sentiment, and ii) said indication.
12. The system of claim 11 , wherein said summary comprises at least one sentence that are only obtained from said first set of sentences.
13. The system of claim 11 , wherein said summary comprises at least one sentence from said second set of sentences.
14. The method of claim 11 , wherein said weight is computed based on a number of associations of each keyword within (a) said first set of sentences, and (b) said second set of sentences.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
IN263/CHE/2012 | 2012-01-23 | ||
IN263CH2012 | 2012-01-23 |
Publications (1)
Publication Number | Publication Date |
---|---|
US20130191735A1 true US20130191735A1 (en) | 2013-07-25 |
Family
ID=48798096
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/746,324 Abandoned US20130191735A1 (en) | 2012-01-23 | 2013-01-22 | Advanced summarization on a plurality of sentiments based on intents |
US13/746,316 Active 2033-02-15 US9037590B2 (en) | 2012-01-23 | 2013-01-22 | Advanced summarization based on intents |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/746,316 Active 2033-02-15 US9037590B2 (en) | 2012-01-23 | 2013-01-22 | Advanced summarization based on intents |
Country Status (1)
Country | Link |
---|---|
US (2) | US20130191735A1 (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150058248A1 (en) * | 2013-08-23 | 2015-02-26 | Wal-Mart Stores, Inc. | Systematic discovery of business ontology |
US20150082161A1 (en) * | 2013-09-17 | 2015-03-19 | International Business Machines Corporation | Active Knowledge Guidance Based on Deep Document Analysis |
US20190066676A1 (en) * | 2016-05-16 | 2019-02-28 | Sony Corporation | Information processing apparatus |
US10579630B2 (en) * | 2015-01-14 | 2020-03-03 | Microsoft Technology Licensing, Llc | Content creation from extracted content |
WO2021072321A1 (en) * | 2019-10-11 | 2021-04-15 | The Board Of Trustees Of The Leland Stanford Junior University | Systems and methods for generating knowledge graphs and text summaries from document databases |
US20210279407A1 (en) * | 2017-08-01 | 2021-09-09 | Samsung Electronics Co., Ltd. | Apparatus and method for providing summarized information using an artificial intelligence model |
US11468243B2 (en) | 2012-09-24 | 2022-10-11 | Amazon Technologies, Inc. | Identity-based display of text |
Families Citing this family (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9460200B2 (en) | 2012-07-02 | 2016-10-04 | International Business Machines Corporation | Activity recommendation based on a context-based electronic files search |
US9262499B2 (en) | 2012-08-08 | 2016-02-16 | International Business Machines Corporation | Context-based graphical database |
US9251237B2 (en) | 2012-09-11 | 2016-02-02 | International Business Machines Corporation | User-specific synthetic context object matching |
US8620958B1 (en) | 2012-09-11 | 2013-12-31 | International Business Machines Corporation | Dimensionally constrained synthetic context objects database |
US9619580B2 (en) | 2012-09-11 | 2017-04-11 | International Business Machines Corporation | Generation of synthetic context objects |
US9223846B2 (en) | 2012-09-18 | 2015-12-29 | International Business Machines Corporation | Context-based navigation through a database |
US9741138B2 (en) | 2012-10-10 | 2017-08-22 | International Business Machines Corporation | Node cluster relationships in a graph database |
US8931109B2 (en) | 2012-11-19 | 2015-01-06 | International Business Machines Corporation | Context-based security screening for accessing data |
US8983981B2 (en) | 2013-01-02 | 2015-03-17 | International Business Machines Corporation | Conformed dimensional and context-based data gravity wells |
US9229932B2 (en) | 2013-01-02 | 2016-01-05 | International Business Machines Corporation | Conformed dimensional data gravity wells |
US9069752B2 (en) * | 2013-01-31 | 2015-06-30 | International Business Machines Corporation | Measuring and displaying facets in context-based conformed dimensional data gravity wells |
US9053102B2 (en) * | 2013-01-31 | 2015-06-09 | International Business Machines Corporation | Generation of synthetic context frameworks for dimensionally constrained hierarchical synthetic context-based objects |
US9292506B2 (en) | 2013-02-28 | 2016-03-22 | International Business Machines Corporation | Dynamic generation of demonstrative aids for a meeting |
US10152526B2 (en) | 2013-04-11 | 2018-12-11 | International Business Machines Corporation | Generation of synthetic context objects using bounded context objects |
US9348794B2 (en) | 2013-05-17 | 2016-05-24 | International Business Machines Corporation | Population of context-based data gravity wells |
US9195608B2 (en) | 2013-05-17 | 2015-11-24 | International Business Machines Corporation | Stored data analysis |
US9189514B1 (en) | 2014-09-04 | 2015-11-17 | Lucas J. Myslinski | Optimized fact checking method and system |
US20160179924A1 (en) | 2014-12-23 | 2016-06-23 | International Business Machines Corporation | Persona based content modification |
CN106293785A (en) * | 2015-05-21 | 2017-01-04 | 富士通株式会社 | The method and apparatus that the rule set of Cascading Style Sheet is optimized |
US10354227B2 (en) * | 2016-01-19 | 2019-07-16 | Adobe Inc. | Generating document review workflows |
US9886501B2 (en) | 2016-06-20 | 2018-02-06 | International Business Machines Corporation | Contextual content graph for automatic, unsupervised summarization of content |
US9881082B2 (en) | 2016-06-20 | 2018-01-30 | International Business Machines Corporation | System and method for automatic, unsupervised contextualized content summarization of single and multiple documents |
US10127323B1 (en) | 2017-07-26 | 2018-11-13 | International Business Machines Corporation | Extractive query-focused multi-document summarization |
US10685050B2 (en) * | 2018-04-23 | 2020-06-16 | Adobe Inc. | Generating a topic-based summary of textual content |
US11934781B2 (en) | 2020-08-28 | 2024-03-19 | Salesforce, Inc. | Systems and methods for controllable text summarization |
US20230177250A1 (en) * | 2021-12-06 | 2023-06-08 | Salesforce.Com, Inc. | Visual text summary generation |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5978820A (en) * | 1995-03-31 | 1999-11-02 | Hitachi, Ltd. | Text summarizing method and system |
US20090112892A1 (en) * | 2007-10-29 | 2009-04-30 | Claire Cardie | System and method for automatically summarizing fine-grained opinions in digital text |
US20120272160A1 (en) * | 2011-02-23 | 2012-10-25 | Nova Spivack | System and method for analyzing messages in a network or across networks |
US20130151538A1 (en) * | 2011-12-12 | 2013-06-13 | Microsoft Corporation | Entity summarization and comparison |
Family Cites Families (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7113910B1 (en) * | 2000-02-18 | 2006-09-26 | At&T Corp. | Document expansion in speech retrieval |
US6823331B1 (en) * | 2000-08-28 | 2004-11-23 | Entrust Limited | Concept identification system and method for use in reducing and/or representing text content of an electronic document |
US8135711B2 (en) * | 2002-02-04 | 2012-03-13 | Cataphora, Inc. | Method and apparatus for sociological data analysis |
US20060074980A1 (en) * | 2004-09-29 | 2006-04-06 | Sarkar Pte. Ltd. | System for semantically disambiguating text information |
US10002325B2 (en) * | 2005-03-30 | 2018-06-19 | Primal Fusion Inc. | Knowledge representation systems and methods incorporating inference rules |
US7849090B2 (en) * | 2005-03-30 | 2010-12-07 | Primal Fusion Inc. | System, method and computer program for faceted classification synthesis |
US8849860B2 (en) * | 2005-03-30 | 2014-09-30 | Primal Fusion Inc. | Systems and methods for applying statistical inference techniques to knowledge representations |
US7606781B2 (en) * | 2005-03-30 | 2009-10-20 | Primal Fusion Inc. | System, method and computer program for facet analysis |
US7580918B2 (en) * | 2006-03-03 | 2009-08-25 | Adobe Systems Incorporated | System and method of efficiently representing and searching directed acyclic graph structures in databases |
US8209665B2 (en) * | 2008-04-08 | 2012-06-26 | Infosys Limited | Identification of topics in source code |
US8676732B2 (en) * | 2008-05-01 | 2014-03-18 | Primal Fusion Inc. | Methods and apparatus for providing information of interest to one or more users |
CA2734756C (en) * | 2008-08-29 | 2018-08-21 | Primal Fusion Inc. | Systems and methods for semantic concept definition and semantic concept relationship synthesis utilizing existing domain definitions |
US8751218B2 (en) * | 2010-02-09 | 2014-06-10 | Siemens Aktiengesellschaft | Indexing content at semantic level |
-
2013
- 2013-01-22 US US13/746,324 patent/US20130191735A1/en not_active Abandoned
- 2013-01-22 US US13/746,316 patent/US9037590B2/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5978820A (en) * | 1995-03-31 | 1999-11-02 | Hitachi, Ltd. | Text summarizing method and system |
US20090112892A1 (en) * | 2007-10-29 | 2009-04-30 | Claire Cardie | System and method for automatically summarizing fine-grained opinions in digital text |
US20120272160A1 (en) * | 2011-02-23 | 2012-10-25 | Nova Spivack | System and method for analyzing messages in a network or across networks |
US20130151538A1 (en) * | 2011-12-12 | 2013-06-13 | Microsoft Corporation | Entity summarization and comparison |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11468243B2 (en) | 2012-09-24 | 2022-10-11 | Amazon Technologies, Inc. | Identity-based display of text |
US9519872B2 (en) * | 2013-08-23 | 2016-12-13 | Wal-Mart Stores, Inc. | Systematic discovery of business ontology |
US20150058248A1 (en) * | 2013-08-23 | 2015-02-26 | Wal-Mart Stores, Inc. | Systematic discovery of business ontology |
US20150082161A1 (en) * | 2013-09-17 | 2015-03-19 | International Business Machines Corporation | Active Knowledge Guidance Based on Deep Document Analysis |
US20150081714A1 (en) * | 2013-09-17 | 2015-03-19 | International Business Machines Corporation | Active Knowledge Guidance Based on Deep Document Analysis |
US9817823B2 (en) * | 2013-09-17 | 2017-11-14 | International Business Machines Corporation | Active knowledge guidance based on deep document analysis |
US9824088B2 (en) * | 2013-09-17 | 2017-11-21 | International Business Machines Corporation | Active knowledge guidance based on deep document analysis |
US10698956B2 (en) | 2013-09-17 | 2020-06-30 | International Business Machines Corporation | Active knowledge guidance based on deep document analysis |
US10579630B2 (en) * | 2015-01-14 | 2020-03-03 | Microsoft Technology Licensing, Llc | Content creation from extracted content |
US20190066676A1 (en) * | 2016-05-16 | 2019-02-28 | Sony Corporation | Information processing apparatus |
US20210279407A1 (en) * | 2017-08-01 | 2021-09-09 | Samsung Electronics Co., Ltd. | Apparatus and method for providing summarized information using an artificial intelligence model |
US11574116B2 (en) * | 2017-08-01 | 2023-02-07 | Samsung Electronics Co., Ltd. | Apparatus and method for providing summarized information using an artificial intelligence model |
WO2021072321A1 (en) * | 2019-10-11 | 2021-04-15 | The Board Of Trustees Of The Leland Stanford Junior University | Systems and methods for generating knowledge graphs and text summaries from document databases |
Also Published As
Publication number | Publication date |
---|---|
US9037590B2 (en) | 2015-05-19 |
US20130191392A1 (en) | 2013-07-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9037590B2 (en) | Advanced summarization based on intents | |
US9053418B2 (en) | System and method for identifying one or more resumes based on a search query using weighted formal concept analysis | |
US10019515B2 (en) | Attribute-based contexts for sentiment-topic pairs | |
US9411790B2 (en) | Systems, methods, and media for generating structured documents | |
US8533208B2 (en) | System and method for topic extraction and opinion mining | |
US10572473B2 (en) | Optimized data visualization according to natural language query | |
US20140115439A1 (en) | Methods and systems for annotating web pages and managing annotations and annotated web pages | |
US20110184960A1 (en) | Methods and systems for content recommendation based on electronic document annotation | |
US20100198802A1 (en) | System and method for optimizing search objects submitted to a data resource | |
US20130024440A1 (en) | Methods, systems, and computer-readable media for semantically enriching content and for semantic navigation | |
KR20190062391A (en) | System and method for context retry of electronic records | |
US20130254126A1 (en) | Method of annotating portions of a transactional legal document related to a merger or acquisition of a business entity with graphical display data related to current metrics in merger or acquisition transactions | |
US11748577B1 (en) | Computer-generated content based on text classification, semantic relevance, and activation of deep learning large language models | |
Qian et al. | A formative study on designing accurate and natural figure captioning systems | |
US20120179709A1 (en) | Apparatus, method and program product for searching document | |
US9613012B2 (en) | System and method for automatically generating keywords | |
Quan et al. | Feature-level sentiment analysis by using comparative domain corpora | |
Ispirova et al. | Mapping Food Composition Data from Various Data Sources to a Domain-Specific Ontology. | |
Khalid et al. | Real-time feedback query expansion technique for supporting scholarly search using citation network analysis | |
Duque et al. | Can multilinguality improve biomedical word sense disambiguation? | |
Qumsiyeh et al. | Searching web documents using a summarization approach | |
Im et al. | Confirmatory aspect-level opinion mining processes for tourism and hospitality research: a proposal of DiSSBUS | |
Houssein et al. | Semantic protocol and resource description framework query language: a comprehensive review | |
JP2022187507A (en) | Technical research support device, technical research support method and technical research support program | |
Borin et al. | Swe-Clarin: Language resources and technology for digital humanities |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: FORMCEPT TECHNOLOGIES AND SOLUTIONS PVT LTD, INDIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:KUMAR, ANUJ;SRINIVASAN, SURESH;REEL/FRAME:030119/0753 Effective date: 20130311 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |