EP3970031A1 - Systeme und verfahren zur ereigniszusammenfassung aus daten - Google Patents

Systeme und verfahren zur ereigniszusammenfassung aus daten

Info

Publication number
EP3970031A1
EP3970031A1 EP20809703.0A EP20809703A EP3970031A1 EP 3970031 A1 EP3970031 A1 EP 3970031A1 EP 20809703 A EP20809703 A EP 20809703A EP 3970031 A1 EP3970031 A1 EP 3970031A1
Authority
EP
European Patent Office
Prior art keywords
sentence
extracted
sentences
keyword
type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP20809703.0A
Other languages
English (en)
French (fr)
Other versions
EP3970031A4 (de
Inventor
Eleanor HAGERMAN
Blake HOWALD
Berk EKMEKCI
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Thomson Reuters Enterprise Centre GmbH
Original Assignee
Thomson Reuters Enterprise Centre GmbH
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US16/700,746 external-priority patent/US11461555B2/en
Priority claimed from US16/848,739 external-priority patent/US11182539B2/en
Application filed by Thomson Reuters Enterprise Centre GmbH filed Critical Thomson Reuters Enterprise Centre GmbH
Publication of EP3970031A1 publication Critical patent/EP3970031A1/de
Publication of EP3970031A4 publication Critical patent/EP3970031A4/de
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/55Rule-based translation
    • G06F40/56Natural language generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0635Risk analysis of enterprise or organisation activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services

Definitions

  • the present subject matter is directed generally to event summarization, and more particularly but without limitation, to generating a summary of a risk event from textual data.
  • NLP natural language processing
  • risk mining Identifying or predicting risk events in textual data associated with individuals, companies, and other entities is a common natural language processing (NLP) task known as risk mining.
  • NLP natural language processing
  • Monitoring systems rely on risk mining to describe risk events that are generally passed on to an expert for analysis.
  • These mining tasks are generally computationally intensive which require processing large amounts of data available to search. Processing even a portion of such data would require significant processing resources and energy consumption, which may not be supported by many types of electronic devices.
  • Risk mining technologies are designed to determine relevant textual extractions that capture entity-risk relationships. These risk mining technologies may be applied to large, high volume data sets. When such data sets are processed, a multitude of relevant extractions can be returned. Such voluminous extractions can take substantial time for an analyst to review. Additionally, the extractions may include only a phrase or a single sentence. A phrase or a single sentence may not provide enough information to an analyst to properly determine the relevance of a particular document. To improve the information provided by the text extractions, the text extractions may be used to generate summaries. Two categories of automatic text summarization include abstractive summarization and extractive summarization. Abstractive summarization techniques identify relevant phrases or sentences, then rewrite the identified phrases or sentences to form a summary.
  • abstractive summarization may be performed based on graphs or using neural networks. Extractive summarization techniques identify relevant phrases or sentences (e.g., extracts), rank the extracts to find the most informative extracts, and combine selected extracts into a summary. Abstractive summaries are typically preferred by humans for content or readability, but abstractive summarization techniques are typically more computationally expensive than extractive summarization techniques, and thus require more hardware resources to implement than extractive summarization techniques.
  • the present disclosure provides systems, methods, and computer-readable media for data summarization, such as summarization of textual data.
  • the data summarization may include or correspond to event summarization, where an“event,” as used herein, corresponds to a combination of a keyword and entity in text of a document.
  • an“event,” as used herein corresponds to a combination of a keyword and entity in text of a document.
  • the systems and methods described herein extract sentences from data corresponding to one or more textual documents, order the extracted sentences, identify types of extracted sentences that include matched pairs including keywords from two different keyword sets, and generate a summary that includes sentences having the two types of sentences intermixed based on a predetermined order rule set.
  • data including text from a data source may be received and natural language processing (NLP) performed on the data.
  • NLP natural language processing
  • a first set of keywords are compared to the data to detect keywords included in the data, and for each keyword, a corresponding entity is identified that is positioned closest to the corresponding keyword to determine a matched pair for the keyword.
  • a second set of keywords may be generated based on the first set of keywords and the data, or may be supplied so that, for each keyword, a matched pair for the keyword may be determined.
  • the systems and methods may extract sentences, such as a single sentences or multiple sentences, from documents that include the matched pairs (e.g., the entities and the keywords). [0006] After extracting the sentences, the systems and methods may order the extracted sentences based on a predicted relevance of the extracted sentences. For example, the predicted relevance may be based on a distance (e.g., a token distance) between a keyword and a corresponding entity in each extracted sentence. In some implementations, the predicted relevance may be further based on frequency of the keywords included in each of the extracted sentences. The systems and methods may also identify a first type of extracted sentences and a second type of extracted sentences from the ordered extracted sentences.
  • a distance e.g., a token distance
  • Extracted sentences having the first type include one or more keywords that are included in the first keyword set
  • extracted sentences having the second type include one or more keywords that are included in the second keyword set.
  • the systems and methods may generate an extracted summary that includes at least one sentence having the first type and at least one sentence having the second type.
  • the at least one sentence having the first type may be intermixed with the at least one sentence having the second type based on a predetermined order rule set.
  • the extracted summary may include sentences having the first type followed by sentences having the second type, in an alternating order indicated by the predetermined order rule set.
  • the predetermined order rule set may indicate other orderings of sentence types.
  • the systems and methods may also output the extracted summary.
  • the extracted summary may be stored or provided to an electronic device for review and/or analysis.
  • the systems and methods may also expand an initial seed taxonomy, such as the first keyword set, using word vector encodings.
  • a corresponding semantic vector may be generated - e.g., based on a skipgram model that utilizes words and subwords from the document.
  • the at least one keyword is compared to each of one or more semantic vectors to determine a corresponding similarity score.
  • a semantic vector having a highest similarity score to the keyword is identified to determine a term of the identified semantic vector as a candidate term.
  • the similarity score of the determined semantic vector having a highest similarity score is compared to a threshold to determine whether or not to discard the candidate term - e.g., the term is discarded if the score is less than or equal to the threshold.
  • the candidate term may be added to the second keyword set to generate the second keyword set (e.g., an expanded keyword set).
  • the initial keyword set and the expanded keyword set may be applied to the extracted sentences to identify sets of extracted sentences as described above.
  • the process of automatically expanding a keyword set (e.g., the second keyword set) from an initial keyword set may broaden or generalize the keywords included in the expanded keyword set
  • the first keyword set may include more specific keywords
  • the second keyword set may include more general keywords.
  • the systems and methods may generate summaries that are preferable to a human analyst (e.g., based on subject matter, grammatical naturalness, and/or readability) as compared to summaries generated by other systems, without requiring more resource-intensive natural language processing (NLP) used in abstractive summarization systems.
  • NLP natural language processing
  • the system could be used for any data, entities and taxonomies to support generalized event summarization.
  • the systems and methods may be equally applicable to other areas of summarization, such as document review, auditing, and the like, as illustrative, non-limiting examples.
  • a method for summarizing data includes extracting a plurality of sentences from data corresponding to one or more documents, each including text.
  • Each extracted sentence includes at least one matched pair including a keyword from a first keyword set or a second keyword set and an entity from an entity set.
  • Each extracted sentence includes a single sentence or multiple sentences.
  • the method includes ordering the plurality of extracted sentences based on a distance between a respective keyword and a respective entity in each extracted sentence of the plurality of extracted sentences.
  • the method also includes identifying a first type of extracted sentences from the ordered plurality of extracted sentences. Extracted sentences having the first type include one or more keywords included in the first keyword set.
  • the method includes identifying a second type of extracted sentences from the ordered plurality of extracted sentences.
  • Extracted sentences having the second type include one or more keywords included in the second keyword set.
  • the method also includes generating an extracted summary that includes at least one sentence having the first type and at least one sentence having the second type.
  • the at least one sentence having the first type is intermixed with the at least one sentence having the second type based on a predetermined order rule set.
  • the method further includes outputting the extracted summary.
  • a system may be provided.
  • the system includes a sentence extractor configured to extract a plurality of sentences from data corresponding to one or more documents each comprising text.
  • Each extracted sentence includes at least one matched pair including a keyword from a first keyword set or a second keyword set and an entity from an entity set.
  • Each extracted sentence includes a single sentence or multiple sentences.
  • the system includes a sentence organizer configured to order the plurality of extracted sentences based on a distance between a respective keyword and a respective entity in each extracted sentence of the plurality of extracted sentences.
  • the system also includes a sentence identifier configured to identify a first type of extracted sentences from the ordered plurality of extracted sentences and to identify a second type of extracted sentences from the ordered plurality of extracted sentences.
  • Extracted sentences having the first type include one or more keywords included in the first keyword set.
  • Extracted sentences having the second type include one or more keywords included in the second keyword set.
  • the system includes a summary extractor configured to extract a summary that includes at least one sentence having the first type and at least one sentence having the second type. The at least one sentence having the first type is intermixed with the at least one sentence having the second type.
  • the system further includes an output generator configured to output the extracted summary.
  • a computer-based tool may include non-transitory computer readable media having stored thereon computer code which, when executed by a processor, causes a computing device to perform operations that include extracting a plurality of sentences from data corresponding to one or more documents each comprising text.
  • Each extracted sentence includes at least one matched pair including a keyword from a first keyword set or a second keyword set and an entity from an entity set.
  • Each extracted sentence includes a single sentence or multiple sentences.
  • the operations include ordering the plurality of extracted sentences based on a distance between a respective keyword and a respective entity in each extracted sentence of the plurality of extracted sentences.
  • the operations also include identifying a first type of extracted sentences from the ordered plurality of extracted sentences.
  • Extracted sentences having the first type include one or more keywords included in the first keyword set.
  • the operations include identifying a second type of extracted sentences from the ordered plurality of extracted sentences. Extracted sentences having the second type include one or more keywords included in the second keyword set.
  • the operations also include generating an extracted summary that includes at least one sentence having the first type and at least one sentence having the second type. The at least one sentence having the first type is intermixed with the at least one sentence having the second type based on a predetermined order rule set.
  • the operations further include outputting the extracted summary.
  • FIG. 1 shows a system configured to perform operations in accordance with aspects of the present disclosure
  • FIG. 2 shows a flow diagram illustrating functionality of the system of FIG. 1 implemented in accordance with aspects of the present disclosure
  • FIG. 3 is a block diagram of a system for summarizing data and testing the summary in accordance with the present disclosure
  • FIG. 4 illustrates a graph of expert preference ratings
  • FIG. 5 is a flow chart illustrating an example of a method of summarizing data.
  • FIG. 1 is a block diagram of an exemplary system 100 configured with capabilities and functionality for event summarization. As shown in FIG.
  • system 100 includes server 110, at least one user terminal 160, at least one data source 170, and network 180. These components, and their individual components, may cooperatively operate to provide functionality in accordance with the discussion herein.
  • data e.g., textual data or documents
  • the various components of server 110 may cooperatively operate to perform text summarization from the data.
  • the various components of server 110 may cooperative operate to identify matched pairs (e.g., a keyword from a keyword set and an entity from an entity set) in the data and to extract one or more sentences that include the matched pairs.
  • the various components of server 110 may order the extracted sentences based on distances (e.g., token distances) between the keywords and the entities in the extracted sentences, based on frequency of the keywords in the extracted sentences, or both.
  • a first type of extracted sentences is identified. Extracted sentences having the first type include keywords that are included in a first keyword set, which may have a greater specificity than some other keywords.
  • the first keyword set may be human-generated and may include keywords having a high degree of specificity.
  • a second type of extracted sentences is also identified. Extracted sentences having the second type include keywords that are included in a second keyword set, which may have a greater generality than some other keywords.
  • the second keyword set may be an automatically expanded keyword set that is generated by the system based on the first keyword set and the data, such as by using one or more machine learning techniques.
  • the various components of server 110 may generate a summary, such as multiple extracted sentences, using at least one sentence having the first type and at least one sentence having the second type.
  • the at least one sentence having the first type may be intermixed with the at least one sentence having the second type based on a predetermined order rule set.
  • Such intermixing may be implemented in accordance with rules, such as the predetermined order rule set, configured to provide a more grammatically natural/readable summary.
  • the summary may be stored or provided to an electronic device for review and/or analysis.
  • various aspects of the present disclosure allow text summarization using extracted sentences that include keywords from different keyword sets (e.g., having different types), which may correspond to different levels of specificity or generality in the keywords, as further described herein.
  • the functional blocks, and components thereof, of system 100 of implementations of the present invention may be implemented using processors, electronics devices, hardware devices, electronics components, logical circuits, memories, software codes, firmware codes, etc., or any combination thereof.
  • one or more functional blocks, or some portion thereof may be implemented as discrete gate or transistor logic, discrete hardware components, or combinations thereof configured to provide logic for performing the functions described herein.
  • one or more of the functional blocks, or some portion thereof may comprise code segments operable upon a processor to provide logic for preforming the functions described herein.
  • each of the various illustrated components may be implemented as a single component (e.g., a single application, server module, etc.), may be functional components of a single component, or the functionality of these various components may be distributed over multiple devices/components. In such aspects, the functionality of each respective component may be aggregated from the functionality of multiple modules residing in a single, or in multiple devices.
  • server 110, user terminal 160, and data sources 170 may be communicatively coupled via network 180.
  • Network 180 may include a wired network, a wireless communication network, a cellular network, a cable transmission system, a Local Area Network (LAN), a Wireless LAN (WLAN), a Metropolitan Area Network (MAN), a Wide Area Network (WAN), the Internet, the Public Switched Telephone Network (PSTN), etc., that may be configured to facilitate communications between user terminal 160 and server 110.
  • LAN Local Area Network
  • WLAN Wireless LAN
  • MAN Metropolitan Area Network
  • WAN Wide Area Network
  • PSTN Public Switched Telephone Network
  • User terminal 160 may be implemented as a mobile device, a smartphone, a tablet computing device, a personal computing device, a laptop computing device, a desktop computing device, a computer system of a vehicle, a personal digital assistant (PDA), a smart watch, another type of wired and/or wireless computing device, or any part thereof.
  • User terminal 160 may be configured to provide a graphical user interface (GUI) via which a user may be provided with information related to data and information received from server 110.
  • GUI graphical user interface
  • user terminal 160 may receive results of event summarization from server 110.
  • the results may include one or more summaries, one or more extracted sentences, a document identifier, or a combination thereof, as illustrative, non-limiting examples.
  • a user may review the results and provide an analysis or feedback regarding the results.
  • the analysis or feedback may be provided to server 110 from user terminal 160 as an input.
  • Data sources 170 may comprise at least one source of textual data.
  • the data source(s) may include a streaming data source, news data, a database, a social media feed, a data room, another data source, the like, or a combination thereof.
  • the data from data source 170 may include or correspond to one or more entities.
  • the one or more entities may include an individual, a company, a government, an agency, an organization, the like, or a combination thereof, as illustrative, non-limiting examples.
  • Server 110 may be configured to receive data from data sources 170, to apply customized natural language processing algorithms and/or other processing to generate one or more summaries based on the received data.
  • the summaries may be event summaries that summarize an event described in the received data and indicated by detection of a keyword and an entity, as further described herein.
  • This functionality of server 110 may be provided by the cooperative operation of various components of server 110, as will be described in more detail below.
  • FIG. 1 shows a single server 110, it will be appreciated that server 110 and its individual functional blocks may be implemented as a single device or may be distributed over multiple devices having their own processing resources, whose aggregate functionality may be configured to perform operations in accordance with the present disclosure.
  • server 110 may be implemented, wholly or in part, on an on-site system, or on a cloud-based system.
  • server 110 includes processor 111, memory 112, database 113, sentence extractor 120, sentence organizer 121, sentence identifier 122, summary extractor 123, output generator 124, and, optionally, taxonomy expander 125.
  • processor 111 memory 112
  • database 113 database 113
  • sentence extractor 120 sentence organizer 121
  • sentence identifier 122 sentence identifier 122
  • summary extractor 123 output generator 124
  • taxonomy expander 125 the various components of server 110 are illustrated as single and separate components in FIG. 1.
  • each of the various components of server 110 may be a single component (e.g., a single application, server module, etc.), may be functional components of a same component, or the functionality may be distributed over multiple devices/components. In such aspects, the functionality of each respective component may be aggregated from the functionality of multiple modules residing in a single, or in multiple devices.
  • processor 111 may comprise a processor, a microprocessor, a controller, a microcontroller, a plurality of microprocessors, an application- specific integrated circuit (ASIC), an application-specific standard product (ASSP), or any combination thereof, and may be configured to execute instructions to perform operations in accordance with the disclosure herein.
  • implementations of processor 111 may comprise code segments (e.g., software, firmware, and/or hardware logic) executable in hardware, such as a processor, to perform the tasks and functions described herein.
  • processor 111 may be implemented as a combination of hardware and software.
  • Processor 111 may be communicatively coupled to memory 112.
  • Memory 112 may comprise read only memory (ROM) devices, random access memory (RAM) devices, one or more hard disk drives (HDDs), flash memory devices, solid state drives (SSDs), other devices configured to store data in a persistent or non-persistent state, network memory, cloud memory, local memory, or a combination of different memory devices.
  • ROM read only memory
  • RAM random access memory
  • HDDs hard disk drives
  • SSDs solid state drives
  • Memory 112 may store instructions that, when executed by processor 111, cause processor 111 to perform operations in accordance with the present disclosure.
  • memory 112 may also be configured to facilitate storage operations.
  • memory 112 may comprise database 113 for storing one or more keywords (e.g., one or more keyword sets), one or more entities (e.g., an entity set), one or more thresholds, one or more matched pairs, one or more semantic vectors, one or more candidate terms, one or more similarity scores, one or more extracted sentences, one or more summaries, one or more predetermined order rule sets, input (e.g., from user terminal 160), other information, etc., which system 100 may use to provide the features discussed herein.
  • Database 113 may be integrated into memory 112, or may be provided as a separate module.
  • database 113 may be a single database, or may be a distributed database implemented over a plurality of database modules.
  • database 113 may be provided as a module external to server 110.
  • server 110 may include an interface configured to enable communication with data source 170, user terminal 160 (e.g., an electronic device), or a combination thereof.
  • Sentence extractor 120 may be configured to extract a plurality of sentences from data corresponding to one or more documents each comprising text. Each extracted sentence may include at least one matched pair including a keyword from a first keyword set or a second keyword set and an entity from an entity set. Each extracted sentence may include a single sentence or multiple sentences.
  • the keywords of the keyword sets and the entities of the entity set are distinct. In some other implementations, there is at least some overlap between the keywords and the entities.
  • one or more of the keyword sets may include one or more of the entities
  • the entity set may include one or more of the keywords
  • the entity set may be a subset of one of the keyword sets, as non limiting examples.
  • sentence extractor 120 may be further configured to receive data at a receiver from data sources 170, detect one or more keywords for each keyword of the first keyword set and the second keyword set in the data, determine one or more matched pairs corresponding to the detected keywords, and extract the plurality of sentences that include the one or more matched pairs.
  • Sentence organizer 121 may be configured to order the plurality of extracted sentences based on a distance between a respective keyword and a respective entity in each extracted sentence of the plurality of extracted sentences. Ordering the plurality of extracted sentences based on distance may correspond to ordering the plurality of extracted sentences based on predicted relevance. For example, a short distance between a respective keyword and a respective entity may indicate a sentence having a relatively high predicted relevance. In some implementations, the distance includes or corresponds to a token distance (e.g., a number of words) between the keyword and the entity. In some implementations, the sentence organizer 121 is configured to order the plurality of extracted sentences based further on frequencies of respective one or more keywords included in each extracted sentence. For example, the frequencies of respective keywords may also be indicative of the predicted relevance of the corresponding sentences (e.g., identification of a keyword with a high frequency may indicate a sentence having a relatively high predicted relevance).
  • Sentence identifier 122 may be configured to identify a first type of extracted sentences from the ordered plurality of extracted sentences. Extracted sentences having the first type include one or more keywords included in the first keyword set. Sentence identifier 122 may be further configured to identify a second type of extracted sentences from the ordered plurality of extracted sentences. Extracted sentences having the second type include one or more keywords included in the second keyword set.
  • the first keyword set includes a user-generated keyword set
  • the second keyword set includes an expanded keyword set based on the first keyword set, as further described herein.
  • Summary extractor 123 may be configured to extract (e.g., generate) a summary that includes at least one sentence having the first type and at least one sentence having the second type.
  • the summary may include alternating sentences having the first type and sentences having the second type.
  • the sentences are ordered based on a predetermined order rule set.
  • the predetermined order rule set includes one or more rules configured to provide a grammatically natural or readable summary.
  • the predetermined order rule set may include one or more rules that are stored at (or accessible to) server 110 and that indicate an order of sentences for inclusion in summaries based on sentence type (e.g., the first type, the second type, etc.).
  • the predetermined order rule set may indicate that sentences having the first type and sentences having the second type are to be intermixed in an alternating order for inclusion in an extracted summary.
  • the predetermined order rule set may indicate a different ordering of extracted sentences. Such ordering may be predetermined to enable generation of summaries that are more grammatically natural or readable than other computer-generated summaries.
  • Output generator 124 may be configured to output the extracted summary.
  • output generator 124 may store the extracted summary, may output the extracted summary to a display device, or may output the extracted summary to another device, such as user terminal 160, as non-limiting examples.
  • Taxonomy expander 125 may be configured to generate, based on the data and the first keyword set, the second keyword set having a greater number of keywords than the first keyword set. Additional functionality of taxonomy expander 125 is described further herein at least with reference to blocks 240-248 of FIG. 2. It is noted that the functionality of taxonomy expander 125 to expand a keyword set to generate an expanded keyword set may be used prior to, during, or after event identification or summarization.
  • the database 113 may be coupled to sentence extractor 120, sentence organizer 121, sentence identifier 122, summary extractor 123, output generator 124, taxonomy expander 125, or a combination thereof.
  • database 113 is configured to store the first keyword set, the second keyword set, the entity set, processed data, one or more thresholds, one or more extracted sentences, a plurality of matched pairs, one or more extracted summaries, the predetermined order rule set, or a combination thereof.
  • FIG. 2 shows a flow diagram illustrating functionality of system 100 for summarizing an event in data.
  • Blocks of method 200 illustrated in FIG. 2 may be performed by one or more components of system 100 of FIG. 1.
  • blocks 210 and 212 may be performed by sentence extractor 120
  • block 214 may be performed by sentence organizer 121
  • blocks 216 and 218 may be performed by sentence identifier 122
  • block 220 may be performed by summary extractor 123
  • block 222 may be performed by output generator 124
  • blocks 240-248 may be performed by taxonomy expander 125.
  • data is received (e.g., at a receiver).
  • the data may include one or more documents and may be received from data sources 170.
  • data sources 170 may include a streaming data source, news data, a database, or a combination thereof.
  • sentence extraction is performed. For example, a plurality of sentences may be extracted from the data.
  • each extracted sentence includes at least one matched pair including a keyword from a first keyword set or a second keyword set and an entity from an entity set.
  • Each extracted sentence includes a single sentence or multiple sentences.
  • the keyword and the entity may be included in a single sentence, or the keyword and the entity may be included in different sentences, such as different consecutive sentences.
  • the keywords are distinct from the entities. Alternatively, there may be overlap between the keywords and the entities.
  • each extracted sentence includes at least one keyword from the first keyword set or the second keyword set (regardless of whether an entity is included). Extracting sentences that include a keyword (without a corresponding entity) may result in a significantly larger number of extractions, which may widen the scope of the extracted sentences while increasing the processing time and use of processing resources.
  • the first keyword set includes or corresponds to a user-generated keyword set
  • the second keyword set includes or corresponds to an expanded keyword set.
  • the first keyword set may be received via input to the server 110, or may be a previously user-generated keyword set stored in the database 113.
  • the second keyword set may be an automatically expanded keyword set based on the first keyword set, such as a keyword set generated by taxonomy expander 125.
  • taxonomy expander 125 may expand the first keyword set by identifying additional keywords that are similar to the keywords included in the first keyword set using one or more machine learning processes. Because the first keyword set is user-generated, and the second keyword set is automatically expanded, the first keyword set may include keywords having greater specificity, and the second keyword set may include keywords having greater generality.
  • extracting the plurality of sentences from the data includes multiple operations.
  • extracting the plurality of sentences may include receiving the first keyword set, the second keyword set, and the entity set (e.g., from a database, such as database 113).
  • a selection of a first event category of multiple event categories may be received, and the first keyword set (and the second keyword set) may be retrieved based on the selection of the first event category.
  • the multiple event categories include cybersecurity, terrorism, legal/non compliance, or a combination thereof.
  • Extracting the plurality of sentences may also include performing natural language processing (NLP) on the data to generate processed data, the processed data indicating one or more sentences.
  • NLP natural language processing
  • NLP may include tokenization, lemmatization, and/or sentencization on the data.
  • the NLP is performed by a natural language processing pipeline including (in sequence) a tokenizer, a part-of-speech tagger, a dependency parser, and a named entity recognizer.
  • a dependency-based sentencizer may be used as compared to a simpler stop- character based approach due to the unpredictable formatting of certain domains of text - e.g., web-mined news and regulatory filings. Extracting the plurality of sentences also includes, after the NLP, performing keyword and entity detection.
  • keywords from the first keyword set and the second keyword set
  • keywords may be identified in a list of tokens.
  • the sets of keywords are compared to the processed data to detect keywords in the processed data.
  • entities from the entity set
  • keyword and entity matching may be performed. For example, for each detected keyword, a corresponding entity is identified that is positioned closest to the corresponding keyword to determine a matched pair for the keyword.
  • the closest entity may be before or after the keyword, and may be in the same sentence or a different sentence.
  • matched pair filtering is performed.
  • a distance (in tokens) between the keyword and the entity of a matched pair is determined, and if the distance is greater than or equal to a threshold, the matched pair is discarded (e.g., filtered out).
  • sentences that include the matched pairs are extracted.
  • the extracted sentences may be single sentences (if the keyword and entity are in the same sentence) or multiple sentences (if the keyword and entity are in different sentences).
  • extracted sentence ordering is performed.
  • the plurality of extracted sentences may be ordered based on predicted relevance of the extracted sentences.
  • the plurality of extracted sentences are ordered based on a distance (in tokens) between the keyword and the entity in each extracted sentence of the plurality of extracted sentences. For example, matched pairs (e.g., keywords and entities) having a smaller distance between the keyword and the entity may be ordered higher (e.g., prioritized) over matched pairs having a larger distance between the keyword and the entity. The distance may indicate the predicted relevance.
  • the plurality of extracted sentences may be ordered based on frequencies of one or more keywords included in each extracted sentence. For example, matched pairs that include keywords that are identified in the data with a higher frequency may be ordered higher (e.g., prioritized) over matched pairs that include keywords that are identified in the data with a lower frequency. The frequency may indicate the predicted relevance.
  • identification of a first type of extracted sentences is performed. For example, a first type of extracted sentences that include keywords included in the first keyword set are identified.
  • identification of a second type of extracted sentences is performed. For example, a second type of extracted sentences that include keywords included in the second keyword set are identified. Because the first keyword set is user-generated, and the second keyword set is automatically expanded, the first type of extracted sentences may include more specific information, and the second type of extracted sentences may include more general information.
  • summary generation e.g., extraction
  • an extracted summary may be generated that includes at least one sentence having the first type and at least one sentence having the second type.
  • the at least one sentence having the first type may be intermixed with the at least one sentence having the second type based on a predetermined order rule set.
  • the extracted summary may include multiple extracted sentences, and, in some implementations, the order of the multiple extracted sentences may alternate between sentences having the first type (or the second type), and sentences having the second type (or the first type), or according to another ordering scheme.
  • the ordering of the sentences included in the extracted summary is indicated by the predetermined rule set.
  • Such ordering may be predetermined to enable generation of summaries that are more grammatically natural or readable than other computer-generated summaries.
  • a summary that includes a“general” sentence, followed by one or two“specific” sentences, as a non-limiting example, may be more likely to be grammatically natural and more easily readable to a user, as compared to summaries generated according to a random order of sentences.
  • the extracted summary may include one or more sets of three extracted sentences (e.g., sentence triples).
  • each set of three extracted sentences may include a general sentence (e.g., having the second type), followed by a specific sentence (e.g., having the first type), followed by another specific sentence, based on the predetermined order rule set.
  • each set of three extracted sentences may include a general sentence, followed by a specific sentence, followed by another general sentence, based on the predetermined order rule set.
  • the predetermined order rule set may indicate a different ordering, such as an alternating ordering, as a non-limiting example.
  • the predetermined order rule set is configured to enable generation of summaries that are more grammatically natural and readable than other computer-generated summaries.
  • the extracted summary may be limited to a maximum number of characters or a maximum number of words.
  • generating the extracted summary may include determining whether to include an additional sentence from the first set of extracted sentences or the second set of extracted sentences in the extracted summary based on a determination whether a sum of a length of the extracted summary and a length of the extracted summary is less than or equal to a threshold.
  • sentences may be included in the extracted summary until a total length of the extracted summary exceeds a threshold. At this point, the most recently added sentence is discarded to maintain the total length of the extracted summary below or equal to the threshold.
  • the threshold may be any value, based on considerations of amount of information included in the summaries, storage space used to store the summaries, processing power used to generate the summaries, etc.
  • the threshold e.g., the maximum word length
  • the threshold may be 100 words.
  • the threshold e.g., the maximum word length
  • a summary output result is generated. For example, a summary that includes at least one specific sentence (e.g., at least one sentence having the first type) and at least one general sentence (e.g., at least one sentence having the second type) may be output.
  • the extracted summary may be output to an electronic device for display to a user for review and/or analysis or the extracted summary may be stored in a memory for later processing.
  • Method 200 also enables expansion of an initial seed taxonomy.
  • semantic vectors are generated. For example, for at least one document of the received data, a corresponding semantic vector may be generated.
  • the semantic vector may be generated based on a skipgram model that utilizes words and subwords from the document.
  • a similarity calculation is performed. For example, at least one keyword is compared to each of the generated semantic vectors to determine corresponding similarity scores.
  • candidate term identification is performed. For example, a semantic vector having a highest similarity score to the keyword is identified to identify a term of the semantic vector as a candidate term.
  • candidate terms are filtered. For example, the similarity score of the candidate term is compared to a threshold to determine whether or not to discard the candidate term (e.g., the candidate term is discarded if the score is less than or equal to the threshold).
  • the taxonomy is expanded. For example, one or more candidate terms are added to the taxonomy to generate the expanded taxonomy (e.g., an expanded keyword set). The expanded taxonomy may be used in performing sentence extraction and summary generation, as described with reference to the operations of blocks 212-222.
  • system 100 e.g., server 110 and its corresponding operations and functions provides the ability to generate and output text summaries, such as event (e.g., risk) summaries, that more closely conform to summaries generated by humans than other summaries generated by other systems.
  • event e.g., risk
  • the generated summaries include a combination of specific sentences (e.g., extracted sentences including keywords from a user generated keyword set) and general sentences (e.g., extracted sentences including keywords from an automatically expanded keyword set)
  • the summaries may more closely resemble human-generated summaries, such as by being more grammatically natural.
  • the predetermined order rule set enables system 100 (e.g., server 110) to generate summaries having improved quality compared to other computer-generated summaries. For example, these summaries may be more preferable to a human analyst than other computer-generated summaries and/or may have improved readability compared to other computer-generated summaries. Additionally, system 100 (e.g., server 110) may generate the improved summaries using fewer computing resources, and less power consumption, than typical abstractive summarization systems. Thus, the techniques of the present disclosure may be implemented on electronic devices with reduced processing capabilities, as compared to typical abstractive summarization systems.
  • the systems and methods disclosed herein may be used for risk mining.
  • Risk mining seeks to identify the expression of entity-risk relationships in textual data. For example in example sentences (1) below, a CNN - Terrorism relationship is described that is indicated by the reference to CNN in sentence (l)(a) and the keyword representative of the Terrorism risk category,“pipe bomb” in sentence (l)(a) and “bomb threat” in sentence (l)(b).
  • a goal of risk mining systems is to identify the highest value and most relevant text extractions that embody an entity-risk relationship, indexed by an entity and a keyword/phrase - obviating the need for a manual review of numerous sources.
  • Extractive summarization may address this problem. Summarization performed by the systems and methods described herein include extractive summarization with a focus on creating high quality output that appropriately orders the specificity of information in the extracted summaries.
  • sentence (l)(a) provides details about time (“Later Wednesday”), events (“receiv[ing] a pipe bomb”), locations (“Time Warner Center headquarters in Manhattan”), people (“ex-CIA director John Brennan”), and the resulting event (“evacuating] its [CNN’s] offices”).
  • Sentence (l)(b) generalizes that this was the second such event in two days.
  • Example sentences (1) may be reordered as example sentences (2) below.
  • the systems and methods of the present disclosure operate to identify two groups of extracts (e.g., sentences) from a keyword-based risk mining system: one characterized as more specific (from a manually curated/user generated set of keywords) and one characterized as more general (from a semantically encoded set of keywords).
  • two groups of extracts e.g., sentences
  • one characterized as more specific from a manually curated/user generated set of keywords
  • one characterized as more general from a semantically encoded set of keywords
  • Risk mining systems typically start with a keyword list that captures, from a subject matter expert’s perspective, a risk category of interest and entities that are subject to that risk (e.g., media outlets subject to terrorism, persons subject to fraud, etc.). Systems also expand the initial keyword list and fine tune output through some combination of machine learning and human-in-the-loop review until a desired level of performance is achieved. Domains where risk mining has been applied include financial risks based on filings and stock prices, general risks in news, and supply chain risks, as non-limiting examples. Methods of keyword list expansion include ontology merging, crowdsourcing, and paraphrase detection. A goal of keyword list expansion is to reduce or minimize human involvement while still preserving expert judgment, maintaining, and improving performance through the return of highly relevant extracts.
  • Extractive techniques attempt to identify relevant text extractions in single and multi-document source material, rank the extracts to find the most informative, and combine the selected extracts into a summarized discourse.
  • Some systems identify and rank relevant extracts based on queries, document word frequencies, probabilities, TF-IDF weighting, topic modeling, graph-based methods, and neural networks.
  • At least some implementations described herein are configured to perform extractions based on entity- keyword matching with subsequent ranking of token distances between entities and risk keywords with summarization being considered multi rather than single-document.
  • Improvement on the sentence level includes compression and sentence fusion.
  • Improvement on the discourse (e.g., summary) level includes lexical chains, WordNet-based concepts, and discourse relation and graph representations.
  • the dog is in the backyard.
  • Generics describe either a class of entities, such as dogs in sentence (3)(a), or a member of a class of entities, such as the dog in sentence (3)(b).
  • Habituals describe either specific or regular events, such as trouble walking in sentence (3)(c) or slipped and fell in sentence (3)(d).
  • word-level features such as plurals, quantifiers, verb tenses, categories of noun phrases, and lexical resources such as WordNet.
  • occurrences of information specificity may be linked to rhetorical relations.
  • “background” relation provides general backdrop information for subsequent clauses
  • “elaboration” provides more specific unfolding of events
  • “specification” provides more specific detail of the previous information.
  • Annotated granularities may improve the Naive Bayes and Decision Tree prediction of Segmented Discourse Representation Theory (SDRT). Spatial granularities may be leveraged to improve SDRT rhetorical relation prediction between clauses in narratives and also observe a more global distribution of general to specific (and possibly back to general) as narratives progress globally.
  • SDRT Segmented Discourse Representation Theory
  • Shifts in specificity are generally associated with texts of higher quality, which can be further broken down into increased readability, higher coherence, and accommodation of the intended audience. It has also been observed that automatic summaries tend to be much more specific than human authored counterparts and, consequently, are judged to be incoherent and of lower comparative quality.
  • the systems and methods described herein model specificity by alternating selection of sets of extracts that are more or less specific - a more discourse primitive endeavor - rather than explicitly identifying and explaining habituals, generics, or rhetorical relations.
  • a system of the present disclosure is a custom NLP processing pipeline capable of the ingesting and analyzing hundreds of thousands of text documents relative to an initial manually-curated (e.g., user defined) seed taxonomy, such as a first keyword set.
  • the system includes at least five components:
  • Raw text documents are read and tokenization, lemmatization, and sentencization are performed.
  • Keyword/Entity Detection Instances of both keywords and entities are identified in the processed text, and each risk keyword occurrence is matched to the nearest entity token.
  • Match Filtering and Sentence Retrieval Matches within the documents are filtered and categorized by pair distance and/or sentence co-occurrence, and the filtered sentences are retrieved for context.
  • Semantic Encoding and Taxonomy Expansion A semantic vectorization algorithm is trained on domain-specific text and used to perform automated expansion of the keyword taxonomy.
  • Extractive Summarization Construction From the total collection of extracts, summaries are formed based on different combination distances, keyword frequencies, and taxonomy.
  • This design architecture allows for significant customization, high throughput, and modularity for uses in experimental evaluation and deployment in production use-cases.
  • the system may support decentralized or streaming architectures, with each document being processed independently and learning systems (specifically at the semantic encoding/expansion steps) configured for continuous learning or batch model training.
  • One or more known systems can be used for document ingest and low level NLP, such as spaCy, as a non-limiting example.
  • the system may be configured for high speed parsing, out-of-the-box parallel processing, and Python compatibility.
  • the system may allow for a text generator object to be provided, and may take advantage of multi-core processing to parallelize batching.
  • each processed document piped in by the system is converted to its lemmatized form with sentence breaks noted so that sentence and multi-sentence identification of key word/ entity distances can be captured.
  • example sentence (5) extends the extract of example sentence (4) to the prior contiguous sentence which contains settlement. This extension provides greater context for Verizon’s lawsuit.
  • Example sentence (5) is actually background for a larger proposition being made in the document that Verizon is in violation of settlement terms from a previous lawsuit. (5) McDonald says this treatment violated the terms of a settlement the company reached a few years earlier regarding its treatment of employees with disabilities.
  • The“shallow” parsing approach (e.g., the token distance approach) of the system promotes efficiency and is preferable to more complex NLP, e.g., chunking or co reference resolution. Nonetheless, this flexibility comes at a computational cost: a total of (m ⁇ a) x (n b) comparisons must be made for each document, where m is the number of keyword terms across all taxonomic categories, a is the average number of instances of each keyword per document, n is the number of entities provided, and b is the average number of entity instances per document. Changing any single one of these variables will result in computational load changing with O(n) complexity, but their cumulative effects can quickly add up.
  • each keyword is independent of each other keyword and each entity is independent of each other entity. This means that in an infinitely parallel (theoretical) computational scheme, the system runs on 0(a x b), which will vary as a function of the risk and text domains.
  • the system may automate term expansion by using similarity calculations of semantic vectors. These vectors are generated by training a skipgram model, which relies on words and subwords from the same data source as the initial extractions. This ensures that domain usage of language is well-represented, and any rich domain-specific text may be used to train semantic vectors.
  • the model vocabulary for the minimized normalized dot product e.g., a basic similarity score
  • Extracts are deduped, preserving the lowest distance rank version. Extracts may then be rank ordered by shortest distance and highest frequency keyword, and selection for inclusion in a summary proceeds. For example, selection may occur according to the process of Algorithm 2.
  • FIG. 3 shows an example system 300 in accordance with the present disclosure.
  • the system 300 includes stream of text documents 302, initial natural language pre-processing 304, entity-risk detection 306, entity-risk detections output file 308, analyst summarization 310, human summaries 312, re-ordering prioritization and grouping of detections 314, summarization processes 316, risk summarization output file 318, shuffling of methods for comparison 320, machine and human evaluation 322, and system performance results 324.
  • Stream of text documents 302 includes a corpus of documents (e.g., one or more documents) that are provided to system 300, such as to initial natural language pre processing 304.
  • Initial natural language pre-processing 304 is configured to perform low level natural language processing on stream of text documents 302 to generate processed data that indicates one or more sentences. For example, tokenization and/or sentencization may be performed on the stream of text documents 302 to generate the processed data.
  • Entity- risk detection 306 is configured to identify one or more matched pairs of entities and keywords based on an entity list, a first keyword list (e.g., a user-generated keyword list), and a second keyword list (e.g., an automatically expanded keyword list). For example, for each keyword, entity-risk detection may determine a nearest entity to the keyword (and whether it is in the same sentence or not), and then determine bi-directional pairings, e.g., the entity that is closest to the keyword, whether the entity is before or after the keyword (even if the keyword is in a different sentence). In some implementations, entity -risk detection 306 is configured to operate in parallel such that multiple keywords may be paired with entities concurrently.
  • entity-risk detection 306 extracts a plurality of sentences that include the one or more matched pairs. Each extracted sentence may include a single sentence or multiple sentences. The plurality of extracted sentences are output as entity- risk detections output file 308. Additionally, the plurality of extracted sentences are provided to analyst summarization 310. Analyst summarization 310 represents one or more human analysts that generate a plurality of human summaries based on the plurality of extracted sentences. The plurality of human summaries are provided downstream as human summaries 312.
  • the plurality of extracted sentences are provided to re ordering prioritization and grouping of detections 314.
  • Re-ordering prioritization and grouping of detections 314 is configured to order the plurality of extracted sentences based on distances (e.g., token distances) between the keyword and the entity in each extracted sentence, frequencies of the keywords in each extracted sentence, or both.
  • Re-ordering prioritization and grouping of detections 314 may output a ordered plurality of extracted sentences.
  • the ordered plurality of extracted sentences may be provided to summarization processes 316.
  • Summarization processes 316 may be configured to identify a first set of extracted sentences and a second set of extracted sentences from the ordered plurality of extracted sentences, and to generate an extracted summary that includes at least one sentence of the first set of extracted sentences and at least one sentence of the second set of extracted sentences.
  • the first set of extracted sentences corresponds to an entity and includes one or more keywords from the first keyword set
  • the second set of extracted sentences corresponds to the entity and includes one or more keywords from the second keyword set.
  • Summarization processes 316 may include multiple summarization processes, such as summarization processes that generate summaries with different orders of general and specific sentences, as well as summaries that include only a single extracted sentence, and off-the-shelf text summary programs (for comparison against the systems of the present disclosure), as further described herein.
  • Summarization processes 316 may output one or more extracted summaries as risk summarization output file 318. Additionally, the one or more extracted summaries may be provided to shuffling of methods for comparison 320.
  • Shuffling of methods for comparison 320 may be configured to receive the one or more extracted summaries and human summaries 312, and may shuffle (e.g., randomize or pseudo-randomize the order of) the various results so that humans selected to compare the results do not know which results come from which summarization process (either automatic or human generated).
  • the shuffled summaries are provided to machine and human evaluation 322.
  • Machine and human evaluation 322 may be configured to enable one or more humans to read the shuffled summaries and to rank the shuffled summaries based on preference, readability, and/or any other criteria. Results of the human selections may be output by system 300 as system performance results 324.
  • the probability of a multi-sentence extract occurring in the output is high - 70% (30% single sentence) with an average token distance of 30 for multi or single sentence extraction (standard deviation is as high as 25 tokens). Based on distances, a threshold of 100 words was selected for the experiment to control, as best as possible, an extract per third.
  • the experiment included asking six human analysts (e.g., subject matter experts in risk analysis) to write multiple human summaries (e.g., human summaries 312) for each entity-risk relationship using extracts filtered by lowest distance and keyword (rather than all possible extracts and identified documents).
  • human summaries were used in three evaluations involving four systems designed according to implementations described herein, such that it could be determined if modeling of information specificity translated into improved performance.
  • the four systems included:“Seed” - seed extracts (e.g., extracted sentences including keywords included in the seed/first keyword set) selection only;“Expanded” - expanded extracts (e.g., extracted sentences including keywords included in the expanded/second keyword set);“MixedThirds” - selection in thirds (e.g., three sentence combinations), selected based on expanded->seed->seed (general->specific->specific); and “AlternateThirds” - selection in thirds, expanded->seed->expanded (general->specific- >general). Additionally, the three evaluations included a random baseline system as well as two existing extractive summarization systems, TextRank and LexRank.
  • each extract is a node in a graph with weighted edges by normalized word overlap between sentences.
  • each extract is a node in a graph with weighted edges based on cosine similarity of the extract set’s TF-IDF vectors.
  • Table 3 Example summaries generated by some of the various systems are shown below in Table 3.
  • Table 3 includes example summaries output based on a Costco- Legal entity-risk relation.
  • Intrinsic evaluations may provide insight into how informative the systems are where the manual‘extrinsic’ evaluations provide insight as to how the information is packaged. Bot evaluations are relative to the human summaries, assumed to be of the highest quality.
  • FIG. 4 illustrates a chart 400 of preference values for the various summaries tested. As shown in FIG. 4, there is a trend of greater preference of the expanded over non-expanded systems (e.g., the preference values corresponding to system-generated summaries are closer to the preference values for human-generated summaries for expanded systems, such as MixedThirds and AltemateThirds).
  • Discourse awareness in the system comes from semantic coherence associated with token distances, and rhetorical coherence associated with the multi -sentence extractions and the nature of specificity in the extraction sets; all of which are artifacts of the risk mining extraction (which has linear complexity relative to the volume of data). While current research and detection of text specificity (and granularity) shows a great deal of promise as well as sentence ordering approaches generally, it is a very difficult and complex problem where investigations into WordNet and autoencoding can only begin to scratch the surface.
  • the system could be used for any data, entities and taxonomies to support generalized event monitoring and summarization. Additionally, the system may address high system recall relative to maintaining flexibility for analyst users and dynamic definition of the risk problem space - this may include summarization of results for better presentation, alternative source data at the direction of the analyst for given risk categories, and token distance thresholding.
  • FIG. 5 is a flow diagram of a method 500 of summarizing data.
  • the method 500 may be performed by system 100 of FIG. 1, one or more components to execute operations of FIG. 2, or system 300 of FIG. 3.
  • Method 500 includes extracting a plurality of sentences from data corresponding to one or more documents each comprising text, at block 502.
  • Each extracted sentence includes at least one matched pair including a keyword from a first keyword set or a second keyword set and an entity from an entity set.
  • Each extracted sentence includes a single sentence or multiple sentences.
  • sentence extractor 120 may extract a plurality of sentences from data received from data sources 170, the data corresponding to one or more documents.
  • Method 500 includes ordering the plurality of extracted sentences based on a distance between a respective keyword and a respective entity in each extracted sentence of the plurality of extracted sentences, at block 504.
  • sentence organizer 121 may order (e.g., prioritize) the plurality of extracted sentences based on a distance (e.g., a token distance) between the keyword and the entity in each extracted sentence.
  • the distance may indicate a predicted relevance of the extracted sentence, such that the extracted sentences are ordered based on predicted relevance.
  • Method 500 also includes identifying a first type of extracted sentences from the ordered plurality of extracted sentences, at block 506.
  • Extracted sentences having the first type include one or more keywords included in the first keyword set.
  • sentence identifier 122 may identify, from the ordered plurality of extracted sentences, a first type of extracted sentences that include one or more keywords included in the first keyword set.
  • the first type may be“specific.”
  • Method 500 includes identifying a second type of extracted sentences from the ordered plurality of extracted sentences, at block 508.
  • Extracted sentences having the second type include one or more keywords included in the second keyword set.
  • sentence identifier 122 may identify, from the ordered plurality of extracted sentences, a second type of extracted sentences that include one or more keywords included in the second keyword set.
  • the second type may be“general.”
  • Method 500 also includes generating an extracted summary that includes at least one sentence having the first type and at least one sentence having the second type, at block 510.
  • the at least one sentence having the first type is intermixed with the at least one sentence having the second type based on a predetermined order rule set
  • summary extractor 123 may generate an extracted summary that includes at least one sentence having the first type and at least one sentence having the second type by intermixing the at least one sentence having the first type with the at least one sentence having the second type based on a predetermined order rule set.
  • the predetermined order rule set may be configured to enable generation of summaries that are more grammatically natural and readable and may indicate an order of sentences for inclusion in summaries based on sentence type (e.g., the first type and the second type).
  • Method 500 further includes outputting the extracted summary, at block 512.
  • output generator 124 may output the extracted summary, for example, for display to a user. Additionally, or alternatively, the extracted summary may be stored at a memory.
  • the first keyword set includes a user-generated keyword set
  • the second keyword set includes an expanded keyword set.
  • the first keyword set may be generated based on in input to server 110, and the second keyword set may be automatically generated by server 110, such as by taxonomy expander 125, based on the data received from data sources 170 and the first keyword set.
  • generating the extracted summary includes including, in the extracted summary, a first sentence having the second type, followed by a second sentence having the first type, followed by a third sentence having the first type, based on the predetermined order rule set.
  • summary extractor 123 may include a general sentence (e.g., a sentence having the second type), followed by a specific sentence (e.g., a sentence having the first type), followed by a second specific sentence in the extracted summary based on the predetermined order rule set indicating inclusion of sentence triples ordered general->specific->specific.
  • generating the extracted summary includes including, in the extracted summary, a first sentence having the second type, followed by a second sentence having the first type, followed by a third sentence having the second type, based on the predetermined order rule set.
  • summary extractor 123 may include a general sentence, followed by a specific sentence, followed by a second general sentence in the extracted summary based on the predetermined order rule set indicating inclusion of sentence triples ordered general ->specific->general.
  • method 500 may further include determining whether to include an additional sentence in the extracted summary based on a determination whether a sum of a length of the extracted summary and a length of the additional sentence is less than or equal to a threshold.
  • summary extractor 123 may, after adding (e.g., appending) an extracted sentence to the extracted summary, determine whether a sum of a length of the extracted summary and a length of an additional sentence is less than or equal to a threshold. If the sum is less than or equal to the threshold, summary extractor 123 may include the additional sentence in the extracted summary.
  • method 500 also includes generating the second keyword set.
  • Generating the second keyword set includes generating one or more semantic vectors.
  • Generating the second keyword set also includes, for each keyword of the first keyword set, determining a semantic vector having a highest similarity score to the keyword and identifying one or more terms of the determined semantic vector as a candidate term.
  • Generating the second keyword set further includes selecting at least one candidate term to be added to the first keyword set to generate the second keyword set.
  • taxonomy expander 125 may generate semantic vectors and identify terms of semantic vectors as candidate terms based on similarity scores.
  • generating the one or more semantic vectors includes, for each of the one or more documents, generating a corresponding semantic vector based on a skipgram model that utilizes words and subwords from the document.
  • a skipgram generator such as Fasttext, may be used to generate the semantic vectors.
  • Generating the second keyword set further includes, for each keyword of the first keyword set, comparing a similarity score of the determined semantic vector having the highest similarity score to a threshold. The semantic vector is used to identify the candidate term based on a determination that the similarity score of the determined semantic vector is greater than or equal to the threshold.
  • method 500 also includes generating a second extracted summary that includes at least one sentence having the first type and at least one sentence having the second type.
  • the at least one sentence having the first type is intermixed with the at least one sentence having the second type based on the predetermined order rule set.
  • summary extractor 123 may generate a second extracted summary that includes at least one sentence having the first type (e.g., specific) and at least one sentence having the second type (e.g., general).
  • the sentences may be ordered in one of multiple configurations based on the predetermined order rule set, such as general->specific->specific, general->specific->general, or alternating specific (or general) followed by general (or specific), as non-limiting examples.
  • ordering the plurality of extracted sentences is based further on frequencies of respective one or more keywords included in each extracted sentence.
  • sentence organizer 121 may order the plurality of extracted sentences based further on frequencies of the keywords included in each extracted sentence, in addition to ordering the plurality of extracted sentences based on the distance between the keyword and the entity in each extracted sentence.
  • sentence organizer 121 may order the plurality of extracted sentences based only on the distances (and not the frequencies), or based only on the frequencies (and not the distances).
  • method 500 further includes receiving a selection of a first event category of multiple event categories and retrieving the first keyword set based on the selection of the first event category.
  • different keyword sets may correspond to different event categories.
  • one keyword set may correspond to “terrorism” and another keyword set may correspond to “legal.”
  • the multiple event categories include cybersecurity, terrorism, legal/non compliance, or a combination thereof.
  • an extracted sentence of the plurality of extracted sentences includes the multiple sentences, and the multiple sentences include a sentence that includes the at least one matched pair, a sentence that includes the keyword of the at least one matched pair, a sentence preceding the sentence that includes the keyword of the at least one matched pair, a sentence following the sentence with the keyword the at least one matched pair, a sentence that includes the entity of the at least one matched pair, a sentence preceding the sentence that includes the entity of the at least one matched pair, a sentence following the sentence with the entity of the at least one matched pair, or a combination thereof.
  • the data is received from a data source that includes a streaming data source, news data, a database, or a combination thereof, and the entity set indicates an individual, a company, a government, an organization, or a combination thereof.
  • FIGS. 1-5 may comprise processors, electronics devices, hardware devices, electronics components, logical circuits, memories, software codes, firmware codes, etc., or any combination thereof. Consistent with the foregoing, various illustrative logical blocks, modules, and circuits described in connection with the disclosure herein may be implemented or performed with a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein.
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • a general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine.
  • a processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
  • a software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
  • An exemplary storage medium is coupled to the processor such that the processor can read information from, and write information to, the storage medium.
  • the storage medium may be integral to the processor.
  • the processor and the storage medium may reside in an ASIC.
  • the ASIC may reside in a user terminal, base station, a sensor, or any other communication device.
  • the processor and the storage medium may reside as discrete components in a user terminal.
  • the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
  • Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. Computer-readable storage media may be any available media that can be accessed by a general purpose or special purpose computer.
  • such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code means in the form of instructions or data structures and that can be accessed by a general- purpose or special-purpose computer, or a general-purpose or special-purpose processor.
  • a connection may be properly termed a computer-readable medium.
  • the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, or digital subscriber line (DSL), then the coaxial cable, fiber optic cable, twisted pair, or DSL, are included in the definition of medium.
  • DSL digital subscriber line
  • Disk and disc includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Strategic Management (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Development Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Tourism & Hospitality (AREA)
  • Game Theory and Decision Science (AREA)
  • Data Mining & Analysis (AREA)
  • Educational Administration (AREA)
  • Primary Health Care (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Document Processing Apparatus (AREA)
EP20809703.0A 2019-05-17 2020-04-28 Systeme und verfahren zur ereigniszusammenfassung aus daten Pending EP3970031A4 (de)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US201962849182P 2019-05-17 2019-05-17
US16/700,746 US11461555B2 (en) 2018-11-30 2019-12-02 Systems and methods for identifying an event in data
US16/848,739 US11182539B2 (en) 2018-11-30 2020-04-14 Systems and methods for event summarization from data
PCT/IB2020/054007 WO2020234673A1 (en) 2019-05-17 2020-04-28 Systems and methods for event summarization from data

Publications (2)

Publication Number Publication Date
EP3970031A1 true EP3970031A1 (de) 2022-03-23
EP3970031A4 EP3970031A4 (de) 2023-06-07

Family

ID=73458405

Family Applications (1)

Application Number Title Priority Date Filing Date
EP20809703.0A Pending EP3970031A4 (de) 2019-05-17 2020-04-28 Systeme und verfahren zur ereigniszusammenfassung aus daten

Country Status (4)

Country Link
EP (1) EP3970031A4 (de)
AU (1) AU2020278972B2 (de)
CA (1) CA3139081C (de)
WO (1) WO2020234673A1 (de)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112949299A (zh) * 2021-02-26 2021-06-11 深圳市北科瑞讯信息技术有限公司 新闻稿件的生成方法及装置、存储介质、电子装置
CN114637601A (zh) * 2022-03-02 2022-06-17 马上消费金融股份有限公司 信息获取方法、装置、电子设备及存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2715875B2 (ja) * 1993-12-27 1998-02-18 日本電気株式会社 多言語要約生成装置
US7283951B2 (en) * 2001-08-14 2007-10-16 Insightful Corporation Method and system for enhanced data searching
US8631001B2 (en) * 2004-03-31 2014-01-14 Google Inc. Systems and methods for weighting a search query result
US11080295B2 (en) * 2014-11-11 2021-08-03 Adobe Inc. Collecting, organizing, and searching knowledge about a dataset
US10534815B2 (en) * 2016-08-30 2020-01-14 Facebook, Inc. Customized keyword query suggestions on online social networks

Also Published As

Publication number Publication date
CA3139081A1 (en) 2020-11-26
CA3139081C (en) 2024-04-09
AU2020278972A1 (en) 2021-11-25
AU2020278972B2 (en) 2023-07-06
EP3970031A4 (de) 2023-06-07
WO2020234673A1 (en) 2020-11-26

Similar Documents

Publication Publication Date Title
US11182539B2 (en) Systems and methods for event summarization from data
Ahuja et al. The impact of features extraction on the sentiment analysis
Elghazaly et al. Political sentiment analysis using twitter data
AU2019389172B2 (en) Systems and methods for identifying an event in data
US9317498B2 (en) Systems and methods for generating summaries of documents
Weren et al. Examining multiple features for author profiling
Wu et al. Detection of hate speech in videos using machine learning
Oussous et al. Impact of text pre-processing and ensemble learning on Arabic sentiment analysis
Delizo et al. Philippine twitter sentiments during covid-19 pandemic using multinomial naïve-bayes
Alabbas et al. Classification of colloquial Arabic tweets in real-time to detect high-risk floods
AU2020278972B2 (en) Systems and methods for event summarization from data
Mouratidis et al. Domain-specific term extraction: a case study on Greek Maritime legal texts
Cajueiro et al. A comprehensive review of automatic text summarization techniques: method, data, evaluation and coding
CN114756675A (zh) 文本分类方法、相关设备及可读存储介质
Pereira et al. Taxonomy extraction for customer service knowledge base construction
Pasarate et al. Comparative study of feature extraction techniques used in sentiment analysis
Ahmad et al. Aspect Based Sentiment Analysis and Opinion Mining on Twitter Data Set Using Linguistic Rules
Wang The evaluation of ensemble sentiment classification approach on airline services using twitter
Hamad et al. Emotion and polarity prediction from Twitter
Glickman et al. Investigating lexical substitution scoring for subtitle generation
Ekmekci et al. Specificity-based sentence ordering for multi-document extractive risk summarization
Kiminos et al. Using Machine Learning for Text Classification to identify useful information in texts: A comparison of Naïve Bayes and Support Vector Machines to identify decisions in business meeting transcripts
Lad Sarcasm Detection in English and Arabic Tweets Using Transformer Models
Phoo et al. Sentiment analysis for travel and tourism domain using hybrid approach
Yürütücü The Use of Pretrained Language Models in Sentiment Analysis

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20211112

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

DAV Request for validation of the european patent (deleted)
DAX Request for extension of the european patent (deleted)
A4 Supplementary search report drawn up and despatched

Effective date: 20230508

RIC1 Information provided on ipc code assigned before grant

Ipc: G06F 40/284 20200101ALI20230428BHEP

Ipc: G06F 40/247 20200101ALI20230428BHEP

Ipc: G06F 40/56 20200101ALI20230428BHEP

Ipc: G06Q 50/26 20120101ALI20230428BHEP

Ipc: G06Q 50/10 20120101ALI20230428BHEP

Ipc: G06F 40/295 20200101ALI20230428BHEP

Ipc: G06F 40/20 20200101ALI20230428BHEP

Ipc: G06F 40/10 20200101ALI20230428BHEP

Ipc: G06F 16/953 20190101ALI20230428BHEP

Ipc: G06F 16/738 20190101AFI20230428BHEP

P01 Opt-out of the competence of the unified patent court (upc) registered

Effective date: 20230524