WO2019055654A1 - Systems and methods for cross-media event detection and coreferencing - Google Patents
Systems and methods for cross-media event detection and coreferencing Download PDFInfo
- Publication number
- WO2019055654A1 WO2019055654A1 PCT/US2018/050885 US2018050885W WO2019055654A1 WO 2019055654 A1 WO2019055654 A1 WO 2019055654A1 US 2018050885 W US2018050885 W US 2018050885W WO 2019055654 A1 WO2019055654 A1 WO 2019055654A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- event
- social media
- determining
- similarity
- news
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 120
- 238000001514 detection method Methods 0.000 title description 70
- 239000013598 vector Substances 0.000 claims description 65
- 238000004519 manufacturing process Methods 0.000 claims description 46
- 230000008520 organization Effects 0.000 claims description 43
- 230000002123 temporal effect Effects 0.000 claims description 33
- 238000001914 filtration Methods 0.000 claims description 25
- 230000014509 gene expression Effects 0.000 claims description 21
- 238000003860 storage Methods 0.000 claims description 13
- 230000004044 response Effects 0.000 claims description 10
- 238000003058 natural language processing Methods 0.000 claims description 6
- 230000006870 function Effects 0.000 claims description 4
- 238000012384 transportation and delivery Methods 0.000 claims description 3
- 238000000605 extraction Methods 0.000 description 59
- 238000012545 processing Methods 0.000 description 32
- 238000012795 verification Methods 0.000 description 21
- 238000004364 calculation method Methods 0.000 description 18
- 238000004891 communication Methods 0.000 description 18
- 238000011524 similarity measure Methods 0.000 description 16
- 230000008569 process Effects 0.000 description 10
- 238000010586 diagram Methods 0.000 description 9
- 239000000284 extract Substances 0.000 description 9
- 238000012706 support-vector machine Methods 0.000 description 9
- 238000004422 calculation algorithm Methods 0.000 description 7
- 230000037406 food intake Effects 0.000 description 7
- 230000000694 effects Effects 0.000 description 6
- 238000013459 approach Methods 0.000 description 5
- 238000010801 machine learning Methods 0.000 description 5
- 238000013068 supply chain management Methods 0.000 description 4
- 230000005540 biological transmission Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 239000000463 material Substances 0.000 description 3
- 241000282412 Homo Species 0.000 description 2
- 230000009471 action Effects 0.000 description 2
- 230000015556 catabolic process Effects 0.000 description 2
- 238000007635 classification algorithm Methods 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 230000007773 growth pattern Effects 0.000 description 2
- 230000036541 health Effects 0.000 description 2
- 238000013439 planning Methods 0.000 description 2
- 238000007637 random forest analysis Methods 0.000 description 2
- 238000013515 script Methods 0.000 description 2
- 241001269238 Data Species 0.000 description 1
- 238000006424 Flood reaction Methods 0.000 description 1
- 241001489813 Ophelia Species 0.000 description 1
- 206010038743 Restlessness Diseases 0.000 description 1
- 239000008186 active pharmaceutical agent Substances 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 239000010432 diamond Substances 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 230000000877 morphologic effect Effects 0.000 description 1
- 230000008450 motivation Effects 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 230000007935 neutral effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000001902 propagating effect Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 230000035939 shock Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000007480 spreading Effects 0.000 description 1
- 238000003892 spreading Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000003612 virological effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/10—Office automation; Time management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/10—Office automation; Time management
- G06Q10/101—Collaborative creation, e.g. joint development of products or services
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/01—Social networking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
Definitions
- Timely knowledge of events enables better decision-making in a broad range of fields including finance, security, policy, governance, planning and disaster coordination efforts.
- a government may use knowledge of an event to make better decisions regarding political unrest in a region
- a trader may use event knowledge to gain insights into companies vulnerable to natural disasters
- a non-governmental organization may use event knowledge to optimize the allocation of aid workers to where they are needed most.
- Extracting attributes of events such as one or more of the who, what, where, when, why and how of the event, in real time from the text of media, entails many challenges.
- postings of social media platforms may be mostly noise, such as spam, chit chat, etc., be related to events that are not newsworthy or otherwise of interest for decision making, be one of many postings that discuss a same event, and use custom lingo that obscures the attributes of the event.
- News articles while in some respects inherently more event-related than an average social media posting, nonetheless also present difficulties corresponding to their particular format, such correctly extracting the event attributes from a relatively larger amount of information.
- event information extracted from any given type of media may be subject to limits on its usefulness related to limitations of the type of media itself.
- social media such as Twitter
- Twitter has proven to be a major source of breaking news across a variety of topics, with one study indicating that Twitter led mainstream news media in more than 20% of disaster-related stories, it is often unreliable, and only provides limited information about the event.
- traditional news articles are typically rigorously verified before publication, and thus more reliable, and present a rich context that completes the semantic picture about an event, news articles may report breaking news slower than social media.
- FIG. 1 is an exemplary architectural diagram of the system.
- FIG. 2 is an exemplary event processing server.
- FIG. 3a is an exemplary flow chart of one implementation of the disclosure.
- FIG. 3b is an exemplary flow chart of another implementation of the disclosure.
- FIG. 4a illustrates exemplary elements in a veracity calculation.
- FIG. 4b illustrates exemplary elements in an alternative verification calculation.
- FIG. 5a illustrates an exemplary processing of an item of social media data.
- FIG. 5b illustrates an example table representation of mapping key concepts to the respective social media data.
- FIG. 5c illustrates an example database representation in relation to the exemplary social media data of Fig. 5a.
- FIG. 5d illustrates an example unit cluster.
- FIG. 5e illustrates an exemplary ingested data.
- FIGS. 5f-5k is an exemplary metadata of ingested data in Fig. 5e.
- FIGS. 5l-5n is an exemplary metadata of an event detected cluster with ingested data of Fig. 5e as one of the related unit data.
- FIG. 6a illustrate default event detected clusters viewable through an exemplary graphical user interface (GUI).
- GUI graphical user interface
- FIG. 6b illustrate exemplary event detected clusters viewable through an exemplary graphical user interface (GUI).
- GUI graphical user interface
- FIG. 6c illustrate a selected event detected cluster viewable through an exemplary graphical user interface (GUI).
- GUI graphical user interface
- FIG. 7a-7e illustrate additional filters on event detected clusters available through an exemplary graphical user interface (GUI).
- GUI graphical user interface
- FIG. 8 is a schematic diagram depicting an embodiment of a system for detecting and coreferencing events across media types.
- FIG. 9 is a schematic diagram depicting an embodiment of a cross-media event detection and coreferencing system.
- FIG. 10 is a flowchart depicting an embodiment of a method of detecting and coreferencing events across media types.
- FIG. 11 is a schematic diagram depicting an embodiment of a news event extraction module.
- FIG. 12 is a flowchart depicting an embodiment of a method of detecting and generating representations of events referenced by news articles.
- FIGS. 13A-13C depict embodiments of news articles, social media postings and corresponding generated event representations for events that are coreferenced by the depicted news articles and social media postings.
- FIGS. 13D-13F depict embodiments of displays of coreferenced events of the event types in FIGS. 13A-13C, respectively, detected for a predetermined time period.
- FIG. 14 is a schematic diagram depicting an embodiment of a social media event extraction module.
- FIG. 15 is a flowchart depicting an embodiment of a method of detecting and generating representations of events referenced by social media postings.
- FIG. 16 is a schematic diagram depicting an embodiment of an event coreferencing module.
- FIG. 17 is flowchart depicting an embodiment of a method of determining event coreferencing across media types.
- FIG. 18 is a schematic diagram depicting an embodiment of a similarity calculation module.
- FIG. 19 is a flowchart depicting an embodiment of a method of calculating similarities between a news article and a social media cluster.
- FIG. 20 is a schematic diagram depicting an embodiment of a computer system for implementing components of the system for detecting and coreferencing events across media types.
- FIG. 21 is a schematic diagram depicting further embodiments of the cross-media event extraction and coreferencing system and user system.
- FIG. 22 is a flowchart depicting an embodiment of a method of providing an alert for a coreferenced event.
- FIG. 23A-23C depict embodiments of email, text, and feed alerts, respectively, for a coreferenced event.
- FIG. 24 depicts an embodiment of an alert application of the user system for interfacing with an API of the cross-media event extraction and coreferencing system.
- FIG. 25 is a map depicting embodiments of event information of an event production system.
- FIG. 26 is a chart depicting embodiments of event information of an event production system.
- FIG. 1 shows an exemplary system 100 for detecting and verifying an event from social media data.
- the system 100 is configured to include an event detection server 110 that is in communication with a social media platform 180 over a network 160.
- the system 100 further comprises an access device 170 that is in communication with an event processing server 210 over the network 160. Further details of an exemplary event processing server 210 are illustrated in FIG. 2.
- the event detection server 110 is in communication with the event processing server 210 over the network 160.
- Access device 170 can include a personal computer, laptop computer, or other type of electronic device, such as a mobile phone, smart phone, tablet, PDA or PDA phone.
- the access device 170 is coupled to I/O devices (not shown) that include a keyboard in combination with a point device such as a mouse for sending an event request to the event processing server 210.
- the access device 170 is configured to include a browser 172 that is used to request and receive information from the event processing server 210. Communication between the browser 172 of the access device 170 and event processing server 210 may utilize one or more networking protocols, which may include HTTP, HTTPS, RTSP, or RTMP.
- one access device 170 is shown in FIG. 1, the system 100 can support one or multiple access devices.
- the network 160 can include various devices such as routers, servers, and switching elements connected in an Intranet, Extranet or Internet configuration.
- devices such as routers, servers, and switching elements connected in an Intranet, Extranet or Internet configuration.
- the network 160 uses wired communications to transfer information between the access device 170 and the event processing server 210, the social media platform 180 and the event detection server 110.
- the network 160 employs wireless communication protocols.
- the network 160 employs a combination of wired and wireless technologies.
- the event detection server 110 may be a special purpose server, and preferably includes a processor 112, such as a central processing unit CCPU , random access memory RAM') 114, input-output devices 116, such as a display device (not shown), and non-volatile memory 120, all of which are interconnect via a common bus 111 and controlled by the processor 112.
- a processor 112 such as a central processing unit CCPU , random access memory RAM'
- input-output devices 116 such as a display device (not shown)
- non-volatile memory 120 all of which are interconnect via a common bus 111 and controlled by the processor 112.
- the non-volatile memory 120 is configured to include an ingestion module 122 for receiving social media data from the social media platform 180.
- exemplary social media platforms are, but not limited to, Twitter®, Reddit®, Facebook®, Instagram® or Linkedln®.
- the phase "ingested data” refers to received social media data, which may be but is not limited to, tweets and/or online messages, from the social media platform 180.
- the non-volatile memory 120 also includes a filtering module 124 for processing ingested data.
- processing of the ingested data may comprise but is not limited to, detecting language of the ingested data and filtering out ingested data that contains profanity, spam, chat and advertisements.
- the non-volatile memory 120 is also configured to include an organization module 126 for analyzing semantic and syntactic structures in the ingested data.
- the organization module 126 may apply part-of-speech tagging of the ingested data.
- the organization module 126 detects key concepts included in the ingested data.
- the non-volatile memory 120 may also be configured to include a clustering module 128 for storing key concepts identified by the organization module 126 into a database, an example of which may be but is not limited to a hashmap, and generating an event detected cluster upon reaching a threshold of distinct ingested data containing common key concepts.
- the non-volatile memory 120 is also further configured to include a topic
- categorization module 131 for classifying the event detected cluster by topics
- summarization module 132 for selecting a representative description for the event detected cluster; and a newsworthiness module 133 for determining a newsworthy score to indicate the importance of the event detected cluster.
- the non-volatile memory 120 is also configured to include an opinion module 134 for detecting if the each ingested data in the event detected cluster contains an opinion of a particular person or is factual (e.g., non-opinionated tone), and a credibility module 135, for determining the credibility score of the ingested data.
- the credibility score is associated with three components: user/source credibility: who is providing the information, cluster credibility: what is the information, and tweet credibility: how is the information related to other information.
- the non-volatile memory 120 is further configured to include verification module 150 for determining the accuracy of the event detected cluster.
- verification may be done by a veracity algorithm which generates a veracity score.
- the verification module 150 may generate a probability score for an assertion being true based on evidences collected from ingested data.
- the non-volatile memory 120 is further configured to include a knowledge base module 152 for developing a database of information pertaining to credible sources and stores the information in a knowledge base data store 248 (FIG. 2).
- a data store 140 is provided that is utilized by one or more of the software modules 124, 126, 128, 131, 132, 133, 134, 135 ,150, 152 to access and store information relating to the ingested data.
- the data store 140 is a relational database.
- the data store 140 is a file server.
- the data store 140 is a configured area in the nonvolatile memory 120 of the event detection server 110.
- the data store 140 shown in FIG. 1 is part of the event detection server 110, it will be appreciated by one skilled in the art that the data store 140 can be distributed across various servers and be accessible to the server 110 over the network 160. As shown in FIG.
- the data store 140 is configured to include a filtered data store 141, an organization data store 142, a cluster data store 143, a topic categorization data store 144, a summarization data store 145, a newsworthiness data store 146, an opinion fact data store 147, a credibility data store 148 and a veracity data store 154.
- the filtered data store 141 includes ingested data that has been processed by the filtering module 124.
- the ingested data processed by filtering module 124 may be English language tweets that do not contain profanity, advertisements, spam, chat or advertisement.
- the organization data store 142 includes ingested data that has been processed by the organization module 126.
- the ingested data in organization data store 142 may include parts-of-speech tagging notations or identified key concepts, which are stored as a part of ingested data metadata.
- the cluster data store 143 includes ingested data that has been processed by filtering module 124 and organization module 126 and is queued to be formed into a cluster.
- the cluster data store 143 may also contain a data store or database of key concepts (e.g. hashmap) identified by the organization module 126 matched to corresponding ingested data.
- key concepts e.g. hashmap
- ingested data e.g., tweets and/or online messages
- unit data may also be referred to as unit data.
- the topic categorization data store 144 includes the classification of the event detected cluster determined by the topic categorization module 131.
- Exemplary topics may include but are not limited to business/finance, technology/science, politics, sports, entertainment, health/medical, crisis(war/disaster), weather, law/crime, life/society, and other.
- the summarization data store 145 includes a selected unit data that is representative of the event detected cluster as determined by the summarization module 132.
- the newsworthiness data store 146 includes the newsworthy score computed by newsworthiness module 133. For example, a higher score would imply that the event detected cluster is likely to be important from a journalistic standard.
- the opinion data store 147 includes information pertaining to the determination by the opinion module 134 of whether a given unit data comprises an opinion of a particular person or an assertion of a fact.
- the credibility data store 148 includes a credibility or confidence score as determined by the credibility module 135.
- the veracity data store 154 includes metrics generated by the verification module 150 regarding the level of accuracy of the event detected cluster. In one implementation, it may be the veracity score determined through a veracity algorithm. In another
- it may be a verification score indicating the probability of accuracy based on all the evidences collected from social media.
- the Event Processing Server 210 includes a processor (not shown), random access memory (not shown) and non-volatile memory (not shown) which are interconnected via a common bus and controlled by the processor.
- the Event Processing Server 210 is responsible for storing processed information generated or to be used by the Event Detection Server 110.
- the Event Processing Server 210 also communicates directly with the user.
- the Event Processing Server 210 is further illustrated in relation to FIG. 2.
- system 100 shown in FIG. 1 is one implementation of the disclosure.
- Other system implementations of the disclosure may include additional structures that are not shown, such as secondary storage and additional computational devices.
- various other implementations of the disclosure include fewer structures than those shown in FIG. 1.
- the Event Processing Server 210 in one implementation contains a web server 220 with a non-volatile memory 230 and a UI (user interface) module 232.
- the UI module 232 communicates with the access device 170 over the network 160 via a browser 172.
- the UI module 232 may present to a user through the browser 172 detected events clusters and their associated metadata.
- Exemplary associated metadata may be but are not limited to the topic, newsworthiness indication and verification score associated with one or more event detected clusters.
- the event processing server 210 may further comprise a data store 240 to host an ingested data store 242, a generated cluster data store 244, an emitted data store 246 and the knowledge base data store 248.
- the ingested data store 242 includes ingested data received from social platform 180 and processed by ingestion module 122.
- the generated cluster datastore 244 includes the event detected clusters that have been processed by modules 122, 124, 126, 128, 131, 132, 133, 134,135 and 150.
- the emitted data store 246 includes key concepts and corresponding ingested data that were discarded by the clustering module 128, as explained in relation to steps 330-332 of FIG. 3a.
- the emitted data store may be located in the event detection server 110.
- the knowledge base data store 248 includes a list of credible sources as determined by knowledge base module 152.
- the Event Processing Server 210 communicates with the Event Detection Server 110 over the network 160.
- the Event Processing Server 210 is included in the nonvolatile memory 120 of Event Detection Server 110.
- the Event Processing Server 210 is configured to communicate directly with the Event Detection Server 110.
- An exemplary event processing server 210 may be but is not limited to MongoDB® or ElasticSearch® .
- an exemplary method 300 of detecting and verifying social media events is disclosed.
- information from social media platform 180 is retrieved by the ingestion module 122 of event detection server 110.
- the ingestion module 122 may include scripts or code that interface with the social media platform 180 application API. The scripts or code are also able to request and pull information from the APIs.
- the ingestion module 122 may determine the location of the ingested data and the user and append location information as metadata to the ingested data.
- the ingestion module 122 stores the ingested data into the ingested data store 242 of event processing server 210.
- metadata may also be generated by the ingestion module 122 and appended to the ingested data prior to storage in the ingested data store 242.
- the knowledge base module 152 may compile the list of credible sources using information gathered from the ingested data.
- the knowledge base module 152 stores the list of credible sources in the knowledge base data store 248.
- the knowledge base module 152 may analyze user profiles from the ingested data to capture information such as user affiliations or geography to be used for compilation of the list of credible sources.
- the knowledge base module 152 takes established credible users and reviews lists generated by the user for relevant information that may be used to generate the list of credible sources.
- the knowledge base module 152 continually updates knowledge base data store 248 as further social media data are ingested and may be evaluated at a predetermined frequency to ensure the information is current.
- the filtering module 124 retrieves the ingested data from ingested data store 242 and processes the ingested data. Exemplary processing by the filtering module 124 may include language detection and profanity detection. In one implementation, the filtering module 124 determines the language of the ingested data and eliminates ingested data that are not in English. In an alternative implementation, elimination of ingested data can be for other languages.
- the filtering module 124 may also detect profane terms in the ingested data and flag the ingested data that contains profanity. Ingested data containing profanity is then eliminated by the filtering module 124. In one implementation, the detection of profanity is based on querying a dictionary set of profane terms.
- the filtering module 124 may utilize a classification algorithm that removes ingested data that is recognized to be spam, chat or advertisements. Exemplary indication of spam would be ingested data saying "follow me @xyz”. Exemplary chat in ingested data may be general chatter about daily lives like "good morning”.
- Exemplary advertisements in ingested data may contain language such as "click here to buy this superb T-shirt for $10.”
- the classification algorithm is based on a machine learning model that has been trained on a number of features based on language (i.e., terms used in constructing the data), message quality (i.e., presence of capitalization, emoticons), user features (i.e., average registration age).
- Exemplary machine learning models include, but are not limited to, Support Vector Machines, Random Forests, and Regression Models.
- the filtered ingested data is then stored in filtered data store 141.
- the organization module 126 retrieves the now filtered ingested data from filtered data store 141 and detects key concepts in the ingested data. In one implementation, the organization module 126 detects semantic and syntactic structures in the ingested data.
- the organization module 126 may apply part-of-speech tagging, through a Part-Of-Speech tagger, on the ingested data. For example, the organization module 126 recognizes verbs, adverbs, proper nouns, and adjectives in the ingested data.
- Part-of-speech tagging notations or identified key concepts may then be stored into the organization data store 142.
- the Part-of-speech tagging notations or identified key concepts may be appended to the ingested data metadata and stored into the organization data store 142.
- markable All key concepts, proper nouns, hashtags, and any list terms found in the ingested data are designated as a 'markable'.
- the markable may be further concatenated to produce markables that are more meaningful. For example, if "New” followed by "York” has been identified as a markable, then the terms are
- the clustering module 128 obtains organized ingested data from organization data store 142 and creates a database of key concepts with a reference to the corresponding ingested data.
- the referenced corresponding ingested data maybe in the form of a unit data. This database is then stored in cluster data store 143.
- each key concept has a predefined time frame to grow to a minimum count of unit data required to be considered an unit cluster or else it is discarded.
- An exemplary threshold count may be but is not limited to, three (3) unit data for a key concept.
- step 314 the clustering module 128 generates a unit cluster.
- the unit data corresponding to the markable are generated as the unit cluster in step 314 and are removed from the database in step 316.
- the markables in the database may be reviewed. For markables that have not exceeded a predefined time window, (i.e. 2 hours), the process starts again from step 302 with newly ingested data. To illustrate, this may be social media information that is so fresh that other collective users did not get to mention it yet.
- markables that never grow to the minimum threshold of unit data after a predefined time window i.e., 2 hours
- the discarded markables and unit data may be sent to the emitted data store 246 along with other metadata about it.
- social media information that no other users are mentioning might not be an event of importance to a professional consumer.
- step 318 once the unit cluster is generated, its corresponding markables and unit data are removed from the database in step 316.
- the newly generated unit cluster is checked against a set of previously generated event detected clusters, at step 318.
- the set of previously generated event detected clusters may be located in the cluster data store 143.
- generated clusters may be located in the generated cluster data store 244 of the event processing server 210.
- step 324 the unit cluster is determined to be a new event detected cluster by the clustering module 128 and is stored into cluster data store 143.
- step 320 if there is a match to existing generated event detected clusters, based on a set of predefined rules, at step 320, a decision to either merge two similar clusters or keep them as two separate clusters is made.
- the decision to merge may be based on the same underlying concepts.
- the cluster module 128 merges the clusters and stores the merged event detected cluster is stored into cluster data store 143. For example, if social media information is the same as a previously detected event, the social media information is then merged with the previously detected event.
- the unit cluster is determined to be a new event detected cluster and is stored into cluster data store 143.
- cluster data store 143 For example, social media information that is distinct from the previously detected events maybe an event of importance to a professional consumer and should be noted as such, therefore the unit cluster is considered by the clustering module 128 as an event detected cluster.
- enrichments may be applied to the event detected cluster.
- exemplary enrichments are, but not limited to, topic categorization, summarization, newsworthiness, opinion and credibility.
- the topic categorization module 131 may determine one or more classification for the event detected cluster.
- the classification may be a taxonomy of predefined categories (i.e., politics, entertainment).
- the classification is added to the metadata for the event detected cluster.
- the summarization module 132 may select a unit data in the event detected cluster that best describes the cluster. The selected unit data is used as a summary for the event detected cluster. In a further implementation, the summarization module 132 may also utilize metrics such as the earliest unit data or a popular unit data in the generation of the summary for the event detected cluster. The summary is added to the metadata for the event detected cluster.
- the newsworthiness module 133 uses a newsworthiness algorithm to calculate a newsworthy score.
- the newsworthy score is an indication of the importance of the event detected cluster from a journalistic standard. For example, an event detected cluster concerning an airplane crash for a breaking news event is considered more important than a cluster around a viral celebrity picture.
- the newsworthiness algorithm is a supervised Machine Learning algorithm that has been trained on a
- the newsworthy set of ingested data and predicts a newsworthy score for any ingested data that is passed through it.
- the newsworthy score is added to the metadata for the event detected cluster.
- the opinion module 134 determines if the each unit data in the event detected cluster contains an opinion of a particular person or an assertion of a fact. In one implementation, for unit data that are an assertion of fact, a score indicative of an assertion as a fact is also assigned to the unit data and likewise for an opinion. In a further implementation, the opinion module 134 executes in a two stage process. In the first stage, a rule based classifier is applied that uses simple rules based on presence/absence of certain types of opinion/sentiment words, and/or usage of personal pronouns to identify opinions. In the second stage, all unit data that are indicated to be non-opinions are passed through a bag-of-words classifier that has been trained specifically to recognize fact assertions.
- the determination of fact or opinion is then stored as a part of the event detected cluster metadata.
- the credibility module 135 determines the confidence score of each unit data in the event detected cluster.
- the confidence score is associated with three components: source credibility, cluster credibility, and tweet credibility.
- the score and information generated by the components are then stored as a part of the event detected cluster metadata.
- Source credibility relates to the source of the unit data. If the source is a credible source, for example, an authority such as the White House stating an event is more credible than a random unknown user. In one implementation, source credibility is measured by an algorithm that uses features like, but not limited to, age of the user, description, and presence of a profile image of the social media account.
- Cluster credibility relates to what the information is.
- detected events clusters containing genuine events may have different growth patterns from fake detected events clusters, such as a fake event might be driven by negative motivations like purposely spreading rum.
- a supervised learning model is used based on historical data that identifies likelihood of the event detected cluster being true or false based on growth patterns.
- Tweet credibility relates to the content of the individual tweets in the unit data and the language being mentioned therein.
- the unit data is evaluated against a set of textual words trained on credible and noncredible unit data.
- the verification module 150 analyzes the enrichments applied to the event detected cluster and its related unit data to determine the level of accuracy of the event detected cluster.
- the verification module 150 may generate a veracity calculation based on three categories: user, tweet-level or social media data level and event, from the unit data.
- the verification module 150 may compute a probability of the propagating rumor being true using extracted language, user and other metadata features from event detected cluster and its related unit data.
- step 346 the enriched event detected cluster is then stored in generated cluster data store 244 of the event processing server 210.
- FIG. 4a illustrates an exemplary description of categories used in a veracity calculation.
- the first category for consideration pertains to a user category.
- the user features 402a are boolean and may include, but are not limited to: name, description, url, location, matches cluster location, witness, protected (i.e., private or not), verified, as illustrated in FIG. 4a.
- the user category captures user specific information gathered from their social media profile. .Exemplary features like location or url can weigh into the credibility of the user. For example, if the user is anonymous for their location, it is hard to determine the accuracy of what they are saying. However, if their location matches the location of the event detected cluster, the incident as gathered from the ingested data might be viewed in a more favorable way as being accurate.
- the secondary category for consideration is on the social media level.
- the social media features 402b of boolean type may include, but are not limited to: multimedia, elongated word, url and news url, as illustrated in FIG. 4a.
- the social medial category may further include numerical type: number sentiment positive words, number sentiment negative words, and sentiment score, which is of numerical type. For example, if a user is attaching a picture or multimedia to the reported incident, that can be a clear indication of the accuracy of the reporting on the social media data.
- the type of words used by the user especially elongated words, i.e.
- OMMMMMMGGG! might convey the user's shock related to the event and lend itself to a more credible event. However, if the user uses a url in the social media data, the user might be sharing by reiteration.
- the sentiment of the ingested data is also examined. The ingested data may be checked against a set of positive and negative words for an indication of the sentiment. As an example, if the event detected cluster pertains to a disaster, the general tone of the ingested data should be negative.
- the third category for consideration is event features.
- the event features 402c may include: event topic, which may be categorical type, and highest retweet count, retweet sum, hashtag sum, negation fraction, support fraction, question fraction, which may be of numerical type, as illustrated in FIG. 4a.
- event topic which may be categorical type
- highest retweet count if the ingested data are twitter tweets, the retweeting count and sum are valued, with the assumption that the count correlates to the popularity of the event which weighs more in favor of being accurate.
- hashtags may also be an indicator of the event.
- sports related ingested data may contain many hashtags, while a disaster related ingested data may not have many hashtags, as there might not be time to list so many hashtags when a disaster is unfolding at the location of the user.
- the algorithm also takes into consideration the fraction of ingested data that deny, believe or question the event.
- the verification module 150 generates a matrix that is aggregated based on the three categories to generate a veracity score between -1 to 1, ranging from a false rumor to a true story.
- the veracity score 550 may be added to the metadata of the event detected cluster.
- the veracity score 614 may be presented to the user in the form of circle representations.
- FIG. 4b illustrates the determination by the verification module 150 a probability score for the event detected cluster being true based on information collected from social media.
- Twitter is used as an exemplary social media platform.
- the verification module 150 first determines if the unit data of the event detected cluster is an expert type assertion or a witness type assertion.
- Expert type assertions are assertions that likely to be made only by people or organizations that are considered authoritative for that assertion.
- An exemplary expert type assertion may be the company Apple® asserting that they will be releasing a new iPhone®.
- the verification module 150 may invoke the knowledge base module 152 to determine if the identified user of the unit data (i.e., Apple®) is a credible source and awards a higher score if the unit data is originating from a credible source.
- witness type assertions are assertions any random user may potentially make. These include crises type of events (for example, Userl23 assets that an explosion took place in a particular area.)
- the verification module 150 compares either the topic or the geography of the unit data against other unit data from the same geographic area. If other users are not mentioning the same assertion during the same time period, then a lower score may be assigned.
- a knowledge base of organizations as determined by the knowledge base module 152 may also be considered.
- Social media data from the collective knowledge base of organizations may also be processed by the Event Detection Server 110 to determine if they are discussing about a similar assertion and are used to compare with the current unit data to determine level of authenticity.
- the verification module 150 may then assign a probability that indicates its likeliness to be true or false.
- the verification module may algorithmically compute a score between -1 and 1, where 0 is neutral depicting our lack of information in the matter, 1 depicts highest level of confidence in the assertion being true and -1 being the highest level of confidence in it being false. For example, if information from very credible sources have confirmed that an assertion is true, then its score is likely 1. However for cases that we cannot find concrete evidences for near accuracy of its authenticity or truthfulness, the score will then fall between -1 and 1 depending on the type of evidences collected. The confidence may be re-evaluated when new evidences are included in its assessment.
- the ingested data may be but is not limited to a tweet.
- the organization module 126 analyzes semantic and syntactic structures in the ingested data to identify key concepts.
- terms 502a - 502d such as "confederate flag” "rally” “Linn Park” “Birmingham” are identified key concepts by organization module 126.
- four key concepts are identified in this example, there may be n number of terms identified by the organization module 126.
- the key concepts are stored in a database 500, with the key concepts designated as a "markable” and the corresponding originating ingested data as a "unit data", as illustrated in FIG. 5b. As shown in FIG. 5b, there may be a column 504 for n number of markables, each with corresponding column 506 pertaining to n number of unit datas.
- the database may be a hash table or a hashmap.
- each xth ingested data is represented as "Unit data x".
- the second ingested data may be represented as "Unit data 2".
- the unit cluster becomes the event detected cluster if the clustering module 128 determines that there is not already an existing cluster, or if there is an existing cluster but based on predetermined rules, the clustering module 128 determines not to merge with an existing cluster.
- the unit cluster comprises a threshold number n of n unit data (e.g., 3 unit clusters).
- FIG. 5e is another exemplary ingested data in the form of a tweet.
- This ingested data is one of the many unit data from an exemplary event detected cluster pertaining to "Mugabe: Foreign firms 'stole diamonds': Colombiaan President Robert Mugabe accuse foreign mining companies of This ingested data was also selected by the summarization module 132 as a representative summary of the event detected cluster.
- FIGS. 5f-5k are exemplary metadata of ingested data in FIG. 5e.
- the ingested data comprises default metadata generated by the social media platform (i.e, twitter metadata) as illustrated in FIGS. 5f - 5h and 5k.
- the Event Detection Server generates additional metadata and is appended to metadata of ingested data described above, and is illustrated in FIGS. 5i and 5j.
- the added metadata includes, but is not limited to, the credibility score 535 as determined by the credibility module 135; the opinion score 534 as determined by the opinion module 134; the profanity indicator 524 as determined by filtering module 124 and the markables 526 as determined by organization module 126.
- FIGS. 51 - 5n are an exemplary metadata of an event detected cluster with ingested data of FIG. 5e as one of the related unit data.
- the cluster metadata includes, but is not limited to, the
- Each markables 504a may also include the respective unit data 506a information.
- the cluster metadata includes, but is not limited to, unit data 506b forming the event detected cluster and the veracity score 550 as computed by verification module 150.
- GUI graphical user interface
- the browser 172 includes an application interface 600 that includes a plurality of columns for viewing of a list of event detected clusters pertaining to channels 602. Within each channel are the event detected clusters relating to the topic of the channel.
- the default channels provided by the application interface 600 allow the user to be notified of events that might be new or trending without having to search by key terms.
- a user through the browser 172 of access device 170 may enter a search term in search field 601 to tailor the application interface 600 to their needs.
- the UI module 232 of Event Processing Server 210 will then retrieve any event detected clusters matching the user's search term from the generated cluster datastore 244.
- the results are rendered by the UI module 232 and presented to the user through browser 172 under channel 602a of program interface 600, with the channel representing the search term.
- channel 602c representing the search term "GOP" and channel 602d for "Democrats” may be presented for viewing.
- the indication 604 provided before the text of the event detected cluster depicts the number of unit data in the event detected cluster.
- the event detected cluster may also be presented with the topic 606 as determined by topic categorization module 131; categories 608 which may be customized terms; summary 616 as determined by summarization module 132.
- the event detected cluster may also contain concepts 610, which are the markables from the unit data that formed the event detected cluster, as determined by organization module 126.
- the event detected cluster may further be presented with the hashtags 612 used in the ingested data as detected by the organization module 126, newsworthiness indication 618 as determined by newsworthiness module 133.
- newsworthiness indication 618 might be depicted as a filled in star.
- the event detected cluster may also be presented with veracity score 614 as determined by verification module 150.
- the veracity score may be in the form of filled-in circles indicative of the strength of the veracity determination, with 5 solid circles as near accurate.
- the user may select create new channel 620 based on concepts in an event detected cluster. The newly created channel is based on identified concepts 610.
- FIG. 6c The set of unit data 632a-632n corresponding to the selected event detected cluster 631 is presented.
- the user may utilize link 634 to view a specific unit data.
- channel options 622 allows for filtering of the event detected cluster results presented by UI module 232 onto browser 172 of the access device 170.
- the UI module 232 receives the filter designation as selected by the user in the application interface 600 and processes the request in accordance with the filters illustrated in relation to FIG. 7a-7e.
- filtering is available based on topic 710, sort method 720, category 730 and advance 740 filtering.
- FIG. 7b illustrates an exemplary topic filter 710.
- the topic filter 710 contains list of topic filters 712a - 712n. They may be, but not limited to, topics pertaining to:
- topic categorization module 131 business/finance, crisis, entertainment, hard news, health/medical, law/crime, life/society, politics, sports, technology, weather, or other as identified by the topic categorization module 131.
- FIG. 7c illustrates an exemplary sort filter 720.
- the sort filter 720 contains options 722a - 722n and they may be but are not limited to sorting by: newest, updated, most popular, tending, newsworthy, and veracity.
- FIG. 7d illustrates an exemplary category filter 730.
- the category filter 730 contains a list of category filters 732a - 732n.
- the category options may be but are not limited to: breaking news, conflict, disaster, dow, financial risks, geopolitical risks, legal, legal risks, markets, oil, politics, shootings, U.S. elections.
- FIG. 7e are the advanced options upon selection of advance 740 on application interface 600.
- the advance options for the selected channel may be, reset defaults 744, timeline 746 with a time frame selection, minimum posts 748 count, and three levels of strict 760, medium 762 or loose 764 for fact 750, newsworthiness 752 and veracity 754.
- FIGS. 1 through 7e are conceptual illustrations allowing for an explanation of the present disclosure.
- Various features of the system may be implemented in hardware, software, or a combination of hardware and software.
- some features of the system may be implemented in one or more computer programs executing on
- Each program may be implemented in a high level procedural or object-oriented programming language to communicate with a computer system or other machine. Furthermore, each such computer program may be stored on a storage medium such as read-only-memory (ROM) readable by a general or special purpose programmable computer or processor, for configuring and operating the computer to perform the functions described above.
- ROM read-only-memory
- FIG. 8 depicts an embodiment of a system 800 for detecting and coreferencing events across media types, the system 800 including a cross-media event detection and coreferencing system 804, a news production system 808, a social media system 812, an event production system 816, and a user system 820.
- the cross-media event detection and coreferencing system 804 includes a news event extraction module 824, a social media event extraction module 828, an event system interface module 832, and an event coreferencing and alerting module 836.
- the news event extraction module 824 intakes a stream of news articles from the news production system 808, detects and extracts information about events referenced by the news articles, and generates and stores representations of the events.
- the news production system 808 may be any system that produces news articles, such as a newspaper system or online service, news content system or online service, etc.
- the social media event extraction module 828 intakes a stream of social media postings from the social media system 812, detects and extracts information about events referenced by the postings, and generates and stores representations of the events.
- the social media system 812 may be any social media platform that produces social media postings, such as Twitter, Facebook, Instagram, etc.
- An optional event system interface module 832 intakes a stream of event information from the event production system 816, and stores representations of the events.
- the event production system 816 may be any system that produces event information, such as scientific systems that directly produce weather data, earthquake data, tsunami data, etc.
- the event coreferencing and alerting module 836 receives the representations of events generated by the news and social media event extraction modules 824, 828, determines whether any of the news articles and social media postings reference the same event, i.e., coreference the event, and generates and stores a coreferenced event representation for any such coreferenced events.
- the event coreferencing and alerting module 836 also receives the event representations from the event system interface module 832, and integrate these into its event coreferencing.
- the event coreferencing and alerting module 836 further generates and outputs alerts for coreferenced events to the user system 820, for use in decision making and/or system control by the user and/or user system 820.
- the user system 820 may be any system used by a user, such as an individual, organizational, or governmental entity, etc., to receive coreferenced event alerting from the cross-media event detection and coreferencing system 804.
- the cross-media event detection and coreferencing system 804 thus detects, extracts and provides coreferenced event representations for events referenced across different media types, including both news articles and social media, and therefore greatly improves the quality of generated event information by combining aspects of different media types, including the ubiquitous coverage of social media and the reliability and context of news articles, which provides a correspondingly improved basis for decision making and/or control by users and user systems 820.
- a system for detecting and coreferencing events across media types may include only any subset of, or an alternative connection or ordering of, the features depicted in or discussed herein in regard to FIG. 8.
- FIG. 9 depicts an embodiment of the cross-media event detection and coreferencing system 804, showing embodiments of the news event extraction module 824, social media event extraction module 828, event system interface module 832, and event coreferencing and alerting module 836 in more detail.
- the news event extraction module 824 includes a news intake module 840, a news event extraction module 844, and a news event database module 848.
- the news intake module 840 retrieves a stream of news articles from the news production system 808.
- the news event extraction module 844 detects events referenced by the news articles, extracts information about the detected events, and generates an event representation including attributes of the event based on the extracted information.
- the news event database module 848 stores the generated event representations.
- the social media event extraction module 828 includes a social media intake module 852, a social media event extraction module 856, and a social media event database module 860.
- the social media intake module 852 retrieves a stream of social media postings from the social media system 812.
- the social media event extraction module 856 detects events referenced by the social media postings, extracts information about the detected events, and generates an event representation including attributes of the event based on the extracted information.
- the social media event database module 860 stores the generated event representations.
- the event system interface module 832 includes an event system intake module 864, an event system processing module 868, and an event database module 872.
- the event system intake module 864 retrieves a stream of event information from the event production system 816.
- the event processing module 868 processes the received event information to generate event representations including attributes of the events based on the event information.
- the event database module 872 stores the generated event representations.
- the event coreferencing and alerting module 836 includes an event coreferencing module 876, a coreferenced event database module 880, and an event alerting module 884.
- the event coreferencing module 876 retrieves the event representations stored for the stream of news articles and for the stream of social media postings, determines whether any news article and social media posting references a same event, and generates a
- the event coreferencing module 876 also retrieves the event representations stored from the retrieved external event information, determines whether any news article, social media posting and external event information reference a same event, and generates a
- the coreferenced event database module 880 stores the generated coreferenced event representations.
- the event altering module 884 provides an alert to the user system 820 including coreferenced event representations, for use by the user or user system in decision making and/or system control.
- FIG. 10 depicts an embodiment of a method 1000 of detecting and coreferencing events across media types. The method may be performed by or involving components of the system 800 for detecting and coreferencing events across media types of FIG. 8, such as by embodiments of the cross-media event detection and coreferencing system 804 of FIG. 9. The method detects, extracts and provides to the user system 820 alerts with coreferenced event representations for events referenced by both news articles and social media.
- the method thus greatly improves the quality of generated event information by combining aspects of different media types, including the ubiquitous coverage of social media and the reliability and context of news articles.
- the method also thus provides a correspondingly improved basis for decision making and/or control by the user and/or user system 804.
- the method begins at step 1002.
- a stream of news articles is retrieved.
- the stream of news articles may be retrieved by the news article intake module 840, such as by communicating over one or more communication networks with an interface module 890 of the news production system 808.
- the news article intake module 840 may generate and transmit over the communication network one or more requests to the interface module 890 of the news production system 808, which may be an application programming interface (API), and receive from the interface module 890 one or more transmissions over the API
- API application programming interface
- event representations for events referenced by the news articles are generated and stored.
- the event representations may be generated by the news event extraction module 844, such as discussed in more detail below.
- the generated event representations may be stored by the news event database module 848.
- a stream of social media postings are retrieved.
- the stream of social media postings may be retrieved by the social media intake module 852, such as by communicating over one or more communication networks with an interface module 894 of the social media system 812.
- the social media intake module 852 may generate and transmit over the communication network one or more requests for the interface module 894 of the social media system 812, which may be an API, and receive from the interface module 894 one or more transmissions over the communication network including the stream of social media postings in response.
- event representations for events referenced by the social media postings are generated and stored.
- the event representations may be generated by the social media event extraction module 856, such as discussed in more detail below.
- the generated event representations may be stored by the social media event database module 860.
- a stream of externally generated event information is retrieved.
- the stream of externally generated event information may be retrieved by the event system intake module 832, such as by communicating over one or more communication networks with an interface module 898 of the event production system 816.
- the event system intake module 864 may be configured to receive a feed of event information from the interface module 898 of the event production system 816, which may be an API.
- event representations corresponding to the externally generated event information are generated and stored.
- the event representations may be generated by the event processing module 868.
- the generated event representations may be stored by the event database module 872.
- FIG. 9 separately depicts the news event database module 848, social media event database module 860, event database module 872, and coreferenced event database module 880, in embodiments these modules may be implemented using either separate databases or a single database.
- step 1016 it is determined whether any of the news article and social media postings reference a same event.
- a news article and a social media posting referencing a same event is referrend to herein as the news article and a social media posting
- Event coreferencing is determined by the event coreferencing module 876, such as discussed in more detail below.
- coreferenced event representations are generated and stored for determined coreferenced events.
- the coreferenced event representations are generated by the event coreferencing module 876, and stored by the event database module 880.
- alerts regarding any corefernced events are provided.
- the alerts may be provided by the event alerting module 884, such as by communicating over one or more communication networks with an interface module 902 of the user system 820.
- the alerting module 884 may generate and transmit over the communication network one or more alert emails, text messages, feed items, API messages, etc. containing the coreferenced event representations to the interface module 902 of the user system 820.
- the alerting module 884 may receive a transmission from the interface module 902 of the user system 820 containing one or more criteria defining what types of alerts are to be provided to the user system 820, such as defining the type, location, time, etc., of events, and the alerts transmitted by the alerting module 884 may containing information for correspondingly selected coreferenced events.
- the user and/or user system 820 performs control of components of the user system 820 based on the received alert.
- the type of control performed will generally depend upon the type of user system 820.
- manufacturing, supply chain management, or other business- operations action may be performed based on the alert.
- a manufacturing system may contain a control component that performs supply chain management control, such as scheduling or routing supply chain deliveries, in response to an alert regarding an event in a region also containing a manufacturing plant.
- a trading action may be performed based on the alert.
- a financial trading system may contain a control component that performs trading, such as selling or buying financial commodities, in response to an alert regarding an event affecting a business stance of an organization. Other types of control are also possible.
- the method ends at step 1024.
- a method of detecting and co-referencing events across media types may include only any subset of, or an alternative ordering of, the features depicted in or discussed above in regard to FIG. 10.
- the event representations generated by the event detection and coreferencing system 804 provide a number of functionalities, including for storage by the event detection and coreferencing system 804, for use in comparing events by the event detection and coreferencing system 804, and for use to perform decision making and system control based on events in the user system 820.
- the event representations may include one or more attributes defining the event.
- event representations include an event type, an event location, an event time, and an event impact.
- event representations may include one or more of the who, what, where, when, why and how of the event (i.e., who was involved in the event, what type of event was it and/or what type of human and/or material impact did it have, where did the event occur, when did the event occur, why did the event occur, and how did the event occur), or variations thereof.
- Other embodiments may use other event attributes for event representations.
- the event representations also may include one or more of the news article and/or social media posting referencing the event, or links thereto.
- an event representation generated for an event referenced by a news articled may include the news article or a link to the news article.
- An event representation generated for an event referenced by a social media posting or cluster of social media postings may include the social media postings or cluster of social media postings or a link or links thereto.
- An event representation generated for an event coreferenced by both a news article and a social media posting or cluster of social media postings may include the news article, the social media postings or cluster of social media postings, a link or links thereto, or any combination thereof.
- FIG. 11 depicts an embodiment of the news event extraction module 824, including an event detection module 906 and an event attribute extraction module 910.
- the event detection module 906 detects events, and corresponding event types, referenced by the retrieved news articles.
- the event detection module 906 includes a filter module 914 and an event classifier module 918.
- the filter module 914 removes non-event related news articles from the stream of news articles.
- the event classifier module 918 classifies the type of event referenced by the remaining articles.
- the event attribute extraction module 910 extracts further information about the detected events, and generates an event representation including attributes of the event based on the extracted information.
- the event attribute extraction module 910 includes a candidate attribute extraction module 922, a location attribute extraction module 926, a time attribute extraction module 930, and an impact attribute extraction module 934.
- the candidate attribute extraction module 922 processes the news article to generate candidate event attributes.
- the location attribute extraction module 926 generates a location attribute for the event using the candidate attributes.
- the time attribute extraction module 930 generates a time attribute for the event using the candidate attributes.
- the impact attribute extraction module 934 generates an impact attribute for the event using the candidate attributes.
- a news event extraction module may include only any subset of, or an alternative connection or ordering of, the features depicted in or discussed herein in regard to FIG. 11.
- FIG. 12 depicts an embodiment of a method 1200 of detecting and generating an representation of events referenced by news articles.
- Embodiments of the method of FIG. 12 may be used to perform the event representation generation and storage step 1006 of the method 1000 of FIG. 10.
- the method may be performed by or involving components of the event detection and coreferencing system 804, such as by or involving components of the news event extraction module 844 of FIG. 11.
- the method processes each of the retrieved stream of news articles to detect whether the article references an event of a predetermined set of event types, and, if so, generates a representation of the event referenced by the article.
- the method may operate on the stream of news articles in real time to provide a corresponding stream of detected and generated event representations.
- the method may be performed for each article in the stream.
- the method begins at step 1202.
- a filtering may be performed to remove articles not related to events.
- the filtering may be performed by comparing the article to a set of key words related to events, and if the article does not have any of the key words, deeming the article to be non- event related, and if it has any of the key words, deeming it to be event related.
- step 1206 if in step 1204 the article is deemed to be not event-related, the method proceeds to step 1222, where the method ends, but if in step 1204 the article is deemed to be event-related, the method proceeds to step 1208.
- the type of event referenced by the article is determined.
- the determination may be performed using supervised machine learning, including composing a feature vector based on the news article, inputting the vector to a classifier, and the classifier predicting that the news article is one of a predetermined set of event types, or none of these event types, based on the vector.
- the predetermined set of event types may be selected to include types of events that will be useful for the user system to have knowledge of. For example, for a user system focused on events relevant to manufacturing, finance, security, policy, governance, planning and disaster coordination, in embodiments the event types may include: conflict, fire, flood, infrastructure breakdown, labor unavailability, storms, terrorism.
- the classifier predicts a discrete class label yi , where yi e ⁇ 'conflict', 'fire', flood', 'infrastructure breakdown', 'labour unavailability', 'storms', 'terrorism', 'none' ⁇ , for a given news article xi .
- SVM Support Vector Machine
- Random Forest Random Forest
- Convolution Neural Network Convolution Neural Network
- the input feature vector may be composed to include word embeddings for words of the news article.
- the word embeddings may be customized by training a word embedding model using a combination of data sources, such as a filtered English Wikipedia dump and tokens extracted from news articles tagged with disaster or accident topic codes, allowing the model to capture the semantic structure of event-related news.
- step 1210 if at step 1208 the article is classified as related to none of the predetermined set of event types, the method proceeds to step 1222, where the method ends, but if at step 1208 the article is classified as related to one of the predetermined set of event types, the event type attribute for the event is selected as the predicted event type, and the method proceeds to step 1212.
- the article is processed to extract candidates for remaining attributes of the event representation.
- the processing may include natural language processing the article to split the raw text into tokens based on morphological aspects of the text, and also provide additional information for each token, such as a part-of-speech tag, a named entity type, and a dependency tree, and using these determined enriched tokens as candidate attributes.
- mentions of entity types such as locations, dates, numerals etc. in the text, provides a set of candidate event attributes that efficiently narrows down the search space for extracting the true event attributes such as the event location, event time and event impact.
- the generated enriched tokens also provide leverage in the further stages of the news event extraction module.
- the part-of-speech tags capture the syntactic structure around the words, while the dependency trees can resolve structural ambiguity.
- the location attribute is determined.
- the location attribute may be determined by classifying locations of the candidate attributes. For example, the classification may be performed using supervised machine learning, such as SVM, including for each candidate location composing a feature vector, inputting the vector to a classifier, the classifier predicting whether or not that candidate location is the event location with an associated confidence level, and selecting the candidate location predicted as the event location with the highest confidence as the event location attribute.
- the determination of the location attribute may also include determining geographical coordinates of the selected location.
- the location attribute may be organized using a four-level hierarchy: country as level 0; first administrative area (e.g., state, province, etc.) as level 1; second administrative area (e.g. county, department, etc.) as level 2; and localities (e.g. city, towns, villages, etc.) as level 3.
- first administrative area e.g., state, province, etc.
- second administrative area e.g. county, department, etc.
- localities e.g. city, towns, villages, etc.
- the last two components of the feature vector encode the geographical taxonomy while the rest of the features capture the syntactic and semantic context.
- the determination of the location attribute may also include determining geographical coordinates of the selected location.
- a problem to address may be location ambiguity: several distinct locations may have the same name. For example, if the event location is identified as “Naples”, it may be important to disambiguate whether the location referred is "Naples, Italy” or “Naples, Florida (US)” or “Naples, Illinois (US)”. The ambiguity in the event location may be resolved based on spatial proximity clues. It is assumed that all the candidate locations are likely to be near to each other (hence to the event location). When the event location is ambiguous, all potential addresses for the selected event location are compared with the remaining candidate locations to
- a geocoder is queried to retrieve all the potential addresses corresponding to all candidate locations. Each address is arranged in the four-level hierarchy described above. For each potential address of the event location we compute a score that is the linear combination of an overlap score and a popularity score. The overlap score is computed by summing the height of the common subtrees between the potential event location address and all other candidate locations addresses. The popularity score is returned by the geocoder and is calculated using frequency-based statistics over Wikipedia articles. Finally, the address with the maximum score is selected and the corresponding geographical coordinates are used as the coordinates of the location attribute.
- the time attribute is determined.
- the time attribute may be determined by using a rule-based model to select one of the temporal expressions of the candidate attributes as the time attribute for the event.
- a rule-based model may select as the time attribute the first occurring temporal expression in the article text.
- the following four types of temporal expressions may be considered: absolute values (e.g. 12-March), explicit offsets (e.g. yesterday), implicit offsets (e.g. Thursday) and positional offsets (e.g. last week).
- An exception to the above rule of the rule-based model may be when the following two conditions are simultaneously true: (1) the news article began with an absolute value (usually the publication date/time) and (2) the first sentence of the news article contains multiple temporal expressions.
- the rule-based model may ignore the first absolute value and select the second temporal expression as the time attribute of the event.
- the time attribute may be composed as a date and time.
- Generating the time attribute may include converting the selected temporal expression to a canonical form, with the publication timestamp of the article used to resolve offsets (e.g, yesterday, last week).
- the impact attribute is determined.
- the impact of an event may include one or more of a human impact (e.g., how were humans impacted), a material impact (e.g., how were material things such as structures, goods, financial quantities, etc. impacted), etc.
- a human impact e.g., how were humans impacted
- a material impact e.g., how were material things such as structures, goods, financial quantities, etc. impacted
- an impact attribute may indicate one or more of: a number of human casualties, a number of humans relocated, or an amount of financial damages.
- the impact of large-scale events is expressed in quantifiable units in association with an effect (e.g., ten people injured, 15 drowned).
- the impact attribute may be determined by classifying numeric references of the candidate attributes and adjacent word sequences as either representing an impact of the event or not. For example, for each sentence of the article that contained tokens with a cardinal number part-of-speech tag, that numeric value is considered as a putative unit of human impact (e.g., ten, 15).
- a feature vector is generated as a concatenation of: (1) an average embedding vector corresponding to the word sequence, (2) a length of the word sequence (n), (3) a pre and post token offset of the cardinal number token, relative to the word sequence, (4) a binary vector corresponding to one part-of-speech tag for each word in the word sequence, (5) a binary vector corresponding to the entity types of the word sequence, and (6) a binary vector corresponding to the dependency tree relations of the word sequence.
- a classifier such as an SVM classifier, and the classifier classifies the input numeric value and word sequence as either indicating a human impact or not.
- the determined impacts may be mapped into broad categories such as dead, injured, missing and displaced.
- the mapped impacts predicted by the classifier may be selected as the impact attribute.
- the impact attribute may include all of the predicted impacts, or a single or predetermined number of impacts may be selected as the impact attribute, such as the impacts predicted with the highest confidence.
- the event representation including the determined event attributes, for the event referenced by the news article is stored.
- the event representation may be stored by the news event database module 848.
- the stored event representation also may include the news article itself, or a link to the news article.
- a method of a method of detecting and generating a representation of events referenced by news articles may include only any subset of, or an alternative ordering of, the features depicted in or discussed above in regard to FIG. 12.
- FIGS. 13A-13C depict embodiments of news articles and social media postings from the month of October 2017 that an embodiment of the cross-media event detection and coreferencing system 804 determined coreference the same events.
- FIG. 13A depicts in the top half of the figure a news article that references an event related to the wildfires that affected California, and a corresponding event representation extracted by the event detection and coreferencing system 804 having a fire event type.
- FIG. 13B depicts in the top half of the figure a news article that references an event related to the Hurricane
- FIG. 13C depicts in the top half of the figure a news article that references an event related to armed conflicts in Afghanistan, and a corresponding event representation extracted by the event detection and coreferencing system 804 having a conflict event type.
- FIGS. 13D- 13F depict embodiments of a display of coreferenced events of the event types in FIGS. 13A-13C, respectively, detected by embodiments of the event detection and coreferencing system 804 for the first three weeks of October 2017, shown as points on a map having the coordinates of the coreferenced event representations.
- FIG. 14 depicts an embodiment of the social media event extraction module 856, including an event detection module 938, an event filtering module 942, and an event attribute extraction module 946.
- the event detection module 938 detects events referenced by the retrieved social media postings, and clusters social media postings that reference the same event.
- the event detection module 938 includes a noise filter module 950 and an event detection and clustering module 954.
- the noise filter module 950 removes non-event related social media postings from the stream of social media postings.
- the event detection and clustering module 954 detects events in the social media postings, and clusters social media postings that reference a same event.
- the event filtering module 942 removes social media clusters that are not related to events of predetermined event types, such as newsworthy events, and that are not related to current events.
- the event filtering module 942 includes a topic classification module 958 and a novelty detection module 962.
- the topic classification module 958 classifies the type of event referenced by the social media cluster.
- the novelty detection module 962 determines whether the event referenced by the social media cluster is a current event.
- the event attribute extraction module 946 extracts further information about the detected events, and generates an event representation including attributes of the event based on the extracted information.
- the event attribute extraction module 946 includes an event summarization module 966, a location attribute extraction module 970, a time attribute extraction module 974, and an impact attribute extraction module 978.
- the event summarization module 966 produces a summary of the social media cluster.
- the location attribute extraction module 970 generates a location attribute for the event referenced by the social media cluster.
- the time attribute extraction module 974 generates a time attribute for the event referenced by the social media cluster.
- the impact attribute extraction module 978 generates an impact attribute for the event referenced by the social media cluster.
- a social media event extraction module may include only any subset of, or an alternative connection or ordering of, the features depicted in or discussed herein in regard to FIG. 14.
- the social media event extraction module 856 may instead be, or be composed of components of, the system 100 for detecting and verifying an event from social media data.
- the social media event extraction module 856 may include components of the system 100 that perform the event detection and extraction
- the social media event extraction module 856 may include any combination the components and/or features of the embodiment of the social media event extraction module of FIG. 14 and the embodiment of the system 100 for detecting and verifying an event from social media data of FIG. 1, or any combination of any subset of, or connection or ordering of, such components and/or features.
- FIG. 15 depicts an embodiment of a method 1500 of detecting and generating an representation of events referenced by social media postings.
- Embodiments of the method of FIG. 15 may be used to perform the event representation generation and storage step 1010 of the method 1000 of FIG. 10.
- the method may be performed by or involving components of the event detection and coreferencing system 804, such as by or involving components of the social media event extraction module 856 of FIG. 14.
- the method processes the retrieved stream of social media postings to detect whether the postings reference events of a predetermined set of event types, clusters postings that refer to the same event, and generates representations of the referenced events.
- the method may operate on the stream of social media postings in real time to provide a corresponding stream of detected and generated event representations.
- the method begins at step 1502.
- the social media postings are filtered to remove postings not related to events.
- Postings not related to an event may be considered to be noise, and may include spam, such as advertisements and bot-generated content, and daily chit-chat.
- spam such as advertisements and bot-generated content
- the filtering of the social media posting may include applying an iterative set of filters to the postings to remove the noise.
- a rule-based classifier is used to filter suspicious spam users or messages from certain domains such as ebay.com.
- a topic model is used to identify and filter out chitchat. The model is trained on two corpora of online conversations that are unrelated to news.
- cost-sensitive learning is used to filter the remaining postings. Since the signal -to- noise ratio is very small, the model is tuned to penalize false positives, so that messages that may have some valuable content in them are not filtered.
- event detection and clustering is performed on the remaining social media postings.
- the event detection and clustering detects events referenced by the social media postings, and groups into a cluster, or collection, postings that refer to the same event.
- the postings may be processed using natural language processing to identify attributes of a referenced event.
- an event may be conceptualized as a semantic entity with four main dimensions: what, where, who, and when.
- a natural language processing tool may be used to identify the first three dimensions, if present, in each posting.
- a rule-based model that identifies explicit or implicit expressions of time such as "on Monday,” “this morning,” or "1926" may be used to identify the fourth dimension.
- a soft-matching process is used to align postings along each dimension, and a linear classifier, trained on the interpolation of these dimensions to group postings around real- world events, is used to group into clusters postings that refer to the same event.
- the result is a cache of clusters, where each cluster consists of postings that discuss a particular event. This identifies events dynamically and in real-time, and if a cluster forms around an event, as new postings emerge about the same event, they can be dynamically added to the cluster.
- the type of event referenced by the social media cluster is classified.
- the event type classification may be performed similarly to as for the news event type classification.
- the event referenced by the clusters may be classified by determining a feature vector for one or more postings of the cluster, providing the feature vector as an input to a classifier, and predicting by the classifier whether the posting references one of a predetermined set of event types, or none of these event types, based on the vector.
- the feature vector may include word embeddings for words of the social media postings of the social media cluster.
- LDA Latent Dirichlet Allocation
- the two-hot vector comprises the concatenation of two individual vectors in one-hot encoding, the first of which constituting a redundant encoding of the word itself (where the Ar-th bit set to 1 means the word is the Ar-th word in the lexicon and all other bits are set to 0), and the second of which is the topic.
- the resulting 600- dimensional embeddings are used as features for a Sequential Minimal Optimization (SMO) Support Vector Machine (SVM) model to predict the topic. It is trained on a set of 26,300 postings using n+1 topics, including the topic model induced as described above modified to include a new catch-all topic (the n+lst) to capture tweets that do not fall under any of the target topics.
- SMO Sequential Minimal Optimization
- SVM Support Vector Machine
- Clusters not related to current events are removed.
- Current events may be defined as events occurring within a predetermined time period from a current time of the real time processing of the social media stream.
- a hybrid approach may be utilized.
- clusters are analyzed for expressions of time, with clusters having expressions of time referencing a time period before a time period of current events being removed. For example, postings that explicitly mention a historical event such as Halloween or 9/11 are removed by a taxonomy-based filter, and postings that mention an expression of time such as "last week" are removed by another filter.
- a similarity score may be calculated for newly formed clusters relative to previously formed clusters, with newly formed clusters similar to previously formed clusters to within a predetermined degree being regarded as mere updates on the event of the previous cluster, and thus also removed.
- a pairwise similarity score may be calculated between a newly formed cluster and every other cluster in the cache, and if the incoming cluster closely resembles an older one to a predetermined degree based on the score, it is likely to be an update on an event that is previously reported, and these residual updates are ignored and the new cluster removed.
- a summary of each cluster is determined for inclusion in the event representation for that cluster.
- the summary may be generated as a selected one of the postings in the cluster that may be most representative, objective, and/or informative. For example, given a cluster, each posting is treated as a document and represented by a tf-idf vector. Each cluster is then represented by a centroid vector. Each posting vector is then scored based on its similarity to the centroid. A rule-based approach is utilized to penalize tweets that include opinionated terms or patterns such as repeated characters or punctuation. The posting with the highest score is selected as the summary.
- a location attribute of the event referenced by the social media cluster is determined. Since social media postings may have a character limit, the number of locations mentioned in them is often limited. For example, more than 60% of tweets in one dataset mention a single location, and fewer than 2% have more than three locations in them.
- the location attribute may be selected for the cluster using a rule-based approach. If multiple locations are mentioned but some are included within others, the less granular locations are ignored. If the remaining pool of locations includes more than one location, it is handled based on the nature of the event. For example, floods can span multiple locations but terror attacks are often limited to one. In cases when the event is limited to one location, the location mentioned last in the tweet is selected.
- each posting generates one location for the cluster.
- the location for the cluster is selected using a voting system to select from among the locations selected for the postings of the cluster.
- a least-common-distance metric may be used to disambiguate toponyms such as "Paris" that can be mapped to multiple coordinates around the word.
- a time attribute for the event referenced by the social media cluster is determined.
- the timestamp of the realtime formation of the cluster by the social media event extraction module 828 may be selected as the time attribute.
- the timestamp of the posting may be selected as the time attribute.
- an impact attribute for the event referenced by the social media cluster is determined.
- the impact attribute may be selected as done for the news article in step 1218 of the method of FIG. 12, with the social media cluster or postings of the cluster being analyzed instead of a news article.
- Each of the postings of a cluster may be analyzed to determine impacts, or a single or predetermined number of representative posting of the cluster, such as the posting used for the summary, may be analyzed to determine impact.
- the event representation including the determined event attributes, for the event referenced by the social media cluster is stored.
- the event representation may be stored by the social media event database module 860.
- the stored event representation also may include the some or all of the social media postings of the social media cluster, or a link or links thereto. The method ends at step 1522.
- representation of events referenced by social media postings may include only any subset of, or an alternative ordering of, the features depicted in or discussed above in regard to FIG. 15.
- FIG. 13A depicts in the bottom half of the figure a cluster of social media postings that references the event related to the wildfires that affected California, and a corresponding extracted event representation
- FIG. 13B depicts in the bottom half of the figure a cluster of social media postings that references the event related to the Hurricane Ophelia storms in Ireland and United Kingdom, and a corresponding extracted event representation
- FIG. 13C depicts in the bottom half of the figure a cluster of social media postings that references the event related to armed conflicts in Afghanistan, and a corresponding extracted event representation.
- FIG. 16 depicts an embodiment of the event coreferencing module 876, including a similarity calculation module 982 and a coreferencing module 986.
- the similarity calculation module 982 receives the event representations generated for the stream of news articles and for the stream of social media postings, and determines one or more similarity measures between the news article event representations and the social media event representations.
- the coreferencing module 986 receives the determined similarity measures, and determines whether any news articles and social media clusters retrieved from corresponding streams within a predetermined timeframe, anchored back from the current time, reference the same event, i.e., coreference the event.
- an event coreferencing module may include only any subset of, or an alternative connection or ordering of, the features depicted in or discussed herein in regard to FIG. 16.
- FIG. 17 depicts an embodiment of a method 170 of determining event coreferencing across media types. Embodiments of the method of FIG. 17 may be used to perform the event coreferencing and coreferenced event representation generation and storage of steps 1016 and 1018 of the method 1000 of FIG. 10. The method may be performed by or involving components of the event detection and coreferencing system 804, such as the event coreferencing module 876 of FIG. 16. The method determines one or more similarity measures between a given pair of a news article and social media cluster, classifies the pair as coreferencing a same event or not based on the similarity measures, and for those pairs that coreference a same event, generates a coreferenced event representation for the coreferenced event.
- the method may operate on the streams of event representations produced from the streams of news articles and social media postings in real time to provide a corresponding stream of detected and generated coreferenced event representations.
- the steps of the method may be performed for each possible pair of news article and social media cluster generating event representations within a predetermined time window anchored back from the current time.
- the method thus greatly improves the quality of generated event information, by combining qualities of the different media types, including the ubiquitous coverage of social media and the reliability and context of news articles, to produce coreferenced event representations, which provides a correspondingly improved basis for decision making and/or control by the user and/or user system 820.
- the method begins at step 1702.
- one or more similarity measure is determined between the news article, or the event representation for the news article, and the social media cluster, or the event representation for the social media cluster.
- the one or more similarity measures may be based on values of corresponding attributes of the event representation for the news article and the event representation for the social media cluster.
- the one or more similarity measures may include one or more of a similarity measure based on a location attribute of the news article event representation and a location attribute of the social media event representation, or a similarity measure based on a time attribute of the news article event representation and a time attribute of the social media cluster event representation.
- the one or more similarity measures also may be based on the text of or information extracted from the news the news articles and social media clusters, such as candidate attributes, tokens, etc.
- the one or more similarity measures may include one or more of a similarity measure based on a person or organization entity extracted from the news article and a person or organization entity extracted from the social media cluster, or a similarity measure based on a title or text of the news article and a text of the social media cluster.
- a classification of whether a pair of a news article and a social media cluster coreference the same event is performed.
- the classification may be performed by composing a feature vector for the pair of the news article and social media cluster based on the determined one or more similarity measures between the news article and social media cluster, inputting the feature vector into a trained classifier, such as an SVM classifier, and the classifier then determining if the news article and social media cluster coreference the same event or not based on the input vector.
- the feature vector for the pair of the news article and social media cluster may be composed from the determined one or more similarity measures between the news article and social media cluster, such as by concatenating each of the determined one or more similarity measures into a vector.
- step 1708 if at step 1706, it is determined that the news article and social media cluster pair do not coreference the same event, the method proceeds to step 1712, where the method ends, but if at step 1706 it is determined that the news article and social media cluster pair coreference the same event, the method proceeds to step 1710, where an event representation for the coreferenced event is generated and stored.
- the coreferenced event representation may use the event representation of one or the other of the news article or social media cluster, or combine these event representations, such as where corresponding attributes of the event representation agree, using that attribute value, where corresponding attributes of the event representation do not fully agree, either selecting one or the other of the attribute values or using no value, and where one of the event representations includes an attribute value but the other does not, using that value or no value.
- the coreferenced event representation also may include the corresponding news article, social media cluster, a link or links thereto, or combinations thereof.
- the coreferenced event representation may be stored by the coreferenced event database module 880.
- a method of determining event coreferencing across media types may include only any subset of, or an alternative ordering of, the features depicted in or discussed above in regard to FIG. 17.
- FIG. 18 depicts an embodiment of the similarity calculation module 982, including a spatial similarity calculation module 990, a temporal similarity calculation module 992, an entity similarity calculation module 994, and a text similarity calculation module 996.
- the spatial similarity calculation module 990 calculates a similarity based on the location attributes of the event representations of the news article and social media cluster.
- the temporal similarity calculation module 992 calculates a similarity based on the temporal attributes of the event representations of the news article and social media cluster.
- the entity similarity calculation module 994 calculates one or more similarities based on entities, such as persons or organizations, extracted from the news article and social media cluster.
- the text similarity calculation module 996 calculates a similarity based on text, such as the title or body, of the news article and the text of the social media cluster.
- a similarity calculation module may include only any subset of, or an alternative connection or ordering of, the features depicted in or discussed herein in regard to FIG. 18.
- FIG. 19 depicts an embodiment of a method 1900 of calculating similarities between a news article and social media cluster.
- Embodiments of the method of FIG. 19 may be used to perform the similarity determining step 1704 of the method 1700 of FIG. 17.
- the method may be performed by or involving components of the event detection and coreferencing system 804, such as the similarity calculation module 984 of FIG. 18.
- the method begins at step 1902.
- a spatial similarity SL between the news article and social media cluster is determined as a similarity based on locations of the event representations of and/or extracted from the news article and social media cluster.
- the determining of the spatial similarity SL may include determining feature vectors for the news article and the social media cluster based on the candidate locations extracted from the news article and the candidate locations extracted from the social media cluster, calculating similarities between each potential pair of such locations of the news article and the social media cluster using the feature vectors, and determining the spatial similarity SL as function of the determined candidate location similarities.
- each location can be represented as a tree.
- a similarity between two locations x and y can be calculated based on the length of the common path, as follows:
- Vy y 6 r ⁇ )+ ⁇ ? 1 max( (x f , yy)
- a temporal similarity ST between the news article and social media cluster is determined as a similarity based on the temporal attributes of the event representations of and/or temporal expressions extracted from the news article and social media cluster.
- the determining of the temporal similarity ST may include determining feature vectors of temporal expressions extracted from the news article and of temporal expressions extracted from the social media cluster, and determining the temporal similarity ST as the minimum time difference between temporal expressions in the news temporal vector and in the social media temporal vector.
- T [tl, t2, tz ] be a vector of all z temporal expressions extracted from the news article rx
- V [vl,v2, ...,vw] be a vector of all w temporal expressions extracted from ry.
- a person entity similarity Sp between the news article and social media cluster is determined.
- the determining of the person entity similarity SP may include determining sets of person entities extracted from the news article and the social media cluster using natural language processing on the news article and social media cluster, and determining a similarity, such as Jaccard similarity, between the sets of extracted persons for the news article and the social media cluster as the person entity similarity SP.
- an organization entity similarity So between the news article and social media cluster is determined.
- the determining of the organization entity similarity So may include determining sets of organization entities extracted from the news article and the social media cluster using natural language processing on the news article and social media cluster, and determining a similarity, such as Jaccard similarity, between the sets of extracted organizations for the news article and the social media cluster as the organization entity similarity So.
- a text similarity SB between the body of the news article and one or more postings of the social media cluster is determined.
- the determining of the text similarity SB may include generating vectors for tokenized text of the news article and the social media posting based on word embeddings, and determining a similarity, such as a cosine similarity, between the determined vectors.
- the vector representing the text of the body of the news article or social media posting is computed as:
- the text similarity SB may be calculated as the cosine similarity between the vector r txt for the news article and the vector r txt for the social media posting.
- the text similarity may be determined for one or more postings of the cluster, such as a representative posting of the cluster, such as the posting used for the summary.
- a text similarity ST between the title of the news article and one or more postings of the social media cluster is determined.
- the text similarity ST may be determined in the same way as for the similarity between the body of the news article and the posting, except the title of the news article is used instead of the body of the news article.
- a method of a method of calculating one or more similarities between a news article and social media cluster may include only any subset of, or an alternative ordering of, the features depicted in or discussed above in regard to FIG. 19.
- FIGS. 13A-13C depict embodiments of news articles, social media postings, and corresponding event representations, for events that an embodiment of the cross-media event detection and coreferencing system 804 determined to be coreferenced by both the depicted news articles and social media postings
- FIGS. 13D-13F depict embodiments of a display of coreferenced events of the event types in FIGS. 13A-13C, respectively, detected by the embodiment of the cross-media event detection and coreferencing system 804 for a predetermined time period, shown as points on a map having the coordinates of the coreferenced event representations.
- coreferencing module 836 may be implemented as hardware, software, or a mixture of hardware and software.
- each of cross- media event detection and coreferencing system 804, user system 820, news production system 808, social media system 812, and/or event production system 816, and/or any individual one, subset, or all of the components thereof may be implemented using a processor and a non-transitory storage medium, where the non-transitory machine-readable storage medium includes program instructions that when executed by the processor perform embodiments of the functions of such components discussed herein.
- each of cross-media event detection and coreferencing system 804, user system 820, news production system 808, social media system 812, and/or event production system 816, and/or any individual one, subset, or all of the components thereof, may be implemented using one or more computer systems, such as, e.g., a mobile computing device, a desktop computer, laptop computer, network device, server, Internet server, cloud server, etc.
- FIG. 20 depicts an embodiment of a computer system 1030 that may be used to implement any of cross-media event detection and coreferencing system 804, user system 820, news production system 808, social media system 812, and/or event production system 816, and/or any individual one, subset, or all of the components thereof.
- the computer system 1030 includes a processor 1034, a non-transitory machine-readable storage medium 1042, a communication circuit 1038, and optionally other components 1046.
- the processor 1034 executes program instructions stored in the non-transitory machine-readable storage medium 1042 to perform the functionality of the system or component that the computer system 1034 is implementing, as discussed herein.
- the communication circuit 1038 can be controlled by the processor 1034 to communicate with other devices, such as any other of the any of cross-media event detection and coreferencing system 804, user system 820, news production system 808, social media system 812, and/or event production system 816, to perform the functionality of the system or component that the computer system 1034 is implementing, as discussed herein.
- the optional other components 1046 may include any further components required by the computer system 1034 to perform this functionality.
- a computer system that may be used to implement any of the cross-media event extraction and coreferencing system, user system, news production system, social media system, or event production system, and/or any individual one, subset, or all of the components thereof, may include only any subset of, or an alternative connection or ordering of, the features depicted in or discussed herein in regard to FIG. 20.
- FIG. 21 depicts embodiments of the cross-media event detection and coreferencing system 804 and user system 820, showing further details of the event alerting module of the event detection and coreferencing system 804 and the interface module 902 and other components of the user system 820.
- FIG. 21 for clarity of illustration, only component of the cross-media event detection and coreferencing system 804 and the user system 820 discussed further with respect to the figure are shown, and other components are omitted.
- the event alerting module 884 may include an interface component including one or more of a publishing module 1050 or an API module 1054.
- the publishing module 1050 publishes alerts containing generated coreferenced event representations.
- the publishing module 1050 may publishes the alerts in a variety of ways, such as by transmitting emails containing the alerts to the user system 820, sending text messages containing the alerts to the user or user system 820, or providing a feed received by the user system 820 containing the alerts, etc.
- the API module 1054 implements an API that provides the alerts containing the generated coreferenced event representations.
- the API module 1054 may provide the alerts in a variety of ways, such as by transmitting responses to the user system 820 responsive to specific requests for alerts of the API module 1054 by the user system, by periodically transmitting alerts to the user system 820 based on established preferences for receiving alerts received by the API module 1054 from the user system 820, etc.
- the user system 820 includes the interface module 902, a control module 1058, and other system components 1062.
- the interface module 902 interfaces with the event alerting module 884 over the one or more communication networks to receive the alerts, such as from the publication module 1050 or API module 1054, as discussed above.
- the control module 1058 implements control of the user system 820 in response to the alerts, such as to implement the control of step 1022 of FIG. 10.
- the control module 1058 may include a standalone controller or processor, or may be implemented by a processor of a computer system implementing the control module 1058 and other components of the user system 820.
- the control module 1058 receives the alert and transmits control instructions to the other components 1062 of the user system 820 to implement control of these components 1062 based on the alert.
- the type of control and other components 1062 may depend on the context and uses of the user system.
- the user system 820 is a supply chain management system for a
- control module 1058 transmits a signal to a supply chain management module 1062 to control a supply chain, such as to schedule or reschedule a supply chain delivery, based upon the coreferenced event of the alert, such as an event near the manufacturing or business organization.
- the user system 820 is a financial trading system
- the control module 1058 transmits a signal to a trading module 1062 to control trading of financial commodities, such as to buy or sell the financial commodities, based on the coreferenced event of the alert, such as an event affecting an organization related to the financial commodities.
- the user system 820 is a manufacturing system
- the control module 1058 transmits a signal to a manufacturing module 1062 to control manufacturing activities, such as to suspend manufacturing activities, based on the coreferenced event of the alert, such as an event affecting an area of the manufacturing.
- a manufacturing module 1062 to control manufacturing activities, such as to suspend manufacturing activities, based on the coreferenced event of the alert, such as an event affecting an area of the manufacturing.
- Many other types of alert-based control are possible.
- a cross-media event detection and coreferencing system 804 and user system may include only any subset of, or an alternative connection or ordering of, the features depicted in or discussed herein in regard to FIG. 21.
- FIG. 22 depicts an embodiment of a method 2200 of providing an alert for a coreferenced event.
- Embodiments of the method of FIG. 22 may be used to perform the alerting of step 1020 of the method 1000 of FIG. 10.
- the method may be performed by or involving components of the event detection and coreferencing system 804, such as the event alerting module 884 of the event coreferencing and alerting module 836 of FIG. 9.
- the method begins at step 2202.
- a trigger condition for providing an alert may be determined to have occurred.
- the trigger condition may be one or more of the generation of the coreferenced event by the event coreferencing and alerting module 804, passage of predetermined amount of time since a previous alert, receipt of a request for an alert by the API module 1054, etc.
- the types and recipients of the alert may be determined.
- an operator of the event detection and coreferencing system 804 may provide both an alert publishing service and an alert API service that may be subscribed to by persons and organizations desiring to receive alerts.
- the event alerting module 884 may maintain a list of recipients for different types of alerts, such as for publication or provision by API, and determine from the list a set of recipients and corresponding alert types for alert generation upon occurrence of the alert trigger.
- the alert is generated and published by the publication module 1050.
- the alert publication may take a variety of forms.
- the alert may be included in an email or text message addressed to a recipient, such as the user or user system 820, that has subscribed to such a service, and the publishing include transmitting the email or text message to the recipient.
- the alert may be included in a feed, such as an RSS feed, and the publishing include providing the feed to the recipient that has subscribed to such a service.
- the alert is generated and provided by the API module 1054.
- the alert may by transmitted to the interface module 902 of the user system 820 in response to a request to the API module 1054 from an alert application executing on the user system 820.
- the method ends at step 2212.
- a method of providing an alert for a coreferenced event may include only any subset of, or an alternative ordering of, the features depicted in or discussed above in regard to FIG. 22.
- FIGS. 23A-23C depict embodiments of an email, text message, and feed item, respectively, that the publishing module may transmit to the user system.
- the email is addressed to a subscriber of an alert service, and contains a coreferenced event representation, a copy or link to a news article referencing the coreferenced event, a copy of or link to a social media cluster referencing the coreferenced event, and a link for further information such as additional coreferenced event representation attributes, additional news articles referencing the coreferenced event, additional social media postings referencing the coreferenced event, etc.
- FIG. 23A the email is addressed to a subscriber of an alert service, and contains a coreferenced event representation, a copy or link to a news article referencing the coreferenced event, a copy of or link to a social media cluster referencing the coreferenced event, and a link for further information such as additional coreferenced event representation attributes, additional news articles referencing the coreferenced event, additional social media postings referencing the coreferenced event, etc.
- the text is addressed to a subscriber of an alert service, and contains a summary of the coreferenced event, and a link for further information such as coreferenced event representation attributes, news articles referencing the coreferenced event, social media postings referencing the coreferenced event, etc.
- the feed item contains a coreferenced event representation, a copy or link to a news article referencing the coreferenced event, a copy of or link to a social media cluster referencing the coreferenced event, and a link for further information such as additional coreferenced event representation attributes, additional news articles referencing the coreferenced event, additional social media postings referencing the coreferenced event, etc.
- FIG. 24 depicts an embodiment of a display of an alert application that the interface module of the user system may execute and display to the user for interfaceing with the API module of the cross-media event extraction and coreferencing system to request and receive alerts.
- the application display includes a section 1066 for the user to indicate the types, timeframe and location of events that it wants to request and receive alerts for, a section 1070 to display alerts and included event representations that it has received in response to requests, and a section 1074 to display further information for the events.
- the cross-media event detection and coreferencing system 804 also may receive event information from the event production system 816, and process and store this event information in an event representation form as used for the news article, social media and coreferenced events.
- the event coreferencing and alerting module 836 may incorporate such event representations into its coreferencing, coreferenced event representation generation, and alerting. That is, the event coreferencing and alerting module 836 may determine whether the event referenced by the event representation based on the information received from the event production system 816 coreferences an event referenced by either a news article, or a social media cluster, or coreferenced by both a news article and social media cluster.
- FIGS. 25 and 26 depict embodiments of event information of the event production system 816 that may be retrieved and utilized by the cross-media event detection and coreferencing system 804.
- FIG. 25 depicts a map showing flood event information output by the National Oceanic and Atmospheric Administration
- FIG. 26 shows a detailed set of flood information for one location on the map in FIG. 25, showing a timewise evolution of a flooding state at the location.
- the event coreferencing and alerting module 836 in addition to determining coreferencing between news articles and social media clusters, may also determine coreferencing between news articles and news articles, social media clusters and social media cluster, etc. That is, the event coreferencing and alerting module 836 may determine coreferencing between any event representation resulting from any source, and generate corresponding coreferenced event representations and alerts.
- the event coreferencing and alerting module 836 may use the above system and methods to determine coreferencing between any two different types of media instead of or in addition to between news articles and social media.
- cross-media event extraction and coreferencing system 804, user system 820, news production system 808, social media system 812, event production system 816, and associated methods, as discussed herein, are possible.
- any feature of any of the embodiments of these systems and methods described herein may be used in any other embodiment of these systems and methods.
- embodiments of these systems and methods may include only any subset of the
Landscapes
- Business, Economics & Management (AREA)
- Engineering & Computer Science (AREA)
- Human Resources & Organizations (AREA)
- Strategic Management (AREA)
- Entrepreneurship & Innovation (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Marketing (AREA)
- Theoretical Computer Science (AREA)
- Economics (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Data Mining & Analysis (AREA)
- Operations Research (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Primary Health Care (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Information Transfer Between Computers (AREA)
Abstract
Description
Claims
Priority Applications (5)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
NZ762583A NZ762583A (en) | 2017-09-15 | 2018-09-13 | Systems and methods for cross-media event detection and coreferencing |
CA3075865A CA3075865A1 (en) | 2017-09-15 | 2018-09-13 | Systems and methods for cross-media event detection and coreferencing |
SG11202002303PA SG11202002303PA (en) | 2017-09-15 | 2018-09-13 | Systems and methods for cross-media event detection and coreferencing |
AU2018331397A AU2018331397A1 (en) | 2017-09-15 | 2018-09-13 | Systems and methods for cross-media event detection and coreferencing |
EP18855535.3A EP3682400A4 (en) | 2017-09-15 | 2018-09-13 | Systems and methods for cross-media event detection and coreferencing |
Applications Claiming Priority (6)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201762559079P | 2017-09-15 | 2017-09-15 | |
US62/559,079 | 2017-09-15 | ||
US201762579218P | 2017-10-31 | 2017-10-31 | |
US62/579,218 | 2017-10-31 | ||
US16/130,390 US11061946B2 (en) | 2015-05-08 | 2018-09-13 | Systems and methods for cross-media event detection and coreferencing |
US16/130,390 | 2018-09-13 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019055654A1 true WO2019055654A1 (en) | 2019-03-21 |
Family
ID=65723414
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2018/050885 WO2019055654A1 (en) | 2017-09-15 | 2018-09-13 | Systems and methods for cross-media event detection and coreferencing |
Country Status (5)
Country | Link |
---|---|
AU (1) | AU2018331397A1 (en) |
CA (1) | CA3075865A1 (en) |
NZ (1) | NZ762583A (en) |
SG (1) | SG11202002303PA (en) |
WO (1) | WO2019055654A1 (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110674296A (en) * | 2019-09-17 | 2020-01-10 | 上海仪电(集团)有限公司中央研究院 | Information abstract extraction method and system based on keywords |
CN110909125A (en) * | 2019-10-30 | 2020-03-24 | 中山大学 | Media rumor detection method for shoji society |
CN111914152A (en) * | 2020-06-30 | 2020-11-10 | 中国科学院计算技术研究所 | Network event early warning method and system |
CN113807622A (en) * | 2020-06-15 | 2021-12-17 | 海信集团有限公司 | Event decision generation method and device, electronic equipment and storage medium |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11212316B2 (en) * | 2018-01-04 | 2021-12-28 | Fortinet, Inc. | Control maturity assessment in security operations environments |
US11270213B2 (en) * | 2018-11-05 | 2022-03-08 | Convr Inc. | Systems and methods for extracting specific data from documents using machine learning |
GB202002192D0 (en) * | 2020-02-18 | 2020-04-01 | Echobox Ltd | Topic clustering and Event Detection |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060253418A1 (en) * | 2002-02-04 | 2006-11-09 | Elizabeth Charnock | Method and apparatus for sociological data mining |
US20090327115A1 (en) * | 2008-01-30 | 2009-12-31 | Thomson Reuters Global Resources | Financial event and relationship extraction |
US20150120783A1 (en) * | 2013-10-28 | 2015-04-30 | Salesforce.Com, Inc. | Inter-entity connection maps |
-
2018
- 2018-09-13 CA CA3075865A patent/CA3075865A1/en active Pending
- 2018-09-13 NZ NZ762583A patent/NZ762583A/en unknown
- 2018-09-13 AU AU2018331397A patent/AU2018331397A1/en active Pending
- 2018-09-13 SG SG11202002303PA patent/SG11202002303PA/en unknown
- 2018-09-13 WO PCT/US2018/050885 patent/WO2019055654A1/en unknown
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060253418A1 (en) * | 2002-02-04 | 2006-11-09 | Elizabeth Charnock | Method and apparatus for sociological data mining |
US20090327115A1 (en) * | 2008-01-30 | 2009-12-31 | Thomson Reuters Global Resources | Financial event and relationship extraction |
US20150120783A1 (en) * | 2013-10-28 | 2015-04-30 | Salesforce.Com, Inc. | Inter-entity connection maps |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110674296A (en) * | 2019-09-17 | 2020-01-10 | 上海仪电(集团)有限公司中央研究院 | Information abstract extraction method and system based on keywords |
CN110674296B (en) * | 2019-09-17 | 2022-11-04 | 上海仪电(集团)有限公司中央研究院 | Information abstract extraction method and system based on key words |
CN110909125A (en) * | 2019-10-30 | 2020-03-24 | 中山大学 | Media rumor detection method for shoji society |
CN110909125B (en) * | 2019-10-30 | 2022-11-15 | 中山大学 | Detection method of media rumor of news-level society |
CN113807622A (en) * | 2020-06-15 | 2021-12-17 | 海信集团有限公司 | Event decision generation method and device, electronic equipment and storage medium |
CN111914152A (en) * | 2020-06-30 | 2020-11-10 | 中国科学院计算技术研究所 | Network event early warning method and system |
CN111914152B (en) * | 2020-06-30 | 2023-05-12 | 中国科学院计算技术研究所 | Network event early warning method and system |
Also Published As
Publication number | Publication date |
---|---|
SG11202002303PA (en) | 2020-04-29 |
CA3075865A1 (en) | 2019-03-21 |
NZ762583A (en) | 2024-01-26 |
AU2018331397A1 (en) | 2020-04-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11061946B2 (en) | Systems and methods for cross-media event detection and coreferencing | |
AU2016261088B2 (en) | Social media events detection and verification | |
Kaur et al. | Automating fake news detection system using multi-level voting model | |
Pierri et al. | False news on social media: a data-driven survey | |
Castillo | Big crisis data: social media in disasters and time-critical situations | |
Bozarth et al. | Toward a better performance evaluation framework for fake news classification | |
WO2019055654A1 (en) | Systems and methods for cross-media event detection and coreferencing | |
Mendon et al. | A hybrid approach of machine learning and lexicons to sentiment analysis: Enhanced insights from twitter data of natural disasters | |
Castillo et al. | Information credibility on twitter | |
EP2753024B1 (en) | System and method for continuously monitoring and searching social networking media | |
Liu et al. | Reuters tracer: Toward automated news production using large scale social media data | |
Hunt et al. | Monitoring misinformation on Twitter during crisis events: a machine learning approach | |
Effrosynidis et al. | The climate change Twitter dataset | |
Kumar et al. | Multimedia social big data: Mining | |
WO2016209213A1 (en) | Recommending analytic tasks based on similarity of datasets | |
Lamsal et al. | Socially enhanced situation awareness from microblogs using artificial intelligence: A survey | |
Smailović | Sentiment analysis in streams of microblogging posts | |
Goyal et al. | Detection of fake accounts on social media using multimodal data with deep learning | |
Sharma et al. | A transformer-based model for evaluation of information relevance in online social-media: A case study of covid-19 media posts | |
EP3682400A1 (en) | Systems and methods for cross-media event detection and coreferencing | |
Garcia et al. | Supporting Humanitarian Crisis Decision Making with Reliable Intelligence Derived from Social Media Using AI | |
Kumar | Social media analytics for crisis response | |
Muthulakshmi et al. | Generative adversarial networks classifier optimized with water strider algorithm for fake tweets detection | |
ELazab et al. | Fraud news detection for online social networks | |
Alsaedi | Event identification in social media using classification-clustering framework |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18855535 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 3075865 Country of ref document: CA |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 2018331397 Country of ref document: AU Date of ref document: 20180913 Kind code of ref document: A |
|
ENP | Entry into the national phase |
Ref document number: 2018855535 Country of ref document: EP Effective date: 20200415 |