CN106484767A - A kind of event extraction method across media - Google Patents

A kind of event extraction method across media Download PDF

Info

Publication number
CN106484767A
CN106484767A CN201610809600.1A CN201610809600A CN106484767A CN 106484767 A CN106484767 A CN 106484767A CN 201610809600 A CN201610809600 A CN 201610809600A CN 106484767 A CN106484767 A CN 106484767A
Authority
CN
China
Prior art keywords
event
key
data
alternate message
message
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610809600.1A
Other languages
Chinese (zh)
Other versions
CN106484767B (en
Inventor
尹芷仪
薛聪
向继
查达仁
王雷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN201610809600.1A priority Critical patent/CN106484767B/en
Publication of CN106484767A publication Critical patent/CN106484767A/en
Application granted granted Critical
Publication of CN106484767B publication Critical patent/CN106484767B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Abstract

The invention discloses a kind of event extraction method across media.This method is:Setting seed affair character storehouse and required knowledge data;Gather news web page from credible news source, and extract newsletter archive and metadata information;Extract event argument information from every then newsletter archive, generate a primary event set;Calculate significance level in event is portrayed for each key element of primary event, generate event initial summary framework;Based on each key element search social network message text in event initial summary framework, generate alternate message set;Summary framework according to alternate message and the similarity of event summary framework filter to alternate message, obtain the corresponding message queue of primary event;To exist in event argument in event initial summary framework and message queue and in initial summary framework non-existent event argument generate exhaustive events data.The present invention can realize the accurate extraction of important event in across the media data environment of magnanimity.

Description

A kind of event extraction method across media
Technical field
The present invention relates to a kind of event extraction method in across media data environment based on news media and social networkies, Belong to information retrieval field.
Background technology
Carry out scientific quantitative analysis for media event data to analyze in societies such as Situation Awareness, emergency response, Risk-warnings Can launch in studying to apply.Event data (Event Data) have recorded a mankind's activity in special scenes, comprises to relate to thing The multiclass key elements such as subject and object, agent behavior, time, place, type, sociology attribute, are generally carried out with many tuples form Represent, be the atomization description to real world.The expression classification of event argument can be divided into numeric type, description type, assert type etc., Quantity information in digital data generally expression event, description type data is usually the key word with event argument classification, breaks Speech type data is used for representing specific attribute character.Before and after particular topic event occurs, news media and social networkies launch to close Note, the information around event is propagated in the Internet by carriers such as text, images, and this also makes to obtain by information retrieval Event data becomes main way, forms Event Extraction.
The main task of event extraction is to find event from mass network data and carry out structuring around event argument Process, ultimately generate the event data that can be used for Machine automated analysis, conventional main processing steps are as follows:(1) data carries Take, for different classes of data source, the data detection rule data setting up coupling extracts interface, and arranges Policy Updates plan Slightly tackle the interface variation of data source;(2) initial data is carried out with pretreatment, clears up data noise, to text, image, first number According to etc. different classes of data carry out suitable data encapsulation;(3) combine knowledge information and machine learning method, realize entering of data One step understands, finds the location anchors related to event argument or data characteristicses, identifies and extract the related element information of event; (4) a series of process such as duplicate removal, cluster, standardization is passed through to identified event argument, generate candidate events data;(5) Event data merges, and generates fine structuring event data, and centralized stores form event base.User can be by unified thing Part storehouse access interface extracts event data, thus greatly simplify data processing work, and carries for studying politics and social evolution For bigger excavated space.
Due to the unification of newsletter archive article framework, diction is rigorous, the event extraction method commonly used at present mainly for Text data in news media, ultimately generates the event data meeting predetermined format.With the popularization of social networkies, Yong Hufa The Twitter message of cloth becomes the firsthand information of event, and in communication process, user spontaneously supplements event information, in social networkies In define the population effect to critical eventss;Meanwhile, social networkies gradually play an important role (example in the evolution of promotion event As " spring of Arab " event) so that traditional shows limitation based on the event extracting process of newsletter archive.Additionally, it is multiple Event analysis under miscellaneous scene require to extract the variation of event argument, and important event generally causes a series of correlating events, thing The interaction relation of part development is difficult to embody it is therefore desirable to the event extraction method that becomes more meticulous and dynamic in traditional news data Variable event data storage structure.Not yet find to carry out event extraction in roundup news media and social network data at present Method.With the continuous maturation of knowledge connection and machine learning method, realize important event in media data in magnanimity isomery The Precise Event of data extracts possesses condition of sufficiently realizing.
Content of the invention
For the problems referred to above, the present invention provides a kind of event extraction method across media, is broadly divided into knowledge and prepares (step 1), basic event argument extracts (step 2-4), event argument extension (step 5-9) three phases, covers event initial summary The methods such as framework, the candidate events key element being extracted by social network information and Events Fusion.Key step is as follows:
(1) setting seed affair character storehouse and required knowledge data, real including particular organization, mechanism, place, personage etc. Body key element information bank, open knowledge mapping data set, event behavior category patterns storehouse or the language material resource of association body and classification Etc. content.
(2) Real-time Collection news web page carry out pretreatment from the credible news source setting, extracts newsletter archive and unit Data message.
(3) from every, newsletter archive, extract the event argument information on basis, generate primary event data;And to similar thing Number of packages, according to carrying out duplicate removal or merging, constitutes primary event set.
(4) calculate significance level in event is portrayed for each key element of primary event, generate the event being made up of basic factors Initial summary framework.
(5) the initial summary framework based on event generates the retrieval framework of social network data, using the inspection of Dynamic iterations Rope scheme real-time update retrieves framework, extracts the social network message text meeting search condition, generates alternate message set.
(6) text semantic analysis method, the element information in analysis alternate message set and affiliated classification are combined, analysis is every The significance level of individual key-value pair, and the summary framework of alternate message is generated according to the analysis result of key-value pair.
(7) compare the similarity of alternate message summary framework and event summary framework, when demanded by alternate message It is added to the corresponding message queue of primary event.
(8) according to default prioritisation of messages condition (as conditions such as the significance level of social network message, issuing time), according to Key-value pair in secondary selection message queue is as the candidate events key element of event data;For certainty information such as geographical coordinates, According to adding key-value pair in message queue to be clustered, analysis result adds in candidate events key element.
(9) the candidate events key element that above-mentioned newsletter archive and social network data are extracted, according to the time, place, entity, The aspects such as classification, result, scale, sociology attribute are classified further, using Events Fusion rule, carry out specification to event argument Change and integrate, generate complete event data.
The positive effect of the present invention is:
1st, provide multi-class event argument abstracting method in across media data environment it is achieved that becoming more meticulous extendible Event argument extracts, and has not only incorporated the advantage of newsletter archive Description standard, extracts event base key element;Also use social network The characteristic such as network text data scale is big, user updates, content coverage is wide, by increasing capacitance it is possible to increase event result, scale and impact, society The element information of the classifications such as attribute can be learned.
2nd, the retrieval framework based on event summary and alternate message summary framework is double in retrieval phase and filtration stage To inquiry, can more accurately filter out the social network message related to event.
3rd, combine the impact to the event of portraying for the event argument significance level, thus remain more crucial believable event will Prime information.
4th, not only from across the text data of media environment extract event argument, have also combined in social network metadata The advantage of the aspects such as description event related time, position, temperature.
Brief description
Fig. 1 is across media event abstracting method flow chart according to an embodiment of the invention.
Specific embodiment
Process provides a kind of event extraction method across media, for base after the important event of particular category occurs Relevant information in news media and social networkies quickly generates the structurized event data of fine granularity, including data extraction, thing Part summary framework, event argument extract and Events Fusion.Below, the present invention is described in detail in conjunction with specific embodiments, Wherein social networkies event extraction, taking study microblog data as a example it should be understood that the present embodiment is only used for explaining the present invention, does not limit In the scope of the present invention.
It show the schematic flow sheet across media event abstracting method for the present invention with reference to Fig. 1, comprise the steps:
(1) setting seed affair character storehouse and required knowledge data, real including particular organization, mechanism, place, personage etc. Body key element information bank, open knowledge mapping data set, event behavior category patterns storehouse or the language material resource of association body and classification Etc. content.
In implementation process, the type of theme for object event and the main feature of common data resource, collect and select With suitable affair character storehouse and knowledge collection, contain feature word set and media event typical case's language material of object event, be used for Follow-up event recognition and filtration, and set up synchronized update rule.Entity elements and body, while reference name, are set up same Adopted word, classification etc. associate, such as, in biographical information, the synonymous conjunctive word of " certain so-and-so " word has certain state president, certain state the most polo-neck Lead people etc., again belong to government organs personnel simultaneously, and it is ageing to have certain association, it is possible to use WordNet corpus and The data resource that official organization provides;And for example " two countries conclude an agreement " event belongs to collaborative event, belongs simultaneously to front feelings The event of sense tendency, the tree-like formula of available code is labeled.Using DBpedia or Freebase etc. increase income knowledge base provide body Information and the knowledge mapping of corresponding classification, such as " the United Nations " corresponding classification " non-profit international organization ".Event behavioral pattern can Define in terms of language template from syntactic structure and syntax tree etc., syntactic structure is closed with conventional entity recognition method rule Connection, obtains the event behavior and entity relationship characteristic in text representation, extracts for follow-up event argument.
(2) Real-time Collection news web page carry out pretreatment from believable news sources, extracts newsletter archive and metadata Information.
Believable news media should be selected when newsletter archive extracts event data, credible news source would generally be in great thing After part occurs, the very first time is reported, covers event category comprehensively, thus reducing the integrated quantity of website RSS seed, news simultaneously Draft Copy layout and reference frame verity aspect are also relatively gone together and are had higher quality, are that subsequent treatment module reduces difficulty. The list of credible news data source angularly need to consider from authority, region, freshness when selecting, and collection news web page should meet Extensive real-time requirement, can crawl mechanism using Redis is distributed, extract text and metadata information can be adopted from news web page With Goose message extraction mechanism, filter extraneous data, more specifically processing procedure is as follows simultaneously:
A) define credible news source seed list:Mark the covering classification of news sources by concern region, including domestic, state Border, some areas etc., and set the renewal time respectively, it is defaulted as 15 minutes updating once.
B) news sources list is stored in master server, and divides subtask in subordinate's server, each news sources Seed distributes single background work thread, and starts text and metadata information extraction module.
C) all text mark portions are gone out using structure extraction such as dom, css from the html of original web page in extraction module Point, for the node node comprising multiple texts, according to position in webpage for the stop words quantity and this node under each node Put layout to be given a mark, for judging the significance level of node:In general stop words quantity represents the contents of the section more Full and accurate, in page layout, the content the closer to center is more more important, finds out obs network node in this way, and extracts core Content of text in hearty cord point is as core newsletter archive.
D) newsletter archive of description extraneous events is filtered.Generally have substantially due to being also easy to produce the extraneous events obscured Text feature, for example study political society event when, the news report such as competitive sports through frequently with mean country trial of strength type Wording characteristics, but again comprise physical culture vocabulary such as numerous " international league matches " simultaneously, therefore can be using including picking of unrelated word feature Except word dictionary, filter extraneous events.
E) rule being defined in advance according to some or template, remove unrelated with content structure mark in css and script Sign, retain date issued, heading message, complete Text Feature Extraction and cleaning.
F) newsletter archive of extraction and metadata are integrated into the file of prescribed form, and are uploaded to NoSQL storage architecture Data base in.
(3) knowledge data according to needed for step (1), from every, extracts the event argument information on basis newsletter archive, raw Become primary event data;And similar case data is carried out with duplicate removal or merging, constitute primary event set.
Newsletter archive follows specific Writing Standards, and generally forward paragraph introduces media event outline, paragraph rearward Main event is done is supplemented further.Therefore can integrated template analysis newsletter archive be analyzed, more with statistical learning method Specific process is as follows:
A) using sentence extractor, newsletter archive segmentation is formed a complete sentence, application natural language processing instrument is (as Stanford University CoreNLP, the NLPIR of Beijing Institute of Technology etc.) morphology and syntactic analysis are carried out to news in brief (can select the first six sentence), It is parsed into the form of syntax tree, and identify dependence.
B) according to architectural feature in syntax tree for the word and entity elements information bank, reality is named to news in brief Body identifies, excavates the entity objects such as the name being related in outgoing event, place name, mechanism's name.
C) according to the action core word in news in brief, judge behavior relates to thing subject and object, according to predefined thing Part behavior category patterns (content in such as table 1 is illustrated), the behavior relation of identification events and generic, and calculate event Sentiment orientation intensity.For example when studying international events, it is divided into 20 big class from political cooperation to extensive incident of violence, and point Do not define corresponding subclass and word uses feature, Sentiment orientation intensity is allocated -10 to 10 scoring, military attack/big rule Mould incident of violence is -10 points, terminates military operation and is+10 points, issues statement and is 0 point.
Table 1
D) time descriptor, application TimeML text time affinity criterions and the issuing time in positioning newsletter archive, right Fuzzy time statement is converted to (as " this Saturday " " yesterday " etc.) the time notation of specification by rule of inference.Comprehensive text The sequential relationship of time relationship reasoning outgoing event, event is matched with time labelling.
E) the location expression word in localization of text, it is possible to use the geography information mark service increased income, chooses and is identified as position Put first of adverbial modifier mark word as venue location, and completion automatically made a look up according to the place name in text, reach from Country, the minimum identification granularity of administrative region to city.If having indicated the fine position information such as street, building in text, Recognize city, retain this description field simultaneously.
F) above-mentioned Elements Integration is become primary event data, event argument types value can be used but not limited to following shape Formula:Event=(time, location, actor1, actor2, action, type, scale, url)
Wherein time is time of origin, describes or numeric type key element;Location be occur position, include describe title, Country origin, administrative region, city aliquot, are sky when default;Actor1 and actor2 represents agent main body and word denoting the receiver of an action object respectively, Available multiclass field is indicated, and has both included description type title, also includes marking entity property (as name, official mission, non-official Square mechanism, international organization etc.) assert information;Action records behavior description word;Type represents event category, belongs to the type of asserting Key element;Scale represents the Sentiment orientation of event, belongs to numeric type key element;Url is side information, represents the source of initial data.
For example, the news that August is issued on the 13rd
Table 2
Corresponding primary event data is represented by
Table 3
G) when the similarity of same period primary event data exceedes specific threshold, retain the up-to-date thing generating in this period Number of packages carries out duplicate removal according to this;It is defined by the more complete data of information simultaneously, event argument is entered with row information and merges, and record all Corresponding source-information.
(4) calculate significance level in event is portrayed for each key element of primary event, generate be made up of basic factors thing Part initial summary framework.
A) event argument is more crucial to the event of portraying, and its significance level value is bigger, span between 0 to 1, its In:The significance level of time of origin key element is 1;The significance level of description type key element is common in the corresponding newsletter archive of event by it Existing frequency determines, and is normalized;For the event argument using multistage description form, there is position letter in such as event Breath using place name, city name, administrative area domain name, name of the country multilevel hierarchy description, the computational methods of description type title ibid, with will Sketch states the expansion of granularity, and significance level suitably reduces on the basis of this key element property..
B) each key element value of primary event data is launched according to the form of key-value pair, and according to the important journey of key element Degree carries out assignment to the significance level of each key-value pair, generates event initial summary framework, as follows:P (e)={ ((ki,vi),ωi (e,(ki,vi)))|(ki,vi)∈E,ωi(e,(ki,vi)) ∈ [0,1], wherein E represents all key assignments wanting prime component of event e To set, the maximum occurrences of i are the number of all key-value pairs, (ki,vi) it is i-th key-value pair, kiIt is intended to the title of prime component, viCorrespond to value, ω for componentiSignificance level for key-value pair.
(5) the initial summary framework based on event generates the retrieval framework of social network data, using the inspection of Dynamic iterations Rope scheme real-time update retrieves framework, extracts the social network message text meeting search condition, generates alternate message set.
More specifically process is as follows:
A) using the key-value pair information in event initial summary framework as search key seed, according to synset to pass Keyword is extended, and generates microblogging retrieval framework;By the open data retrieval interface of microblogging, retrieval event occurs nearest one section The microblog data of (within such as 7 days) in time.
B) during in the Twitter message retrieving, the TFIDF value according to word or phrase is to Twitter message, word or phrase enter Row ranking, chooses the higher word of ranking as key word, and updates retrieval framework, disappears further according to above-mentioned requirements retrieval microblogging Breath.
C) terminate iterative search when the discovery procedure convergence of key word, extract the Twitter message text retrieving, charge to Alternate message set.
(6) according to the knowledge data in step (1), in conjunction with text semantic analysis method, analyze in alternate message set Element information and affiliated classification, analyze the significance level of each key-value pair, and are disappeared according to the analysis result generation candidate of key-value pair The summary framework of breath.
More specifically process is as follows:
A) extract picture metadata or the user's geographical location information in alternate message metadata from Twitter message, obtain The corresponding geographic coordinate information of alternate message.
B) alternate message is named with Entity recognition and Shallow Semantic Parsing, the entity information that every microblogging of positioning is related to And semantic role.
C) use knowledge mapping data set and associated tool, the entity information in blog article is mapped to related notion, obtains The key-value pair information comprising in microblogging.For example first microblogging " apart from ten thousand logical new city International Residentials of about 2 kilometers of blast site, wealth Produce loss serious " in identify that entity " ten thousand logical new city International Residentials " belongs to " residential block " classification.
D) microblogging text is carried out with Classification and Identification or cluster, and set up the association of generic and key word, form one group Key-value pair, is stored in the lump with this microblogging text.After important event occurs, content of microblog is generally divided into following classification:Event shadow Sound, analysis on reasons, potential risk, client's experience, user comment etc., according to text feature and corresponding classification recognition rule, Text is classified;Then the key-value pair information having identified is mapped to corresponding classification, for example " event impact " class Following key-value pair is potentially included, (death toll, 165), (number of injured people, 798), (residential block, Wan Tong new city state under other microblogging Border cell) etc..
E) assess Twitter message content in terms of microblogging metadata, user's attention rate and microblogging issue geographical location information etc. Significance level.Microblogging metadata includes this then concern such as the forwarding of microblogging, comment temperature, and usual temperature is higher, and this then disappears Breath content is more important;User's attention rate refers to the vermicelli quantity of publisher, represents the power of influence of publisher;The geographical position that microblogging is issued Put and be compared with the geographical position in primary event framework, geographic distance is then designated client's message within the specific limits, Importance degree improves.It is to be calculated according to metadata that the assessment models of significance level can adopt score=MS+US+LS, wherein MS Microblogging temperature score, US is the score being calculated according to user profile, and LS is the score being calculated according to geographical relative position, final To score be normalized, value is between 0 to 1.
F) integrate the key-value pair information of every microblogging, and the inquiry score according to key-value pair and microblogging significance level information, Form the summary framework with regard to candidate Twitter message m, that is,
P (m)={ ((ki,vi),si(m,(ki,vi)))|(ki,vi)∈M,si(m,(ki,vi))∈[0,1]};Wherein si(m, (ki,vi)) it is the key-value pair (k extracting in Message-texti,vi) significance level, significance level score according to microblogging m and key Value calculates jointly to the TFIDF value in alternate message key-value pair;The maximum occurrences of i be this Twitter message (include text and Metadata) included in key-value pair number;Article one, the key-value pair that the summary framework of Twitter message comprises may for empty it is also possible to Comprise multipacket message, M represents all key-value pair set wanting prime component of alternate message m, kiIt is the title wanting prime component for i-th, vi Correspond to value for component.
(7) compare the similarity of alternate message summary framework and event summary framework, when demanded, by alternate message It is added to the message queue of this event.
The microblogging search method being triggered by event summary framework P (e) is the query filter being carried out according to text.By adjustment Cosine similarity or Ming Shi distance method calculate summary framework P (m) of every alternate message and the similarity of P (e), and according to phase Threshold value like degree sets up the filtering rule of alternate message, realizes semantic filtering, thus obtaining more accurate event message queue. (8) according to default prioritisation of messages condition (as conditions such as the significance level of social network message, issuing time), select successively Key-value pair in message queue is as the candidate events key element of event data;For certainty information such as geographical coordinates, according to Key-value pair in message queue is added to be clustered, analysis result adds in candidate events key element.
The Twitter message queue of event contains the element information that event more becomes more meticulous, and needs to add according to ad hoc rule condition Enter in event data, further description is as follows:
A) message in Twitter message list is ranked up:Can be according to microblogging significance level score or microblogging summary Framework is ranked up with the similarity of primary event summary framework, also can be according to the issuing time of Twitter message and event summary frame The ascending sequence of degree of closeness of the time in frame, user comprehensive can also build the ordering strategy customizing.
B) extract microblogging successively according to queue sequence, if the corresponding key-value pair information of this microblogging do not appear in current Event summary framework, then be added in the candidate events key element of event data, till not having new information to add.
C) to geographic coordinate data substantial amounts of in message queue, by abnormity point elimination and cluster analyses, it is possible to obtain thing The accurate longitude and latitude that part occurs, the particularly event to multiple scenes, this step plays more accurate effect.
(9) the candidate events key element that above-mentioned newsletter archive and social network data are extracted, according to the time, place, entity, The aspects such as classification, result, scale, sociology attribute are classified further, using Events Fusion rule, carry out specification to event argument Change and integrate, generate complete event data.
Content overlap is there may be in the candidate events key element obtain due to event summary framework and by microblog data Relate to fact object key element in situation, such as " the especially big blast in 812 PORT OF TIANJIN " event, corresponding value be probably " auspicious sea logistics ", " Rui Hai company ", " PORT OF TIANJIN harbour affairs group " etc. are it is therefore desirable to integrate to the same category information of event, analog information is closed And waiting operation, further description is as follows:
A) according to knowledge and training data, feature category title is classified, classification includes time of origin, spot Point, agent main body, word denoting the receiver of an action object, event category, event result, scale and impact, sociology attribute etc., involved classification is made Outermost description label for event data.
B) conceptual network being provided according to knowledge mapping, key element item name is added in the subtab of event data, Middle concept node can be added if necessary.
C) normalization process is carried out to the value type of candidate events key element, and by type label (description type, assert type, Numeric type etc.) and valued content be added to event data, form complete event data.
Table 4 exhaustive events part section takes

Claims (10)

1. a kind of event extraction method across media, its step is:
1) setting seed affair character storehouse and required knowledge data;
2) gather news web page from the credible news source setting, and extract newsletter archive and first number from the news web page of collection It is believed that breath;
3) event argument information is extracted from every then newsletter archive according to described seed affair character storehouse and required knowledge data, Generate primary event data, obtain a primary event set;
4) calculate significance level in event is portrayed for each key element of primary event, generate the initial summary framework of event;
5) based on each key element search social network message text in the initial summary framework of event, generate alternate message set;
6) combine text semantic analysis method, the element information included in analysis alternate message set and affiliated classification, generate The summary framework of each alternate message;
7) the summary framework according to alternate message and the similarity of the initial summary framework of described event were carried out to alternate message Filter, obtains the corresponding message queue of primary event;
8) will exist and event initial summary framework in the event argument in the initial summary framework of described event and message queue In non-existent event argument be added to a candidate events elements combination;
9) exhaustive events data is generated according to the event argument in candidate events elements combination.
2. the method for claim 1 is it is characterised in that the initial summary framework of described event is P (e)={ ((ki, vi),ωi(e,(ki,vi)))|(ki,vi)∈E,ωi(e,(ki,vi))∈[0,1]};Wherein, E represents that all key elements of event e are divided The key-value pair set of amount, kiIt is the title wanting prime component for i-th, viCorrespond to value, ω for componentiFor i-th key-value pair (ki,vi) Significance level.
3. method as claimed in claim 2 is it is characterised in that the method generating alternate message set is:
A) using the key-value pair information in the initial summary framework of event as search key seed, according to synset to key Word is extended, and generates alternate message retrieval framework and retrieves the alternate message in setting time;
B) ranking is carried out to participle according to the TFIDF value of participle in the alternate message retrieving, some participles are chosen according to ranking Update alternate message retrieval framework, the then alternate message in iterative searching setting time as key word;When sending out of key word Terminate iterative search, using the alternate message retrieving as alternate message set during existing process convergence.
4. the method as described in claim 1 or 2 or 3 is it is characterised in that the method generating the summary framework of alternate message is:
A) extract picture metadata or the user's geographical location information in alternate message metadata, obtain alternate message correspondingly Reason coordinate information;
B) alternate message is named with Entity recognition and Shallow Semantic Parsing, the entity information of every alternate message of positioning and language Adopted role;
C) according to described knowledge data, the entity information of alternate message is mapped, obtain the key assignments comprising in this alternate message To information;
D) Classification and Identification or cluster are carried out to the key-value pair information that step c) obtains, set up the association of generic and key word, Obtain some groups of key-value pairs of this alternate message, and assess the significance level of this alternate message;
E) key-value pair according to alternate message and its significance level information, forms the summary framework of this alternate message.
5. method as claimed in claim 2 or claim 3 is it is characterised in that the summary framework of described alternate message is P (m)={ ((ki, vi),si(m,(ki,vi)))|(ki,vi)∈M,si(m,(ki,vi))∈[0,1]};Wherein, si(m,(ki,vi)) it is alternate message m Key-value pair (the k of middle extractioni,vi) significance level, M represents all key-value pair set wanting prime component of alternate message m, kiIt is i-th The individual title wanting prime component, viCorrespond to value for component.
6. the method as described in claim 1 or 2 or 3 is it is characterised in that described seed affair character storehouse and required knowledge number According to inclusion:Entity elements information bank, association body and classification open knowledge mapping data set, event behavior category patterns storehouse or Language material resource.
7. method as claimed in claim 6 is it is characterised in that according to seed affair character storehouse and required knowledge data from every Then extract event argument information in newsletter archive, the method generating primary event data is:
A) subordinate sentence is carried out to newsletter archive, news in brief is parsed into syntax tree and identifies dependence;
B) according to architectural feature in syntax tree for the word and entity elements information bank, news in brief is named with entity and knows , do not obtain the entity object being related in event;
C) subject and object relating to thing according to the action core word judgment behavior in news in brief, according to predefined event row For category patterns, the behavior relation of identification events and generic, and calculate the Sentiment orientation intensity of event;
D) the time descriptor in positioning newsletter archive, fuzzy time statement is converted to the time notation of specification, and by thing Part is mated with time labelling;
E) the location expression word in positioning newsletter archive, chooses first mark word being identified as the position adverbial modifier as event generation Ground, and completion is automatically made a look up according to the place name in newsletter archive;
F) above-mentioned Elements Integration is become primary event data.
8. the method described in claim 1 or 2 or 3 is it is characterised in that calculate the weight to the event of portraying for each key element of primary event The method wanting degree is:Event argument is more important to the event of portraying, and its significance level value is bigger;Wherein:Time of origin key element Significance level be significance level maximum;The co-occurrence frequency in newsletter archive is true according to it for the significance level of description type key element Fixed.
9. the method as described in claim 1 or 2 or 3 is it is characterised in that the method for exhaustive events data as described in generating is:
A) event argument in candidate events elements combination is classified, classification includes time of origin, scene, agent master Body, word denoting the receiver of an action object, event category, event result, scale and impact, sociology attribute;Using the classification of event argument as event The outermost description label of data;
B) conceptual network being provided according to the knowledge mapping in described knowledge data, the item name of event argument is added to thing In the subtab of number of packages evidence;
C) normalization process is carried out to the value type of candidate events key element, obtain value type label;Then by value type Label and valued content are added to event data, form complete event data.
10. the method for claim 1 is it is characterised in that the element type form of described primary event data is event =(time, location, actor1, actor2, action, type, scale, url);Wherein, time is time of origin, Location is that position occurs, and actor1 is agent main body, and actor2 is word denoting the receiver of an action object, and action is behavior description word, type Represent event category, scale represents the Sentiment orientation of event, url is the source of initial data.
CN201610809600.1A 2016-09-08 2016-09-08 A kind of event extraction method across media Active CN106484767B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610809600.1A CN106484767B (en) 2016-09-08 2016-09-08 A kind of event extraction method across media

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610809600.1A CN106484767B (en) 2016-09-08 2016-09-08 A kind of event extraction method across media

Publications (2)

Publication Number Publication Date
CN106484767A true CN106484767A (en) 2017-03-08
CN106484767B CN106484767B (en) 2019-06-21

Family

ID=58273654

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610809600.1A Active CN106484767B (en) 2016-09-08 2016-09-08 A kind of event extraction method across media

Country Status (1)

Country Link
CN (1) CN106484767B (en)

Cited By (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107229712A (en) * 2017-05-27 2017-10-03 中南大学 A kind of space-time clustering method towards occurred events of public safety acquisition of information
CN107766477A (en) * 2017-09-30 2018-03-06 武汉汉思信息技术有限责任公司 Page structure data extraction method, terminal device and storage medium
CN108920447A (en) * 2018-05-07 2018-11-30 国家计算机网络与信息安全管理中心 A kind of Chinese event abstracting method towards specific area
CN108959626A (en) * 2018-07-23 2018-12-07 四川省烟草公司成都市公司 A kind of cross-platform efficient automatic generation method of isomeric data bulletin
CN109033074A (en) * 2018-06-29 2018-12-18 北京百度网讯科技有限公司 News in brief generation method, device, equipment and computer-readable medium
CN109241438A (en) * 2018-09-27 2019-01-18 国家计算机网络与信息安全管理中心 Across channel focus incident discovery method, apparatus and storage medium based on element
CN109408806A (en) * 2018-09-11 2019-03-01 中国电子科技集团公司第二十八研究所 A kind of Event Distillation method based on English grammar rule
CN109885698A (en) * 2019-02-13 2019-06-14 北京航空航天大学 A kind of knowledge mapping construction method and device, electronic equipment
CN110134842A (en) * 2019-04-03 2019-08-16 深圳价值在线信息科技股份有限公司 Information matching method, device, storage medium and server based on Information Atlas
CN110297885A (en) * 2019-05-27 2019-10-01 中国科学院深圳先进技术研究院 Generation method, device, equipment and the storage medium of real-time event abstract
CN110334220A (en) * 2019-07-15 2019-10-15 中国人民解放军战略支援部队航天工程大学 A kind of knowledge mapping construction method based on multi-data source
CN110457468A (en) * 2019-07-05 2019-11-15 武楚荷 A kind of classification method of event, device and storage device
CN110472066A (en) * 2019-08-07 2019-11-19 北京大学 A kind of construction method of urban geography semantic knowledge map
CN110471993A (en) * 2019-07-05 2019-11-19 武楚荷 A kind of correlating method of event, device and storage device
CN111191046A (en) * 2019-12-31 2020-05-22 北京明略软件系统有限公司 Method, device, computer storage medium and terminal for realizing information search
CN111191413A (en) * 2019-12-30 2020-05-22 北京航空航天大学 Method, device and system for automatically marking event core content based on graph sequencing model
CN111428041A (en) * 2019-01-09 2020-07-17 阿里巴巴集团控股有限公司 Case abstract generation method, device, system and storage medium
CN111782907A (en) * 2020-07-01 2020-10-16 北京知因智慧科技有限公司 News classification method and device and electronic equipment
CN111966890A (en) * 2020-06-30 2020-11-20 北京百度网讯科技有限公司 Text-based event pushing method and device, electronic equipment and storage medium
WO2020237479A1 (en) * 2019-05-27 2020-12-03 中国科学院深圳先进技术研究院 Real-time event summarization generation method, apparatus and device, and storage medium
CN112328856A (en) * 2020-10-30 2021-02-05 中国平安人寿保险股份有限公司 Common event tracking method and device, computer equipment and computer readable medium
CN112328794A (en) * 2020-11-10 2021-02-05 南京师范大学 Typhoon event information aggregation method
CN112560461A (en) * 2020-12-11 2021-03-26 北京百度网讯科技有限公司 News clue generation method and device, electronic equipment and storage medium
CN112579738A (en) * 2020-12-23 2021-03-30 广州博冠信息科技有限公司 Target object label processing method, device, equipment and storage medium
CN112597772A (en) * 2020-12-31 2021-04-02 讯飞智元信息科技有限公司 Hotspot information determination method, computer equipment and device
CN113033201A (en) * 2020-11-06 2021-06-25 新华智云科技有限公司 Earthquake news information extraction method and system
CN113065051A (en) * 2021-04-02 2021-07-02 西南石油大学 Visual agricultural big data analysis interactive system
CN113326352A (en) * 2021-06-18 2021-08-31 哈尔滨工业大学 Sub-event relation identification method based on heterogeneous event graph
CN113495951A (en) * 2020-04-03 2021-10-12 源析(青岛)信息技术有限公司 Construction method of knowledge graph for persistent social events
CN113609309A (en) * 2021-08-16 2021-11-05 脸萌有限公司 Knowledge graph construction method and device, storage medium and electronic equipment
CN114065769A (en) * 2022-01-14 2022-02-18 四川大学 Method, device, equipment and medium for training emotion reason pair extraction model
CN114880588A (en) * 2022-06-13 2022-08-09 四川封面传媒科技有限责任公司 News popularity prediction method based on knowledge graph
CN115422948A (en) * 2022-11-04 2022-12-02 文灵科技(北京)有限公司 Event level network identification system and method based on semantic analysis
CN117435697A (en) * 2023-12-21 2024-01-23 中科雨辰科技有限公司 Data processing system for acquiring core event
CN114880588B (en) * 2022-06-13 2024-04-26 四川封面传媒科技有限责任公司 News heat prediction method based on knowledge graph

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103778200A (en) * 2014-01-09 2014-05-07 中国科学院计算技术研究所 Method for extracting information source of message and system thereof
CN104408093A (en) * 2014-11-14 2015-03-11 中国科学院计算技术研究所 News event element extracting method and device
CN105389304A (en) * 2015-10-27 2016-03-09 小米科技有限责任公司 Event extraction method and apparatus
CN105389354A (en) * 2015-11-02 2016-03-09 东南大学 Social media text oriented unsupervised method for extracting and sorting events

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103778200A (en) * 2014-01-09 2014-05-07 中国科学院计算技术研究所 Method for extracting information source of message and system thereof
CN104408093A (en) * 2014-11-14 2015-03-11 中国科学院计算技术研究所 News event element extracting method and device
CN105389304A (en) * 2015-10-27 2016-03-09 小米科技有限责任公司 Event extraction method and apparatus
CN105389354A (en) * 2015-11-02 2016-03-09 东南大学 Social media text oriented unsupervised method for extracting and sorting events

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
金璐钰: "基于框架的事件抽取研究", 《高科技与产业化》 *

Cited By (51)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107229712A (en) * 2017-05-27 2017-10-03 中南大学 A kind of space-time clustering method towards occurred events of public safety acquisition of information
CN107766477A (en) * 2017-09-30 2018-03-06 武汉汉思信息技术有限责任公司 Page structure data extraction method, terminal device and storage medium
CN108920447A (en) * 2018-05-07 2018-11-30 国家计算机网络与信息安全管理中心 A kind of Chinese event abstracting method towards specific area
CN109033074A (en) * 2018-06-29 2018-12-18 北京百度网讯科技有限公司 News in brief generation method, device, equipment and computer-readable medium
CN108959626B (en) * 2018-07-23 2023-06-13 四川省烟草公司成都市公司 Efficient automatic generation method for cross-platform heterogeneous data profile
CN108959626A (en) * 2018-07-23 2018-12-07 四川省烟草公司成都市公司 A kind of cross-platform efficient automatic generation method of isomeric data bulletin
CN109408806A (en) * 2018-09-11 2019-03-01 中国电子科技集团公司第二十八研究所 A kind of Event Distillation method based on English grammar rule
CN109241438A (en) * 2018-09-27 2019-01-18 国家计算机网络与信息安全管理中心 Across channel focus incident discovery method, apparatus and storage medium based on element
CN109241438B (en) * 2018-09-27 2022-06-24 国家计算机网络与信息安全管理中心 Element-based cross-channel hot event discovery method and device and storage medium
CN111428041A (en) * 2019-01-09 2020-07-17 阿里巴巴集团控股有限公司 Case abstract generation method, device, system and storage medium
CN111428041B (en) * 2019-01-09 2023-06-16 阿里巴巴集团控股有限公司 Case abstract generation method, device, system and storage medium
CN109885698A (en) * 2019-02-13 2019-06-14 北京航空航天大学 A kind of knowledge mapping construction method and device, electronic equipment
CN110134842A (en) * 2019-04-03 2019-08-16 深圳价值在线信息科技股份有限公司 Information matching method, device, storage medium and server based on Information Atlas
CN110297885B (en) * 2019-05-27 2021-08-17 中国科学院深圳先进技术研究院 Method, device and equipment for generating real-time event abstract and storage medium
WO2020237479A1 (en) * 2019-05-27 2020-12-03 中国科学院深圳先进技术研究院 Real-time event summarization generation method, apparatus and device, and storage medium
CN110297885A (en) * 2019-05-27 2019-10-01 中国科学院深圳先进技术研究院 Generation method, device, equipment and the storage medium of real-time event abstract
CN110457468A (en) * 2019-07-05 2019-11-15 武楚荷 A kind of classification method of event, device and storage device
CN110471993A (en) * 2019-07-05 2019-11-19 武楚荷 A kind of correlating method of event, device and storage device
CN110457468B (en) * 2019-07-05 2022-08-23 武楚荷 Event classification method and device and storage device
CN110334220A (en) * 2019-07-15 2019-10-15 中国人民解放军战略支援部队航天工程大学 A kind of knowledge mapping construction method based on multi-data source
CN110472066B (en) * 2019-08-07 2022-03-25 北京大学 Construction method of urban geographic semantic knowledge map
CN110472066A (en) * 2019-08-07 2019-11-19 北京大学 A kind of construction method of urban geography semantic knowledge map
CN111191413A (en) * 2019-12-30 2020-05-22 北京航空航天大学 Method, device and system for automatically marking event core content based on graph sequencing model
CN111191413B (en) * 2019-12-30 2021-11-12 北京航空航天大学 Method, device and system for automatically marking event core content based on graph sequencing model
CN111191046A (en) * 2019-12-31 2020-05-22 北京明略软件系统有限公司 Method, device, computer storage medium and terminal for realizing information search
CN113495951A (en) * 2020-04-03 2021-10-12 源析(青岛)信息技术有限公司 Construction method of knowledge graph for persistent social events
CN111966890B (en) * 2020-06-30 2023-07-04 北京百度网讯科技有限公司 Text-based event pushing method and device, electronic equipment and storage medium
CN111966890A (en) * 2020-06-30 2020-11-20 北京百度网讯科技有限公司 Text-based event pushing method and device, electronic equipment and storage medium
CN111782907B (en) * 2020-07-01 2024-03-01 北京知因智慧科技有限公司 News classification method and device and electronic equipment
CN111782907A (en) * 2020-07-01 2020-10-16 北京知因智慧科技有限公司 News classification method and device and electronic equipment
CN112328856A (en) * 2020-10-30 2021-02-05 中国平安人寿保险股份有限公司 Common event tracking method and device, computer equipment and computer readable medium
CN113033201B (en) * 2020-11-06 2023-07-28 新华智云科技有限公司 Earthquake news information extraction method and system
CN113033201A (en) * 2020-11-06 2021-06-25 新华智云科技有限公司 Earthquake news information extraction method and system
CN112328794B (en) * 2020-11-10 2021-08-24 南京师范大学 Typhoon event information aggregation method
CN112328794A (en) * 2020-11-10 2021-02-05 南京师范大学 Typhoon event information aggregation method
CN112560461A (en) * 2020-12-11 2021-03-26 北京百度网讯科技有限公司 News clue generation method and device, electronic equipment and storage medium
CN112579738A (en) * 2020-12-23 2021-03-30 广州博冠信息科技有限公司 Target object label processing method, device, equipment and storage medium
CN112597772A (en) * 2020-12-31 2021-04-02 讯飞智元信息科技有限公司 Hotspot information determination method, computer equipment and device
CN113065051A (en) * 2021-04-02 2021-07-02 西南石油大学 Visual agricultural big data analysis interactive system
CN113326352B (en) * 2021-06-18 2022-05-24 哈尔滨工业大学 Sub-event relation identification method based on heterogeneous event graph
CN113326352A (en) * 2021-06-18 2021-08-31 哈尔滨工业大学 Sub-event relation identification method based on heterogeneous event graph
WO2023022655A3 (en) * 2021-08-16 2023-04-13 脸萌有限公司 Knowledge map construction method and apparatus, storage medium, and electronic device
CN113609309A (en) * 2021-08-16 2021-11-05 脸萌有限公司 Knowledge graph construction method and device, storage medium and electronic equipment
CN113609309B (en) * 2021-08-16 2024-02-06 脸萌有限公司 Knowledge graph construction method and device, storage medium and electronic equipment
CN114065769B (en) * 2022-01-14 2022-04-08 四川大学 Method, device, equipment and medium for training emotion reason pair extraction model
CN114065769A (en) * 2022-01-14 2022-02-18 四川大学 Method, device, equipment and medium for training emotion reason pair extraction model
CN114880588A (en) * 2022-06-13 2022-08-09 四川封面传媒科技有限责任公司 News popularity prediction method based on knowledge graph
CN114880588B (en) * 2022-06-13 2024-04-26 四川封面传媒科技有限责任公司 News heat prediction method based on knowledge graph
CN115422948A (en) * 2022-11-04 2022-12-02 文灵科技(北京)有限公司 Event level network identification system and method based on semantic analysis
CN117435697B (en) * 2023-12-21 2024-03-22 中科雨辰科技有限公司 Data processing system for acquiring core event
CN117435697A (en) * 2023-12-21 2024-01-23 中科雨辰科技有限公司 Data processing system for acquiring core event

Also Published As

Publication number Publication date
CN106484767B (en) 2019-06-21

Similar Documents

Publication Publication Date Title
CN106484767B (en) A kind of event extraction method across media
CN110941692B (en) Internet political outturn news event extraction method
JP7201730B2 (en) Intention recommendation method, device, equipment and storage medium
CN104076944B (en) A kind of method and apparatus of chatting facial expression input
CN107609052A (en) A kind of generation method and device of the domain knowledge collection of illustrative plates based on semantic triangle
CN107180045B (en) Method for extracting geographic entity relation contained in internet text
CN109885698A (en) A kind of knowledge mapping construction method and device, electronic equipment
CN108073569A (en) A kind of law cognitive approach, device and medium based on multi-layer various dimensions semantic understanding
US8296309B2 (en) System and method for high precision and high recall relevancy searching
CN112131449B (en) Method for realizing cultural resource cascade query interface based on ElasticSearch
CN104281702B (en) Data retrieval method and device based on electric power critical word participle
CN110929125B (en) Search recall method, device, equipment and storage medium thereof
CN102567509B (en) Method and system for instant messaging with visual messaging assistance
CN108628828A (en) A kind of joint abstracting method of viewpoint and its holder based on from attention
CN111881290A (en) Distribution network multi-source grid entity fusion method based on weighted semantic similarity
CN103984771B (en) Method for extracting geographical interest points in English microblog and perceiving time trend of geographical interest points
CN113705218B (en) Event element gridding extraction method based on character embedding, storage medium and electronic device
CN111967761A (en) Monitoring and early warning method and device based on knowledge graph and electronic equipment
CN109947952A (en) Search method, device, equipment and storage medium based on english knowledge map
CN109840325A (en) Text semantic method for measuring similarity based on mutual information
CN107480137A (en) With semantic iterative extraction network accident and the method that identifies extension event relation
Hao et al. Semantic patterns for user‐interactive question answering
CN112015908A (en) Knowledge graph construction method and system, and query method and system
CN111241299A (en) Knowledge graph automatic construction method for legal consultation and retrieval system thereof
CN114186567A (en) Sensitive word detection method and device, equipment, medium and product thereof

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant