CN106484767A - A kind of event extraction method across media - Google Patents
A kind of event extraction method across media Download PDFInfo
- Publication number
- CN106484767A CN106484767A CN201610809600.1A CN201610809600A CN106484767A CN 106484767 A CN106484767 A CN 106484767A CN 201610809600 A CN201610809600 A CN 201610809600A CN 106484767 A CN106484767 A CN 106484767A
- Authority
- CN
- China
- Prior art keywords
- event
- key
- data
- alternate message
- message
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
- G06Q50/01—Social networking
Abstract
The invention discloses a kind of event extraction method across media.This method is:Setting seed affair character storehouse and required knowledge data;Gather news web page from credible news source, and extract newsletter archive and metadata information;Extract event argument information from every then newsletter archive, generate a primary event set;Calculate significance level in event is portrayed for each key element of primary event, generate event initial summary framework;Based on each key element search social network message text in event initial summary framework, generate alternate message set;Summary framework according to alternate message and the similarity of event summary framework filter to alternate message, obtain the corresponding message queue of primary event;To exist in event argument in event initial summary framework and message queue and in initial summary framework non-existent event argument generate exhaustive events data.The present invention can realize the accurate extraction of important event in across the media data environment of magnanimity.
Description
Technical field
The present invention relates to a kind of event extraction method in across media data environment based on news media and social networkies,
Belong to information retrieval field.
Background technology
Carry out scientific quantitative analysis for media event data to analyze in societies such as Situation Awareness, emergency response, Risk-warnings
Can launch in studying to apply.Event data (Event Data) have recorded a mankind's activity in special scenes, comprises to relate to thing
The multiclass key elements such as subject and object, agent behavior, time, place, type, sociology attribute, are generally carried out with many tuples form
Represent, be the atomization description to real world.The expression classification of event argument can be divided into numeric type, description type, assert type etc.,
Quantity information in digital data generally expression event, description type data is usually the key word with event argument classification, breaks
Speech type data is used for representing specific attribute character.Before and after particular topic event occurs, news media and social networkies launch to close
Note, the information around event is propagated in the Internet by carriers such as text, images, and this also makes to obtain by information retrieval
Event data becomes main way, forms Event Extraction.
The main task of event extraction is to find event from mass network data and carry out structuring around event argument
Process, ultimately generate the event data that can be used for Machine automated analysis, conventional main processing steps are as follows:(1) data carries
Take, for different classes of data source, the data detection rule data setting up coupling extracts interface, and arranges Policy Updates plan
Slightly tackle the interface variation of data source;(2) initial data is carried out with pretreatment, clears up data noise, to text, image, first number
According to etc. different classes of data carry out suitable data encapsulation;(3) combine knowledge information and machine learning method, realize entering of data
One step understands, finds the location anchors related to event argument or data characteristicses, identifies and extract the related element information of event;
(4) a series of process such as duplicate removal, cluster, standardization is passed through to identified event argument, generate candidate events data;(5)
Event data merges, and generates fine structuring event data, and centralized stores form event base.User can be by unified thing
Part storehouse access interface extracts event data, thus greatly simplify data processing work, and carries for studying politics and social evolution
For bigger excavated space.
Due to the unification of newsletter archive article framework, diction is rigorous, the event extraction method commonly used at present mainly for
Text data in news media, ultimately generates the event data meeting predetermined format.With the popularization of social networkies, Yong Hufa
The Twitter message of cloth becomes the firsthand information of event, and in communication process, user spontaneously supplements event information, in social networkies
In define the population effect to critical eventss;Meanwhile, social networkies gradually play an important role (example in the evolution of promotion event
As " spring of Arab " event) so that traditional shows limitation based on the event extracting process of newsletter archive.Additionally, it is multiple
Event analysis under miscellaneous scene require to extract the variation of event argument, and important event generally causes a series of correlating events, thing
The interaction relation of part development is difficult to embody it is therefore desirable to the event extraction method that becomes more meticulous and dynamic in traditional news data
Variable event data storage structure.Not yet find to carry out event extraction in roundup news media and social network data at present
Method.With the continuous maturation of knowledge connection and machine learning method, realize important event in media data in magnanimity isomery
The Precise Event of data extracts possesses condition of sufficiently realizing.
Content of the invention
For the problems referred to above, the present invention provides a kind of event extraction method across media, is broadly divided into knowledge and prepares (step
1), basic event argument extracts (step 2-4), event argument extension (step 5-9) three phases, covers event initial summary
The methods such as framework, the candidate events key element being extracted by social network information and Events Fusion.Key step is as follows:
(1) setting seed affair character storehouse and required knowledge data, real including particular organization, mechanism, place, personage etc.
Body key element information bank, open knowledge mapping data set, event behavior category patterns storehouse or the language material resource of association body and classification
Etc. content.
(2) Real-time Collection news web page carry out pretreatment from the credible news source setting, extracts newsletter archive and unit
Data message.
(3) from every, newsletter archive, extract the event argument information on basis, generate primary event data;And to similar thing
Number of packages, according to carrying out duplicate removal or merging, constitutes primary event set.
(4) calculate significance level in event is portrayed for each key element of primary event, generate the event being made up of basic factors
Initial summary framework.
(5) the initial summary framework based on event generates the retrieval framework of social network data, using the inspection of Dynamic iterations
Rope scheme real-time update retrieves framework, extracts the social network message text meeting search condition, generates alternate message set.
(6) text semantic analysis method, the element information in analysis alternate message set and affiliated classification are combined, analysis is every
The significance level of individual key-value pair, and the summary framework of alternate message is generated according to the analysis result of key-value pair.
(7) compare the similarity of alternate message summary framework and event summary framework, when demanded by alternate message
It is added to the corresponding message queue of primary event.
(8) according to default prioritisation of messages condition (as conditions such as the significance level of social network message, issuing time), according to
Key-value pair in secondary selection message queue is as the candidate events key element of event data;For certainty information such as geographical coordinates,
According to adding key-value pair in message queue to be clustered, analysis result adds in candidate events key element.
(9) the candidate events key element that above-mentioned newsletter archive and social network data are extracted, according to the time, place, entity,
The aspects such as classification, result, scale, sociology attribute are classified further, using Events Fusion rule, carry out specification to event argument
Change and integrate, generate complete event data.
The positive effect of the present invention is:
1st, provide multi-class event argument abstracting method in across media data environment it is achieved that becoming more meticulous extendible
Event argument extracts, and has not only incorporated the advantage of newsletter archive Description standard, extracts event base key element;Also use social network
The characteristic such as network text data scale is big, user updates, content coverage is wide, by increasing capacitance it is possible to increase event result, scale and impact, society
The element information of the classifications such as attribute can be learned.
2nd, the retrieval framework based on event summary and alternate message summary framework is double in retrieval phase and filtration stage
To inquiry, can more accurately filter out the social network message related to event.
3rd, combine the impact to the event of portraying for the event argument significance level, thus remain more crucial believable event will
Prime information.
4th, not only from across the text data of media environment extract event argument, have also combined in social network metadata
The advantage of the aspects such as description event related time, position, temperature.
Brief description
Fig. 1 is across media event abstracting method flow chart according to an embodiment of the invention.
Specific embodiment
Process provides a kind of event extraction method across media, for base after the important event of particular category occurs
Relevant information in news media and social networkies quickly generates the structurized event data of fine granularity, including data extraction, thing
Part summary framework, event argument extract and Events Fusion.Below, the present invention is described in detail in conjunction with specific embodiments,
Wherein social networkies event extraction, taking study microblog data as a example it should be understood that the present embodiment is only used for explaining the present invention, does not limit
In the scope of the present invention.
It show the schematic flow sheet across media event abstracting method for the present invention with reference to Fig. 1, comprise the steps:
(1) setting seed affair character storehouse and required knowledge data, real including particular organization, mechanism, place, personage etc.
Body key element information bank, open knowledge mapping data set, event behavior category patterns storehouse or the language material resource of association body and classification
Etc. content.
In implementation process, the type of theme for object event and the main feature of common data resource, collect and select
With suitable affair character storehouse and knowledge collection, contain feature word set and media event typical case's language material of object event, be used for
Follow-up event recognition and filtration, and set up synchronized update rule.Entity elements and body, while reference name, are set up same
Adopted word, classification etc. associate, such as, in biographical information, the synonymous conjunctive word of " certain so-and-so " word has certain state president, certain state the most polo-neck
Lead people etc., again belong to government organs personnel simultaneously, and it is ageing to have certain association, it is possible to use WordNet corpus and
The data resource that official organization provides;And for example " two countries conclude an agreement " event belongs to collaborative event, belongs simultaneously to front feelings
The event of sense tendency, the tree-like formula of available code is labeled.Using DBpedia or Freebase etc. increase income knowledge base provide body
Information and the knowledge mapping of corresponding classification, such as " the United Nations " corresponding classification " non-profit international organization ".Event behavioral pattern can
Define in terms of language template from syntactic structure and syntax tree etc., syntactic structure is closed with conventional entity recognition method rule
Connection, obtains the event behavior and entity relationship characteristic in text representation, extracts for follow-up event argument.
(2) Real-time Collection news web page carry out pretreatment from believable news sources, extracts newsletter archive and metadata
Information.
Believable news media should be selected when newsletter archive extracts event data, credible news source would generally be in great thing
After part occurs, the very first time is reported, covers event category comprehensively, thus reducing the integrated quantity of website RSS seed, news simultaneously
Draft Copy layout and reference frame verity aspect are also relatively gone together and are had higher quality, are that subsequent treatment module reduces difficulty.
The list of credible news data source angularly need to consider from authority, region, freshness when selecting, and collection news web page should meet
Extensive real-time requirement, can crawl mechanism using Redis is distributed, extract text and metadata information can be adopted from news web page
With Goose message extraction mechanism, filter extraneous data, more specifically processing procedure is as follows simultaneously:
A) define credible news source seed list:Mark the covering classification of news sources by concern region, including domestic, state
Border, some areas etc., and set the renewal time respectively, it is defaulted as 15 minutes updating once.
B) news sources list is stored in master server, and divides subtask in subordinate's server, each news sources
Seed distributes single background work thread, and starts text and metadata information extraction module.
C) all text mark portions are gone out using structure extraction such as dom, css from the html of original web page in extraction module
Point, for the node node comprising multiple texts, according to position in webpage for the stop words quantity and this node under each node
Put layout to be given a mark, for judging the significance level of node:In general stop words quantity represents the contents of the section more
Full and accurate, in page layout, the content the closer to center is more more important, finds out obs network node in this way, and extracts core
Content of text in hearty cord point is as core newsletter archive.
D) newsletter archive of description extraneous events is filtered.Generally have substantially due to being also easy to produce the extraneous events obscured
Text feature, for example study political society event when, the news report such as competitive sports through frequently with mean country trial of strength type
Wording characteristics, but again comprise physical culture vocabulary such as numerous " international league matches " simultaneously, therefore can be using including picking of unrelated word feature
Except word dictionary, filter extraneous events.
E) rule being defined in advance according to some or template, remove unrelated with content structure mark in css and script
Sign, retain date issued, heading message, complete Text Feature Extraction and cleaning.
F) newsletter archive of extraction and metadata are integrated into the file of prescribed form, and are uploaded to NoSQL storage architecture
Data base in.
(3) knowledge data according to needed for step (1), from every, extracts the event argument information on basis newsletter archive, raw
Become primary event data;And similar case data is carried out with duplicate removal or merging, constitute primary event set.
Newsletter archive follows specific Writing Standards, and generally forward paragraph introduces media event outline, paragraph rearward
Main event is done is supplemented further.Therefore can integrated template analysis newsletter archive be analyzed, more with statistical learning method
Specific process is as follows:
A) using sentence extractor, newsletter archive segmentation is formed a complete sentence, application natural language processing instrument is (as Stanford University
CoreNLP, the NLPIR of Beijing Institute of Technology etc.) morphology and syntactic analysis are carried out to news in brief (can select the first six sentence),
It is parsed into the form of syntax tree, and identify dependence.
B) according to architectural feature in syntax tree for the word and entity elements information bank, reality is named to news in brief
Body identifies, excavates the entity objects such as the name being related in outgoing event, place name, mechanism's name.
C) according to the action core word in news in brief, judge behavior relates to thing subject and object, according to predefined thing
Part behavior category patterns (content in such as table 1 is illustrated), the behavior relation of identification events and generic, and calculate event
Sentiment orientation intensity.For example when studying international events, it is divided into 20 big class from political cooperation to extensive incident of violence, and point
Do not define corresponding subclass and word uses feature, Sentiment orientation intensity is allocated -10 to 10 scoring, military attack/big rule
Mould incident of violence is -10 points, terminates military operation and is+10 points, issues statement and is 0 point.
Table 1
D) time descriptor, application TimeML text time affinity criterions and the issuing time in positioning newsletter archive, right
Fuzzy time statement is converted to (as " this Saturday " " yesterday " etc.) the time notation of specification by rule of inference.Comprehensive text
The sequential relationship of time relationship reasoning outgoing event, event is matched with time labelling.
E) the location expression word in localization of text, it is possible to use the geography information mark service increased income, chooses and is identified as position
Put first of adverbial modifier mark word as venue location, and completion automatically made a look up according to the place name in text, reach from
Country, the minimum identification granularity of administrative region to city.If having indicated the fine position information such as street, building in text,
Recognize city, retain this description field simultaneously.
F) above-mentioned Elements Integration is become primary event data, event argument types value can be used but not limited to following shape
Formula:Event=(time, location, actor1, actor2, action, type, scale, url)
Wherein time is time of origin, describes or numeric type key element;Location be occur position, include describe title,
Country origin, administrative region, city aliquot, are sky when default;Actor1 and actor2 represents agent main body and word denoting the receiver of an action object respectively,
Available multiclass field is indicated, and has both included description type title, also includes marking entity property (as name, official mission, non-official
Square mechanism, international organization etc.) assert information;Action records behavior description word;Type represents event category, belongs to the type of asserting
Key element;Scale represents the Sentiment orientation of event, belongs to numeric type key element;Url is side information, represents the source of initial data.
For example, the news that August is issued on the 13rd
Table 2
Corresponding primary event data is represented by
Table 3
G) when the similarity of same period primary event data exceedes specific threshold, retain the up-to-date thing generating in this period
Number of packages carries out duplicate removal according to this;It is defined by the more complete data of information simultaneously, event argument is entered with row information and merges, and record all
Corresponding source-information.
(4) calculate significance level in event is portrayed for each key element of primary event, generate be made up of basic factors thing
Part initial summary framework.
A) event argument is more crucial to the event of portraying, and its significance level value is bigger, span between 0 to 1, its
In:The significance level of time of origin key element is 1;The significance level of description type key element is common in the corresponding newsletter archive of event by it
Existing frequency determines, and is normalized;For the event argument using multistage description form, there is position letter in such as event
Breath using place name, city name, administrative area domain name, name of the country multilevel hierarchy description, the computational methods of description type title ibid, with will
Sketch states the expansion of granularity, and significance level suitably reduces on the basis of this key element property..
B) each key element value of primary event data is launched according to the form of key-value pair, and according to the important journey of key element
Degree carries out assignment to the significance level of each key-value pair, generates event initial summary framework, as follows:P (e)={ ((ki,vi),ωi
(e,(ki,vi)))|(ki,vi)∈E,ωi(e,(ki,vi)) ∈ [0,1], wherein E represents all key assignments wanting prime component of event e
To set, the maximum occurrences of i are the number of all key-value pairs, (ki,vi) it is i-th key-value pair, kiIt is intended to the title of prime component,
viCorrespond to value, ω for componentiSignificance level for key-value pair.
(5) the initial summary framework based on event generates the retrieval framework of social network data, using the inspection of Dynamic iterations
Rope scheme real-time update retrieves framework, extracts the social network message text meeting search condition, generates alternate message set.
More specifically process is as follows:
A) using the key-value pair information in event initial summary framework as search key seed, according to synset to pass
Keyword is extended, and generates microblogging retrieval framework;By the open data retrieval interface of microblogging, retrieval event occurs nearest one section
The microblog data of (within such as 7 days) in time.
B) during in the Twitter message retrieving, the TFIDF value according to word or phrase is to Twitter message, word or phrase enter
Row ranking, chooses the higher word of ranking as key word, and updates retrieval framework, disappears further according to above-mentioned requirements retrieval microblogging
Breath.
C) terminate iterative search when the discovery procedure convergence of key word, extract the Twitter message text retrieving, charge to
Alternate message set.
(6) according to the knowledge data in step (1), in conjunction with text semantic analysis method, analyze in alternate message set
Element information and affiliated classification, analyze the significance level of each key-value pair, and are disappeared according to the analysis result generation candidate of key-value pair
The summary framework of breath.
More specifically process is as follows:
A) extract picture metadata or the user's geographical location information in alternate message metadata from Twitter message, obtain
The corresponding geographic coordinate information of alternate message.
B) alternate message is named with Entity recognition and Shallow Semantic Parsing, the entity information that every microblogging of positioning is related to
And semantic role.
C) use knowledge mapping data set and associated tool, the entity information in blog article is mapped to related notion, obtains
The key-value pair information comprising in microblogging.For example first microblogging " apart from ten thousand logical new city International Residentials of about 2 kilometers of blast site, wealth
Produce loss serious " in identify that entity " ten thousand logical new city International Residentials " belongs to " residential block " classification.
D) microblogging text is carried out with Classification and Identification or cluster, and set up the association of generic and key word, form one group
Key-value pair, is stored in the lump with this microblogging text.After important event occurs, content of microblog is generally divided into following classification:Event shadow
Sound, analysis on reasons, potential risk, client's experience, user comment etc., according to text feature and corresponding classification recognition rule,
Text is classified;Then the key-value pair information having identified is mapped to corresponding classification, for example " event impact " class
Following key-value pair is potentially included, (death toll, 165), (number of injured people, 798), (residential block, Wan Tong new city state under other microblogging
Border cell) etc..
E) assess Twitter message content in terms of microblogging metadata, user's attention rate and microblogging issue geographical location information etc.
Significance level.Microblogging metadata includes this then concern such as the forwarding of microblogging, comment temperature, and usual temperature is higher, and this then disappears
Breath content is more important;User's attention rate refers to the vermicelli quantity of publisher, represents the power of influence of publisher;The geographical position that microblogging is issued
Put and be compared with the geographical position in primary event framework, geographic distance is then designated client's message within the specific limits,
Importance degree improves.It is to be calculated according to metadata that the assessment models of significance level can adopt score=MS+US+LS, wherein MS
Microblogging temperature score, US is the score being calculated according to user profile, and LS is the score being calculated according to geographical relative position, final
To score be normalized, value is between 0 to 1.
F) integrate the key-value pair information of every microblogging, and the inquiry score according to key-value pair and microblogging significance level information,
Form the summary framework with regard to candidate Twitter message m, that is,
P (m)={ ((ki,vi),si(m,(ki,vi)))|(ki,vi)∈M,si(m,(ki,vi))∈[0,1]};Wherein si(m,
(ki,vi)) it is the key-value pair (k extracting in Message-texti,vi) significance level, significance level score according to microblogging m and key
Value calculates jointly to the TFIDF value in alternate message key-value pair;The maximum occurrences of i be this Twitter message (include text and
Metadata) included in key-value pair number;Article one, the key-value pair that the summary framework of Twitter message comprises may for empty it is also possible to
Comprise multipacket message, M represents all key-value pair set wanting prime component of alternate message m, kiIt is the title wanting prime component for i-th, vi
Correspond to value for component.
(7) compare the similarity of alternate message summary framework and event summary framework, when demanded, by alternate message
It is added to the message queue of this event.
The microblogging search method being triggered by event summary framework P (e) is the query filter being carried out according to text.By adjustment
Cosine similarity or Ming Shi distance method calculate summary framework P (m) of every alternate message and the similarity of P (e), and according to phase
Threshold value like degree sets up the filtering rule of alternate message, realizes semantic filtering, thus obtaining more accurate event message queue.
(8) according to default prioritisation of messages condition (as conditions such as the significance level of social network message, issuing time), select successively
Key-value pair in message queue is as the candidate events key element of event data;For certainty information such as geographical coordinates, according to
Key-value pair in message queue is added to be clustered, analysis result adds in candidate events key element.
The Twitter message queue of event contains the element information that event more becomes more meticulous, and needs to add according to ad hoc rule condition
Enter in event data, further description is as follows:
A) message in Twitter message list is ranked up:Can be according to microblogging significance level score or microblogging summary
Framework is ranked up with the similarity of primary event summary framework, also can be according to the issuing time of Twitter message and event summary frame
The ascending sequence of degree of closeness of the time in frame, user comprehensive can also build the ordering strategy customizing.
B) extract microblogging successively according to queue sequence, if the corresponding key-value pair information of this microblogging do not appear in current
Event summary framework, then be added in the candidate events key element of event data, till not having new information to add.
C) to geographic coordinate data substantial amounts of in message queue, by abnormity point elimination and cluster analyses, it is possible to obtain thing
The accurate longitude and latitude that part occurs, the particularly event to multiple scenes, this step plays more accurate effect.
(9) the candidate events key element that above-mentioned newsletter archive and social network data are extracted, according to the time, place, entity,
The aspects such as classification, result, scale, sociology attribute are classified further, using Events Fusion rule, carry out specification to event argument
Change and integrate, generate complete event data.
Content overlap is there may be in the candidate events key element obtain due to event summary framework and by microblog data
Relate to fact object key element in situation, such as " the especially big blast in 812 PORT OF TIANJIN " event, corresponding value be probably " auspicious sea logistics ",
" Rui Hai company ", " PORT OF TIANJIN harbour affairs group " etc. are it is therefore desirable to integrate to the same category information of event, analog information is closed
And waiting operation, further description is as follows:
A) according to knowledge and training data, feature category title is classified, classification includes time of origin, spot
Point, agent main body, word denoting the receiver of an action object, event category, event result, scale and impact, sociology attribute etc., involved classification is made
Outermost description label for event data.
B) conceptual network being provided according to knowledge mapping, key element item name is added in the subtab of event data,
Middle concept node can be added if necessary.
C) normalization process is carried out to the value type of candidate events key element, and by type label (description type, assert type,
Numeric type etc.) and valued content be added to event data, form complete event data.
Table 4 exhaustive events part section takes
Claims (10)
1. a kind of event extraction method across media, its step is:
1) setting seed affair character storehouse and required knowledge data;
2) gather news web page from the credible news source setting, and extract newsletter archive and first number from the news web page of collection
It is believed that breath;
3) event argument information is extracted from every then newsletter archive according to described seed affair character storehouse and required knowledge data,
Generate primary event data, obtain a primary event set;
4) calculate significance level in event is portrayed for each key element of primary event, generate the initial summary framework of event;
5) based on each key element search social network message text in the initial summary framework of event, generate alternate message set;
6) combine text semantic analysis method, the element information included in analysis alternate message set and affiliated classification, generate
The summary framework of each alternate message;
7) the summary framework according to alternate message and the similarity of the initial summary framework of described event were carried out to alternate message
Filter, obtains the corresponding message queue of primary event;
8) will exist and event initial summary framework in the event argument in the initial summary framework of described event and message queue
In non-existent event argument be added to a candidate events elements combination;
9) exhaustive events data is generated according to the event argument in candidate events elements combination.
2. the method for claim 1 is it is characterised in that the initial summary framework of described event is P (e)={ ((ki,
vi),ωi(e,(ki,vi)))|(ki,vi)∈E,ωi(e,(ki,vi))∈[0,1]};Wherein, E represents that all key elements of event e are divided
The key-value pair set of amount, kiIt is the title wanting prime component for i-th, viCorrespond to value, ω for componentiFor i-th key-value pair (ki,vi)
Significance level.
3. method as claimed in claim 2 is it is characterised in that the method generating alternate message set is:
A) using the key-value pair information in the initial summary framework of event as search key seed, according to synset to key
Word is extended, and generates alternate message retrieval framework and retrieves the alternate message in setting time;
B) ranking is carried out to participle according to the TFIDF value of participle in the alternate message retrieving, some participles are chosen according to ranking
Update alternate message retrieval framework, the then alternate message in iterative searching setting time as key word;When sending out of key word
Terminate iterative search, using the alternate message retrieving as alternate message set during existing process convergence.
4. the method as described in claim 1 or 2 or 3 is it is characterised in that the method generating the summary framework of alternate message is:
A) extract picture metadata or the user's geographical location information in alternate message metadata, obtain alternate message correspondingly
Reason coordinate information;
B) alternate message is named with Entity recognition and Shallow Semantic Parsing, the entity information of every alternate message of positioning and language
Adopted role;
C) according to described knowledge data, the entity information of alternate message is mapped, obtain the key assignments comprising in this alternate message
To information;
D) Classification and Identification or cluster are carried out to the key-value pair information that step c) obtains, set up the association of generic and key word,
Obtain some groups of key-value pairs of this alternate message, and assess the significance level of this alternate message;
E) key-value pair according to alternate message and its significance level information, forms the summary framework of this alternate message.
5. method as claimed in claim 2 or claim 3 is it is characterised in that the summary framework of described alternate message is P (m)={ ((ki,
vi),si(m,(ki,vi)))|(ki,vi)∈M,si(m,(ki,vi))∈[0,1]};Wherein, si(m,(ki,vi)) it is alternate message m
Key-value pair (the k of middle extractioni,vi) significance level, M represents all key-value pair set wanting prime component of alternate message m, kiIt is i-th
The individual title wanting prime component, viCorrespond to value for component.
6. the method as described in claim 1 or 2 or 3 is it is characterised in that described seed affair character storehouse and required knowledge number
According to inclusion:Entity elements information bank, association body and classification open knowledge mapping data set, event behavior category patterns storehouse or
Language material resource.
7. method as claimed in claim 6 is it is characterised in that according to seed affair character storehouse and required knowledge data from every
Then extract event argument information in newsletter archive, the method generating primary event data is:
A) subordinate sentence is carried out to newsletter archive, news in brief is parsed into syntax tree and identifies dependence;
B) according to architectural feature in syntax tree for the word and entity elements information bank, news in brief is named with entity and knows
, do not obtain the entity object being related in event;
C) subject and object relating to thing according to the action core word judgment behavior in news in brief, according to predefined event row
For category patterns, the behavior relation of identification events and generic, and calculate the Sentiment orientation intensity of event;
D) the time descriptor in positioning newsletter archive, fuzzy time statement is converted to the time notation of specification, and by thing
Part is mated with time labelling;
E) the location expression word in positioning newsletter archive, chooses first mark word being identified as the position adverbial modifier as event generation
Ground, and completion is automatically made a look up according to the place name in newsletter archive;
F) above-mentioned Elements Integration is become primary event data.
8. the method described in claim 1 or 2 or 3 is it is characterised in that calculate the weight to the event of portraying for each key element of primary event
The method wanting degree is:Event argument is more important to the event of portraying, and its significance level value is bigger;Wherein:Time of origin key element
Significance level be significance level maximum;The co-occurrence frequency in newsletter archive is true according to it for the significance level of description type key element
Fixed.
9. the method as described in claim 1 or 2 or 3 is it is characterised in that the method for exhaustive events data as described in generating is:
A) event argument in candidate events elements combination is classified, classification includes time of origin, scene, agent master
Body, word denoting the receiver of an action object, event category, event result, scale and impact, sociology attribute;Using the classification of event argument as event
The outermost description label of data;
B) conceptual network being provided according to the knowledge mapping in described knowledge data, the item name of event argument is added to thing
In the subtab of number of packages evidence;
C) normalization process is carried out to the value type of candidate events key element, obtain value type label;Then by value type
Label and valued content are added to event data, form complete event data.
10. the method for claim 1 is it is characterised in that the element type form of described primary event data is event
=(time, location, actor1, actor2, action, type, scale, url);Wherein, time is time of origin,
Location is that position occurs, and actor1 is agent main body, and actor2 is word denoting the receiver of an action object, and action is behavior description word, type
Represent event category, scale represents the Sentiment orientation of event, url is the source of initial data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610809600.1A CN106484767B (en) | 2016-09-08 | 2016-09-08 | A kind of event extraction method across media |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610809600.1A CN106484767B (en) | 2016-09-08 | 2016-09-08 | A kind of event extraction method across media |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106484767A true CN106484767A (en) | 2017-03-08 |
CN106484767B CN106484767B (en) | 2019-06-21 |
Family
ID=58273654
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610809600.1A Active CN106484767B (en) | 2016-09-08 | 2016-09-08 | A kind of event extraction method across media |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106484767B (en) |
Cited By (35)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107229712A (en) * | 2017-05-27 | 2017-10-03 | 中南大学 | A kind of space-time clustering method towards occurred events of public safety acquisition of information |
CN107766477A (en) * | 2017-09-30 | 2018-03-06 | 武汉汉思信息技术有限责任公司 | Page structure data extraction method, terminal device and storage medium |
CN108920447A (en) * | 2018-05-07 | 2018-11-30 | 国家计算机网络与信息安全管理中心 | A kind of Chinese event abstracting method towards specific area |
CN108959626A (en) * | 2018-07-23 | 2018-12-07 | 四川省烟草公司成都市公司 | A kind of cross-platform efficient automatic generation method of isomeric data bulletin |
CN109033074A (en) * | 2018-06-29 | 2018-12-18 | 北京百度网讯科技有限公司 | News in brief generation method, device, equipment and computer-readable medium |
CN109241438A (en) * | 2018-09-27 | 2019-01-18 | 国家计算机网络与信息安全管理中心 | Across channel focus incident discovery method, apparatus and storage medium based on element |
CN109408806A (en) * | 2018-09-11 | 2019-03-01 | 中国电子科技集团公司第二十八研究所 | A kind of Event Distillation method based on English grammar rule |
CN109885698A (en) * | 2019-02-13 | 2019-06-14 | 北京航空航天大学 | A kind of knowledge mapping construction method and device, electronic equipment |
CN110134842A (en) * | 2019-04-03 | 2019-08-16 | 深圳价值在线信息科技股份有限公司 | Information matching method, device, storage medium and server based on Information Atlas |
CN110297885A (en) * | 2019-05-27 | 2019-10-01 | 中国科学院深圳先进技术研究院 | Generation method, device, equipment and the storage medium of real-time event abstract |
CN110334220A (en) * | 2019-07-15 | 2019-10-15 | 中国人民解放军战略支援部队航天工程大学 | A kind of knowledge mapping construction method based on multi-data source |
CN110457468A (en) * | 2019-07-05 | 2019-11-15 | 武楚荷 | A kind of classification method of event, device and storage device |
CN110472066A (en) * | 2019-08-07 | 2019-11-19 | 北京大学 | A kind of construction method of urban geography semantic knowledge map |
CN110471993A (en) * | 2019-07-05 | 2019-11-19 | 武楚荷 | A kind of correlating method of event, device and storage device |
CN111191046A (en) * | 2019-12-31 | 2020-05-22 | 北京明略软件系统有限公司 | Method, device, computer storage medium and terminal for realizing information search |
CN111191413A (en) * | 2019-12-30 | 2020-05-22 | 北京航空航天大学 | Method, device and system for automatically marking event core content based on graph sequencing model |
CN111428041A (en) * | 2019-01-09 | 2020-07-17 | 阿里巴巴集团控股有限公司 | Case abstract generation method, device, system and storage medium |
CN111782907A (en) * | 2020-07-01 | 2020-10-16 | 北京知因智慧科技有限公司 | News classification method and device and electronic equipment |
CN111966890A (en) * | 2020-06-30 | 2020-11-20 | 北京百度网讯科技有限公司 | Text-based event pushing method and device, electronic equipment and storage medium |
WO2020237479A1 (en) * | 2019-05-27 | 2020-12-03 | 中国科学院深圳先进技术研究院 | Real-time event summarization generation method, apparatus and device, and storage medium |
CN112328856A (en) * | 2020-10-30 | 2021-02-05 | 中国平安人寿保险股份有限公司 | Common event tracking method and device, computer equipment and computer readable medium |
CN112328794A (en) * | 2020-11-10 | 2021-02-05 | 南京师范大学 | Typhoon event information aggregation method |
CN112560461A (en) * | 2020-12-11 | 2021-03-26 | 北京百度网讯科技有限公司 | News clue generation method and device, electronic equipment and storage medium |
CN112579738A (en) * | 2020-12-23 | 2021-03-30 | 广州博冠信息科技有限公司 | Target object label processing method, device, equipment and storage medium |
CN112597772A (en) * | 2020-12-31 | 2021-04-02 | 讯飞智元信息科技有限公司 | Hotspot information determination method, computer equipment and device |
CN113033201A (en) * | 2020-11-06 | 2021-06-25 | 新华智云科技有限公司 | Earthquake news information extraction method and system |
CN113065051A (en) * | 2021-04-02 | 2021-07-02 | 西南石油大学 | Visual agricultural big data analysis interactive system |
CN113326352A (en) * | 2021-06-18 | 2021-08-31 | 哈尔滨工业大学 | Sub-event relation identification method based on heterogeneous event graph |
CN113495951A (en) * | 2020-04-03 | 2021-10-12 | 源析(青岛)信息技术有限公司 | Construction method of knowledge graph for persistent social events |
CN113609309A (en) * | 2021-08-16 | 2021-11-05 | 脸萌有限公司 | Knowledge graph construction method and device, storage medium and electronic equipment |
CN114065769A (en) * | 2022-01-14 | 2022-02-18 | 四川大学 | Method, device, equipment and medium for training emotion reason pair extraction model |
CN114880588A (en) * | 2022-06-13 | 2022-08-09 | 四川封面传媒科技有限责任公司 | News popularity prediction method based on knowledge graph |
CN115422948A (en) * | 2022-11-04 | 2022-12-02 | 文灵科技(北京)有限公司 | Event level network identification system and method based on semantic analysis |
CN117435697A (en) * | 2023-12-21 | 2024-01-23 | 中科雨辰科技有限公司 | Data processing system for acquiring core event |
CN114880588B (en) * | 2022-06-13 | 2024-04-26 | 四川封面传媒科技有限责任公司 | News heat prediction method based on knowledge graph |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103778200A (en) * | 2014-01-09 | 2014-05-07 | 中国科学院计算技术研究所 | Method for extracting information source of message and system thereof |
CN104408093A (en) * | 2014-11-14 | 2015-03-11 | 中国科学院计算技术研究所 | News event element extracting method and device |
CN105389304A (en) * | 2015-10-27 | 2016-03-09 | 小米科技有限责任公司 | Event extraction method and apparatus |
CN105389354A (en) * | 2015-11-02 | 2016-03-09 | 东南大学 | Social media text oriented unsupervised method for extracting and sorting events |
-
2016
- 2016-09-08 CN CN201610809600.1A patent/CN106484767B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103778200A (en) * | 2014-01-09 | 2014-05-07 | 中国科学院计算技术研究所 | Method for extracting information source of message and system thereof |
CN104408093A (en) * | 2014-11-14 | 2015-03-11 | 中国科学院计算技术研究所 | News event element extracting method and device |
CN105389304A (en) * | 2015-10-27 | 2016-03-09 | 小米科技有限责任公司 | Event extraction method and apparatus |
CN105389354A (en) * | 2015-11-02 | 2016-03-09 | 东南大学 | Social media text oriented unsupervised method for extracting and sorting events |
Non-Patent Citations (1)
Title |
---|
金璐钰: "基于框架的事件抽取研究", 《高科技与产业化》 * |
Cited By (51)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107229712A (en) * | 2017-05-27 | 2017-10-03 | 中南大学 | A kind of space-time clustering method towards occurred events of public safety acquisition of information |
CN107766477A (en) * | 2017-09-30 | 2018-03-06 | 武汉汉思信息技术有限责任公司 | Page structure data extraction method, terminal device and storage medium |
CN108920447A (en) * | 2018-05-07 | 2018-11-30 | 国家计算机网络与信息安全管理中心 | A kind of Chinese event abstracting method towards specific area |
CN109033074A (en) * | 2018-06-29 | 2018-12-18 | 北京百度网讯科技有限公司 | News in brief generation method, device, equipment and computer-readable medium |
CN108959626B (en) * | 2018-07-23 | 2023-06-13 | 四川省烟草公司成都市公司 | Efficient automatic generation method for cross-platform heterogeneous data profile |
CN108959626A (en) * | 2018-07-23 | 2018-12-07 | 四川省烟草公司成都市公司 | A kind of cross-platform efficient automatic generation method of isomeric data bulletin |
CN109408806A (en) * | 2018-09-11 | 2019-03-01 | 中国电子科技集团公司第二十八研究所 | A kind of Event Distillation method based on English grammar rule |
CN109241438A (en) * | 2018-09-27 | 2019-01-18 | 国家计算机网络与信息安全管理中心 | Across channel focus incident discovery method, apparatus and storage medium based on element |
CN109241438B (en) * | 2018-09-27 | 2022-06-24 | 国家计算机网络与信息安全管理中心 | Element-based cross-channel hot event discovery method and device and storage medium |
CN111428041A (en) * | 2019-01-09 | 2020-07-17 | 阿里巴巴集团控股有限公司 | Case abstract generation method, device, system and storage medium |
CN111428041B (en) * | 2019-01-09 | 2023-06-16 | 阿里巴巴集团控股有限公司 | Case abstract generation method, device, system and storage medium |
CN109885698A (en) * | 2019-02-13 | 2019-06-14 | 北京航空航天大学 | A kind of knowledge mapping construction method and device, electronic equipment |
CN110134842A (en) * | 2019-04-03 | 2019-08-16 | 深圳价值在线信息科技股份有限公司 | Information matching method, device, storage medium and server based on Information Atlas |
CN110297885B (en) * | 2019-05-27 | 2021-08-17 | 中国科学院深圳先进技术研究院 | Method, device and equipment for generating real-time event abstract and storage medium |
WO2020237479A1 (en) * | 2019-05-27 | 2020-12-03 | 中国科学院深圳先进技术研究院 | Real-time event summarization generation method, apparatus and device, and storage medium |
CN110297885A (en) * | 2019-05-27 | 2019-10-01 | 中国科学院深圳先进技术研究院 | Generation method, device, equipment and the storage medium of real-time event abstract |
CN110457468A (en) * | 2019-07-05 | 2019-11-15 | 武楚荷 | A kind of classification method of event, device and storage device |
CN110471993A (en) * | 2019-07-05 | 2019-11-19 | 武楚荷 | A kind of correlating method of event, device and storage device |
CN110457468B (en) * | 2019-07-05 | 2022-08-23 | 武楚荷 | Event classification method and device and storage device |
CN110334220A (en) * | 2019-07-15 | 2019-10-15 | 中国人民解放军战略支援部队航天工程大学 | A kind of knowledge mapping construction method based on multi-data source |
CN110472066B (en) * | 2019-08-07 | 2022-03-25 | 北京大学 | Construction method of urban geographic semantic knowledge map |
CN110472066A (en) * | 2019-08-07 | 2019-11-19 | 北京大学 | A kind of construction method of urban geography semantic knowledge map |
CN111191413A (en) * | 2019-12-30 | 2020-05-22 | 北京航空航天大学 | Method, device and system for automatically marking event core content based on graph sequencing model |
CN111191413B (en) * | 2019-12-30 | 2021-11-12 | 北京航空航天大学 | Method, device and system for automatically marking event core content based on graph sequencing model |
CN111191046A (en) * | 2019-12-31 | 2020-05-22 | 北京明略软件系统有限公司 | Method, device, computer storage medium and terminal for realizing information search |
CN113495951A (en) * | 2020-04-03 | 2021-10-12 | 源析(青岛)信息技术有限公司 | Construction method of knowledge graph for persistent social events |
CN111966890B (en) * | 2020-06-30 | 2023-07-04 | 北京百度网讯科技有限公司 | Text-based event pushing method and device, electronic equipment and storage medium |
CN111966890A (en) * | 2020-06-30 | 2020-11-20 | 北京百度网讯科技有限公司 | Text-based event pushing method and device, electronic equipment and storage medium |
CN111782907B (en) * | 2020-07-01 | 2024-03-01 | 北京知因智慧科技有限公司 | News classification method and device and electronic equipment |
CN111782907A (en) * | 2020-07-01 | 2020-10-16 | 北京知因智慧科技有限公司 | News classification method and device and electronic equipment |
CN112328856A (en) * | 2020-10-30 | 2021-02-05 | 中国平安人寿保险股份有限公司 | Common event tracking method and device, computer equipment and computer readable medium |
CN113033201B (en) * | 2020-11-06 | 2023-07-28 | 新华智云科技有限公司 | Earthquake news information extraction method and system |
CN113033201A (en) * | 2020-11-06 | 2021-06-25 | 新华智云科技有限公司 | Earthquake news information extraction method and system |
CN112328794B (en) * | 2020-11-10 | 2021-08-24 | 南京师范大学 | Typhoon event information aggregation method |
CN112328794A (en) * | 2020-11-10 | 2021-02-05 | 南京师范大学 | Typhoon event information aggregation method |
CN112560461A (en) * | 2020-12-11 | 2021-03-26 | 北京百度网讯科技有限公司 | News clue generation method and device, electronic equipment and storage medium |
CN112579738A (en) * | 2020-12-23 | 2021-03-30 | 广州博冠信息科技有限公司 | Target object label processing method, device, equipment and storage medium |
CN112597772A (en) * | 2020-12-31 | 2021-04-02 | 讯飞智元信息科技有限公司 | Hotspot information determination method, computer equipment and device |
CN113065051A (en) * | 2021-04-02 | 2021-07-02 | 西南石油大学 | Visual agricultural big data analysis interactive system |
CN113326352B (en) * | 2021-06-18 | 2022-05-24 | 哈尔滨工业大学 | Sub-event relation identification method based on heterogeneous event graph |
CN113326352A (en) * | 2021-06-18 | 2021-08-31 | 哈尔滨工业大学 | Sub-event relation identification method based on heterogeneous event graph |
WO2023022655A3 (en) * | 2021-08-16 | 2023-04-13 | 脸萌有限公司 | Knowledge map construction method and apparatus, storage medium, and electronic device |
CN113609309A (en) * | 2021-08-16 | 2021-11-05 | 脸萌有限公司 | Knowledge graph construction method and device, storage medium and electronic equipment |
CN113609309B (en) * | 2021-08-16 | 2024-02-06 | 脸萌有限公司 | Knowledge graph construction method and device, storage medium and electronic equipment |
CN114065769B (en) * | 2022-01-14 | 2022-04-08 | 四川大学 | Method, device, equipment and medium for training emotion reason pair extraction model |
CN114065769A (en) * | 2022-01-14 | 2022-02-18 | 四川大学 | Method, device, equipment and medium for training emotion reason pair extraction model |
CN114880588A (en) * | 2022-06-13 | 2022-08-09 | 四川封面传媒科技有限责任公司 | News popularity prediction method based on knowledge graph |
CN114880588B (en) * | 2022-06-13 | 2024-04-26 | 四川封面传媒科技有限责任公司 | News heat prediction method based on knowledge graph |
CN115422948A (en) * | 2022-11-04 | 2022-12-02 | 文灵科技(北京)有限公司 | Event level network identification system and method based on semantic analysis |
CN117435697B (en) * | 2023-12-21 | 2024-03-22 | 中科雨辰科技有限公司 | Data processing system for acquiring core event |
CN117435697A (en) * | 2023-12-21 | 2024-01-23 | 中科雨辰科技有限公司 | Data processing system for acquiring core event |
Also Published As
Publication number | Publication date |
---|---|
CN106484767B (en) | 2019-06-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106484767B (en) | A kind of event extraction method across media | |
CN110941692B (en) | Internet political outturn news event extraction method | |
JP7201730B2 (en) | Intention recommendation method, device, equipment and storage medium | |
CN104076944B (en) | A kind of method and apparatus of chatting facial expression input | |
CN107609052A (en) | A kind of generation method and device of the domain knowledge collection of illustrative plates based on semantic triangle | |
CN107180045B (en) | Method for extracting geographic entity relation contained in internet text | |
CN109885698A (en) | A kind of knowledge mapping construction method and device, electronic equipment | |
CN108073569A (en) | A kind of law cognitive approach, device and medium based on multi-layer various dimensions semantic understanding | |
US8296309B2 (en) | System and method for high precision and high recall relevancy searching | |
CN112131449B (en) | Method for realizing cultural resource cascade query interface based on ElasticSearch | |
CN104281702B (en) | Data retrieval method and device based on electric power critical word participle | |
CN110929125B (en) | Search recall method, device, equipment and storage medium thereof | |
CN102567509B (en) | Method and system for instant messaging with visual messaging assistance | |
CN108628828A (en) | A kind of joint abstracting method of viewpoint and its holder based on from attention | |
CN111881290A (en) | Distribution network multi-source grid entity fusion method based on weighted semantic similarity | |
CN103984771B (en) | Method for extracting geographical interest points in English microblog and perceiving time trend of geographical interest points | |
CN113705218B (en) | Event element gridding extraction method based on character embedding, storage medium and electronic device | |
CN111967761A (en) | Monitoring and early warning method and device based on knowledge graph and electronic equipment | |
CN109947952A (en) | Search method, device, equipment and storage medium based on english knowledge map | |
CN109840325A (en) | Text semantic method for measuring similarity based on mutual information | |
CN107480137A (en) | With semantic iterative extraction network accident and the method that identifies extension event relation | |
Hao et al. | Semantic patterns for user‐interactive question answering | |
CN112015908A (en) | Knowledge graph construction method and system, and query method and system | |
CN111241299A (en) | Knowledge graph automatic construction method for legal consultation and retrieval system thereof | |
CN114186567A (en) | Sensitive word detection method and device, equipment, medium and product thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |