CN110737819B - Emergency clue extraction method based on news reports - Google Patents
Emergency clue extraction method based on news reports Download PDFInfo
- Publication number
- CN110737819B CN110737819B CN201910983942.9A CN201910983942A CN110737819B CN 110737819 B CN110737819 B CN 110737819B CN 201910983942 A CN201910983942 A CN 201910983942A CN 110737819 B CN110737819 B CN 110737819B
- Authority
- CN
- China
- Prior art keywords
- event
- events
- topic
- emergency
- distribution
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 26
- 238000000034 method Methods 0.000 claims abstract description 35
- 238000007781 pre-processing Methods 0.000 claims abstract description 14
- 239000013598 vector Substances 0.000 claims description 40
- 238000005070 sampling Methods 0.000 claims description 21
- 238000003058 natural language processing Methods 0.000 claims description 4
- 101100481876 Danio rerio pbk gene Proteins 0.000 claims description 3
- 101100481878 Mus musculus Pbk gene Proteins 0.000 claims description 3
- 238000004458 analytical method Methods 0.000 claims description 3
- 230000001105 regulatory effect Effects 0.000 claims 1
- 238000010586 diagram Methods 0.000 description 6
- 230000007547 defect Effects 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000000717 retained effect Effects 0.000 description 2
- 206010037660 Pyrexia Diseases 0.000 description 1
- 208000009714 Severe Dengue Diseases 0.000 description 1
- 244000052616 bacterial pathogen Species 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000001364 causal effect Effects 0.000 description 1
- 230000010485 coping Effects 0.000 description 1
- 230000009429 distress Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000005180 public health Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a news report-based emergency clue extraction method, which is used for preprocessing news texts; extracting events from the preprocessing result; obtaining event distributed representation, and calculating to obtain event similarity so as to construct event semantic knowledge; constructing an event topic model to obtain event topic distribution and document topic distribution; taking the event with the highest topic probability as a topic event set; constructing a time sequence relation graph of the events by taking each subject event as a node and taking the sequence relation of the occurrence of the events as an arc; and outputting a final event clue by using an improved topological sorting algorithm. Through the design, the invention can accurately and completely acquire the emergency clues, and solves the problems of weak semantic expression of the event clues and low clue acquisition accuracy rate in the prior art. The method is flexible and has strong application and popularization values.
Description
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to an emergency clue extraction method based on news reports.
Background
The emergency event refers to a natural disaster, an accident disaster, a public health event and a social security event which are caused or possibly cause serious social hazards by sudden occurrence and need to be dealt with by emergency treatment measures. In order to prevent and reduce the occurrence of emergency and control, reduce and eliminate serious social hazards caused by the emergency, people's government and related departments need to standardize the emergency coping activities, comprehensively evaluate the possible emergency, and furthest reduce the influence of major emergency. The sudden events have obvious time sequence characteristics, and the logic sequence of the sudden events can be represented by the topic evolution of the events, namely, the sudden event clues. For example, when a "wilamason typhoon 9 in 2014" event occurs, events such as "casualties", "crop disaster damage", "communication interruption" and the like occur at the same time; with the advance of time, a series of related events such as 'Wilmason logging in China', 'weather station issuing early warning', 'related department issuing notice', 'transferring related personnel', 'preventing germs' and the like are carried out. These related events are all sub-events that evolve or are derived under the theme of "Wilmacson typhoon". These events occur in a time-sequential or causal relationship. The method can accurately and completely obtain clues of the emergency, has an important function for understanding the situation development trend by the antecedent consequence of the emergency, and has certain reference and prediction functions on how to deal with similar emergency.
In the prior art, a word or a phrase is used as a basic unit, and a topic model is applied to obtain the distribution of the word on a topic. The high-frequency topic word set is used for representing the subtopic, and the document reporting time is used for representing the evolution process of the topic, so that the following defects exist: 1. the method takes words or phrases as basic units, has isolated semantics, neglects the semantic relation between the words and cannot completely describe topics; the term has no time concept, and the time sequence characteristics of the topic purpose can be embodied only by means of the document report time. The scheme adopted in the prior art also takes ACE events as basic units, and identifies and infers the relationship between the events so as to describe the evolution process of topics, and the scheme has the following defects: the ACE event categories are divided into 8 large categories and 33 subclasses, the event field is limited, and the extraction accuracy is limited; most of the ACE events are coarse-grained statements or chapter-level events, and part of fine-grained events cannot be extracted; the definition of the event relation has no unified structure, the relation judgment accuracy is low, and the realization difficulty is high.
Therefore, we have designed an emergency cue extraction method based on news reports. The event clue takes triple atomic events (Subject, preset, Object) as a basic unit, and represents the clue by using the time sequence relation among the events. An improved topic model is adopted to generate an event (namely a topic event) set which is strongly related to the topic, and an improved topological sorting algorithm is applied to the constructed event timing relationship graph to output a final event clue.
Disclosure of Invention
Aiming at the defects in the prior art, the method for extracting the emergency clue based on the news report solves the problems that the semantic expression of the event clue is not strong and the clue acquisition accuracy rate is low in the prior art.
In order to achieve the above purpose, the invention adopts the technical scheme that:
the scheme provides an emergency clue extraction method based on news reports, which comprises the following steps:
s1, acquiring a news data set, and preprocessing each news in the news data set by using a natural language processing method;
s2, taking a statement as a unit, and extracting events according to the preprocessing result;
s3, obtaining event distributed representation according to the event extraction result, and constructing event semantic knowledge;
s4, taking the event pairs as entries, and constructing an event topic model by using the event semantic knowledge and the Boria jar model;
s5, according to the event topic model, taking topK events with highest topic probability as a topic event set;
s6, constructing an event time sequence relation graph according to the theme event set and the sequence of the events;
and S7, calculating according to the event timing relationship diagram by using an improved topological sorting algorithm to obtain an emergency clue, thereby completing the extraction of the emergency clue.
Further, the preprocessing in the step S1 includes part of speech tagging, dependency analysis and resolution of reference.
Still further, the step S2 includes the following steps:
s201, taking a statement as a unit, and extracting all predicate relation pairs in an event according to the preprocessing result;
and S202, judging whether the predicate relation pairs have the same predicate, if so, combining the same predicate into a triple event, and entering the step S3, otherwise, keeping the predicate relation pairs as a binary event, and entering the step S3, thereby completing the extraction of the events.
Still further, the step S3 includes the following steps:
s301, obtaining Word vector representation on news corpus by using Word2Vec algorithm according to event extraction results;
s302, calculating by utilizing a combined semantic algorithm according to the word vector representation to obtain event distributed representation;
s303, calculating by using an Euclidean distance algorithm according to the event distributed representation to obtain the similarity between the events;
s304, event semantic knowledge is constructed according to the similarity among the events.
Still further, the event distributed representation in step S302 includes any one of the following cases:
in the first case:
if the event is a triple event, calculating to obtain an event distributed representation according to a predicate vector of the event and a kroneinner-outer product of a subject vector and an object vector of the event, wherein the event distributed representation isThe expression of (a) is as follows:
in the second case:
if the event is a binary event, calculating according to a predicate vector of the event and a vector of a subject or an object of the event to obtain an event distributed representation, wherein the event distributed representation isOrThe expression is as follows:
wherein,represents a kronecker outer product operation, represents a dot product operation,a vector of the predicate of the event is represented,a vector of a subject of the event is represented,representing an event object vector.
Still further, the step S4 includes the following steps:
s401, setting polynomial distribution parameters for generating event topics by taking event pairs as entries-Dir (beta), wherein,representing the distribution of each event under a subject k, and Dir (beta) represents the Dirichlet distribution with a distribution obeying hyper-parameter beta;
s402, setting a polynomial publishing parameter theta for generating a document theme m Dir (α), wherein θ m Representing the topic distribution of the document m, and Dir (alpha) representing the Dirichlet distribution with the hyper-parameter alpha;
s403, for each news document m, event co-occurrence pairs b (e) i ,e j ) Separately sampling to generate a topic z b ~Mult(θ m ) And sampling to generate event e i ~And event e j ~And introducing event similarity by using a Borizia sub-model and the event semantic knowledge in the sampling process, wherein a threshold value adjusting expression of the event similarity is as follows:
where b represents any co-occurrence pair of events occurring in document m, e i Representing events i, e j Representing an event j, z b A topic, Mult (θ), representing the co-occurrence pair of events b in the current sampling process m ) Representing compliance parameter as θ m Is preferably a polynomial distribution of (a) and (b),expressing compliance parameters asIs preferably a polynomial distribution of (a) and (b),represents an event e i And event e j Adjusted similarity, σ, denotes the set threshold, sim (e) i ,e j ) Representing events e derived from event semantic knowledge i And event e j The similarity of (2);
s404, obtaining event theme distribution according to the theme sampling and the event samplingAnd document topic distribution theta, and according to the event topic distributionAnd constructing an event topic model according to the document topic distribution theta.
Still further, the step S6 includes the following steps:
s601, taking each topic event in the topic event set as a node, and determining the time sequence relation of any event pair by using a statistical rule;
and S602, according to the sequence of the events, taking the event which occurs first as an arc tail and the event which occurs later as an arc head, and constructing an event time sequence relation diagram.
Still further, the determining the timing relationship of any event pair by using the statistical rule in step S601 includes any one of the following cases:
in the first case:
probability p of two statistical subject matter events appearing in the same document 1 If the position sequence p of the two subject events in the same document is the maximum, counting the position sequence p of the two subject events in the same document 2 And if event e i Prior to event e j The strength of the timing relationship is: p ═ p 1 ×p 2 ;
In the second case:
probability p of two statistical topic events appearing in different documents 3 If the maximum, counting the reporting time sequence p of the documents of the two subject events 4 And if event e i Prior to event e j The strength of the timing relationship is: p ═ p 3 ×p 4 。
Still further, the step S7 includes the following steps:
s701, outputting an event node sequence by utilizing an improved topological sorting algorithm according to the event timing relationship graph;
s702, judging whether the timing relationship graph has event nodes which are not output, if so, having loops in the rest subgraphs of the timing relationship graph, and entering the step S703, otherwise, entering the step S704;
s703, deleting all arcs in the rest subgraphs, scanning the node events in the output event node sequence to the arcs of each non-output node event in the rest subgraphs in sequence, selecting arcs according to the strength of the time sequence relation, outputting each non-output node event, and recording the current arc;
and S704, forming an emergency clue by the output event node sequence and the recorded current arc, thereby completing the extraction of the emergency clue.
Still further, the step S701 includes the steps of:
s7011, constructing a priority queue according to the event time sequence relation graph, and enqueuing node events with zero degree in the time sequence relation graph;
s7012, sequentially dequeuing the node events with zero admission in the priority queue, outputting the events and deleting arcs taking the output events as tails;
s7013, judging whether a new node event with zero degree of entry exists, if so, performing enqueuing operation on the node event, recording the currently deleted arc, and returning to the step S7012, otherwise, entering the step S702.
The invention has the beneficial effects that:
(1) the method takes the triple event as a basic unit, the extraction algorithm is simple to realize, the granularity of the triple event appearing in the document is limited between words and sentences, the semantic relation between the words can be expressed, and the interference of noise words in the sentences can be avoided;
(2) the method introduces event semantic knowledge, expresses event semantics by a mainstream distributed vector, and alleviates the problem of event sparsity by using event similarity;
(3) according to the method, an event topic model is used for realizing automatic clustering of topic events, a topic model is constructed based on event pairs, and the topic distribution of events and documents is obtained by combining a Borui sub-model and introducing event semantic knowledge;
(4) the invention constructs an event time sequence relational graph, in the constructed event time sequence relational graph, a node represents each subject event, the time sequence relation of each event pair is taken as an arc, and a final event clue is output by utilizing an improved topological sorting algorithm.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Fig. 2 is a structural diagram of the topic model in this embodiment.
Detailed Description
The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.
Examples
The invention provides an emergency clue extraction method based on news reports, which adopts a topic model and a time sequence relation graph algorithm to construct emergency clues. The embodiment of collecting thematic documents on the Xinlang net (including a theme (92) of 'Taifeng Wilmason comes from today No. 9', a theme (102) of 'Taiwan passenger plane forced landing and heavy fire', a theme (38) of 'Guangdong encountering the most severe dengue fever epidemic situation for 20 years', a 'Hangzhou bus longitudinal fire case' (54) and the like) shows that the technology is simple and effective to realize, an unsupervised learning mode enables the technology to be implemented without excessive manual intervention, finally generated event clues take triple events as a basic unit, an improved subject model is adopted to generate a subject event set, and event semantic knowledge is introduced into the subject model. In order to use each subject event as a node, statistically calculate the sequence of events and use the sequence as an arc, construct a time sequence relational graph between events, and output a final event clue by an improved topological sorting algorithm, as shown in fig. 1, the method comprises the following steps:
s1, acquiring a news data set, and preprocessing each news in the news data set by using a natural language processing method, wherein the preprocessing comprises part of speech tagging, dependency analysis and reference resolution;
s2, taking a statement as a unit, extracting the event according to the preprocessing result, wherein the implementation method comprises the following steps:
s201, taking a statement as a unit, and extracting all predicate relation pairs in an event according to the preprocessing result;
and S202, judging whether the predicate relation pairs have the same predicate, if so, combining the same predicate into a triple event, and entering the step S3, otherwise, keeping the predicate relation pairs as a binary event, and entering the step S3, thereby completing the extraction of the events.
In this embodiment, a plurality of events may exist in a statement, and all possible predicate relationship pairs in the statement are proposed, such as "NSUBJ" and "DOBJ" relationships, and if the "NSUBJ" and "DOBJ" relationships have the same predicate, the two-tuple events are merged into one triple event, and if the dependency relationships cannot be merged, the two-tuple events are retained. If the given statement "weather bureau issues typhoon warning", there are two dependency pairs "NSUBJ (issue, weather bureau)" and "DOBJ (issue, warning)", and both predicates thereof are "issue", they can be merged into a triple event "(weather bureau, issue, warning)", whereas for the statement "airplane lost in the open air over the open sea", only a binary event "(airplane, lost, nil)" can be extracted, ("nil" indicates that the event argument is missing). The invention takes the triple event as a basic unit, the extraction algorithm is simple to realize, the granularity of the triple event appearing in the document is between words and sentences, the semantic relation between the words can be expressed, and the interference of noise words in the sentences can be avoided.
S3, obtaining event distributed representation according to the event extraction result, and constructing event semantic knowledge, wherein the implementation method comprises the following steps:
s301, obtaining Word vector representation on news corpus by using Word2Vec algorithm according to event extraction results;
s302, calculating by utilizing a combined semantic algorithm according to the word vector representation to obtain an event distributed representation, wherein the event distributed representation comprises any one of the following conditions:
in the first case:
if the event is a triple event, calculating to obtain an event distributed representation according to a predicate vector of the event and a kroneinner-outer product of a subject vector and an object vector of the event, wherein the event distributed representation isThe expression of (a) is as follows:
in the second case:
if the event is a binary event, calculating according to a predicate vector of the event and a vector of a subject or an object of the event to obtain an event distributed representation, wherein the event distributed representation isOrThe expression is as follows:
wherein,represents a kronecker outer product operation, represents a dot product operation,a vector of the predicate of the event is represented,a vector of a subject of the event is represented,representing an event object vector;
s303, calculating by using an Euclidean distance algorithm according to the event distributed representation to obtain the similarity between the events;
s304, establishing event semantic knowledge according to the similarity among the events.
In the embodiment, Word vector representation is obtained on background corpus by using Word2Vec model, then event distributed representation is obtained by adopting combined semantic mode calculation, if the event is a triple, the Clonecker outer product of the event subject and the object vector is multiplied by the event predicate vector point; if the event is a binary group, directly multiplying a predicate vector point by an argument vector (subject or object), and calculating the similarity between the events by adopting Euclidean distance after obtaining the distributed vector representation of the event so as to construct event semantic knowledge. For example, the semantic similarity between the event "(airplane, crash, nil)" and the event "(airplane, crash, nil)" is about 0.8765. The invention introduces event semantic knowledge, expresses event semantics by mainstream distributed vectors and utilizes event similarity to relieve the problem of event sparsity.
S4, as shown in fig. 2, using the event pair as an entry, and constructing an event topic model by using the event semantic knowledge and the boli subsytum model, where the implementation method is as follows:
s401, setting polynomial distribution parameters for generating event topics by taking event pairs as entries-Dir (beta), wherein,representing the distribution of each event under a subject k, and Dir (beta) represents the Dirichlet distribution with a distribution obeying hyper-parameter beta;
s402, setting a polynomial publishing parameter theta for generating a document theme m Dir (α), wherein θ m Representing the topic distribution of the document m, and Dir (alpha) representing the Dirichlet distribution of which the distribution obeys a hyper-parameter alpha;
s403, for each news document m, event co-occurrence pairs b (e) i ,e j ) Separately sampling to generate a topic z b ~Mult(θ m ) And sampling to generate event e i ~And event e j ~And in the sampling process, introducing event similarity by using a boli subshell submodel and the event semantic knowledge, and inclining the theme distribution to a high-frequency event by using the boli subshell submodel, in the sampling process, increasing the probability of collecting similar events by introducing the event similarity, wherein the event similarity can be obtained by the event semantic knowledge in the step 2 (the semantic knowledge strength can be adjusted by a threshold value), and the threshold value adjusting expression of the event similarity is as follows:
where b represents any co-occurrence pair of events occurring in document m, e i Representing events i, e j Representing an event j, z b A topic, Mult (θ), representing the co-occurrence pair of events b in the current sampling process m ) Representing compliance parameter as θ m Is preferably a polynomial distribution of (a) and (b),expressing compliance parameters asIs preferably a polynomial distribution of (a) and (b),representing an event e i And event e j Adjusted similarity, σ, denotes the set threshold, sim (e) i ,e j ) Representing events e derived from event semantic knowledge i And event e j The similarity of (2);
s404, obtaining event theme distribution according to the theme sampling and the event samplingAnd document topic distribution theta, and according to the event topic distributionAnd constructing an event topic model according to the document topic distribution theta. In this embodiment, the topic sampling and the event sampling are performed iteratively according to the event topic model, and the event topic distribution is obtained after iterative convergence or balanceAnd a document theme distribution θ.
In the embodiment, because the event semantic knowledge is introduced, the probability that similar events such as "(human, death, nil)", "(human, casualty, nil)", "(human, distress, nil)" and the like are sampled in the generating process is increased, and the long tail phenomenon after the event is extracted is effectively solved.
S5, according to the event topic model, taking topK events with highest topic probability as a topic event set.
In this embodiment, taking the topic "typhoon wilms attack in this year 9 as an example, the traditional topic model with words as basic units is used to obtain topic words (10) such as" typhoon, wilms, landing, coastal, guangdong, hainan, center, influence, rainstorm, forecast ", and the like, and the topic events (10) such as" (wilms, landing, nil), (nil, influenced), (nil, loss, yuan), (man, death, nil), (weather station, issue, early warning), (man, disaster, nil), (nil, start, response), (communication, interruption, nil), (nil, damage, farm house), and (crop, disaster, nil) "can be obtained by using the topic model of events topic.
S6, constructing an event time sequence relation graph according to the theme event set and the sequence of the events, wherein the implementation method comprises the following steps:
s601, taking each topic event in the topic event set as a node, and determining the time sequence relationship of any event pair by using a statistical rule, wherein the determination of the time sequence relationship of any event pair by using the statistical rule includes any one of the following situations:
in the first case:
probability p of two statistical subject matter events appearing in the same document 1 Maximum ofThen, the position sequence p of the two subject events appearing in the same document is counted 2 And if event e i Prior to event e j The strength of the timing relationship is: p ═ p 1 ×p 2 ;
In the second case:
probability p of two statistical topic events appearing in different documents 3 If the sequence is maximum, counting the reporting time sequence p of the documents of the two subject matters 4 And if event e i Prior to event e j The strength of the timing relationship is: p ═ p 3 ×p 4 ;
S602, according to the sequence of events, taking the event which occurs first as an arc tail and the event which occurs later as an arc head, and constructing an event time sequence relation graph;
s7, calculating the emergency clue according to the event timing relationship diagram by using an improved topological sorting algorithm, thereby completing the extraction of the emergency clue, wherein the implementation method comprises the following steps:
s701, outputting an event node sequence by utilizing an improved topological sorting algorithm according to the event timing relationship diagram, wherein the implementation method comprises the following steps:
s7011, constructing a priority queue according to the event time sequence relation graph, and enqueuing node events with zero degree in the time sequence relation graph;
s7012, dequeuing the node events with zero in-degree in the priority queue, outputting the events and deleting the arcs taking the output events as tails;
s7013, judging whether a new node event with zero in-degree exists, if yes, performing enqueue operation on the node event, recording a currently deleted arc, and returning to the step S7012, otherwise, entering the step S702;
s702, judging whether the timing relationship graph has event nodes which are not output, if so, having loops in the rest subgraphs of the timing relationship graph, and entering the step S703, otherwise, entering the step S704;
s703, deleting all arcs in the rest subgraphs, scanning the node events in the output event node sequence to the arcs of each non-output node event in the rest subgraphs in sequence, selecting arcs according to the strength of the time sequence relation, outputting each non-output node event, and recording the current arc;
and S704, forming an emergency clue by the output event node sequence and the recorded current arc, thereby completing the extraction of the emergency clue.
In this embodiment, taking the topic "typhoon wilms attack No. 9 this year" as an example, in a constructed event time sequence relationship diagram, three events of "(person, death, nil)", "(nil, damage, farm house)", "(crop, disaster, nil)" form a ring structure, the three events cannot be added to event clues by adopting a traditional topological sorting algorithm, the ring can be directly broken by adopting the improved topological sorting, because the strength of the relationship between the output event "(wilms, landing, nil)" and the three events is the maximum, the three events can be directly output, and the event time sequence relationship retained in the event clues is the time sequence relationship between the event "(wilms, landing, nil)" and the event clues.
Compared with the prior art, the method adopts the topic model and the time sequence relation graph to construct the emergency clue. Event clues take triple events as basic units, so that not only can the semantic relation between words be expressed, but also the interference of noise words in sentences can be avoided; generating a theme event set by adopting an improved theme model, relieving the data sparsity problem by introducing a Borizia sub-model and event semantic knowledge, and effectively obtaining event theme distribution; the constructed event time sequence relational graph can intuitively express the event logic relation, and more intuitive event clues can be output through the improved topological sorting algorithm.
Claims (8)
1. A news-report-based emergency clue extraction method is characterized by comprising the following steps:
s1, acquiring a news data set, and preprocessing each news in the news data set by using a natural language processing method;
s2, taking a statement as a unit, and extracting events according to the preprocessing result;
s3, obtaining distributed vector representation of the event according to the event extraction result, and constructing event semantic knowledge;
s4, taking the event pairs as entries, and constructing an event topic model by using the event semantic knowledge and the Boria jar model;
s5, according to the event topic model, taking topK events with highest topic probability as a topic event set;
s6, constructing an event time sequence relation graph according to the theme event set and the sequence of the events;
s7, calculating according to the event time sequence relational graph by using an improved topological sorting algorithm to obtain an emergency clue, thereby completing the extraction of the emergency clue;
the step S7 includes the following steps:
s701, outputting an event node sequence by utilizing an improved topological sorting algorithm according to the event timing relationship graph;
s702, judging whether the timing relationship graph has event nodes which are not output, if so, having loops in the rest subgraphs of the timing relationship graph, and entering the step S703, otherwise, entering the step S704;
s703, deleting all arcs in the rest subgraphs, scanning the node events in the output event node sequence to the arcs of each non-output node event in the rest subgraphs in sequence, selecting arcs according to the strength of the time sequence relation, outputting each non-output node event, and recording the current arc;
s704, forming an emergency clue by the output event node sequence and the recorded current arc, thereby completing the extraction of the emergency clue;
the step S701 includes the steps of:
s7011, constructing a priority queue according to the event time sequence relational graph, and enqueuing node events with zero in-degree in the time sequence relational graph;
s7012, sequentially dequeuing the node events with zero admission in the priority queue, outputting the events and deleting arcs taking the output events as tails;
s7013, judging whether a new node event with zero degree of entry exists, if so, performing enqueuing operation on the node event, recording the currently deleted arc, and returning to the step S7012, otherwise, entering the step S702.
2. The method for extracting emergency cue based on news report as claimed in claim 1, wherein the preprocessing in step S1 includes part-of-speech tagging, dependency analysis and resolution of reference.
3. The method for extracting emergency clues based on news reports as claimed in claim 1, wherein said step S2 includes the steps of:
s201, taking a statement as a unit, and extracting all predicate relation pairs in an event according to the preprocessing result;
and S202, judging whether the predicate relation pairs have the same predicate, if so, combining the same predicate into a triple event, and entering the step S3, otherwise, keeping the predicate relation pairs as a binary event, and entering the step S3, thereby completing the extraction of the events.
4. The method for extracting emergency clues based on news reports as claimed in claim 1, wherein said step S3 includes the steps of:
s301, obtaining Word vector representation on news corpus by using Word2Vec algorithm according to event extraction result;
s302, calculating by utilizing a combined semantic algorithm according to the word vector representation to obtain distributed vector representation of the event;
s303, calculating by using an Euclidean distance algorithm according to the event distributed representation to obtain the similarity between the events;
s304, event semantic knowledge is constructed according to the similarity among the events.
5. The method for extracting emergency clues based on news reports as claimed in claim 4, wherein the event distributed representation in step S302 includes any one of the following cases:
in the first case:
if the event is a triple event, calculating according to a predicate vector of the event and a kronecker product of a subject vector and an object vector of the event to obtain an event distributed representation, wherein the event distributed representation isThe expression of (a) is as follows:
in the second case:
if the event is a binary event, calculating according to a predicate vector of the event and a vector of a subject or an object of the event to obtain an event distributed representation, wherein the event distributed representation isOrThe expression is as follows:
6. The method for extracting emergency clues based on news reports as claimed in claim 1, wherein said step S4 includes the steps of:
s401, setting polynomial distribution parameters for generating event topics by taking event pairs as entriesWherein,representing the distribution of each event under a subject k, and Dir (beta) represents the Dirichlet distribution with a distribution obeying hyper-parameter beta;
s402, setting a polynomial publishing parameter theta for generating a document theme m Dir (α), wherein θ m Representing the topic distribution of the document m, and Dir (alpha) representing the Dirichlet distribution with the hyper-parameter alpha;
s403, for each news document m, event co-occurrence pairs b (e) i ,e j ) Separately sampling to generate a topic z b ~Mult(θ m ) And sampling the generated eventsAnd eventsAnd utilizing the boli tank submodel and the event semantic knowledge in the sampling processAnd identifying the introduced event similarity, wherein the threshold value of the event similarity is regulated as follows:
where b represents any co-occurrence pair of events occurring in document m, e i Representing events i, e j Representing an event j, z b A topic, Mult (θ), representing the co-occurrence pair of events b in the current sampling process m ) Representing compliance parameter as θ m Is preferably a polynomial distribution of (a) and (b),expressing compliance parameters asIs preferably a polynomial distribution of (a) and (b),represents an event e i And event e j Adjusted similarity, σ, denotes the set threshold, sim (e) i ,e j ) Representing events e derived in event semantic knowledge i And event e j The similarity of (2);
7. The method for extracting emergency clues based on news reports as claimed in claim 1, wherein said step S6 includes the steps of:
s601, taking each topic event in the topic event set as a node, and determining the time sequence relation of any event pair by using a statistical rule;
and S602, according to the sequence of the events, taking the event which occurs first as an arc tail and the event which occurs later as an arc head, and constructing an event time sequence relation graph.
8. The method for extracting emergency clues based on news reports as claimed in claim 7, wherein the step S601 of determining the time sequence relationship of any event pair by using statistical rules includes any one of the following cases:
in the first case:
probability p of two statistical subject matter events appearing in the same document 1 If the position sequence p of the two subject events in the same document is the maximum, counting the position sequence p of the two subject events in the same document 2 And if event e i Prior to event e j The strength of the timing relationship is: p ═ p 1 ×p 2 ;
In the second case:
probability p of two statistical topic events appearing in different documents 3 If the maximum, counting the reporting time sequence p of the documents of the two subject events 4 And if event e i Prior to event e j The strength of the timing relationship is: p ═ p 3 ×p 4 。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910983942.9A CN110737819B (en) | 2019-10-16 | 2019-10-16 | Emergency clue extraction method based on news reports |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910983942.9A CN110737819B (en) | 2019-10-16 | 2019-10-16 | Emergency clue extraction method based on news reports |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110737819A CN110737819A (en) | 2020-01-31 |
CN110737819B true CN110737819B (en) | 2022-09-16 |
Family
ID=69269147
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910983942.9A Active CN110737819B (en) | 2019-10-16 | 2019-10-16 | Emergency clue extraction method based on news reports |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110737819B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112069383A (en) * | 2020-08-31 | 2020-12-11 | 杭州叙简科技股份有限公司 | News text event and time extraction and normalization system for event tracking |
CN113312490B (en) * | 2021-04-28 | 2023-04-18 | 乐山师范学院 | Event knowledge graph construction method for emergency |
CN114626339A (en) * | 2022-03-10 | 2022-06-14 | 深圳市大数据研究院 | Chinese clue generating method, system, computer equipment and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103984681A (en) * | 2014-03-31 | 2014-08-13 | 同济大学 | News event evolution analysis method based on time sequence distribution information and topic model |
CN104915446A (en) * | 2015-06-29 | 2015-09-16 | 华南理工大学 | Automatic extracting method and system of event evolving relationship based on news |
CN106156299A (en) * | 2016-06-29 | 2016-11-23 | 北京小米移动软件有限公司 | The subject content recognition methods of text message and device |
CN109145224A (en) * | 2018-08-20 | 2019-01-04 | 电子科技大学 | Social networks event-order serie relationship analysis method |
CN109344239A (en) * | 2018-09-20 | 2019-02-15 | 四川昆仑智汇数据科技有限公司 | A kind of business process model querying method and inquiry system based on temporal aspect |
CN110069636A (en) * | 2019-05-05 | 2019-07-30 | 苏州大学 | Merge the event-order serie relation recognition method of dependence and chapter rhetoric relationship |
-
2019
- 2019-10-16 CN CN201910983942.9A patent/CN110737819B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103984681A (en) * | 2014-03-31 | 2014-08-13 | 同济大学 | News event evolution analysis method based on time sequence distribution information and topic model |
CN104915446A (en) * | 2015-06-29 | 2015-09-16 | 华南理工大学 | Automatic extracting method and system of event evolving relationship based on news |
CN106156299A (en) * | 2016-06-29 | 2016-11-23 | 北京小米移动软件有限公司 | The subject content recognition methods of text message and device |
CN109145224A (en) * | 2018-08-20 | 2019-01-04 | 电子科技大学 | Social networks event-order serie relationship analysis method |
CN109344239A (en) * | 2018-09-20 | 2019-02-15 | 四川昆仑智汇数据科技有限公司 | A kind of business process model querying method and inquiry system based on temporal aspect |
CN110069636A (en) * | 2019-05-05 | 2019-07-30 | 苏州大学 | Merge the event-order serie relation recognition method of dependence and chapter rhetoric relationship |
Non-Patent Citations (2)
Title |
---|
《社交网络事件演化分析方法研究》;周磊;《万方数据》;20190916;第三、四章 * |
《融入事件知识的主题表示方法》;孙锐;《计算机学报》;20170430;第3-11页 * |
Also Published As
Publication number | Publication date |
---|---|
CN110737819A (en) | 2020-01-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20230007965A1 (en) | Entity relation mining method based on biomedical literature | |
WO2022227207A1 (en) | Text classification method, apparatus, computer device, and storage medium | |
CN110737819B (en) | Emergency clue extraction method based on news reports | |
CN112487203B (en) | Relation extraction system integrated with dynamic word vector | |
CN111143576A (en) | Event-oriented dynamic knowledge graph construction method and device | |
CN104516947B (en) | A kind of Chinese microblog emotional analysis method for merging dominant and recessive character | |
Lytvyn et al. | Analysis of the developed quantitative method for automatic attribution of scientific and technical text content written in Ukrainian | |
CN112328797A (en) | Emotion classification method and system based on neural network and attention mechanism | |
CN114579746B (en) | Optimized high-precision text classification method and device | |
CN112836051A (en) | Online self-learning court electronic file text classification method | |
Xu et al. | Chinese event detection based on multi-feature fusion and BiLSTM | |
CN116383430A (en) | Knowledge graph construction method, device, equipment and storage medium | |
CN114265943A (en) | Causal relationship event pair extraction method and system | |
JP2015007920A (en) | Extraction of social structural model using text processing | |
Shanto et al. | Cyberbullying detection using deep learning techniques on bangla facebook comments | |
Luo et al. | Unsupervised learning of morphological forests | |
Wang et al. | Construction of causality event evolutionary graph of aviation accident | |
Jia et al. | Tibetan text classification method based on BiLSTM model | |
Chen et al. | Distant supervision for relation extraction via noise filtering | |
CN110705277A (en) | Chinese word sense disambiguation method based on cyclic neural network | |
Saharia | Detecting emotion from short messages on Nepal earthquake | |
CN116089606A (en) | Method, device, electronic equipment and storage medium for classifying spam messages | |
CN113434668B (en) | Deep learning text classification method and system based on model fusion | |
CN115600584A (en) | Mongolian emotion analysis method combining DRCNN-BiGRU dual channels with GAP | |
CN113590768B (en) | Training method and device for text relevance model, question answering method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |