CN110737819B - Emergency clue extraction method based on news reports - Google Patents

Emergency clue extraction method based on news reports Download PDF

Info

Publication number
CN110737819B
CN110737819B CN201910983942.9A CN201910983942A CN110737819B CN 110737819 B CN110737819 B CN 110737819B CN 201910983942 A CN201910983942 A CN 201910983942A CN 110737819 B CN110737819 B CN 110737819B
Authority
CN
China
Prior art keywords
event
events
topic
emergency
distribution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910983942.9A
Other languages
Chinese (zh)
Other versions
CN110737819A (en
Inventor
孙锐
金澎
敬思远
谢红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Leshan Normal University
Original Assignee
Leshan Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Leshan Normal University filed Critical Leshan Normal University
Priority to CN201910983942.9A priority Critical patent/CN110737819B/en
Publication of CN110737819A publication Critical patent/CN110737819A/en
Application granted granted Critical
Publication of CN110737819B publication Critical patent/CN110737819B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a news report-based emergency clue extraction method, which is used for preprocessing news texts; extracting events from the preprocessing result; obtaining event distributed representation, and calculating to obtain event similarity so as to construct event semantic knowledge; constructing an event topic model to obtain event topic distribution and document topic distribution; taking the event with the highest topic probability as a topic event set; constructing a time sequence relation graph of the events by taking each subject event as a node and taking the sequence relation of the occurrence of the events as an arc; and outputting a final event clue by using an improved topological sorting algorithm. Through the design, the invention can accurately and completely acquire the emergency clues, and solves the problems of weak semantic expression of the event clues and low clue acquisition accuracy rate in the prior art. The method is flexible and has strong application and popularization values.

Description

Emergency clue extraction method based on news reports
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to an emergency clue extraction method based on news reports.
Background
The emergency event refers to a natural disaster, an accident disaster, a public health event and a social security event which are caused or possibly cause serious social hazards by sudden occurrence and need to be dealt with by emergency treatment measures. In order to prevent and reduce the occurrence of emergency and control, reduce and eliminate serious social hazards caused by the emergency, people's government and related departments need to standardize the emergency coping activities, comprehensively evaluate the possible emergency, and furthest reduce the influence of major emergency. The sudden events have obvious time sequence characteristics, and the logic sequence of the sudden events can be represented by the topic evolution of the events, namely, the sudden event clues. For example, when a "wilamason typhoon 9 in 2014" event occurs, events such as "casualties", "crop disaster damage", "communication interruption" and the like occur at the same time; with the advance of time, a series of related events such as 'Wilmason logging in China', 'weather station issuing early warning', 'related department issuing notice', 'transferring related personnel', 'preventing germs' and the like are carried out. These related events are all sub-events that evolve or are derived under the theme of "Wilmacson typhoon". These events occur in a time-sequential or causal relationship. The method can accurately and completely obtain clues of the emergency, has an important function for understanding the situation development trend by the antecedent consequence of the emergency, and has certain reference and prediction functions on how to deal with similar emergency.
In the prior art, a word or a phrase is used as a basic unit, and a topic model is applied to obtain the distribution of the word on a topic. The high-frequency topic word set is used for representing the subtopic, and the document reporting time is used for representing the evolution process of the topic, so that the following defects exist: 1. the method takes words or phrases as basic units, has isolated semantics, neglects the semantic relation between the words and cannot completely describe topics; the term has no time concept, and the time sequence characteristics of the topic purpose can be embodied only by means of the document report time. The scheme adopted in the prior art also takes ACE events as basic units, and identifies and infers the relationship between the events so as to describe the evolution process of topics, and the scheme has the following defects: the ACE event categories are divided into 8 large categories and 33 subclasses, the event field is limited, and the extraction accuracy is limited; most of the ACE events are coarse-grained statements or chapter-level events, and part of fine-grained events cannot be extracted; the definition of the event relation has no unified structure, the relation judgment accuracy is low, and the realization difficulty is high.
Therefore, we have designed an emergency cue extraction method based on news reports. The event clue takes triple atomic events (Subject, preset, Object) as a basic unit, and represents the clue by using the time sequence relation among the events. An improved topic model is adopted to generate an event (namely a topic event) set which is strongly related to the topic, and an improved topological sorting algorithm is applied to the constructed event timing relationship graph to output a final event clue.
Disclosure of Invention
Aiming at the defects in the prior art, the method for extracting the emergency clue based on the news report solves the problems that the semantic expression of the event clue is not strong and the clue acquisition accuracy rate is low in the prior art.
In order to achieve the above purpose, the invention adopts the technical scheme that:
the scheme provides an emergency clue extraction method based on news reports, which comprises the following steps:
s1, acquiring a news data set, and preprocessing each news in the news data set by using a natural language processing method;
s2, taking a statement as a unit, and extracting events according to the preprocessing result;
s3, obtaining event distributed representation according to the event extraction result, and constructing event semantic knowledge;
s4, taking the event pairs as entries, and constructing an event topic model by using the event semantic knowledge and the Boria jar model;
s5, according to the event topic model, taking topK events with highest topic probability as a topic event set;
s6, constructing an event time sequence relation graph according to the theme event set and the sequence of the events;
and S7, calculating according to the event timing relationship diagram by using an improved topological sorting algorithm to obtain an emergency clue, thereby completing the extraction of the emergency clue.
Further, the preprocessing in the step S1 includes part of speech tagging, dependency analysis and resolution of reference.
Still further, the step S2 includes the following steps:
s201, taking a statement as a unit, and extracting all predicate relation pairs in an event according to the preprocessing result;
and S202, judging whether the predicate relation pairs have the same predicate, if so, combining the same predicate into a triple event, and entering the step S3, otherwise, keeping the predicate relation pairs as a binary event, and entering the step S3, thereby completing the extraction of the events.
Still further, the step S3 includes the following steps:
s301, obtaining Word vector representation on news corpus by using Word2Vec algorithm according to event extraction results;
s302, calculating by utilizing a combined semantic algorithm according to the word vector representation to obtain event distributed representation;
s303, calculating by using an Euclidean distance algorithm according to the event distributed representation to obtain the similarity between the events;
s304, event semantic knowledge is constructed according to the similarity among the events.
Still further, the event distributed representation in step S302 includes any one of the following cases:
in the first case:
if the event is a triple event, calculating to obtain an event distributed representation according to a predicate vector of the event and a kroneinner-outer product of a subject vector and an object vector of the event, wherein the event distributed representation is
Figure BDA0002236105790000031
The expression of (a) is as follows:
Figure BDA0002236105790000032
in the second case:
if the event is a binary event, calculating according to a predicate vector of the event and a vector of a subject or an object of the event to obtain an event distributed representation, wherein the event distributed representation is
Figure BDA0002236105790000041
Or
Figure BDA0002236105790000042
The expression is as follows:
Figure BDA0002236105790000043
Figure BDA0002236105790000044
wherein,
Figure BDA0002236105790000045
represents a kronecker outer product operation, represents a dot product operation,
Figure BDA0002236105790000046
a vector of the predicate of the event is represented,
Figure BDA0002236105790000047
a vector of a subject of the event is represented,
Figure BDA0002236105790000048
representing an event object vector.
Still further, the step S4 includes the following steps:
s401, setting polynomial distribution parameters for generating event topics by taking event pairs as entries
Figure BDA0002236105790000049
-Dir (beta), wherein,
Figure BDA00022361057900000410
representing the distribution of each event under a subject k, and Dir (beta) represents the Dirichlet distribution with a distribution obeying hyper-parameter beta;
s402, setting a polynomial publishing parameter theta for generating a document theme m Dir (α), wherein θ m Representing the topic distribution of the document m, and Dir (alpha) representing the Dirichlet distribution with the hyper-parameter alpha;
s403, for each news document m, event co-occurrence pairs b (e) i ,e j ) Separately sampling to generate a topic z b ~Mult(θ m ) And sampling to generate event e i
Figure BDA00022361057900000411
And event e j
Figure BDA00022361057900000412
And introducing event similarity by using a Borizia sub-model and the event semantic knowledge in the sampling process, wherein a threshold value adjusting expression of the event similarity is as follows:
Figure BDA00022361057900000413
where b represents any co-occurrence pair of events occurring in document m, e i Representing events i, e j Representing an event j, z b A topic, Mult (θ), representing the co-occurrence pair of events b in the current sampling process m ) Representing compliance parameter as θ m Is preferably a polynomial distribution of (a) and (b),
Figure BDA00022361057900000414
expressing compliance parameters as
Figure BDA00022361057900000415
Is preferably a polynomial distribution of (a) and (b),
Figure BDA00022361057900000416
represents an event e i And event e j Adjusted similarity, σ, denotes the set threshold, sim (e) i ,e j ) Representing events e derived from event semantic knowledge i And event e j The similarity of (2);
s404, obtaining event theme distribution according to the theme sampling and the event sampling
Figure BDA0002236105790000051
And document topic distribution theta, and according to the event topic distribution
Figure BDA0002236105790000052
And constructing an event topic model according to the document topic distribution theta.
Still further, the step S6 includes the following steps:
s601, taking each topic event in the topic event set as a node, and determining the time sequence relation of any event pair by using a statistical rule;
and S602, according to the sequence of the events, taking the event which occurs first as an arc tail and the event which occurs later as an arc head, and constructing an event time sequence relation diagram.
Still further, the determining the timing relationship of any event pair by using the statistical rule in step S601 includes any one of the following cases:
in the first case:
probability p of two statistical subject matter events appearing in the same document 1 If the position sequence p of the two subject events in the same document is the maximum, counting the position sequence p of the two subject events in the same document 2 And if event e i Prior to event e j The strength of the timing relationship is: p ═ p 1 ×p 2
In the second case:
probability p of two statistical topic events appearing in different documents 3 If the maximum, counting the reporting time sequence p of the documents of the two subject events 4 And if event e i Prior to event e j The strength of the timing relationship is: p ═ p 3 ×p 4
Still further, the step S7 includes the following steps:
s701, outputting an event node sequence by utilizing an improved topological sorting algorithm according to the event timing relationship graph;
s702, judging whether the timing relationship graph has event nodes which are not output, if so, having loops in the rest subgraphs of the timing relationship graph, and entering the step S703, otherwise, entering the step S704;
s703, deleting all arcs in the rest subgraphs, scanning the node events in the output event node sequence to the arcs of each non-output node event in the rest subgraphs in sequence, selecting arcs according to the strength of the time sequence relation, outputting each non-output node event, and recording the current arc;
and S704, forming an emergency clue by the output event node sequence and the recorded current arc, thereby completing the extraction of the emergency clue.
Still further, the step S701 includes the steps of:
s7011, constructing a priority queue according to the event time sequence relation graph, and enqueuing node events with zero degree in the time sequence relation graph;
s7012, sequentially dequeuing the node events with zero admission in the priority queue, outputting the events and deleting arcs taking the output events as tails;
s7013, judging whether a new node event with zero degree of entry exists, if so, performing enqueuing operation on the node event, recording the currently deleted arc, and returning to the step S7012, otherwise, entering the step S702.
The invention has the beneficial effects that:
(1) the method takes the triple event as a basic unit, the extraction algorithm is simple to realize, the granularity of the triple event appearing in the document is limited between words and sentences, the semantic relation between the words can be expressed, and the interference of noise words in the sentences can be avoided;
(2) the method introduces event semantic knowledge, expresses event semantics by a mainstream distributed vector, and alleviates the problem of event sparsity by using event similarity;
(3) according to the method, an event topic model is used for realizing automatic clustering of topic events, a topic model is constructed based on event pairs, and the topic distribution of events and documents is obtained by combining a Borui sub-model and introducing event semantic knowledge;
(4) the invention constructs an event time sequence relational graph, in the constructed event time sequence relational graph, a node represents each subject event, the time sequence relation of each event pair is taken as an arc, and a final event clue is output by utilizing an improved topological sorting algorithm.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Fig. 2 is a structural diagram of the topic model in this embodiment.
Detailed Description
The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.
Examples
The invention provides an emergency clue extraction method based on news reports, which adopts a topic model and a time sequence relation graph algorithm to construct emergency clues. The embodiment of collecting thematic documents on the Xinlang net (including a theme (92) of 'Taifeng Wilmason comes from today No. 9', a theme (102) of 'Taiwan passenger plane forced landing and heavy fire', a theme (38) of 'Guangdong encountering the most severe dengue fever epidemic situation for 20 years', a 'Hangzhou bus longitudinal fire case' (54) and the like) shows that the technology is simple and effective to realize, an unsupervised learning mode enables the technology to be implemented without excessive manual intervention, finally generated event clues take triple events as a basic unit, an improved subject model is adopted to generate a subject event set, and event semantic knowledge is introduced into the subject model. In order to use each subject event as a node, statistically calculate the sequence of events and use the sequence as an arc, construct a time sequence relational graph between events, and output a final event clue by an improved topological sorting algorithm, as shown in fig. 1, the method comprises the following steps:
s1, acquiring a news data set, and preprocessing each news in the news data set by using a natural language processing method, wherein the preprocessing comprises part of speech tagging, dependency analysis and reference resolution;
s2, taking a statement as a unit, extracting the event according to the preprocessing result, wherein the implementation method comprises the following steps:
s201, taking a statement as a unit, and extracting all predicate relation pairs in an event according to the preprocessing result;
and S202, judging whether the predicate relation pairs have the same predicate, if so, combining the same predicate into a triple event, and entering the step S3, otherwise, keeping the predicate relation pairs as a binary event, and entering the step S3, thereby completing the extraction of the events.
In this embodiment, a plurality of events may exist in a statement, and all possible predicate relationship pairs in the statement are proposed, such as "NSUBJ" and "DOBJ" relationships, and if the "NSUBJ" and "DOBJ" relationships have the same predicate, the two-tuple events are merged into one triple event, and if the dependency relationships cannot be merged, the two-tuple events are retained. If the given statement "weather bureau issues typhoon warning", there are two dependency pairs "NSUBJ (issue, weather bureau)" and "DOBJ (issue, warning)", and both predicates thereof are "issue", they can be merged into a triple event "(weather bureau, issue, warning)", whereas for the statement "airplane lost in the open air over the open sea", only a binary event "(airplane, lost, nil)" can be extracted, ("nil" indicates that the event argument is missing). The invention takes the triple event as a basic unit, the extraction algorithm is simple to realize, the granularity of the triple event appearing in the document is between words and sentences, the semantic relation between the words can be expressed, and the interference of noise words in the sentences can be avoided.
S3, obtaining event distributed representation according to the event extraction result, and constructing event semantic knowledge, wherein the implementation method comprises the following steps:
s301, obtaining Word vector representation on news corpus by using Word2Vec algorithm according to event extraction results;
s302, calculating by utilizing a combined semantic algorithm according to the word vector representation to obtain an event distributed representation, wherein the event distributed representation comprises any one of the following conditions:
in the first case:
if the event is a triple event, calculating to obtain an event distributed representation according to a predicate vector of the event and a kroneinner-outer product of a subject vector and an object vector of the event, wherein the event distributed representation is
Figure BDA0002236105790000091
The expression of (a) is as follows:
Figure BDA0002236105790000092
in the second case:
if the event is a binary event, calculating according to a predicate vector of the event and a vector of a subject or an object of the event to obtain an event distributed representation, wherein the event distributed representation is
Figure BDA0002236105790000093
Or
Figure BDA0002236105790000094
The expression is as follows:
Figure BDA0002236105790000095
Figure BDA0002236105790000096
wherein,
Figure BDA0002236105790000097
represents a kronecker outer product operation, represents a dot product operation,
Figure BDA0002236105790000098
a vector of the predicate of the event is represented,
Figure BDA0002236105790000099
a vector of a subject of the event is represented,
Figure BDA00022361057900000910
representing an event object vector;
s303, calculating by using an Euclidean distance algorithm according to the event distributed representation to obtain the similarity between the events;
s304, establishing event semantic knowledge according to the similarity among the events.
In the embodiment, Word vector representation is obtained on background corpus by using Word2Vec model, then event distributed representation is obtained by adopting combined semantic mode calculation, if the event is a triple, the Clonecker outer product of the event subject and the object vector is multiplied by the event predicate vector point; if the event is a binary group, directly multiplying a predicate vector point by an argument vector (subject or object), and calculating the similarity between the events by adopting Euclidean distance after obtaining the distributed vector representation of the event so as to construct event semantic knowledge. For example, the semantic similarity between the event "(airplane, crash, nil)" and the event "(airplane, crash, nil)" is about 0.8765. The invention introduces event semantic knowledge, expresses event semantics by mainstream distributed vectors and utilizes event similarity to relieve the problem of event sparsity.
S4, as shown in fig. 2, using the event pair as an entry, and constructing an event topic model by using the event semantic knowledge and the boli subsytum model, where the implementation method is as follows:
s401, setting polynomial distribution parameters for generating event topics by taking event pairs as entries
Figure BDA0002236105790000101
-Dir (beta), wherein,
Figure BDA0002236105790000102
representing the distribution of each event under a subject k, and Dir (beta) represents the Dirichlet distribution with a distribution obeying hyper-parameter beta;
s402, setting a polynomial publishing parameter theta for generating a document theme m Dir (α), wherein θ m Representing the topic distribution of the document m, and Dir (alpha) representing the Dirichlet distribution of which the distribution obeys a hyper-parameter alpha;
s403, for each news document m, event co-occurrence pairs b (e) i ,e j ) Separately sampling to generate a topic z b ~Mult(θ m ) And sampling to generate event e i
Figure BDA0002236105790000103
And event e j
Figure BDA0002236105790000104
And in the sampling process, introducing event similarity by using a boli subshell submodel and the event semantic knowledge, and inclining the theme distribution to a high-frequency event by using the boli subshell submodel, in the sampling process, increasing the probability of collecting similar events by introducing the event similarity, wherein the event similarity can be obtained by the event semantic knowledge in the step 2 (the semantic knowledge strength can be adjusted by a threshold value), and the threshold value adjusting expression of the event similarity is as follows:
Figure BDA0002236105790000105
where b represents any co-occurrence pair of events occurring in document m, e i Representing events i, e j Representing an event j, z b A topic, Mult (θ), representing the co-occurrence pair of events b in the current sampling process m ) Representing compliance parameter as θ m Is preferably a polynomial distribution of (a) and (b),
Figure BDA0002236105790000106
expressing compliance parameters as
Figure BDA0002236105790000107
Is preferably a polynomial distribution of (a) and (b),
Figure BDA0002236105790000108
representing an event e i And event e j Adjusted similarity, σ, denotes the set threshold, sim (e) i ,e j ) Representing events e derived from event semantic knowledge i And event e j The similarity of (2);
s404, obtaining event theme distribution according to the theme sampling and the event sampling
Figure BDA0002236105790000109
And document topic distribution theta, and according to the event topic distribution
Figure BDA00022361057900001010
And constructing an event topic model according to the document topic distribution theta. In this embodiment, the topic sampling and the event sampling are performed iteratively according to the event topic model, and the event topic distribution is obtained after iterative convergence or balance
Figure BDA00022361057900001011
And a document theme distribution θ.
In the embodiment, because the event semantic knowledge is introduced, the probability that similar events such as "(human, death, nil)", "(human, casualty, nil)", "(human, distress, nil)" and the like are sampled in the generating process is increased, and the long tail phenomenon after the event is extracted is effectively solved.
S5, according to the event topic model, taking topK events with highest topic probability as a topic event set.
In this embodiment, taking the topic "typhoon wilms attack in this year 9 as an example, the traditional topic model with words as basic units is used to obtain topic words (10) such as" typhoon, wilms, landing, coastal, guangdong, hainan, center, influence, rainstorm, forecast ", and the like, and the topic events (10) such as" (wilms, landing, nil), (nil, influenced), (nil, loss, yuan), (man, death, nil), (weather station, issue, early warning), (man, disaster, nil), (nil, start, response), (communication, interruption, nil), (nil, damage, farm house), and (crop, disaster, nil) "can be obtained by using the topic model of events topic.
S6, constructing an event time sequence relation graph according to the theme event set and the sequence of the events, wherein the implementation method comprises the following steps:
s601, taking each topic event in the topic event set as a node, and determining the time sequence relationship of any event pair by using a statistical rule, wherein the determination of the time sequence relationship of any event pair by using the statistical rule includes any one of the following situations:
in the first case:
probability p of two statistical subject matter events appearing in the same document 1 Maximum ofThen, the position sequence p of the two subject events appearing in the same document is counted 2 And if event e i Prior to event e j The strength of the timing relationship is: p ═ p 1 ×p 2
In the second case:
probability p of two statistical topic events appearing in different documents 3 If the sequence is maximum, counting the reporting time sequence p of the documents of the two subject matters 4 And if event e i Prior to event e j The strength of the timing relationship is: p ═ p 3 ×p 4
S602, according to the sequence of events, taking the event which occurs first as an arc tail and the event which occurs later as an arc head, and constructing an event time sequence relation graph;
s7, calculating the emergency clue according to the event timing relationship diagram by using an improved topological sorting algorithm, thereby completing the extraction of the emergency clue, wherein the implementation method comprises the following steps:
s701, outputting an event node sequence by utilizing an improved topological sorting algorithm according to the event timing relationship diagram, wherein the implementation method comprises the following steps:
s7011, constructing a priority queue according to the event time sequence relation graph, and enqueuing node events with zero degree in the time sequence relation graph;
s7012, dequeuing the node events with zero in-degree in the priority queue, outputting the events and deleting the arcs taking the output events as tails;
s7013, judging whether a new node event with zero in-degree exists, if yes, performing enqueue operation on the node event, recording a currently deleted arc, and returning to the step S7012, otherwise, entering the step S702;
s702, judging whether the timing relationship graph has event nodes which are not output, if so, having loops in the rest subgraphs of the timing relationship graph, and entering the step S703, otherwise, entering the step S704;
s703, deleting all arcs in the rest subgraphs, scanning the node events in the output event node sequence to the arcs of each non-output node event in the rest subgraphs in sequence, selecting arcs according to the strength of the time sequence relation, outputting each non-output node event, and recording the current arc;
and S704, forming an emergency clue by the output event node sequence and the recorded current arc, thereby completing the extraction of the emergency clue.
In this embodiment, taking the topic "typhoon wilms attack No. 9 this year" as an example, in a constructed event time sequence relationship diagram, three events of "(person, death, nil)", "(nil, damage, farm house)", "(crop, disaster, nil)" form a ring structure, the three events cannot be added to event clues by adopting a traditional topological sorting algorithm, the ring can be directly broken by adopting the improved topological sorting, because the strength of the relationship between the output event "(wilms, landing, nil)" and the three events is the maximum, the three events can be directly output, and the event time sequence relationship retained in the event clues is the time sequence relationship between the event "(wilms, landing, nil)" and the event clues.
Compared with the prior art, the method adopts the topic model and the time sequence relation graph to construct the emergency clue. Event clues take triple events as basic units, so that not only can the semantic relation between words be expressed, but also the interference of noise words in sentences can be avoided; generating a theme event set by adopting an improved theme model, relieving the data sparsity problem by introducing a Borizia sub-model and event semantic knowledge, and effectively obtaining event theme distribution; the constructed event time sequence relational graph can intuitively express the event logic relation, and more intuitive event clues can be output through the improved topological sorting algorithm.

Claims (8)

1. A news-report-based emergency clue extraction method is characterized by comprising the following steps:
s1, acquiring a news data set, and preprocessing each news in the news data set by using a natural language processing method;
s2, taking a statement as a unit, and extracting events according to the preprocessing result;
s3, obtaining distributed vector representation of the event according to the event extraction result, and constructing event semantic knowledge;
s4, taking the event pairs as entries, and constructing an event topic model by using the event semantic knowledge and the Boria jar model;
s5, according to the event topic model, taking topK events with highest topic probability as a topic event set;
s6, constructing an event time sequence relation graph according to the theme event set and the sequence of the events;
s7, calculating according to the event time sequence relational graph by using an improved topological sorting algorithm to obtain an emergency clue, thereby completing the extraction of the emergency clue;
the step S7 includes the following steps:
s701, outputting an event node sequence by utilizing an improved topological sorting algorithm according to the event timing relationship graph;
s702, judging whether the timing relationship graph has event nodes which are not output, if so, having loops in the rest subgraphs of the timing relationship graph, and entering the step S703, otherwise, entering the step S704;
s703, deleting all arcs in the rest subgraphs, scanning the node events in the output event node sequence to the arcs of each non-output node event in the rest subgraphs in sequence, selecting arcs according to the strength of the time sequence relation, outputting each non-output node event, and recording the current arc;
s704, forming an emergency clue by the output event node sequence and the recorded current arc, thereby completing the extraction of the emergency clue;
the step S701 includes the steps of:
s7011, constructing a priority queue according to the event time sequence relational graph, and enqueuing node events with zero in-degree in the time sequence relational graph;
s7012, sequentially dequeuing the node events with zero admission in the priority queue, outputting the events and deleting arcs taking the output events as tails;
s7013, judging whether a new node event with zero degree of entry exists, if so, performing enqueuing operation on the node event, recording the currently deleted arc, and returning to the step S7012, otherwise, entering the step S702.
2. The method for extracting emergency cue based on news report as claimed in claim 1, wherein the preprocessing in step S1 includes part-of-speech tagging, dependency analysis and resolution of reference.
3. The method for extracting emergency clues based on news reports as claimed in claim 1, wherein said step S2 includes the steps of:
s201, taking a statement as a unit, and extracting all predicate relation pairs in an event according to the preprocessing result;
and S202, judging whether the predicate relation pairs have the same predicate, if so, combining the same predicate into a triple event, and entering the step S3, otherwise, keeping the predicate relation pairs as a binary event, and entering the step S3, thereby completing the extraction of the events.
4. The method for extracting emergency clues based on news reports as claimed in claim 1, wherein said step S3 includes the steps of:
s301, obtaining Word vector representation on news corpus by using Word2Vec algorithm according to event extraction result;
s302, calculating by utilizing a combined semantic algorithm according to the word vector representation to obtain distributed vector representation of the event;
s303, calculating by using an Euclidean distance algorithm according to the event distributed representation to obtain the similarity between the events;
s304, event semantic knowledge is constructed according to the similarity among the events.
5. The method for extracting emergency clues based on news reports as claimed in claim 4, wherein the event distributed representation in step S302 includes any one of the following cases:
in the first case:
if the event is a triple event, calculating according to a predicate vector of the event and a kronecker product of a subject vector and an object vector of the event to obtain an event distributed representation, wherein the event distributed representation is
Figure FDA0003675818440000031
The expression of (a) is as follows:
Figure FDA0003675818440000032
in the second case:
if the event is a binary event, calculating according to a predicate vector of the event and a vector of a subject or an object of the event to obtain an event distributed representation, wherein the event distributed representation is
Figure FDA0003675818440000033
Or
Figure FDA0003675818440000034
The expression is as follows:
Figure FDA0003675818440000035
Figure FDA0003675818440000036
wherein,
Figure FDA0003675818440000037
represents a kronecker product operation, represents a dot product operation,
Figure FDA0003675818440000038
a vector of the predicate of the event is represented,
Figure FDA0003675818440000039
a vector of a subject of the event is represented,
Figure FDA00036758184400000310
representing an event object vector.
6. The method for extracting emergency clues based on news reports as claimed in claim 1, wherein said step S4 includes the steps of:
s401, setting polynomial distribution parameters for generating event topics by taking event pairs as entries
Figure FDA00036758184400000311
Wherein,
Figure FDA00036758184400000312
representing the distribution of each event under a subject k, and Dir (beta) represents the Dirichlet distribution with a distribution obeying hyper-parameter beta;
s402, setting a polynomial publishing parameter theta for generating a document theme m Dir (α), wherein θ m Representing the topic distribution of the document m, and Dir (alpha) representing the Dirichlet distribution with the hyper-parameter alpha;
s403, for each news document m, event co-occurrence pairs b (e) i ,e j ) Separately sampling to generate a topic z b ~Mult(θ m ) And sampling the generated events
Figure FDA0003675818440000041
And events
Figure FDA0003675818440000042
And utilizing the boli tank submodel and the event semantic knowledge in the sampling processAnd identifying the introduced event similarity, wherein the threshold value of the event similarity is regulated as follows:
Figure FDA0003675818440000043
where b represents any co-occurrence pair of events occurring in document m, e i Representing events i, e j Representing an event j, z b A topic, Mult (θ), representing the co-occurrence pair of events b in the current sampling process m ) Representing compliance parameter as θ m Is preferably a polynomial distribution of (a) and (b),
Figure FDA0003675818440000044
expressing compliance parameters as
Figure FDA0003675818440000045
Is preferably a polynomial distribution of (a) and (b),
Figure FDA0003675818440000046
represents an event e i And event e j Adjusted similarity, σ, denotes the set threshold, sim (e) i ,e j ) Representing events e derived in event semantic knowledge i And event e j The similarity of (2);
s404, obtaining event theme distribution according to the theme sampling and the event sampling
Figure FDA0003675818440000047
And document theme distribution theta, and according to the event theme distribution
Figure FDA0003675818440000048
And constructing an event topic model according to the document topic distribution theta.
7. The method for extracting emergency clues based on news reports as claimed in claim 1, wherein said step S6 includes the steps of:
s601, taking each topic event in the topic event set as a node, and determining the time sequence relation of any event pair by using a statistical rule;
and S602, according to the sequence of the events, taking the event which occurs first as an arc tail and the event which occurs later as an arc head, and constructing an event time sequence relation graph.
8. The method for extracting emergency clues based on news reports as claimed in claim 7, wherein the step S601 of determining the time sequence relationship of any event pair by using statistical rules includes any one of the following cases:
in the first case:
probability p of two statistical subject matter events appearing in the same document 1 If the position sequence p of the two subject events in the same document is the maximum, counting the position sequence p of the two subject events in the same document 2 And if event e i Prior to event e j The strength of the timing relationship is: p ═ p 1 ×p 2
In the second case:
probability p of two statistical topic events appearing in different documents 3 If the maximum, counting the reporting time sequence p of the documents of the two subject events 4 And if event e i Prior to event e j The strength of the timing relationship is: p ═ p 3 ×p 4
CN201910983942.9A 2019-10-16 2019-10-16 Emergency clue extraction method based on news reports Active CN110737819B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910983942.9A CN110737819B (en) 2019-10-16 2019-10-16 Emergency clue extraction method based on news reports

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910983942.9A CN110737819B (en) 2019-10-16 2019-10-16 Emergency clue extraction method based on news reports

Publications (2)

Publication Number Publication Date
CN110737819A CN110737819A (en) 2020-01-31
CN110737819B true CN110737819B (en) 2022-09-16

Family

ID=69269147

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910983942.9A Active CN110737819B (en) 2019-10-16 2019-10-16 Emergency clue extraction method based on news reports

Country Status (1)

Country Link
CN (1) CN110737819B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112069383A (en) * 2020-08-31 2020-12-11 杭州叙简科技股份有限公司 News text event and time extraction and normalization system for event tracking
CN113312490B (en) * 2021-04-28 2023-04-18 乐山师范学院 Event knowledge graph construction method for emergency
CN114626339A (en) * 2022-03-10 2022-06-14 深圳市大数据研究院 Chinese clue generating method, system, computer equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103984681A (en) * 2014-03-31 2014-08-13 同济大学 News event evolution analysis method based on time sequence distribution information and topic model
CN104915446A (en) * 2015-06-29 2015-09-16 华南理工大学 Automatic extracting method and system of event evolving relationship based on news
CN106156299A (en) * 2016-06-29 2016-11-23 北京小米移动软件有限公司 The subject content recognition methods of text message and device
CN109145224A (en) * 2018-08-20 2019-01-04 电子科技大学 Social networks event-order serie relationship analysis method
CN109344239A (en) * 2018-09-20 2019-02-15 四川昆仑智汇数据科技有限公司 A kind of business process model querying method and inquiry system based on temporal aspect
CN110069636A (en) * 2019-05-05 2019-07-30 苏州大学 Merge the event-order serie relation recognition method of dependence and chapter rhetoric relationship

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103984681A (en) * 2014-03-31 2014-08-13 同济大学 News event evolution analysis method based on time sequence distribution information and topic model
CN104915446A (en) * 2015-06-29 2015-09-16 华南理工大学 Automatic extracting method and system of event evolving relationship based on news
CN106156299A (en) * 2016-06-29 2016-11-23 北京小米移动软件有限公司 The subject content recognition methods of text message and device
CN109145224A (en) * 2018-08-20 2019-01-04 电子科技大学 Social networks event-order serie relationship analysis method
CN109344239A (en) * 2018-09-20 2019-02-15 四川昆仑智汇数据科技有限公司 A kind of business process model querying method and inquiry system based on temporal aspect
CN110069636A (en) * 2019-05-05 2019-07-30 苏州大学 Merge the event-order serie relation recognition method of dependence and chapter rhetoric relationship

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《社交网络事件演化分析方法研究》;周磊;《万方数据》;20190916;第三、四章 *
《融入事件知识的主题表示方法》;孙锐;《计算机学报》;20170430;第3-11页 *

Also Published As

Publication number Publication date
CN110737819A (en) 2020-01-31

Similar Documents

Publication Publication Date Title
US20230007965A1 (en) Entity relation mining method based on biomedical literature
WO2022227207A1 (en) Text classification method, apparatus, computer device, and storage medium
CN110737819B (en) Emergency clue extraction method based on news reports
CN112487203B (en) Relation extraction system integrated with dynamic word vector
CN111143576A (en) Event-oriented dynamic knowledge graph construction method and device
CN104516947B (en) A kind of Chinese microblog emotional analysis method for merging dominant and recessive character
Lytvyn et al. Analysis of the developed quantitative method for automatic attribution of scientific and technical text content written in Ukrainian
CN112328797A (en) Emotion classification method and system based on neural network and attention mechanism
CN114579746B (en) Optimized high-precision text classification method and device
CN112836051A (en) Online self-learning court electronic file text classification method
Xu et al. Chinese event detection based on multi-feature fusion and BiLSTM
CN116383430A (en) Knowledge graph construction method, device, equipment and storage medium
CN114265943A (en) Causal relationship event pair extraction method and system
JP2015007920A (en) Extraction of social structural model using text processing
Shanto et al. Cyberbullying detection using deep learning techniques on bangla facebook comments
Luo et al. Unsupervised learning of morphological forests
Wang et al. Construction of causality event evolutionary graph of aviation accident
Jia et al. Tibetan text classification method based on BiLSTM model
Chen et al. Distant supervision for relation extraction via noise filtering
CN110705277A (en) Chinese word sense disambiguation method based on cyclic neural network
Saharia Detecting emotion from short messages on Nepal earthquake
CN116089606A (en) Method, device, electronic equipment and storage medium for classifying spam messages
CN113434668B (en) Deep learning text classification method and system based on model fusion
CN115600584A (en) Mongolian emotion analysis method combining DRCNN-BiGRU dual channels with GAP
CN113590768B (en) Training method and device for text relevance model, question answering method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant