CN106951530B

CN106951530B - Event type extraction method and device

Info

Publication number: CN106951530B
Application number: CN201710169761.3A
Authority: CN
Inventors: 洪宇; 杨雪蓉; 姚建民; 朱巧明
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2017-03-21
Filing date: 2017-03-21
Publication date: 2020-01-17
Anticipated expiration: 2037-03-21
Also published as: CN106951530A

Abstract

The application provides an event type extraction method and device, and the method comprises the following steps: extracting candidate corpus words from a preset corpus; determining the relevance between a reference trigger word in a preset trigger word set and the candidate corpus words based on the corpus, wherein the reference trigger word is determined by an automatic content extraction technology; for any reference trigger word, determining candidate corpus words with the relevance meeting preset requirements with the reference trigger word as target trigger words to obtain at least one target trigger word; determining characteristics of the target trigger words in the trigger word set; and clustering the target trigger words based on the characteristics of the target trigger words to obtain a clustered set which belongs to different event categories. The method and the device provide possibility for improving the accuracy of event extraction and increasing the application range of event extraction.

Description

Event type extraction method and device

Technical Field

The present application relates to the field of information processing technologies, and in particular, to an event type extraction method and apparatus.

Background

The event extraction is used as an important component of information extraction, and has wide application prospect and great practical significance. The purpose of event extraction is to accurately and effectively extract interesting time information from a large amount of disordered, disordered and unstructured information, and according to task definition of event extraction, an event refers to an objective fact that a specific person and a specific object interact with a specific place at a specific time, and the event is composed of a trigger word and elements for describing an event structure. The event extraction requires that structured information containing event type, event elements and event role information is automatically identified and extracted from an unstructured source text containing event information.

Currently, the existing event extraction directly uses the annotation result of Automatic Content Extraction (ACE), so that the research of the event extraction is also limited to only the event type defined in the ACE, i.e. only to the limited-domain event extraction. However, the event types in the open domain are more, rich and various, and the difference of the event types is relatively small, so that the difference judgment difficulty is large, and if the ACE is still directly adopted, the event extraction cannot be accurately and effectively carried out.

Disclosure of Invention

In view of the above, the present application provides an event type extraction method and apparatus, so as to provide a possibility for improving the accuracy of event extraction and increasing the application range of event extraction.

In order to achieve the above purpose, the present application provides the following technical solutions:

an event type extraction method, comprising:

extracting a plurality of candidate corpus words from a preset corpus;

determining the relevance between a reference trigger word in a preset trigger word set and the candidate corpus words based on the corpus, wherein the reference trigger word is determined by an automatic content extraction technology;

for any reference trigger word, determining candidate corpus words with the relevance meeting preset requirements with the reference trigger word as target trigger words to obtain at least one target trigger word corresponding to each reference trigger word;

respectively determining the characteristics of each target trigger word;

and based on the characteristics of the target trigger words, clustering all the target trigger words to obtain a plurality of clustered cluster sets belonging to different event categories, wherein each cluster set corresponds to one event category, and each cluster set comprises at least one target trigger word.

Preferably, the extracting the corpus candidate words from the preset corpus includes:

determining undetermined corpus words contained in a plurality of corpus texts in the preset corpus;

and filtering preset useless words contained in the to-be-determined corpus words to obtain the candidate corpus words, wherein the preset useless words comprise stop words and virtual words.

Preferably, the determining, based on the corpus, the relevance between the reference trigger word and the candidate corpus word in the preset trigger word set includes:

for each candidate corpus word, sequentially calculating the initial relevance of the candidate corpus word and each reference trigger word in the trigger word set in each corpus text in the corpus;

and for any pair of the reference trigger word and the candidate corpus word, summing the initial relevance of the reference trigger word and the candidate corpus word in each corpus text to obtain the relevance of the reference trigger word and the candidate corpus word in the corpus.

Preferably, the calculating an initial association between the candidate corpus word and each reference trigger word in the trigger word set in each corpus text in the corpus includes:

for a corpus text, determining the initial relevance of the reference trigger word and the candidate corpus word in the corpus text according to the ratio of the first frequency of occurrence of the reference trigger word and the candidate corpus word in the same sentence in the corpus text to the minimum frequency of occurrence, wherein the minimum frequency of occurrence is the minimum value of the frequency of occurrence of the reference trigger word in the corpus text and the frequency of occurrence of the candidate corpus word in the corpus text.

determining a plurality of preset connecting words;

for a corpus text, determining a first target sentence which simultaneously has the reference trigger word and the candidate corpus word from the corpus text, and connecting the reference trigger word and the candidate corpus word through a preset connecting word;

for each preset conjunction j_iThe corpus text is writtenIn (2), the preset conjunction word j is provided_iIs determined as the ratio of the number of the reference trigger word and the corpus word candidate in the corpus text with respect to the conjunction j to the minimum number of occurrences_iCorrelation of (c) Con (conj)_i)；

Calculating the reference trigger word seed and the candidate corpus word c in the corpus text d by using the following formula_iThe initial correlation in (1) is R_di(seed,c)：

Wherein i is a natural number from 1 to k, and k represents the corpus text d_iThe preset total number of the conjunctions in all the first target sentences.

determining a plurality of preset relationship types;

in any corpus text d_iFor any one of the relationship types j_iDetermining the ratio of the third frequency of the reference trigger word and the candidate corpus word appearing in the second target sentence to the minimum frequency of the reference trigger word and the candidate corpus word appearing in the corpus text d_iWith respect to said relationship type j_iCorrelation of Rel (relj)_i) Wherein the second target sentence is the sentence having the relation type j_iCorresponding appointed connecting words, sentences formed by connecting the reference trigger words and the candidate corpus words through the appointed connecting words, and the minimum occurrence frequency is the reference trigger words in the corpus text d_iAnd the number of occurrences of the corpus word candidate in the corpus text d_iThe minimum of the number of occurrences;

Wherein i is a natural number from 1 to k, and k represents the corpus text d_iHas the preset maximum number of relationship types.

Preferably, the determining the characteristics of each target trigger word includes any one or more of the following:

acquiring attribute characteristics of the target trigger words;

acquiring relevant words of the target trigger words, wherein the relevant words comprise synonyms, antisense words and related words of the target trigger words;

searching in the corpus text to obtain a target corpus text containing the target trigger word, positioning a feature word meeting a preset position relation with the target trigger word in the target corpus text, and taking the obtained feature word as the context feature of the target trigger word;

and identifying the target trigger word and the frame type of the target trigger word from sentences in the corpus text of the corpus based on a frame network FrameNet tool.

Preferably, after the obtaining of the clustered plurality of cluster sets belonging to different event categories, the method further includes:

determining at least one target trigger word suitable for being used as a label of any cluster set in the cluster set according to a word frequency and reverse file frequency TF-IDF algorithm;

and taking the at least one target trigger word as a label of the cluster set, and labeling the cluster set.

In another aspect, the present application further provides an event type extraction device, including:

the word screening unit is used for extracting a plurality of candidate corpus words from a preset corpus;

the association determining unit is used for determining the association between a reference trigger word in a preset trigger word set and the candidate corpus words based on the corpus, wherein the reference trigger word is determined by an automatic content extraction technology;

the word expansion unit is used for determining candidate corpus words, the relevance of which with the reference trigger words meets preset requirements, as target trigger words for any reference trigger word to obtain at least one target trigger word corresponding to each reference trigger word;

the characteristic determining unit is used for respectively determining the characteristics of each target trigger word;

and the type determining unit is used for clustering all the trigger words based on the characteristics of the target trigger words to obtain a plurality of clustered cluster sets belonging to different event categories, wherein each cluster set corresponds to one event category, and each cluster set comprises at least one target trigger word.

Preferably, the association determining unit includes:

the first association calculation unit is used for sequentially calculating the initial association of each corpus text of each reference trigger word in the corpus between each corpus text and each candidate corpus word in the candidate corpus word set;

and the second correlation calculation unit is used for summing the initial correlations of the reference trigger word and the candidate corpus words in each corpus text to obtain the correlation of the reference trigger word and the candidate corpus words in the corpus for any pair of the reference trigger word and the candidate corpus words.

According to the technical scheme, the target trigger words in the trigger word set are obtained by expanding the trigger words obtained by the automatic extraction technology based on the trigger words obtained by the existing automatic content extraction technology, so that the range covered by the obtained trigger words is wider, and the core words for triggering the events in the event extraction can be determined.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on the provided drawings without creative efforts.

FIG. 1 is a flow chart diagram illustrating an embodiment of an event type extraction method according to the present application;

FIG. 2 is a flow chart illustrating a further embodiment of an event type extraction method of the present application;

fig. 3 is a schematic structural diagram illustrating an example of an event type extraction apparatus according to the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

To facilitate understanding of the event extraction process, some terms involved in the event extraction will be briefly described as follows:

entity (Entity): an object or set of objects belonging to a certain semantic class.

Entity description (Entity description): phrases (typically noun phrases) that comprise an entity.

Event trigger word (Event trigger): core words that trigger events (triggers in ACE are mainly verbs or nouns).

Event elements (Event definitions): the participants of the event are the core parts that make up the event.

Event roles/element roles (alignment roles): the relationship of the event participant to the event.

Event description (Event maintenance): including event trigger words and phrases or sentences of event participants.

An event type extraction method of the present application will be described below.

Referring to fig. 1, which shows a schematic flow chart of an embodiment of an event type extraction method according to the present application, the method of the embodiment may include:

101, extracting candidate corpus words from a preset corpus.

For example, the corpus may be corpus obtained based on TDT (Topic Detection and Tracking) technology, the corpus includes a plurality of corpus texts, the corpus texts may be news reports oriented to multilingual texts and voice forms, and TDT mainly performs related tasks such as automatic identification of event report boundaries, locking and collection of breaking news topics, Tracking Topic development, and cross-language Detection and Tracking. A large number of events are described in the news text based on TDT technology.

Candidate corpus words extracted from a preset corpus can be regarded as candidate trigger words, so that words which can be used as expanded trigger words are selected from the candidate trigger words. Specifically, word extraction may be performed on a corpus text in a preset corpus to obtain candidate trigger words.

And 102, determining the relevance of the reference trigger word and the candidate corpus word in a preset trigger word set based on the corpus.

Wherein the reference trigger is determined by an automatic content extraction ACE technique. The reference trigger word may be understood as a seed trigger word for expanding the trigger word, so that the expansion of the trigger word is performed in combination with the candidate corpus word on the basis of the reference trigger word.

103, for any reference trigger word, determining the candidate corpus words with the relevance meeting the preset requirement with the reference trigger word as target trigger words, and obtaining at least one target trigger word corresponding to each reference trigger word.

Different from the prior art, the trigger words used for event extraction in the present application are not trigger words obtained by directly adopting the automatic content extraction technology, but trigger words obtained by the automatic content extraction technology are used as a reference to expand the trigger words.

And 104, respectively determining the characteristics of each target trigger word.

The characteristics of the target trigger word are used for representing the self attribute of the target trigger word, the relevance of the target trigger word and the context in the corpus text and the like, and the characteristics of the target trigger word are the basis for determining the event category.

And 105, clustering all the target trigger words based on the characteristics of the target trigger words to obtain a plurality of clustered cluster sets belonging to different event categories.

Each cluster set corresponds to an event category, and each cluster set comprises at least one target trigger word.

In the application, the target trigger words in the trigger word set are obtained by expanding the trigger words obtained by the automatic extraction technology based on the trigger words obtained by the existing automatic content extraction technology, so that the range covered by the obtained trigger words is wider, and the determination of the core words causing the events in the event extraction is facilitated.

Referring to fig. 2, which shows a schematic flow chart of another embodiment of the event type extraction method according to the present application, the method of the embodiment may include:

and 201, determining to-be-determined corpus words from a plurality of corpus texts in a preset corpus.

The step 201 is equivalent to performing word extraction in the corpus text to determine the corpus words included in the corpus texts, and in order to distinguish the corpus words from the candidate corpus words used for the extension trigger word, the initial corpus words extracted from the corpus text are called to-be-determined corpus words.

202, filtering out preset useless words contained in the to-be-determined corpus words to obtain a corpus candidate word set containing a plurality of corpus candidate words.

The preset useless words can be set according to needs, for example, the preset useless words can include some stop words and words of the fictional word. The null words cannot serve as words of sentence components, and are words other than real words. While real words can serve as sentence components alone, i.e., words with lexical and grammatical meanings.

Of course, besides filtering the preset useless words from the undetermined corpus words, the method can also perform preprocessing such as morphology reduction on the remaining real words in the undetermined corpus word bank, and use the remaining undetermined corpus words after preprocessing as candidate corpus words, thereby obtaining a candidate corpus word set.

And 203, acquiring a preset trigger word set.

The trigger word set comprises a plurality of reference trigger words determined by an automatic content extraction technology.

The reference trigger word can be understood as a trigger word determined according to the prior art, and the application needs to expand the trigger word on the basis of the existing trigger word.

And 204, sequentially calculating the initial relevance of each candidate corpus word and each reference trigger word in the trigger word set in each corpus text aiming at each candidate corpus word.

The relevance may reflect the relevance between two words and the degree of relevance, and the relevance may include the relevance of two words in the same corpus text, in which case, the relevance may only reflect the degree of relevance of the two words in the text, and for the convenience of distinction, the relevance of the reference trigger word candidate corpus word in a corpus text is referred to as the initial relevance. It will be appreciated that, since there are multiple corpus texts, the reference trigger word and the corpus candidate word will have initial associations for multiple different corpus texts.

The relevance may further include a comprehensive relevance of all documents in the corpus, where the comprehensive relevance may reflect a degree of relevance of two words in all text documents, and in this embodiment, the comprehensive relevance of the reference trigger word and the candidate corpus word in all documents in the corpus is referred to as the relevance in the corpus.

There are various ways to calculate the initial association between the candidate corpus word and the reference trigger word in a corpus text. Such as:

in one implementation of computing initial relevance:

the initial relevance of the reference trigger word and the candidate corpus word in the corpus text may be determined as a ratio of a first number of occurrences of the reference trigger word and the candidate corpus word in the same sentence in the corpus text to a minimum number of occurrences. The minimum occurrence frequency is the minimum value of the occurrence frequency of the reference trigger word in the corpus text and the occurrence frequency of the candidate corpus word in the corpus text. I.e. the initial relevance R_di(seed, c) may be expressed as:

wherein, the numerator is the frequency of the co-occurrence of the reference trigger word seed and the candidate corpus word c in a sentence, and the denominator is the frequency of the co-occurrence of the reference trigger word seed and the candidate corpus word c in the corpus text d_iOf the frequency of occurrence of (a).

In this implementation, words appearing in the same sentence are regarded as related words, and the ratio of the frequency of two words appearing in the same sentence to the total number of times of the two words appearing is also high, indicating that the correlation between the two words is higher.

In yet another implementation of computing initial relevance:

determining that a reference trigger word and a candidate corpus word exist simultaneously in the corpus text, and determining all first target sentences of which the reference trigger word and the candidate expected words are connected through preset connecting words; for connectionEach connecting word j of the reference trigger word and the candidate corpus word_iCalculating the reference trigger word and the candidate corpus word in the corpus text respectively with respect to the conjunction j_iThe correlation of (c). In the corpus text, the reference trigger word and the candidate corpus word are related to the conjunction word j_iThe correlation of (A) is: a ratio of a second number of occurrences of the baseline trigger word and the candidate corpus word in the first target sentence in the corpus text to the minimum number of occurrences. The minimum occurrence frequency is the minimum value of the occurrence frequency of the reference trigger word in the corpus text and the occurrence frequency of the candidate corpus word in the corpus text.

That is, in the corpus text di, the relevance con (conj) of the reference trigger word and the candidate corpus word with respect to the preset conjunctions_i) Can be expressed as follows:

wherein, in the formula two, the molecule of the fraction is in the corpus text d_iThe reference trigger word seed and the candidate corpus word c are provided, and the reference trigger word seed and the candidate corpus word c are connected through the connecting word i, wherein the number can also be regarded as the number of times that the reference trigger word and the candidate corpus word are connected through the connecting word and commonly appear in one first target sentence. The denominator of the fraction is the reference to trigger the word seed in the corpus text d_iThe number of occurrences in (c) and the corpus word candidate in (d)_iOf the number of occurrences of (a).

Correspondingly, the reference trigger word seed and the candidate corpus word c are in the corpus text d_iThe initial correlation in (1) is R_di(seed,c)：

Wherein i is a natural number from 1 to k, and k represents the corpus text d_iAll of the first target sentences haveOf the preset total number of conjunctions.

The preset connectives may be set as needed, and optionally, 182 connectives may be defined by using PDTB (Penn Discourse Treebank).

In this implementation, the kind of the conjunctive word connecting the two words has a certain influence on the relevancy of the two words: if the types of the connecting words connecting the two words are more, the relevance of the two words is considered to be more disordered, so that the relevance of the two words is reduced; if the types of the conjunctions connecting the two words are less, the relevance of the two words is considered to be stable, and therefore the relevance of the two words is larger.

In yet another implementation of computing initial relevance:

it is necessary to look at preset for each relationship type j_iDetermining a third frequency of occurrence of the reference trigger word and the candidate corpus word in the second target sentence at the same time, and calculating the relationship type j of the reference trigger word and the candidate corpus word in the corpus text_iThe correlation of (c). The second target sentence is a sentence which has a designated connecting word corresponding to the relation type, and the reference trigger word and the candidate corpus word are connected through the designated connecting word. The reference trigger word and the candidate corpus word are related to the relationship type j in the corpus text_iThe correlation of (a) is a ratio of a third time corresponding to the relationship type to a minimum occurrence time, wherein the minimum occurrence time is a minimum value of the occurrence times of the reference trigger word in the corpus text and the occurrence times of the candidate corpus word in the corpus text.

I.e. in the corpus text d_iThe reference trigger word seed and the candidate corpus word c have a preset relation type j_iCorrelation of Rel (relj)_i) Can be expressed as follows:

wherein the molecule is a corpus text d_iIn (1), the reference trigger word seed and the candidate language are providedAnd c, connecting the reference trigger word and the second target sentence of the candidate corpus word by the specified connecting word corresponding to the relationship type for the third times, wherein the reference trigger word and the candidate corpus word can be regarded as the times that the reference trigger word and the candidate corpus word are connected by the connecting word corresponding to the relationship type and appear in one target sentence together. The denominator of the fraction is the reference to trigger the word seed in the corpus text d_iAnd the number of occurrences of the corpus word candidate c in the corpus text di.

Obtaining the preset relation type j of the reference trigger word seed and the candidate corpus word c_iCorrelation of Rel (relj)_i) Then, the reference trigger word and the candidate corpus word c in the corpus text d may be counted_iInitial relevance of (1)_di(seed, c) is:

wherein i is a natural number from 1 to k, and k represents the corpus text d_iHas the preset maximum number of relationship types. If there are four preset relationship types, the value of k is 4.

Since 182 conjuncts are defined in PDTB, it is easy to make the number of instances of each conjunct rare, and the relationship is calculated using chapter-based relationship types in this implementation. Optionally, in this embodiment of the present application, the relationship types of the chapters may include four preset types of relationship types: contrast (contrast), causality (containment), Expansion (Expansion), and Temporal (Temporal).

Some of the conjunctions in the PDTB point to a particular relationship type, for example, having the conjunctions "because" preceding argument "and" following argument "in the conjunctions" can point to "cause" relationship; the partial conjunctions may point to a variety of relationship types, such as the conjunctions "and (and)". Therefore, the invention only selects specific conjunctions in the PDTB. A particular conjunctive is one that has a high probability of pointing to a certain type of relationship in the discourse. The invention aims at the distribution of the connective words in the PDTB and counts the probability that each connective word points to a certain relationship type. For example, the probability that the conjunction "optionally" points to the "Expansion" relationship type is 100%. In the application, only the connection words pointing to a certain relation type with a probability greater than 80% are selected as the designated connection words contained in the relation type.

Accordingly, in this implementation, the "bound range" in which two words have a correlation is set to be that the two words are in the same sentence, and the two words are required to be connected by a designated connecting word. Meanwhile, the relevance of the words seed and c can be calculated respectively for the four relation types.

Of course, in practical applications, there may be other ways to calculate the initial association between the reference trigger word and the candidate corpus word in the corpus text, and the method is not limited herein.

205, for any pair of the reference trigger word and the candidate corpus word, counting the relevance of the reference trigger word and the candidate corpus word in the corpus according to the initial relevance of the reference trigger word and the candidate corpus word in each corpus text in the trigger word set.

For any pair of the reference trigger word and the candidate corpus word, the initial relevance of each corpus text of the reference trigger word and the candidate corpus word in the corpus is added, so that the relevance of the reference trigger word and the candidate corpus word in the corpus, namely the final relevance of the reference trigger word and the candidate corpus word, can be obtained.

That is, the relevance R (seed, c) of the reference trigger word seed and the corpus candidate word c in the corpus is:

wherein n represents the total number of the corpus texts of the sentences with the reference trigger word seed and the candidate corpus words c, i is a natural number from 1 to n, and di represents the corpus text in which the reference trigger word seed and the candidate corpus words c commonly appear in one sentence.

206, for any reference trigger word, determining the candidate corpus words with the relevance meeting the preset requirement with the reference trigger word as target trigger words, and obtaining at least one target trigger word expanded from the reference trigger word.

One or more target trigger words may be expanded for each reference trigger word.

The preset requirements met by the target trigger word and the reference trigger word can be set according to needs. Such as. The preset requirement may be that the value of the correlation is greater than a preset threshold. Optionally, for each reference trigger word, the candidate corpus words may be sorted in the order from the highest to the lowest in the relevance to the reference trigger word, and a target trigger word is determined by a specified number of candidate corpus words sorted in the top.

207, the characteristics of each target trigger word are obtained respectively.

Wherein the characteristics of the target trigger word are used for describing the basic characteristics of the target trigger word.

For example, the characteristics of the target trigger word may include any one or more of the following:

attribute characteristics of the target trigger word;

relevant words of the target trigger word, such as synonyms, antisense words and related words of the target trigger word;

contextual characteristics of the target trigger word;

the frame type to which the target trigger word belongs.

The attribute features are features of the target trigger word, and can be obtained by identifying the part of speech of the target trigger word and naming an entity.

The relevant words of the target trigger words can be obtained by calling the appointed word bank through a preset interface.

The context feature of the target trigger word can be obtained by searching in the corpus text of the corpus to obtain a target corpus text containing the target trigger word, locating a feature word meeting a preset position relationship with the target trigger word in the target corpus text, and taking the obtained feature word as the context feature of the target trigger word. For example, contextual characteristics may include the following:

the first three words and the last three words of the target trigger word (excluding the stop word);

searching a sequence with a distance to a target trigger word not more than three words in the corpus text according to an N-Gram model, and extracting two or three words;

and extracting a word which is adjacent to the target trigger word and is positioned in front of the target trigger word and a word which is positioned behind the target trigger word from the corpus text.

The Frame type to which the target trigger word belongs is a Frame (Frame) for identifying the target trigger word and the Frame (Frame) of the target trigger word of each sentence in the corpus text based on a Frame network FrameNet tool, so that the Frame type of the target trigger word is obtained under the condition that the target trigger word has the Frame. The frame type of the word before the target trigger word and the frame type of the word after the target trigger word can be further extracted. The framework network is a semantic network based on a corpus, applying the theory of framework semantics, based on a framework and connecting the lexical meanings thereof with each other.

And 208, clustering all the obtained target trigger words based on the characteristics of the target trigger words to obtain a clustered set which belongs to a plurality of different event categories.

Each cluster set comprises a plurality of target trigger words

The different cluster sets correspond to different event categories, and one cluster set of an event category comprises a plurality of target trigger words belonging to the event category.

In the embodiment of the present application, clustering the target trigger word may be performed according to a preset clustering algorithm, for example, clustering may be performed according to an Affinity Propagation Cluster algorithm, which is an adjacent Propagation Cluster algorithm, where the adjacent Propagation Cluster algorithm may also be referred to as an AP clustering algorithm for short. The clustering algorithm takes all data points as potential cluster centers and does not need to specify the number of clusters. In the clustering process, vectors formed by the characteristics of the target trigger words obtained in the previous step are used as input data, the constructed event trigger word characteristic vectors are used as input data, the trigger words of the same type can be classified into one type, and the types or the characteristics of the target trigger words in the same type in the clustering result are the same. Wherein a class can be considered as a trigger word set.

Because the characteristics of the target trigger words determined in the application are obviously different from the characteristics determined in the prior art, all the target trigger words are clustered through a clustering algorithm, and the obtained event types are not limited and are different from the event types defined in the ACE corpus.

209, for each cluster set, selecting at least one target trigger word from the cluster set as a label of the cluster set, and labeling the cluster set with the obtained label.

Optionally, the target trigger words suitable for being used as the labels of the cluster set in the cluster set are determined according to a TF-IDF algorithm, and specifically, for each event type generated by the cluster algorithm, a first specified number of target trigger words with the largest TF-IDF value are selected from the cluster set of the event type as the labels of the event type.

TF-IDF is a statistical method for evaluating the importance of a word to one of a set of documents or a corpus. The main idea of TF-IDF is: if a word or phrase appears in an article with a high frequency TF and rarely appears in other articles, the word or phrase is considered to have a good classification capability and is suitable for classification. Wherein, TF is Term Frequency (Term Frequency) and represents the Frequency of a certain word or phrase appearing in a document; the IDF is the Inverse file Frequency (Inverse Document Frequency), the total Document number is divided by the Document number containing the word or phrase, and the quotient obtained is subjected to logarithm, so that the universal importance of the word or phrase is measured. So if a word or phrase appears in a document with a high frequency TF and rarely appears in other documents, the word or phrase is considered to have a better category discrimination ability and is suitable for being used as a label of a certain category.

Assuming that a clustering set of K event categories is obtained through a clustering algorithm, and a plurality of target trigger word sets which can represent the event category most under the category are counted for each event category, the calculation of TF-IDF under each event category is counted only inside the event category. In the present invention, each event category includes several corpus texts, and then for each target trigger word in an event category, the definition of TF-IDF (TF is the frequency of occurrence of a target trigger word in the corpus text di, IDF is the reciprocal of the number of corpus texts including the target trigger word) of the target trigger word in each corpus text is as follows:

wherein i represents a target trigger word, n_ijRepresenting the number of times the target trigger word i appears in the corpus text j in the event category;

representing the sum of the occurrence times of all target trigger words in the corpus text j under the event category; m represents the number of all target trigger words of the event category; n represents the total number of corpus texts corresponding to the event category (i.e. the total number of all corpus texts containing any one target trigger word under the event category); n is_jIndicating the amount of corpus text with the target trigger word in the event category, plus 1 indicates smoothing.

Therefore, the invention respectively marks the K event categories generated by clustering the AP clustering algorithm as: c1, C2, … Ck; calculating a TF-IDF value of each target trigger word in each document for all documents d in each category Ci (i ═ 1,2, … k); and taking the top specified number (for example, 100) of target trigger words with the maximum TF-IDF value in the event category as the mark of the event type category for each event category.

The invention uses a plurality of target trigger words with higher TF-IDF values to represent labels corresponding to certain event categories (the labels distinguish the types of all the event categories), and the method breaks away from the limit of 33 event types defined in the ACE corpus, but considers all language phenomena to form an event type system of an open domain.

On the other hand, the embodiment of the application also provides an event type extraction device. Referring to fig. 3, which shows a schematic structural diagram of an embodiment of an event type extraction device according to the present application, the device of the present embodiment may include:

a word screening unit 301, configured to extract a plurality of candidate corpus words from a preset corpus;

an association determining unit 302, configured to determine, based on the corpus, an association between a reference trigger word in a preset trigger word set and the candidate corpus word, where the reference trigger word is determined by an automatic content extraction technique;

the word expansion unit 303 is configured to, for any one reference trigger word, determine a candidate corpus word whose association with the reference trigger word meets a preset requirement as a target trigger word, and obtain at least one target trigger word corresponding to each reference trigger word;

a feature determining unit 304, configured to determine a feature of each target trigger word respectively;

a type determining unit 305, configured to cluster all the trigger words based on characteristics of the target trigger words to obtain a plurality of clustered sets belonging to different event categories, where each clustered set corresponds to one event category, and each clustered set includes at least one target trigger word.

Optionally, the word screening unit includes:

the undetermined word determining unit is used for determining undetermined corpus words contained in a plurality of corpus texts in the preset corpus;

and the word filtering unit is used for filtering preset useless words contained in the to-be-determined corpus words to obtain the candidate corpus words, wherein the preset useless words comprise stop words and fictitious words.

Optionally, the association determining unit includes:

Optionally, when the first association calculating unit calculates the initial association of the candidate corpus word and each reference trigger word in the trigger word set in each corpus text in the corpus, specifically:

determining a plurality of preset connecting words;

for each preset conjunction j_iIn the corpus text, the toolWith said preset conjunction word j_iIs determined as the ratio of the number of the reference trigger word and the corpus word candidate in the corpus text with respect to the conjunction j to the minimum number of occurrences_iCorrelation of (c) Con (conj)_i)；

determining a plurality of preset relationship types;

calculating the reference trigger word seed and the candidate corpus word c in the language by using the following formulaMaterial text d_iThe initial correlation in (1) is R_di(seed,c)：

Optionally, the manner of determining the feature of each target trigger word by the feature determining unit may include any one or more of the following:

acquiring attribute characteristics of the target trigger words;

Optionally, the apparatus further comprises:

a tagged word determining unit, configured to determine, for any one cluster set, at least one target trigger word in the cluster set that is suitable for being used as a label of the cluster set according to a word frequency and a reverse file frequency TF-IDF algorithm after the type determining unit obtains multiple cluster sets that are clustered and belong to different event categories;

and the event labeling unit is used for taking the at least one target trigger word as a label of the cluster set and labeling the cluster set.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An event type extraction method, comprising:

extracting a plurality of candidate corpus words from a preset corpus;

respectively determining the characteristics of each target trigger word;

based on the characteristics of the target trigger words, clustering all the target trigger words to obtain a plurality of clustered cluster sets belonging to different event categories, wherein each cluster set corresponds to one event category and comprises at least one target trigger word;

wherein, the determining the characteristics of each target trigger word comprises any one or more of the following steps:

acquiring attribute characteristics of the target trigger words;

2. The method according to claim 1, wherein said extracting the corpus candidate words from the preset corpus comprises:

3. The method according to claim 1, wherein the determining the relevance of the reference trigger word in a preset set of trigger words to the corpus candidate words based on the corpus comprises:

4. The method according to claim 3, wherein said calculating an initial association of said candidate corpus word with each reference trigger word in said set of trigger words within each corpus text in said corpus comprises:

5. The method according to claim 3, wherein said calculating an initial association of said candidate corpus word with each reference trigger word in said set of trigger words within each corpus text in said corpus comprises:

determining a plurality of preset connecting words;

for each preset conjunction j_iThe preset connecting word j is arranged in the corpus text_iIs determined as the ratio of the number of the reference trigger word and the corpus word candidate in the corpus text with respect to the conjunction j to the minimum number of occurrences_iCorrelation of (c) Con (conj)_i)；

6. The method according to claim 3, wherein said calculating an initial association of said candidate corpus word with each reference trigger word in said set of trigger words within each corpus text in said corpus comprises:

determining a plurality of preset relationship types;

7. The method according to any one of claims 1 to 3, wherein after obtaining the clustered plurality of sets of clusters belonging to different event categories, further comprising:

8. An event type extraction device, comprising:

the type determining unit is used for clustering all the trigger words based on the characteristics of the target trigger words to obtain a plurality of clustered cluster sets belonging to different event categories, wherein each cluster set corresponds to one event category, and each cluster set comprises at least one target trigger word;

the feature determining unit determines the features of each target trigger word, and the features include any one or more of the following:

acquiring attribute characteristics of the target trigger words;

9. The apparatus of claim 8, wherein the association determining unit comprises: