CN106951530B - Event type extraction method and device - Google Patents

Event type extraction method and device Download PDF

Info

Publication number
CN106951530B
CN106951530B CN201710169761.3A CN201710169761A CN106951530B CN 106951530 B CN106951530 B CN 106951530B CN 201710169761 A CN201710169761 A CN 201710169761A CN 106951530 B CN106951530 B CN 106951530B
Authority
CN
China
Prior art keywords
corpus
word
words
candidate
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710169761.3A
Other languages
Chinese (zh)
Other versions
CN106951530A (en
Inventor
洪宇
杨雪蓉
姚建民
朱巧明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN201710169761.3A priority Critical patent/CN106951530B/en
Publication of CN106951530A publication Critical patent/CN106951530A/en
Application granted granted Critical
Publication of CN106951530B publication Critical patent/CN106951530B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms

Abstract

The application provides an event type extraction method and device, and the method comprises the following steps: extracting candidate corpus words from a preset corpus; determining the relevance between a reference trigger word in a preset trigger word set and the candidate corpus words based on the corpus, wherein the reference trigger word is determined by an automatic content extraction technology; for any reference trigger word, determining candidate corpus words with the relevance meeting preset requirements with the reference trigger word as target trigger words to obtain at least one target trigger word; determining characteristics of the target trigger words in the trigger word set; and clustering the target trigger words based on the characteristics of the target trigger words to obtain a clustered set which belongs to different event categories. The method and the device provide possibility for improving the accuracy of event extraction and increasing the application range of event extraction.

Description

Event type extraction method and device
Technical Field
The present application relates to the field of information processing technologies, and in particular, to an event type extraction method and apparatus.
Background
The event extraction is used as an important component of information extraction, and has wide application prospect and great practical significance. The purpose of event extraction is to accurately and effectively extract interesting time information from a large amount of disordered, disordered and unstructured information, and according to task definition of event extraction, an event refers to an objective fact that a specific person and a specific object interact with a specific place at a specific time, and the event is composed of a trigger word and elements for describing an event structure. The event extraction requires that structured information containing event type, event elements and event role information is automatically identified and extracted from an unstructured source text containing event information.
Currently, the existing event extraction directly uses the annotation result of Automatic Content Extraction (ACE), so that the research of the event extraction is also limited to only the event type defined in the ACE, i.e. only to the limited-domain event extraction. However, the event types in the open domain are more, rich and various, and the difference of the event types is relatively small, so that the difference judgment difficulty is large, and if the ACE is still directly adopted, the event extraction cannot be accurately and effectively carried out.
Disclosure of Invention
In view of the above, the present application provides an event type extraction method and apparatus, so as to provide a possibility for improving the accuracy of event extraction and increasing the application range of event extraction.
In order to achieve the above purpose, the present application provides the following technical solutions:
an event type extraction method, comprising:
extracting a plurality of candidate corpus words from a preset corpus;
determining the relevance between a reference trigger word in a preset trigger word set and the candidate corpus words based on the corpus, wherein the reference trigger word is determined by an automatic content extraction technology;
for any reference trigger word, determining candidate corpus words with the relevance meeting preset requirements with the reference trigger word as target trigger words to obtain at least one target trigger word corresponding to each reference trigger word;
respectively determining the characteristics of each target trigger word;
and based on the characteristics of the target trigger words, clustering all the target trigger words to obtain a plurality of clustered cluster sets belonging to different event categories, wherein each cluster set corresponds to one event category, and each cluster set comprises at least one target trigger word.
Preferably, the extracting the corpus candidate words from the preset corpus includes:
determining undetermined corpus words contained in a plurality of corpus texts in the preset corpus;
and filtering preset useless words contained in the to-be-determined corpus words to obtain the candidate corpus words, wherein the preset useless words comprise stop words and virtual words.
Preferably, the determining, based on the corpus, the relevance between the reference trigger word and the candidate corpus word in the preset trigger word set includes:
for each candidate corpus word, sequentially calculating the initial relevance of the candidate corpus word and each reference trigger word in the trigger word set in each corpus text in the corpus;
and for any pair of the reference trigger word and the candidate corpus word, summing the initial relevance of the reference trigger word and the candidate corpus word in each corpus text to obtain the relevance of the reference trigger word and the candidate corpus word in the corpus.
Preferably, the calculating an initial association between the candidate corpus word and each reference trigger word in the trigger word set in each corpus text in the corpus includes:
for a corpus text, determining the initial relevance of the reference trigger word and the candidate corpus word in the corpus text according to the ratio of the first frequency of occurrence of the reference trigger word and the candidate corpus word in the same sentence in the corpus text to the minimum frequency of occurrence, wherein the minimum frequency of occurrence is the minimum value of the frequency of occurrence of the reference trigger word in the corpus text and the frequency of occurrence of the candidate corpus word in the corpus text.
Preferably, the calculating an initial association between the candidate corpus word and each reference trigger word in the trigger word set in each corpus text in the corpus includes:
determining a plurality of preset connecting words;
for a corpus text, determining a first target sentence which simultaneously has the reference trigger word and the candidate corpus word from the corpus text, and connecting the reference trigger word and the candidate corpus word through a preset connecting word;
for each preset conjunction jiThe corpus text is writtenIn (2), the preset conjunction word j is providediIs determined as the ratio of the number of the reference trigger word and the corpus word candidate in the corpus text with respect to the conjunction j to the minimum number of occurrencesiCorrelation of (c) Con (conj)i);
Calculating the reference trigger word seed and the candidate corpus word c in the corpus text d by using the following formulaiThe initial correlation in (1) is Rdi(seed,c):
Figure BDA0001250824780000031
Wherein i is a natural number from 1 to k, and k represents the corpus text diThe preset total number of the conjunctions in all the first target sentences.
Preferably, the calculating an initial association between the candidate corpus word and each reference trigger word in the trigger word set in each corpus text in the corpus includes:
determining a plurality of preset relationship types;
in any corpus text diFor any one of the relationship types jiDetermining the ratio of the third frequency of the reference trigger word and the candidate corpus word appearing in the second target sentence to the minimum frequency of the reference trigger word and the candidate corpus word appearing in the corpus text diWith respect to said relationship type jiCorrelation of Rel (relj)i) Wherein the second target sentence is the sentence having the relation type jiCorresponding appointed connecting words, sentences formed by connecting the reference trigger words and the candidate corpus words through the appointed connecting words, and the minimum occurrence frequency is the reference trigger words in the corpus text diAnd the number of occurrences of the corpus word candidate in the corpus text diThe minimum of the number of occurrences;
calculating the reference trigger word seed and the candidate corpus word c in the corpus text d by using the following formulaiThe initial correlation in (1) is Rdi(seed,c):
Figure BDA0001250824780000032
Wherein i is a natural number from 1 to k, and k represents the corpus text diHas the preset maximum number of relationship types.
Preferably, the determining the characteristics of each target trigger word includes any one or more of the following:
acquiring attribute characteristics of the target trigger words;
acquiring relevant words of the target trigger words, wherein the relevant words comprise synonyms, antisense words and related words of the target trigger words;
searching in the corpus text to obtain a target corpus text containing the target trigger word, positioning a feature word meeting a preset position relation with the target trigger word in the target corpus text, and taking the obtained feature word as the context feature of the target trigger word;
and identifying the target trigger word and the frame type of the target trigger word from sentences in the corpus text of the corpus based on a frame network FrameNet tool.
Preferably, after the obtaining of the clustered plurality of cluster sets belonging to different event categories, the method further includes:
determining at least one target trigger word suitable for being used as a label of any cluster set in the cluster set according to a word frequency and reverse file frequency TF-IDF algorithm;
and taking the at least one target trigger word as a label of the cluster set, and labeling the cluster set.
In another aspect, the present application further provides an event type extraction device, including:
the word screening unit is used for extracting a plurality of candidate corpus words from a preset corpus;
the association determining unit is used for determining the association between a reference trigger word in a preset trigger word set and the candidate corpus words based on the corpus, wherein the reference trigger word is determined by an automatic content extraction technology;
the word expansion unit is used for determining candidate corpus words, the relevance of which with the reference trigger words meets preset requirements, as target trigger words for any reference trigger word to obtain at least one target trigger word corresponding to each reference trigger word;
the characteristic determining unit is used for respectively determining the characteristics of each target trigger word;
and the type determining unit is used for clustering all the trigger words based on the characteristics of the target trigger words to obtain a plurality of clustered cluster sets belonging to different event categories, wherein each cluster set corresponds to one event category, and each cluster set comprises at least one target trigger word.
Preferably, the association determining unit includes:
the first association calculation unit is used for sequentially calculating the initial association of each corpus text of each reference trigger word in the corpus between each corpus text and each candidate corpus word in the candidate corpus word set;
and the second correlation calculation unit is used for summing the initial correlations of the reference trigger word and the candidate corpus words in each corpus text to obtain the correlation of the reference trigger word and the candidate corpus words in the corpus for any pair of the reference trigger word and the candidate corpus words.
According to the technical scheme, the target trigger words in the trigger word set are obtained by expanding the trigger words obtained by the automatic extraction technology based on the trigger words obtained by the existing automatic content extraction technology, so that the range covered by the obtained trigger words is wider, and the core words for triggering the events in the event extraction can be determined.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on the provided drawings without creative efforts.
FIG. 1 is a flow chart diagram illustrating an embodiment of an event type extraction method according to the present application;
FIG. 2 is a flow chart illustrating a further embodiment of an event type extraction method of the present application;
fig. 3 is a schematic structural diagram illustrating an example of an event type extraction apparatus according to the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
To facilitate understanding of the event extraction process, some terms involved in the event extraction will be briefly described as follows:
entity (Entity): an object or set of objects belonging to a certain semantic class.
Entity description (Entity description): phrases (typically noun phrases) that comprise an entity.
Event trigger word (Event trigger): core words that trigger events (triggers in ACE are mainly verbs or nouns).
Event elements (Event definitions): the participants of the event are the core parts that make up the event.
Event roles/element roles (alignment roles): the relationship of the event participant to the event.
Event description (Event maintenance): including event trigger words and phrases or sentences of event participants.
An event type extraction method of the present application will be described below.
Referring to fig. 1, which shows a schematic flow chart of an embodiment of an event type extraction method according to the present application, the method of the embodiment may include:
101, extracting candidate corpus words from a preset corpus.
For example, the corpus may be corpus obtained based on TDT (Topic Detection and Tracking) technology, the corpus includes a plurality of corpus texts, the corpus texts may be news reports oriented to multilingual texts and voice forms, and TDT mainly performs related tasks such as automatic identification of event report boundaries, locking and collection of breaking news topics, Tracking Topic development, and cross-language Detection and Tracking. A large number of events are described in the news text based on TDT technology.
Candidate corpus words extracted from a preset corpus can be regarded as candidate trigger words, so that words which can be used as expanded trigger words are selected from the candidate trigger words. Specifically, word extraction may be performed on a corpus text in a preset corpus to obtain candidate trigger words.
And 102, determining the relevance of the reference trigger word and the candidate corpus word in a preset trigger word set based on the corpus.
Wherein the reference trigger is determined by an automatic content extraction ACE technique. The reference trigger word may be understood as a seed trigger word for expanding the trigger word, so that the expansion of the trigger word is performed in combination with the candidate corpus word on the basis of the reference trigger word.
103, for any reference trigger word, determining the candidate corpus words with the relevance meeting the preset requirement with the reference trigger word as target trigger words, and obtaining at least one target trigger word corresponding to each reference trigger word.
Different from the prior art, the trigger words used for event extraction in the present application are not trigger words obtained by directly adopting the automatic content extraction technology, but trigger words obtained by the automatic content extraction technology are used as a reference to expand the trigger words.
And 104, respectively determining the characteristics of each target trigger word.
The characteristics of the target trigger word are used for representing the self attribute of the target trigger word, the relevance of the target trigger word and the context in the corpus text and the like, and the characteristics of the target trigger word are the basis for determining the event category.
And 105, clustering all the target trigger words based on the characteristics of the target trigger words to obtain a plurality of clustered cluster sets belonging to different event categories.
Each cluster set corresponds to an event category, and each cluster set comprises at least one target trigger word.
In the application, the target trigger words in the trigger word set are obtained by expanding the trigger words obtained by the automatic extraction technology based on the trigger words obtained by the existing automatic content extraction technology, so that the range covered by the obtained trigger words is wider, and the determination of the core words causing the events in the event extraction is facilitated.
Referring to fig. 2, which shows a schematic flow chart of another embodiment of the event type extraction method according to the present application, the method of the embodiment may include:
and 201, determining to-be-determined corpus words from a plurality of corpus texts in a preset corpus.
The step 201 is equivalent to performing word extraction in the corpus text to determine the corpus words included in the corpus texts, and in order to distinguish the corpus words from the candidate corpus words used for the extension trigger word, the initial corpus words extracted from the corpus text are called to-be-determined corpus words.
202, filtering out preset useless words contained in the to-be-determined corpus words to obtain a corpus candidate word set containing a plurality of corpus candidate words.
The preset useless words can be set according to needs, for example, the preset useless words can include some stop words and words of the fictional word. The null words cannot serve as words of sentence components, and are words other than real words. While real words can serve as sentence components alone, i.e., words with lexical and grammatical meanings.
Of course, besides filtering the preset useless words from the undetermined corpus words, the method can also perform preprocessing such as morphology reduction on the remaining real words in the undetermined corpus word bank, and use the remaining undetermined corpus words after preprocessing as candidate corpus words, thereby obtaining a candidate corpus word set.
And 203, acquiring a preset trigger word set.
The trigger word set comprises a plurality of reference trigger words determined by an automatic content extraction technology.
The reference trigger word can be understood as a trigger word determined according to the prior art, and the application needs to expand the trigger word on the basis of the existing trigger word.
And 204, sequentially calculating the initial relevance of each candidate corpus word and each reference trigger word in the trigger word set in each corpus text aiming at each candidate corpus word.
The relevance may reflect the relevance between two words and the degree of relevance, and the relevance may include the relevance of two words in the same corpus text, in which case, the relevance may only reflect the degree of relevance of the two words in the text, and for the convenience of distinction, the relevance of the reference trigger word candidate corpus word in a corpus text is referred to as the initial relevance. It will be appreciated that, since there are multiple corpus texts, the reference trigger word and the corpus candidate word will have initial associations for multiple different corpus texts.
The relevance may further include a comprehensive relevance of all documents in the corpus, where the comprehensive relevance may reflect a degree of relevance of two words in all text documents, and in this embodiment, the comprehensive relevance of the reference trigger word and the candidate corpus word in all documents in the corpus is referred to as the relevance in the corpus.
There are various ways to calculate the initial association between the candidate corpus word and the reference trigger word in a corpus text. Such as:
in one implementation of computing initial relevance:
the initial relevance of the reference trigger word and the candidate corpus word in the corpus text may be determined as a ratio of a first number of occurrences of the reference trigger word and the candidate corpus word in the same sentence in the corpus text to a minimum number of occurrences. The minimum occurrence frequency is the minimum value of the occurrence frequency of the reference trigger word in the corpus text and the occurrence frequency of the candidate corpus word in the corpus text. I.e. the initial relevance Rdi(seed, c) may be expressed as:
Figure BDA0001250824780000081
wherein, the numerator is the frequency of the co-occurrence of the reference trigger word seed and the candidate corpus word c in a sentence, and the denominator is the frequency of the co-occurrence of the reference trigger word seed and the candidate corpus word c in the corpus text diOf the frequency of occurrence of (a).
In this implementation, words appearing in the same sentence are regarded as related words, and the ratio of the frequency of two words appearing in the same sentence to the total number of times of the two words appearing is also high, indicating that the correlation between the two words is higher.
In yet another implementation of computing initial relevance:
determining that a reference trigger word and a candidate corpus word exist simultaneously in the corpus text, and determining all first target sentences of which the reference trigger word and the candidate expected words are connected through preset connecting words; for connectionEach connecting word j of the reference trigger word and the candidate corpus wordiCalculating the reference trigger word and the candidate corpus word in the corpus text respectively with respect to the conjunction jiThe correlation of (c). In the corpus text, the reference trigger word and the candidate corpus word are related to the conjunction word jiThe correlation of (A) is: a ratio of a second number of occurrences of the baseline trigger word and the candidate corpus word in the first target sentence in the corpus text to the minimum number of occurrences. The minimum occurrence frequency is the minimum value of the occurrence frequency of the reference trigger word in the corpus text and the occurrence frequency of the candidate corpus word in the corpus text.
That is, in the corpus text di, the relevance con (conj) of the reference trigger word and the candidate corpus word with respect to the preset conjunctionsi) Can be expressed as follows:
wherein, in the formula two, the molecule of the fraction is in the corpus text diThe reference trigger word seed and the candidate corpus word c are provided, and the reference trigger word seed and the candidate corpus word c are connected through the connecting word i, wherein the number can also be regarded as the number of times that the reference trigger word and the candidate corpus word are connected through the connecting word and commonly appear in one first target sentence. The denominator of the fraction is the reference to trigger the word seed in the corpus text diThe number of occurrences in (c) and the corpus word candidate in (d)iOf the number of occurrences of (a).
Correspondingly, the reference trigger word seed and the candidate corpus word c are in the corpus text diThe initial correlation in (1) is Rdi(seed,c):
Figure BDA0001250824780000092
Wherein i is a natural number from 1 to k, and k represents the corpus text diAll of the first target sentences haveOf the preset total number of conjunctions.
The preset connectives may be set as needed, and optionally, 182 connectives may be defined by using PDTB (Penn Discourse Treebank).
In this implementation, the kind of the conjunctive word connecting the two words has a certain influence on the relevancy of the two words: if the types of the connecting words connecting the two words are more, the relevance of the two words is considered to be more disordered, so that the relevance of the two words is reduced; if the types of the conjunctions connecting the two words are less, the relevance of the two words is considered to be stable, and therefore the relevance of the two words is larger.
In yet another implementation of computing initial relevance:
it is necessary to look at preset for each relationship type jiDetermining a third frequency of occurrence of the reference trigger word and the candidate corpus word in the second target sentence at the same time, and calculating the relationship type j of the reference trigger word and the candidate corpus word in the corpus textiThe correlation of (c). The second target sentence is a sentence which has a designated connecting word corresponding to the relation type, and the reference trigger word and the candidate corpus word are connected through the designated connecting word. The reference trigger word and the candidate corpus word are related to the relationship type j in the corpus textiThe correlation of (a) is a ratio of a third time corresponding to the relationship type to a minimum occurrence time, wherein the minimum occurrence time is a minimum value of the occurrence times of the reference trigger word in the corpus text and the occurrence times of the candidate corpus word in the corpus text.
I.e. in the corpus text diThe reference trigger word seed and the candidate corpus word c have a preset relation type jiCorrelation of Rel (relj)i) Can be expressed as follows:
Figure BDA0001250824780000101
wherein the molecule is a corpus text diIn (1), the reference trigger word seed and the candidate language are providedAnd c, connecting the reference trigger word and the second target sentence of the candidate corpus word by the specified connecting word corresponding to the relationship type for the third times, wherein the reference trigger word and the candidate corpus word can be regarded as the times that the reference trigger word and the candidate corpus word are connected by the connecting word corresponding to the relationship type and appear in one target sentence together. The denominator of the fraction is the reference to trigger the word seed in the corpus text diAnd the number of occurrences of the corpus word candidate c in the corpus text di.
Obtaining the preset relation type j of the reference trigger word seed and the candidate corpus word ciCorrelation of Rel (relj)i) Then, the reference trigger word and the candidate corpus word c in the corpus text d may be countediInitial relevance of (1)di(seed, c) is:
Figure BDA0001250824780000102
wherein i is a natural number from 1 to k, and k represents the corpus text diHas the preset maximum number of relationship types. If there are four preset relationship types, the value of k is 4.
Since 182 conjuncts are defined in PDTB, it is easy to make the number of instances of each conjunct rare, and the relationship is calculated using chapter-based relationship types in this implementation. Optionally, in this embodiment of the present application, the relationship types of the chapters may include four preset types of relationship types: contrast (contrast), causality (containment), Expansion (Expansion), and Temporal (Temporal).
Some of the conjunctions in the PDTB point to a particular relationship type, for example, having the conjunctions "because" preceding argument "and" following argument "in the conjunctions" can point to "cause" relationship; the partial conjunctions may point to a variety of relationship types, such as the conjunctions "and (and)". Therefore, the invention only selects specific conjunctions in the PDTB. A particular conjunctive is one that has a high probability of pointing to a certain type of relationship in the discourse. The invention aims at the distribution of the connective words in the PDTB and counts the probability that each connective word points to a certain relationship type. For example, the probability that the conjunction "optionally" points to the "Expansion" relationship type is 100%. In the application, only the connection words pointing to a certain relation type with a probability greater than 80% are selected as the designated connection words contained in the relation type.
Accordingly, in this implementation, the "bound range" in which two words have a correlation is set to be that the two words are in the same sentence, and the two words are required to be connected by a designated connecting word. Meanwhile, the relevance of the words seed and c can be calculated respectively for the four relation types.
Of course, in practical applications, there may be other ways to calculate the initial association between the reference trigger word and the candidate corpus word in the corpus text, and the method is not limited herein.
205, for any pair of the reference trigger word and the candidate corpus word, counting the relevance of the reference trigger word and the candidate corpus word in the corpus according to the initial relevance of the reference trigger word and the candidate corpus word in each corpus text in the trigger word set.
For any pair of the reference trigger word and the candidate corpus word, the initial relevance of each corpus text of the reference trigger word and the candidate corpus word in the corpus is added, so that the relevance of the reference trigger word and the candidate corpus word in the corpus, namely the final relevance of the reference trigger word and the candidate corpus word, can be obtained.
That is, the relevance R (seed, c) of the reference trigger word seed and the corpus candidate word c in the corpus is:
wherein n represents the total number of the corpus texts of the sentences with the reference trigger word seed and the candidate corpus words c, i is a natural number from 1 to n, and di represents the corpus text in which the reference trigger word seed and the candidate corpus words c commonly appear in one sentence.
206, for any reference trigger word, determining the candidate corpus words with the relevance meeting the preset requirement with the reference trigger word as target trigger words, and obtaining at least one target trigger word expanded from the reference trigger word.
One or more target trigger words may be expanded for each reference trigger word.
The preset requirements met by the target trigger word and the reference trigger word can be set according to needs. Such as. The preset requirement may be that the value of the correlation is greater than a preset threshold. Optionally, for each reference trigger word, the candidate corpus words may be sorted in the order from the highest to the lowest in the relevance to the reference trigger word, and a target trigger word is determined by a specified number of candidate corpus words sorted in the top.
207, the characteristics of each target trigger word are obtained respectively.
Wherein the characteristics of the target trigger word are used for describing the basic characteristics of the target trigger word.
For example, the characteristics of the target trigger word may include any one or more of the following:
attribute characteristics of the target trigger word;
relevant words of the target trigger word, such as synonyms, antisense words and related words of the target trigger word;
contextual characteristics of the target trigger word;
the frame type to which the target trigger word belongs.
The attribute features are features of the target trigger word, and can be obtained by identifying the part of speech of the target trigger word and naming an entity.
The relevant words of the target trigger words can be obtained by calling the appointed word bank through a preset interface.
The context feature of the target trigger word can be obtained by searching in the corpus text of the corpus to obtain a target corpus text containing the target trigger word, locating a feature word meeting a preset position relationship with the target trigger word in the target corpus text, and taking the obtained feature word as the context feature of the target trigger word. For example, contextual characteristics may include the following:
the first three words and the last three words of the target trigger word (excluding the stop word);
searching a sequence with a distance to a target trigger word not more than three words in the corpus text according to an N-Gram model, and extracting two or three words;
and extracting a word which is adjacent to the target trigger word and is positioned in front of the target trigger word and a word which is positioned behind the target trigger word from the corpus text.
The Frame type to which the target trigger word belongs is a Frame (Frame) for identifying the target trigger word and the Frame (Frame) of the target trigger word of each sentence in the corpus text based on a Frame network FrameNet tool, so that the Frame type of the target trigger word is obtained under the condition that the target trigger word has the Frame. The frame type of the word before the target trigger word and the frame type of the word after the target trigger word can be further extracted. The framework network is a semantic network based on a corpus, applying the theory of framework semantics, based on a framework and connecting the lexical meanings thereof with each other.
And 208, clustering all the obtained target trigger words based on the characteristics of the target trigger words to obtain a clustered set which belongs to a plurality of different event categories.
Each cluster set comprises a plurality of target trigger words
The different cluster sets correspond to different event categories, and one cluster set of an event category comprises a plurality of target trigger words belonging to the event category.
In the embodiment of the present application, clustering the target trigger word may be performed according to a preset clustering algorithm, for example, clustering may be performed according to an Affinity Propagation Cluster algorithm, which is an adjacent Propagation Cluster algorithm, where the adjacent Propagation Cluster algorithm may also be referred to as an AP clustering algorithm for short. The clustering algorithm takes all data points as potential cluster centers and does not need to specify the number of clusters. In the clustering process, vectors formed by the characteristics of the target trigger words obtained in the previous step are used as input data, the constructed event trigger word characteristic vectors are used as input data, the trigger words of the same type can be classified into one type, and the types or the characteristics of the target trigger words in the same type in the clustering result are the same. Wherein a class can be considered as a trigger word set.
Because the characteristics of the target trigger words determined in the application are obviously different from the characteristics determined in the prior art, all the target trigger words are clustered through a clustering algorithm, and the obtained event types are not limited and are different from the event types defined in the ACE corpus.
209, for each cluster set, selecting at least one target trigger word from the cluster set as a label of the cluster set, and labeling the cluster set with the obtained label.
Optionally, the target trigger words suitable for being used as the labels of the cluster set in the cluster set are determined according to a TF-IDF algorithm, and specifically, for each event type generated by the cluster algorithm, a first specified number of target trigger words with the largest TF-IDF value are selected from the cluster set of the event type as the labels of the event type.
TF-IDF is a statistical method for evaluating the importance of a word to one of a set of documents or a corpus. The main idea of TF-IDF is: if a word or phrase appears in an article with a high frequency TF and rarely appears in other articles, the word or phrase is considered to have a good classification capability and is suitable for classification. Wherein, TF is Term Frequency (Term Frequency) and represents the Frequency of a certain word or phrase appearing in a document; the IDF is the Inverse file Frequency (Inverse Document Frequency), the total Document number is divided by the Document number containing the word or phrase, and the quotient obtained is subjected to logarithm, so that the universal importance of the word or phrase is measured. So if a word or phrase appears in a document with a high frequency TF and rarely appears in other documents, the word or phrase is considered to have a better category discrimination ability and is suitable for being used as a label of a certain category.
Assuming that a clustering set of K event categories is obtained through a clustering algorithm, and a plurality of target trigger word sets which can represent the event category most under the category are counted for each event category, the calculation of TF-IDF under each event category is counted only inside the event category. In the present invention, each event category includes several corpus texts, and then for each target trigger word in an event category, the definition of TF-IDF (TF is the frequency of occurrence of a target trigger word in the corpus text di, IDF is the reciprocal of the number of corpus texts including the target trigger word) of the target trigger word in each corpus text is as follows:
wherein i represents a target trigger word, nijRepresenting the number of times the target trigger word i appears in the corpus text j in the event category;
Figure BDA0001250824780000142
representing the sum of the occurrence times of all target trigger words in the corpus text j under the event category; m represents the number of all target trigger words of the event category; n represents the total number of corpus texts corresponding to the event category (i.e. the total number of all corpus texts containing any one target trigger word under the event category); n isjIndicating the amount of corpus text with the target trigger word in the event category, plus 1 indicates smoothing.
Therefore, the invention respectively marks the K event categories generated by clustering the AP clustering algorithm as: c1, C2, … Ck; calculating a TF-IDF value of each target trigger word in each document for all documents d in each category Ci (i ═ 1,2, … k); and taking the top specified number (for example, 100) of target trigger words with the maximum TF-IDF value in the event category as the mark of the event type category for each event category.
The invention uses a plurality of target trigger words with higher TF-IDF values to represent labels corresponding to certain event categories (the labels distinguish the types of all the event categories), and the method breaks away from the limit of 33 event types defined in the ACE corpus, but considers all language phenomena to form an event type system of an open domain.
On the other hand, the embodiment of the application also provides an event type extraction device. Referring to fig. 3, which shows a schematic structural diagram of an embodiment of an event type extraction device according to the present application, the device of the present embodiment may include:
a word screening unit 301, configured to extract a plurality of candidate corpus words from a preset corpus;
an association determining unit 302, configured to determine, based on the corpus, an association between a reference trigger word in a preset trigger word set and the candidate corpus word, where the reference trigger word is determined by an automatic content extraction technique;
the word expansion unit 303 is configured to, for any one reference trigger word, determine a candidate corpus word whose association with the reference trigger word meets a preset requirement as a target trigger word, and obtain at least one target trigger word corresponding to each reference trigger word;
a feature determining unit 304, configured to determine a feature of each target trigger word respectively;
a type determining unit 305, configured to cluster all the trigger words based on characteristics of the target trigger words to obtain a plurality of clustered sets belonging to different event categories, where each clustered set corresponds to one event category, and each clustered set includes at least one target trigger word.
Optionally, the word screening unit includes:
the undetermined word determining unit is used for determining undetermined corpus words contained in a plurality of corpus texts in the preset corpus;
and the word filtering unit is used for filtering preset useless words contained in the to-be-determined corpus words to obtain the candidate corpus words, wherein the preset useless words comprise stop words and fictitious words.
Optionally, the association determining unit includes:
the first association calculation unit is used for sequentially calculating the initial association of each corpus text of each reference trigger word in the corpus between each corpus text and each candidate corpus word in the candidate corpus word set;
and the second correlation calculation unit is used for summing the initial correlations of the reference trigger word and the candidate corpus words in each corpus text to obtain the correlation of the reference trigger word and the candidate corpus words in the corpus for any pair of the reference trigger word and the candidate corpus words.
Optionally, when the first association calculating unit calculates the initial association of the candidate corpus word and each reference trigger word in the trigger word set in each corpus text in the corpus, specifically:
for a corpus text, determining the initial relevance of the reference trigger word and the candidate corpus word in the corpus text according to the ratio of the first frequency of occurrence of the reference trigger word and the candidate corpus word in the same sentence in the corpus text to the minimum frequency of occurrence, wherein the minimum frequency of occurrence is the minimum value of the frequency of occurrence of the reference trigger word in the corpus text and the frequency of occurrence of the candidate corpus word in the corpus text.
Optionally, when the first association calculating unit calculates the initial association of the candidate corpus word and each reference trigger word in the trigger word set in each corpus text in the corpus, specifically:
determining a plurality of preset connecting words;
for a corpus text, determining a first target sentence which simultaneously has the reference trigger word and the candidate corpus word from the corpus text, and connecting the reference trigger word and the candidate corpus word through a preset connecting word;
for each preset conjunction jiIn the corpus text, the toolWith said preset conjunction word jiIs determined as the ratio of the number of the reference trigger word and the corpus word candidate in the corpus text with respect to the conjunction j to the minimum number of occurrencesiCorrelation of (c) Con (conj)i);
Calculating the reference trigger word seed and the candidate corpus word c in the corpus text d by using the following formulaiThe initial correlation in (1) is Rdi(seed,c):
Figure BDA0001250824780000161
Wherein i is a natural number from 1 to k, and k represents the corpus text diThe preset total number of the conjunctions in all the first target sentences.
Optionally, when the first association calculating unit calculates the initial association of the candidate corpus word and each reference trigger word in the trigger word set in each corpus text in the corpus, specifically:
determining a plurality of preset relationship types;
in any corpus text diFor any one of the relationship types jiDetermining the ratio of the third frequency of the reference trigger word and the candidate corpus word appearing in the second target sentence to the minimum frequency of the reference trigger word and the candidate corpus word appearing in the corpus text diWith respect to said relationship type jiCorrelation of Rel (relj)i) Wherein the second target sentence is the sentence having the relation type jiCorresponding appointed connecting words, sentences formed by connecting the reference trigger words and the candidate corpus words through the appointed connecting words, and the minimum occurrence frequency is the reference trigger words in the corpus text diAnd the number of occurrences of the corpus word candidate in the corpus text diThe minimum of the number of occurrences;
calculating the reference trigger word seed and the candidate corpus word c in the language by using the following formulaMaterial text diThe initial correlation in (1) is Rdi(seed,c):
Figure BDA0001250824780000171
Wherein i is a natural number from 1 to k, and k represents the corpus text diHas the preset maximum number of relationship types.
Optionally, the manner of determining the feature of each target trigger word by the feature determining unit may include any one or more of the following:
acquiring attribute characteristics of the target trigger words;
acquiring relevant words of the target trigger words, wherein the relevant words comprise synonyms, antisense words and related words of the target trigger words;
searching in the corpus text to obtain a target corpus text containing the target trigger word, positioning a feature word meeting a preset position relation with the target trigger word in the target corpus text, and taking the obtained feature word as the context feature of the target trigger word;
and identifying the target trigger word and the frame type of the target trigger word from sentences in the corpus text of the corpus based on a frame network FrameNet tool.
Optionally, the apparatus further comprises:
a tagged word determining unit, configured to determine, for any one cluster set, at least one target trigger word in the cluster set that is suitable for being used as a label of the cluster set according to a word frequency and a reverse file frequency TF-IDF algorithm after the type determining unit obtains multiple cluster sets that are clustered and belong to different event categories;
and the event labeling unit is used for taking the at least one target trigger word as a label of the cluster set and labeling the cluster set.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (9)

1. An event type extraction method, comprising:
extracting a plurality of candidate corpus words from a preset corpus;
determining the relevance between a reference trigger word in a preset trigger word set and the candidate corpus words based on the corpus, wherein the reference trigger word is determined by an automatic content extraction technology;
for any reference trigger word, determining candidate corpus words with the relevance meeting preset requirements with the reference trigger word as target trigger words to obtain at least one target trigger word corresponding to each reference trigger word;
respectively determining the characteristics of each target trigger word;
based on the characteristics of the target trigger words, clustering all the target trigger words to obtain a plurality of clustered cluster sets belonging to different event categories, wherein each cluster set corresponds to one event category and comprises at least one target trigger word;
wherein, the determining the characteristics of each target trigger word comprises any one or more of the following steps:
acquiring attribute characteristics of the target trigger words;
acquiring relevant words of the target trigger words, wherein the relevant words comprise synonyms, antisense words and related words of the target trigger words;
searching in the corpus text to obtain a target corpus text containing the target trigger word, positioning a feature word meeting a preset position relation with the target trigger word in the target corpus text, and taking the obtained feature word as the context feature of the target trigger word;
and identifying the target trigger word and the frame type of the target trigger word from sentences in the corpus text of the corpus based on a frame network FrameNet tool.
2. The method according to claim 1, wherein said extracting the corpus candidate words from the preset corpus comprises:
determining undetermined corpus words contained in a plurality of corpus texts in the preset corpus;
and filtering preset useless words contained in the to-be-determined corpus words to obtain the candidate corpus words, wherein the preset useless words comprise stop words and virtual words.
3. The method according to claim 1, wherein the determining the relevance of the reference trigger word in a preset set of trigger words to the corpus candidate words based on the corpus comprises:
for each candidate corpus word, sequentially calculating the initial relevance of the candidate corpus word and each reference trigger word in the trigger word set in each corpus text in the corpus;
and for any pair of the reference trigger word and the candidate corpus word, summing the initial relevance of the reference trigger word and the candidate corpus word in each corpus text to obtain the relevance of the reference trigger word and the candidate corpus word in the corpus.
4. The method according to claim 3, wherein said calculating an initial association of said candidate corpus word with each reference trigger word in said set of trigger words within each corpus text in said corpus comprises:
for a corpus text, determining the initial relevance of the reference trigger word and the candidate corpus word in the corpus text according to the ratio of the first frequency of occurrence of the reference trigger word and the candidate corpus word in the same sentence in the corpus text to the minimum frequency of occurrence, wherein the minimum frequency of occurrence is the minimum value of the frequency of occurrence of the reference trigger word in the corpus text and the frequency of occurrence of the candidate corpus word in the corpus text.
5. The method according to claim 3, wherein said calculating an initial association of said candidate corpus word with each reference trigger word in said set of trigger words within each corpus text in said corpus comprises:
determining a plurality of preset connecting words;
for a corpus text, determining a first target sentence which simultaneously has the reference trigger word and the candidate corpus word from the corpus text, and connecting the reference trigger word and the candidate corpus word through a preset connecting word;
for each preset conjunction jiThe preset connecting word j is arranged in the corpus textiIs determined as the ratio of the number of the reference trigger word and the corpus word candidate in the corpus text with respect to the conjunction j to the minimum number of occurrencesiCorrelation of (c) Con (conj)i);
Calculating the reference trigger word seed and the candidate corpus word c in the corpus text d by using the following formulaiThe initial correlation in (1) is Rdi(seed,c):
Figure FDA0002296362790000021
Wherein i is a natural number from 1 to k, and k represents the corpus text diThe preset total number of the conjunctions in all the first target sentences.
6. The method according to claim 3, wherein said calculating an initial association of said candidate corpus word with each reference trigger word in said set of trigger words within each corpus text in said corpus comprises:
determining a plurality of preset relationship types;
in any corpus text diFor any one of the relationship types jiDetermining the ratio of the third frequency of the reference trigger word and the candidate corpus word appearing in the second target sentence to the minimum frequency of the reference trigger word and the candidate corpus word appearing in the corpus text diWith respect to said relationship type jiCorrelation of Rel (relj)i) Wherein the second target sentence is the sentence having the relation type jiCorresponding appointed connecting words, sentences formed by connecting the reference trigger words and the candidate corpus words through the appointed connecting words, and the minimum occurrence frequency is the reference trigger words in the corpus text diAnd the number of occurrences of the corpus word candidate in the corpus text diThe minimum of the number of occurrences;
calculating the reference trigger word seed and the candidate corpus word c in the corpus text d by using the following formulaiThe initial correlation in (1) is Rdi(seed,c):
Wherein i is a natural number from 1 to k, and k represents the corpus text diHas the preset maximum number of relationship types.
7. The method according to any one of claims 1 to 3, wherein after obtaining the clustered plurality of sets of clusters belonging to different event categories, further comprising:
determining at least one target trigger word suitable for being used as a label of any cluster set in the cluster set according to a word frequency and reverse file frequency TF-IDF algorithm;
and taking the at least one target trigger word as a label of the cluster set, and labeling the cluster set.
8. An event type extraction device, comprising:
the word screening unit is used for extracting a plurality of candidate corpus words from a preset corpus;
the association determining unit is used for determining the association between a reference trigger word in a preset trigger word set and the candidate corpus words based on the corpus, wherein the reference trigger word is determined by an automatic content extraction technology;
the word expansion unit is used for determining candidate corpus words, the relevance of which with the reference trigger words meets preset requirements, as target trigger words for any reference trigger word to obtain at least one target trigger word corresponding to each reference trigger word;
the characteristic determining unit is used for respectively determining the characteristics of each target trigger word;
the type determining unit is used for clustering all the trigger words based on the characteristics of the target trigger words to obtain a plurality of clustered cluster sets belonging to different event categories, wherein each cluster set corresponds to one event category, and each cluster set comprises at least one target trigger word;
the feature determining unit determines the features of each target trigger word, and the features include any one or more of the following:
acquiring attribute characteristics of the target trigger words;
acquiring relevant words of the target trigger words, wherein the relevant words comprise synonyms, antisense words and related words of the target trigger words;
searching in the corpus text to obtain a target corpus text containing the target trigger word, positioning a feature word meeting a preset position relation with the target trigger word in the target corpus text, and taking the obtained feature word as the context feature of the target trigger word;
and identifying the target trigger word and the frame type of the target trigger word from sentences in the corpus text of the corpus based on a frame network FrameNet tool.
9. The apparatus of claim 8, wherein the association determining unit comprises:
the first association calculation unit is used for sequentially calculating the initial association of each corpus text of each reference trigger word in the corpus between each corpus text and each candidate corpus word in the candidate corpus word set;
and the second correlation calculation unit is used for summing the initial correlations of the reference trigger word and the candidate corpus words in each corpus text to obtain the correlation of the reference trigger word and the candidate corpus words in the corpus for any pair of the reference trigger word and the candidate corpus words.
CN201710169761.3A 2017-03-21 2017-03-21 Event type extraction method and device Active CN106951530B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710169761.3A CN106951530B (en) 2017-03-21 2017-03-21 Event type extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710169761.3A CN106951530B (en) 2017-03-21 2017-03-21 Event type extraction method and device

Publications (2)

Publication Number Publication Date
CN106951530A CN106951530A (en) 2017-07-14
CN106951530B true CN106951530B (en) 2020-01-17

Family

ID=59472782

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710169761.3A Active CN106951530B (en) 2017-03-21 2017-03-21 Event type extraction method and device

Country Status (1)

Country Link
CN (1) CN106951530B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108319692B (en) * 2018-02-01 2021-03-19 云知声智能科技股份有限公司 Abnormal punctuation cleaning method, storage medium and server
CN110209807A (en) * 2018-07-03 2019-09-06 腾讯科技(深圳)有限公司 A kind of method of event recognition, the method for model training, equipment and storage medium
CN110032641B (en) * 2019-02-14 2024-02-13 创新先进技术有限公司 Method and device for extracting event by using neural network and executed by computer
CN111310461B (en) * 2020-01-15 2023-03-21 腾讯云计算(北京)有限责任公司 Event element extraction method, device, equipment and storage medium
CN111382575A (en) * 2020-03-19 2020-07-07 电子科技大学 Event extraction method based on joint labeling and entity semantic information
CN111522915A (en) * 2020-04-20 2020-08-11 北大方正集团有限公司 Extraction method, device and equipment of Chinese event and storage medium
CN111985152B (en) * 2020-07-28 2022-09-13 浙江大学 Event classification method based on dichotomy hypersphere prototype network
CN112487171A (en) * 2020-12-15 2021-03-12 中国人民解放军国防科技大学 Event extraction system and method under open domain
CN116611514B (en) * 2023-07-19 2023-10-10 中国科学技术大学 Value orientation evaluation system construction method based on data driving

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462229A (en) * 2014-11-13 2015-03-25 苏州大学 Event classification method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR2807534B1 (en) * 2000-04-05 2002-07-12 Inup COMPUTER FARM WITH PROCESSOR CARD HOT INSERTION / EXTRACTION SYSTEM

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462229A (en) * 2014-11-13 2015-03-25 苏州大学 Event classification method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
中文事件抽取中事件类别的自动识别;赵妍妍;《第三届学生计算语言学研讨会论文集》;20060801;第240-244页 *

Also Published As

Publication number Publication date
CN106951530A (en) 2017-07-14

Similar Documents

Publication Publication Date Title
CN106951530B (en) Event type extraction method and device
CN111104794B (en) Text similarity matching method based on subject term
CN106445998B (en) Text content auditing method and system based on sensitive words
Cucerzan Large-scale named entity disambiguation based on Wikipedia data
Wu et al. Domain-specific keyphrase extraction
CN109960756B (en) News event information induction method
CN101894102A (en) Method and device for analyzing emotion tendentiousness of subjective text
US8443008B2 (en) Cooccurrence dictionary creating system, scoring system, cooccurrence dictionary creating method, scoring method, and program thereof
CN108509490B (en) Network hot topic discovery method and system
Bagalkotkar et al. A novel technique for efficient text document summarization as a service
Wu et al. BTM and GloVe similarity linear fusion-based short text clustering algorithm for microblog hot topic discovery
Chinsha et al. Aspect based opinion mining from restaurant reviews
Abderrahim et al. Using Arabic wordnet for semantic indexation in information retrieval system
Gopan et al. Comparative study on different approaches in keyword extraction
Tamilselvi et al. Sentiment analysis of micro blogs using opinion mining classification algorithm
Jha et al. Hsas: Hindi subjectivity analysis system
Zhou et al. Exploiting multi-features to detect hedges and their scope in biomedical texts
Li-Juan et al. A classification method of Vietnamese news events based on maximum entropy model
Sweeney et al. Multi-entity sentiment analysis using entity-level feature extraction and word embeddings approach.
Farooq et al. Product reputation evaluation: the impact of conjunction on sentiment analysis
Jiang et al. An improved association rule mining approach to identification of implicit product aspects
Heu et al. Multi-document summarization exploiting semantic analysis based on tag cluster
El-Shayeb et al. Comparative analysis of different text segmentation algorithms on Arabic news stories
CN111814025A (en) Viewpoint extraction method and device
Li et al. Keyword extraction based on lexical chains and word co-occurrence for Chinese news web pages

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Hong Yu

Inventor after: Yang Xuerong

Inventor after: Yao Jianmin

Inventor after: Zhu Qiaoming

Inventor before: Yang Xuerong

Inventor before: Hong Yu

Inventor before: Yao Jianmin

Inventor before: Zhu Qiaoming

CB03 Change of inventor or designer information
GR01 Patent grant
GR01 Patent grant