CN110929509A

CN110929509A - Louvain community discovery algorithm-based field event trigger word clustering method

Info

Publication number: CN110929509A
Application number: CN201910980755.5A
Authority: CN
Inventors: 骆祥峰; 黄敬; 马秀侠
Original assignee: Beijing Transpacific Technology Development Ltd
Current assignee: Beijing Transpacific Technology Development Ltd
Priority date: 2019-10-16
Filing date: 2019-10-16
Publication date: 2020-03-27
Anticipated expiration: 2039-10-16
Also published as: CN110929509B

Abstract

The invention discloses a method for clustering field event trigger words based on a louvain community discovery algorithm, which comprises the following specific steps: (1) inputting a text set in any field; (2) extracting event trigger words from the text titles; (3) constructing a trigger word similar network; (4) clustering the trigger words based on a louvain community discovery algorithm; (5) and outputting the clustering result of the event trigger words. The method solves the problems that the traditional event extraction based on the template needs to summarize the event trigger words in advance and consumes a great deal of manpower time, can automatically gather a great deal of trigger words with similar semantics together without manual participation and corpus labeling, can save a great deal of manpower time, and provides great convenience for the event extraction.

Description

Louvain community discovery algorithm-based field event trigger word clustering method

Technical Field

The invention relates to the field of event extraction in information extraction, in particular to a method for clustering field event trigger words based on a louvain community discovery algorithm.

Background

Automatic content extraction ace (automatic content extraction) definition event extraction mainly comprises two tasks: identification of event types and extraction of event elements. The event trigger word is a word capable of describing the occurrence of an event and plays an important role in the process of judging the type of the event, so that the identification of the type of the event is the extraction process of the event trigger word. The traditional extraction process of the event trigger words based on pattern matching needs to manually summarize the trigger words of the summary event, and a large amount of labor time is consumed; the method based on machine learning converts the event trigger word extraction problem into the question classification problem, the method needs to rely on a large-scale corpus for training and is influenced by the scale of the corpus, the problem of data sparsity is serious, the accuracy is low, and the industrial requirement cannot be met.

Disclosure of Invention

The invention aims to provide a clustering method of event-triggered words in the field based on a luvain community discovery algorithm aiming at the defects of the traditional method for extracting the event-triggered words.

In order to achieve the above object, the present invention is conceived as follows: the event trigger is a word capable of clearly representing occurrence of an event, is an important lexical unit for determining a category of the event, is a main verb in a sentence, and many verbs having similar semantics can indicate the same type of event.

According to the above inventive idea, the invention adopts the following technical scheme:

a method for clustering field event trigger words based on a louvain community discovery algorithm comprises the following specific steps:

(1) inputting a text set in any field;

(2) extracting event trigger words from the text titles;

(3) constructing a trigger word similar network;

(4) clustering the trigger words based on a louvain community discovery algorithm;

(5) and outputting the clustering result of the event trigger words.

The event in the step (2) triggers the extraction of words, and the process is as follows:

(2-1) performing dependency parsing on the news headlines by using a dependency parsing tool HanLP, and extracting core dependency relationship (V) in the sentences_HEDHED), the main meaningDependence (Sub, V)_SBV) Activity guest dependency relationship (V)_VOBObj), verb to predicate V_HED、V_SBV、V_VOBAs a candidate trigger word;

(2-2), only preserving candidate trigger words with parts of speech of verbs v, bad verb vi or verb vn, and removing words with length of 1.

The event trigger word similarity network of the step (3) is expressed as follows:

G＝＜W,E＞

e_i,j＝cos(vec_i,vec_j)

wherein W ═ { W ═ W₁,w₂,...,w_nIs the set of event trigger words, n is the number of trigger words in the network, E is a symmetric matrix of n x n, E_ijFor calculated trigger words w_iAnd a trigger word w_jThe similarity of less than 0.3 is set as 0; vec_iIs to use word2vec model to convert w_iExpressed as a word vector, vec_jIs w_jRepresented in the form of a word vector.

The event triggering words are clustered by using the louvain community discovery algorithm in the step (4), and the method specifically comprises the following steps:

(4-1), each node in the network G is in an isolated community at the beginning;

(4-2) randomly selecting a node i from all the nodes;

(4-3) for the node i, finding all neighbor nodes of the node i, and respectively calculating whether the node i is moved from the community where the node i is located to the community C where the neighbor node j is located_jThe magnitude of the resulting modularity gain, Δ Q, wherein the modularity Q is calculated as follows:

wherein gamma is a resolution parameter for flexibly controlling societyThe number and scale of partitions; k is a radical of_iIs the sum of the weights, k, of all edges connected to node i_jIs the sum of the weights of all edges connected to node j, A_i,jIs the weight of the edge between node i and node j,

indicating whether u and v are in the same community, if u and v are in the same community, the value is 1, otherwise, the value is 0;

(4-4) finding out the neighbor node j' which can generate the maximum modularity gain, if the maximum modularity gain delta Q_maxIf greater than 0, let C_i＝C_j'If yes, moving the node i to the community where the node j is located;

(4-5) when all the nodes cannot be moved, indicating that the community division is optimal at present, and aggregating the networks to generate a new network; mapping all nodes in the same community as one node in a new network to become a super node; the internal connection edge of the community is mapped to the self edge of the super node in the new network, and the weight is the sum of the internal connection edge weights; the weight of the connecting edge between two supernodes in the new network is the sum of the weights of the connecting edges between corresponding communities;

(4-6) after the new network is built, jumping to the step (4-1), and performing iterative computation; until all nodes cannot be moved in one iteration process, the algorithm terminates.

Compared with the prior art, the invention has the following outstanding characteristics and advantages:

the method solves the problems that the traditional event extraction based on the template needs to summarize the event trigger words in advance and consumes a great deal of manpower time, can automatically gather a great deal of trigger words with similar semantics together without manual participation and corpus labeling, can save a great deal of manpower time, and provides great convenience for the event extraction.

Drawings

FIG. 1 is a flow chart of a method for clustering domain event trigger words based on a louvain community discovery algorithm according to the present invention.

FIG. 2 is a sample dependency syntax resolution of the present invention.

Detailed Description

The embodiments of the present invention will be further described with reference to the accompanying drawings.

The field event trigger word clustering method based on the louvain community discovery algorithm takes financial field events as an example, and obtains any 10000 news text sets from 9 months in 2018 to 12 months in 2018 from a news website of New wave and finance to cluster event trigger words. As shown in fig. 1, the event-triggered word clustering method based on the louvain community discovery algorithm of the embodiment includes the following steps:

s1, inputting a financial field event text set, for example, 10000 news text sets of a financial field.

S2, extracting event trigger words from the event text headlines, and performing dependency syntax analysis on the news headlines by using a dependency syntax analysis tool HanLP, wherein the effect is shown in figure 2. Extracting core dependency relationship (V) in sentences_HEDHED), Sub, V, dependency relationship_SBV) Activity guest dependency relationship (V)_VOBObj), verb to predicate V_HED，V_SBV，V_VOBAs candidate trigger words. And only the candidate trigger words with the parts of speech of verb v, short verb vi or verb vn are reserved, and the word with the length of 1 is removed.

S3, triggering the similar network by an event, wherein the similar network is expressed as follows:

G＝＜W,E＞

e_i,j＝cos(vec_i,vec_j)

wherein W ═ { W ═ W₁,w₂,...,w_nIs the set of event trigger words, n is the number of trigger words in the network, E is a symmetric matrix of n x n, E_ijFor calculated trigger words w_iAnd a trigger word w_jThe size of the degree of similarity of (c),the similarity less than 0.3 is set to 0; vec_iIs to use word2vec model to convert w_iExpressed as a word vector, vec_jIs w_jRepresented in the form of a word vector.

S4, an event triggering word clustering process based on the louvain community discovery algorithm is as follows:

s4.1, at the beginning, each node in the network G is in an isolated community;

s4.2, randomly selecting a node i from all nodes;

s4.3 for the node i, finding all neighbor nodes of the node i, and respectively calculating whether the node i is moved from the community where the node i is located to the community C where the neighbor node j is located_jThe magnitude of the resulting modularity gain, Δ Q, is calculated as follows:

wherein, gamma is a resolution parameter used for flexibly controlling the number and scale of community division; k is a radical of_iIs the sum of the weights, k, of all edges connected to node i_jIs the sum of the weights of all edges connected to node j; a. the_i,jIs the weight of the edge between node i and node j,

s4.4 finding out the neighbor node j' which can generate the maximum modularity gain, if the maximum modularity gain is delta Q_maxIf greater than 0, let C_i＝C_j'If yes, moving the node i to the community where the node j is located;

s4.5 when all the nodes can not be moved, the community division is shown to be optimal at present, and the network is aggregated to generate a new network. Mapping all nodes in the same community as one node in a new network to become a super node; the internal connection edge of the community is mapped to the self edge of the super node in the new network, and the weight is the sum of the internal connection edge weights; the weight of the connecting edge between two supernodes in the new network is the sum of the weights of the connecting edges between corresponding communities;

s4.6, after the new network is built, the step S4.1 is skipped to, and iterative computation is carried out. Until all nodes cannot be moved in one iteration process, the algorithm terminates.

And S5, outputting event trigger word clustering results, and clustering trigger words with similar semantics into a community.

Claims

1. A method for clustering field event trigger words based on a louvain community discovery algorithm is characterized by comprising the following steps: the method comprises the following specific steps:

(1) inputting a text set in any field;

(2) extracting event trigger words from the text titles;

(3) constructing a trigger word similar network;

(5) and outputting the clustering result of the event trigger words.

2. The method for clustering domain event trigger words based on the luvain community discovery algorithm according to claim 1, characterized in that: the event in the step (2) triggers the extraction of words, and the process is as follows:

(2-1) performing dependency parsing on the news headlines by using a dependency parsing tool HanLP, and extracting core dependency relationship (V) in the sentences_HEDHED), Sub, V, dependency relationship_SBV) Activity guest dependency relationship (V)_VOBObj), verb to predicate V_HED、V_SBV、V_VOBAs a candidate trigger word;

3. The method for clustering domain event trigger words based on the luvain community discovery algorithm according to claim 1, characterized in that: the event trigger word similarity network of the step (3) is expressed as follows:

G＝＜W,E＞

e_i,j＝cos(vec_i,vec_j)

4. The method for clustering domain event trigger words based on the luvain community discovery algorithm according to claim 1, characterized in that: the event triggering words are clustered by using the louvain community discovery algorithm in the step (4), and the method specifically comprises the following steps:

(4-1), each node in the network G is in an isolated community at the beginning;

(4-2) randomly selecting a node i from all the nodes;

wherein, gamma is a resolution parameter used for flexibly controlling the number and scale of community division; k is a radical of_iIs the sum of the weights, k, of all edges connected to node i_jIs a node withSum of the weights of all edges j are connected, A_i,jIs the weight of the edge between node i and node j,