CN110929509B

CN110929509B - Domain event trigger word clustering method based on louvain community discovery algorithm

Info

Publication number: CN110929509B
Application number: CN201910980755.5A
Authority: CN
Inventors: 骆祥峰; 黄敬; 马秀侠
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2019-10-16
Filing date: 2019-10-16
Publication date: 2023-09-15
Anticipated expiration: 2039-10-16
Also published as: CN110929509A

Abstract

The invention discloses a field event trigger word clustering method based on a louvain community discovery algorithm, which comprises the following specific steps: (1) inputting a text set in any field; (2) extracting event trigger words from the text title; (3) constructing a trigger word similarity network; (4) clustering trigger words based on a louvain community discovery algorithm; and (5) outputting a clustering result of the event trigger words. The method solves the problems that the traditional event extraction based on the template needs to induction the event trigger words in advance and consumes a large amount of labor time, can automatically bring together a large number of trigger words with similar semantics, does not need to participate manually, does not need to annotate corpus, can save a large amount of labor time, and provides great convenience for event extraction.

Description

Domain event trigger word clustering method based on louvain community discovery algorithm

Technical Field

The invention relates to the field of event extraction in information extraction, in particular to a field event trigger word clustering method based on a louvain community discovery algorithm.

Background

Automatic content extraction ACE (Automatic Context Extraction) defines that event extraction mainly includes two tasks: identification of event types and extraction of event elements. Because the event trigger words are words capable of describing the occurrence of the event, the event trigger words play an important role in the process of judging the event type, so that the event type is identified, namely the extraction process of the event trigger words. The traditional extraction process of event trigger words based on pattern matching needs to manually summarize trigger words summarizing events, and a great deal of manpower time is consumed; the method is based on machine learning, the event trigger word extraction problem is converted into question classification problem, the method is required to be trained by a large-scale corpus, and is affected by the scale of the corpus, the data sparseness problem is serious, the accuracy is low, and the industrial requirement cannot be met.

Disclosure of Invention

The invention aims at overcoming the defects of the traditional event trigger word extraction method, and provides a field event trigger word clustering method based on a louvain community discovery algorithm.

In order to achieve the above object, the present invention is conceived as follows: the event trigger word is a word capable of clearly representing the occurrence of an event, is an important vocabulary unit for determining the category of the event, is a main predicate verb in a sentence, and many verbs with similar semantics can indicate the same type of event.

According to the inventive idea, the invention adopts the following technical scheme:

a field event trigger word clustering method based on a louvain community discovery algorithm comprises the following specific steps:

(1) Inputting a text set in any field;

(2) Extracting event trigger words from the text titles;

(3) Constructing a trigger word similarity network;

(4) Clustering trigger words based on a louvain community discovery algorithm;

(5) And outputting a clustering result of the event trigger words.

The event trigger word extraction in the step (2) comprises the following steps:

(2-1) performing dependency syntax parsing on the news headline using a dependency syntax parsing tool HanLP, extracting core dependencies (V) _HED HED), master dependency (Sub, V _SBV ) Dynamic guest dependency (V) _VOB Obj), verb V of predicate _HED 、V _SBV 、V _VOB As a candidate trigger word;

(2-2) retaining only candidate trigger words of which the part of speech is verb v, failed verb vi or proper noun vn, and removing words of which the length is 1.

The event trigger word similarity network of the step (3) is expressed as follows:

G＝＜W,E＞

e _i,j ＝cos(vec _i ,vec _j )

wherein w= { W ₁ ,w ₂ ,...,w _n The number of the trigger words in the network is represented by n, E is a symmetric matrix of n, E _ij For calculated trigger word w _i And trigger word w _j A similarity of less than 0.3 is set to 0; vec _i Is to use word2vec model to make w _i Expressed as word vectors, vec _j Is w _j Expressed in the form of a word vector.

The event trigger words are clustered by using a louvain community discovery algorithm in the step (4), and the specific steps are as follows:

(4-1) initially, each node in the network G being in an isolated community;

(4-2) randomly selecting a node i from all nodes;

(4-3) for node i, finding all its neighbor nodes, and respectively calculating if node i is moved from its current community to its community C where its neighbor node j is located _j The magnitude Δq of the module gain is generated, wherein the calculation formula of the module Q is as follows:

wherein, gamma is a resolution parameter used for flexibly controlling the quantity and the scale of community division; k (k) _i Is the sum of the weights of all edges connected with node i, k _j Is the sum of the weights of all edges connected with node j, A _i,j Is the weight of the edge between node i and node j, indicating whether u and v are in the same community, if u and v are in the same community, the value is 1, otherwise, the value is 0;

(4-4) finding a neighbor node j' that can generate the maximum modularity gain if the maximum modularity gain ΔQ _max > 0, let C _i ＝C _j' Moving the node i to a community where the node j is located;

(4-5) when all nodes cannot be moved, indicating that community division is optimal at present, and aggregating the networks to generate a new network; mapping all nodes in the same community into one node in a new network to become a supernode; mapping the intra-community continuous edge into the self edge of the supernode in the new network, wherein the weight is the sum of the weights of the intra-community continuous edges; the weight of the connecting edge between two supernodes in the new network is the sum of the connecting edge weights between the corresponding communities;

(4-6) after the new network construction is completed, jumping to the step (4-1), and performing iterative computation; until all nodes cannot be moved in one iteration, the algorithm terminates.

Compared with the prior art, the invention has the following outstanding characteristics and advantages:

the method solves the problems that the traditional event extraction based on the template needs to induction the event trigger words in advance and consumes a large amount of labor time, can automatically bring together a large number of trigger words with similar semantics, does not need to participate manually, does not need to annotate corpus, can save a large amount of labor time, and provides great convenience for event extraction.

Drawings

FIG. 1 is a flow chart of a domain event trigger word clustering method based on a louvain community discovery algorithm of the invention.

FIG. 2 is a dependency syntax parsing sample of the present invention.

Detailed Description

Embodiments of the present invention are further described below with reference to the accompanying drawings.

In the field event trigger word clustering method based on the louvain community discovery algorithm, financial field events are taken as an example, and any 10000 news text sets from 2018, 9 months to 2018, 12 months are obtained from new wave financial news websites to cluster event trigger words. As shown in fig. 1, the event trigger word clustering method based on the louvain community discovery algorithm in this embodiment includes the following steps:

s1, inputting a financial field event text set, for example, 10000 news text sets in the financial field.

S2, extracting event trigger words from the event text titles, and performing dependency syntax analysis on the news titles by using a dependency syntax analysis tool HanLP, wherein the effect is shown in FIG. 2. Extracting core dependency relationship (V) _HED HED), master dependency (Sub, V _SBV ) Dynamic guest dependency (V) _VOB Obj), verb V of predicate _HED ，V _SBV ，V _VOB As a candidate trigger word. Only candidate trigger words with parts of speech of verb v, failed verb vi or proper noun vn are reserved, and words with length of 1 are removed.

S3, triggering the similar network by an event, wherein the similar network is expressed as follows:

G＝＜W,E＞

e _i,j ＝cos(vec _i ,vec _j )

S4, event trigger word clustering process based on a louvain community discovery algorithm is as follows:

s4.1, initially, each node in the network G is in an isolated community;

s4.2, randomly selecting a node i from all nodes;

s4.3 for node i, finding all the neighbor nodes, and respectively calculating if the node i is moved from the community in which the node i is currently located to the community C in which the neighbor node j is located _j Calculation of the magnitude Δq of the resulting modularity gain, modularity QThe formula is as follows:

wherein, gamma is a resolution parameter used for flexibly controlling the quantity and the scale of community division; k (k) _i Is the sum of the weights of all edges connected with node i, k _j Is the sum of the weights of all edges connected with the node j; a is that _i,j Is the weight of the edge between node i and node j, indicating whether u and v are in the same community, if u and v are in the same community, the value is 1, otherwise, the value is 0;

s4.4 find the neighbor node j' that can generate the maximum modularity gain, if the maximum modularity gain ΔQ _max > 0, let C _i ＝C _j' Moving the node i to a community where the node j is located;

s4.5, when all nodes cannot be moved, the community division is optimized at present, and the networks are aggregated to generate a new network. Mapping all nodes in the same community into one node in a new network to become a supernode; mapping the intra-community continuous edge into the self edge of the supernode in the new network, wherein the weight is the sum of the weights of the intra-community continuous edges; the weight of the connecting edge between two supernodes in the new network is the sum of the connecting edge weights between the corresponding communities;

and S4.6, after the new network construction is completed, jumping to the step S4.1, and performing iterative computation. Until all nodes cannot be moved in one iteration, the algorithm terminates.

S5, outputting event trigger word clustering results, wherein trigger words with similar semantics are clustered into a community.

Claims

1. A field event trigger word clustering method based on a louvain community discovery algorithm is characterized in that: the method comprises the following specific steps:

(1) Inputting a text set in any field;

(2) Extracting event trigger words from the text titles;

(3) Constructing a trigger word similarity network;

(4) Clustering trigger words based on a louvain community discovery algorithm;

(5) Outputting a clustering result of the event trigger words;

(2-2) only preserving candidate trigger words with parts of speech of verb v, failed verb vi or proper noun vn, and removing words with length of 1;

G＝＜W,E＞

e _i,j ＝cos(vec _i ,vec _j )

wherein w= { W ₁ ,w ₂ ,...,w _n The number of the trigger words in the network is represented by n, E is a symmetric matrix of n, E _ij For calculated trigger word w _i And trigger word w _j A similarity of less than 0.3 is set to 0; vec _i Is to use word2vec model to make w _i Expressed as word vectors, vec _j Is w _j Expressed in the form of a word vector;

(4-1) initially, each node in the network G being in an isolated community;

(4-2) randomly selecting a node i from all nodes;