CN110929509A - Louvain community discovery algorithm-based field event trigger word clustering method - Google Patents

Louvain community discovery algorithm-based field event trigger word clustering method Download PDF

Info

Publication number
CN110929509A
CN110929509A CN201910980755.5A CN201910980755A CN110929509A CN 110929509 A CN110929509 A CN 110929509A CN 201910980755 A CN201910980755 A CN 201910980755A CN 110929509 A CN110929509 A CN 110929509A
Authority
CN
China
Prior art keywords
node
community
trigger words
event
event trigger
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910980755.5A
Other languages
Chinese (zh)
Other versions
CN110929509B (en
Inventor
骆祥峰
黄敬
马秀侠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Transpacific Technology Development Ltd
Original Assignee
Beijing Transpacific Technology Development Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Transpacific Technology Development Ltd filed Critical Beijing Transpacific Technology Development Ltd
Priority to CN201910980755.5A priority Critical patent/CN110929509B/en
Publication of CN110929509A publication Critical patent/CN110929509A/en
Application granted granted Critical
Publication of CN110929509B publication Critical patent/CN110929509B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for clustering field event trigger words based on a louvain community discovery algorithm, which comprises the following specific steps: (1) inputting a text set in any field; (2) extracting event trigger words from the text titles; (3) constructing a trigger word similar network; (4) clustering the trigger words based on a louvain community discovery algorithm; (5) and outputting the clustering result of the event trigger words. The method solves the problems that the traditional event extraction based on the template needs to summarize the event trigger words in advance and consumes a great deal of manpower time, can automatically gather a great deal of trigger words with similar semantics together without manual participation and corpus labeling, can save a great deal of manpower time, and provides great convenience for the event extraction.

Description

Louvain community discovery algorithm-based field event trigger word clustering method
Technical Field
The invention relates to the field of event extraction in information extraction, in particular to a method for clustering field event trigger words based on a louvain community discovery algorithm.
Background
Automatic content extraction ace (automatic content extraction) definition event extraction mainly comprises two tasks: identification of event types and extraction of event elements. The event trigger word is a word capable of describing the occurrence of an event and plays an important role in the process of judging the type of the event, so that the identification of the type of the event is the extraction process of the event trigger word. The traditional extraction process of the event trigger words based on pattern matching needs to manually summarize the trigger words of the summary event, and a large amount of labor time is consumed; the method based on machine learning converts the event trigger word extraction problem into the question classification problem, the method needs to rely on a large-scale corpus for training and is influenced by the scale of the corpus, the problem of data sparsity is serious, the accuracy is low, and the industrial requirement cannot be met.
Disclosure of Invention
The invention aims to provide a clustering method of event-triggered words in the field based on a luvain community discovery algorithm aiming at the defects of the traditional method for extracting the event-triggered words.
In order to achieve the above object, the present invention is conceived as follows: the event trigger is a word capable of clearly representing occurrence of an event, is an important lexical unit for determining a category of the event, is a main verb in a sentence, and many verbs having similar semantics can indicate the same type of event.
According to the above inventive idea, the invention adopts the following technical scheme:
a method for clustering field event trigger words based on a louvain community discovery algorithm comprises the following specific steps:
(1) inputting a text set in any field;
(2) extracting event trigger words from the text titles;
(3) constructing a trigger word similar network;
(4) clustering the trigger words based on a louvain community discovery algorithm;
(5) and outputting the clustering result of the event trigger words.
The event in the step (2) triggers the extraction of words, and the process is as follows:
(2-1) performing dependency parsing on the news headlines by using a dependency parsing tool HanLP, and extracting core dependency relationship (V) in the sentencesHEDHED), the main meaningDependence (Sub, V)SBV) Activity guest dependency relationship (V)VOBObj), verb to predicate VHED、VSBV、VVOBAs a candidate trigger word;
(2-2), only preserving candidate trigger words with parts of speech of verbs v, bad verb vi or verb vn, and removing words with length of 1.
The event trigger word similarity network of the step (3) is expressed as follows:
G=<W,E>
Figure BDA0002235115500000021
ei,j=cos(veci,vecj)
wherein W ═ { W ═ W1,w2,...,wnIs the set of event trigger words, n is the number of trigger words in the network, E is a symmetric matrix of n x n, EijFor calculated trigger words wiAnd a trigger word wjThe similarity of less than 0.3 is set as 0; veciIs to use word2vec model to convert wiExpressed as a word vector, vecjIs wjRepresented in the form of a word vector.
The event triggering words are clustered by using the louvain community discovery algorithm in the step (4), and the method specifically comprises the following steps:
(4-1), each node in the network G is in an isolated community at the beginning;
(4-2) randomly selecting a node i from all the nodes;
(4-3) for the node i, finding all neighbor nodes of the node i, and respectively calculating whether the node i is moved from the community where the node i is located to the community C where the neighbor node j is locatedjThe magnitude of the resulting modularity gain, Δ Q, wherein the modularity Q is calculated as follows:
Figure BDA0002235115500000022
wherein gamma is a resolution parameter for flexibly controlling societyThe number and scale of partitions; k is a radical ofiIs the sum of the weights, k, of all edges connected to node ijIs the sum of the weights of all edges connected to node j, Ai,jIs the weight of the edge between node i and node j,
Figure BDA0002235115500000023
Figure BDA0002235115500000024
indicating whether u and v are in the same community, if u and v are in the same community, the value is 1, otherwise, the value is 0;
(4-4) finding out the neighbor node j' which can generate the maximum modularity gain, if the maximum modularity gain delta QmaxIf greater than 0, let Ci=Cj'If yes, moving the node i to the community where the node j is located;
(4-5) when all the nodes cannot be moved, indicating that the community division is optimal at present, and aggregating the networks to generate a new network; mapping all nodes in the same community as one node in a new network to become a super node; the internal connection edge of the community is mapped to the self edge of the super node in the new network, and the weight is the sum of the internal connection edge weights; the weight of the connecting edge between two supernodes in the new network is the sum of the weights of the connecting edges between corresponding communities;
(4-6) after the new network is built, jumping to the step (4-1), and performing iterative computation; until all nodes cannot be moved in one iteration process, the algorithm terminates.
Compared with the prior art, the invention has the following outstanding characteristics and advantages:
the method solves the problems that the traditional event extraction based on the template needs to summarize the event trigger words in advance and consumes a great deal of manpower time, can automatically gather a great deal of trigger words with similar semantics together without manual participation and corpus labeling, can save a great deal of manpower time, and provides great convenience for the event extraction.
Drawings
FIG. 1 is a flow chart of a method for clustering domain event trigger words based on a louvain community discovery algorithm according to the present invention.
FIG. 2 is a sample dependency syntax resolution of the present invention.
Detailed Description
The embodiments of the present invention will be further described with reference to the accompanying drawings.
The field event trigger word clustering method based on the louvain community discovery algorithm takes financial field events as an example, and obtains any 10000 news text sets from 9 months in 2018 to 12 months in 2018 from a news website of New wave and finance to cluster event trigger words. As shown in fig. 1, the event-triggered word clustering method based on the louvain community discovery algorithm of the embodiment includes the following steps:
s1, inputting a financial field event text set, for example, 10000 news text sets of a financial field.
S2, extracting event trigger words from the event text headlines, and performing dependency syntax analysis on the news headlines by using a dependency syntax analysis tool HanLP, wherein the effect is shown in figure 2. Extracting core dependency relationship (V) in sentencesHEDHED), Sub, V, dependency relationshipSBV) Activity guest dependency relationship (V)VOBObj), verb to predicate VHED,VSBV,VVOBAs candidate trigger words. And only the candidate trigger words with the parts of speech of verb v, short verb vi or verb vn are reserved, and the word with the length of 1 is removed.
S3, triggering the similar network by an event, wherein the similar network is expressed as follows:
G=<W,E>
Figure BDA0002235115500000041
ei,j=cos(veci,vecj)
wherein W ═ { W ═ W1,w2,...,wnIs the set of event trigger words, n is the number of trigger words in the network, E is a symmetric matrix of n x n, EijFor calculated trigger words wiAnd a trigger word wjThe size of the degree of similarity of (c),the similarity less than 0.3 is set to 0; veciIs to use word2vec model to convert wiExpressed as a word vector, vecjIs wjRepresented in the form of a word vector.
S4, an event triggering word clustering process based on the louvain community discovery algorithm is as follows:
s4.1, at the beginning, each node in the network G is in an isolated community;
s4.2, randomly selecting a node i from all nodes;
s4.3 for the node i, finding all neighbor nodes of the node i, and respectively calculating whether the node i is moved from the community where the node i is located to the community C where the neighbor node j is locatedjThe magnitude of the resulting modularity gain, Δ Q, is calculated as follows:
Figure BDA0002235115500000042
wherein, gamma is a resolution parameter used for flexibly controlling the number and scale of community division; k is a radical ofiIs the sum of the weights, k, of all edges connected to node ijIs the sum of the weights of all edges connected to node j; a. thei,jIs the weight of the edge between node i and node j,
Figure BDA0002235115500000043
Figure BDA0002235115500000044
indicating whether u and v are in the same community, if u and v are in the same community, the value is 1, otherwise, the value is 0;
s4.4 finding out the neighbor node j' which can generate the maximum modularity gain, if the maximum modularity gain is delta QmaxIf greater than 0, let Ci=Cj'If yes, moving the node i to the community where the node j is located;
s4.5 when all the nodes can not be moved, the community division is shown to be optimal at present, and the network is aggregated to generate a new network. Mapping all nodes in the same community as one node in a new network to become a super node; the internal connection edge of the community is mapped to the self edge of the super node in the new network, and the weight is the sum of the internal connection edge weights; the weight of the connecting edge between two supernodes in the new network is the sum of the weights of the connecting edges between corresponding communities;
s4.6, after the new network is built, the step S4.1 is skipped to, and iterative computation is carried out. Until all nodes cannot be moved in one iteration process, the algorithm terminates.
And S5, outputting event trigger word clustering results, and clustering trigger words with similar semantics into a community.

Claims (4)

1. A method for clustering field event trigger words based on a louvain community discovery algorithm is characterized by comprising the following steps: the method comprises the following specific steps:
(1) inputting a text set in any field;
(2) extracting event trigger words from the text titles;
(3) constructing a trigger word similar network;
(4) clustering the trigger words based on a louvain community discovery algorithm;
(5) and outputting the clustering result of the event trigger words.
2. The method for clustering domain event trigger words based on the luvain community discovery algorithm according to claim 1, characterized in that: the event in the step (2) triggers the extraction of words, and the process is as follows:
(2-1) performing dependency parsing on the news headlines by using a dependency parsing tool HanLP, and extracting core dependency relationship (V) in the sentencesHEDHED), Sub, V, dependency relationshipSBV) Activity guest dependency relationship (V)VOBObj), verb to predicate VHED、VSBV、VVOBAs a candidate trigger word;
(2-2), only preserving candidate trigger words with parts of speech of verbs v, bad verb vi or verb vn, and removing words with length of 1.
3. The method for clustering domain event trigger words based on the luvain community discovery algorithm according to claim 1, characterized in that: the event trigger word similarity network of the step (3) is expressed as follows:
G=<W,E>
Figure FDA0002235115490000011
ei,j=cos(veci,vecj)
wherein W ═ { W ═ W1,w2,...,wnIs the set of event trigger words, n is the number of trigger words in the network, E is a symmetric matrix of n x n, EijFor calculated trigger words wiAnd a trigger word wjThe similarity of less than 0.3 is set as 0; veciIs to use word2vec model to convert wiExpressed as a word vector, vecjIs wjRepresented in the form of a word vector.
4. The method for clustering domain event trigger words based on the luvain community discovery algorithm according to claim 1, characterized in that: the event triggering words are clustered by using the louvain community discovery algorithm in the step (4), and the method specifically comprises the following steps:
(4-1), each node in the network G is in an isolated community at the beginning;
(4-2) randomly selecting a node i from all the nodes;
(4-3) for the node i, finding all neighbor nodes of the node i, and respectively calculating whether the node i is moved from the community where the node i is located to the community C where the neighbor node j is locatedjThe magnitude of the resulting modularity gain, Δ Q, wherein the modularity Q is calculated as follows:
Figure FDA0002235115490000021
wherein, gamma is a resolution parameter used for flexibly controlling the number and scale of community division; k is a radical ofiIs the sum of the weights, k, of all edges connected to node ijIs a node withSum of the weights of all edges j are connected, Ai,jIs the weight of the edge between node i and node j,
Figure FDA0002235115490000022
Figure FDA0002235115490000023
indicating whether u and v are in the same community, if u and v are in the same community, the value is 1, otherwise, the value is 0;
(4-4) finding out the neighbor node j' which can generate the maximum modularity gain, if the maximum modularity gain delta QmaxIf greater than 0, let Ci=Cj'If yes, moving the node i to the community where the node j is located;
(4-5) when all the nodes cannot be moved, indicating that the community division is optimal at present, and aggregating the networks to generate a new network; mapping all nodes in the same community as one node in a new network to become a super node; the internal connection edge of the community is mapped to the self edge of the super node in the new network, and the weight is the sum of the internal connection edge weights; the weight of the connecting edge between two supernodes in the new network is the sum of the weights of the connecting edges between corresponding communities;
(4-6) after the new network is built, jumping to the step (4-1), and performing iterative computation; until all nodes cannot be moved in one iteration process, the algorithm terminates.
CN201910980755.5A 2019-10-16 2019-10-16 Domain event trigger word clustering method based on louvain community discovery algorithm Active CN110929509B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910980755.5A CN110929509B (en) 2019-10-16 2019-10-16 Domain event trigger word clustering method based on louvain community discovery algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910980755.5A CN110929509B (en) 2019-10-16 2019-10-16 Domain event trigger word clustering method based on louvain community discovery algorithm

Publications (2)

Publication Number Publication Date
CN110929509A true CN110929509A (en) 2020-03-27
CN110929509B CN110929509B (en) 2023-09-15

Family

ID=69848930

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910980755.5A Active CN110929509B (en) 2019-10-16 2019-10-16 Domain event trigger word clustering method based on louvain community discovery algorithm

Country Status (1)

Country Link
CN (1) CN110929509B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112395860A (en) * 2020-11-27 2021-02-23 山东省计算中心(国家超级计算济南中心) Large-scale parallel policy data knowledge extraction method and system
CN112632280A (en) * 2020-12-28 2021-04-09 平安科技(深圳)有限公司 Text classification method and device, terminal equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103745258A (en) * 2013-09-12 2014-04-23 北京工业大学 Minimal spanning tree-based clustering genetic algorithm complex web community mining method
CN108509607A (en) * 2018-04-03 2018-09-07 三盟科技股份有限公司 A kind of community discovery method and system based on Louvain algorithms
CN108509551A (en) * 2018-03-19 2018-09-07 西北大学 A kind of micro blog network key user digging system under the environment based on Spark and method
CN108681936A (en) * 2018-04-26 2018-10-19 浙江邦盛科技有限公司 A kind of fraud clique recognition methods propagated based on modularity and balance label

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103745258A (en) * 2013-09-12 2014-04-23 北京工业大学 Minimal spanning tree-based clustering genetic algorithm complex web community mining method
CN108509551A (en) * 2018-03-19 2018-09-07 西北大学 A kind of micro blog network key user digging system under the environment based on Spark and method
CN108509607A (en) * 2018-04-03 2018-09-07 三盟科技股份有限公司 A kind of community discovery method and system based on Louvain algorithms
CN108681936A (en) * 2018-04-26 2018-10-19 浙江邦盛科技有限公司 A kind of fraud clique recognition methods propagated based on modularity and balance label

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
王红斌等: "基于word2vec和依存分析的事件识别研究", 《软件》 *
肖雪等: "基于样本加权的引文网络的社团划分", 《图书情报工作》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112395860A (en) * 2020-11-27 2021-02-23 山东省计算中心(国家超级计算济南中心) Large-scale parallel policy data knowledge extraction method and system
CN112632280A (en) * 2020-12-28 2021-04-09 平安科技(深圳)有限公司 Text classification method and device, terminal equipment and storage medium
CN112632280B (en) * 2020-12-28 2022-05-24 平安科技(深圳)有限公司 Text classification method and device, terminal equipment and storage medium
WO2022142025A1 (en) * 2020-12-28 2022-07-07 平安科技(深圳)有限公司 Text classification method and apparatus, and terminal device and storage medium

Also Published As

Publication number Publication date
CN110929509B (en) 2023-09-15

Similar Documents

Publication Publication Date Title
CN107766324B (en) Text consistency analysis method based on deep neural network
CN107180045B (en) Method for extracting geographic entity relation contained in internet text
CN110825877A (en) Semantic similarity analysis method based on text clustering
CN107506389B (en) Method and device for extracting job skill requirements
CN104346379B (en) A kind of data element recognition methods of logic-based and statistical technique
US10019492B2 (en) Stop word identification method and apparatus
CN105404674B (en) Knowledge-dependent webpage information extraction method
CN108519971B (en) Cross-language news topic similarity comparison method based on parallel corpus
WO2017198031A1 (en) Semantic parsing method and apparatus
CN112650923A (en) Public opinion processing method and device for news events, storage medium and computer equipment
US20190130030A1 (en) Generation method, generation device, and recording medium
CN111767725A (en) Data processing method and device based on emotion polarity analysis model
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
CN112527981B (en) Open type information extraction method and device, electronic equipment and storage medium
CN115186654B (en) Method for generating document abstract
CN110674301A (en) Emotional tendency prediction method, device and system and storage medium
CN110929509A (en) Louvain community discovery algorithm-based field event trigger word clustering method
CN112989813A (en) Scientific and technological resource relation extraction method and device based on pre-training language model
CN107341142B (en) Enterprise relation calculation method and system based on keyword extraction and analysis
CN111444713B (en) Method and device for extracting entity relationship in news event
CN104572633A (en) Method for determining meanings of polysemous word
CN105354184A (en) Method for using optimized vector space model to automatically classify document
CN113360647A (en) 5G mobile service complaint source-tracing analysis method based on clustering
CN110413985B (en) Related text segment searching method and device
CN106776590A (en) A kind of method and system for obtaining entry translation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant