CN110929509B - Domain event trigger word clustering method based on louvain community discovery algorithm - Google Patents

Domain event trigger word clustering method based on louvain community discovery algorithm Download PDF

Info

Publication number
CN110929509B
CN110929509B CN201910980755.5A CN201910980755A CN110929509B CN 110929509 B CN110929509 B CN 110929509B CN 201910980755 A CN201910980755 A CN 201910980755A CN 110929509 B CN110929509 B CN 110929509B
Authority
CN
China
Prior art keywords
node
community
event trigger
event
trigger word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910980755.5A
Other languages
Chinese (zh)
Other versions
CN110929509A (en
Inventor
骆祥峰
黄敬
马秀侠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CN201910980755.5A priority Critical patent/CN110929509B/en
Publication of CN110929509A publication Critical patent/CN110929509A/en
Application granted granted Critical
Publication of CN110929509B publication Critical patent/CN110929509B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a field event trigger word clustering method based on a louvain community discovery algorithm, which comprises the following specific steps: (1) inputting a text set in any field; (2) extracting event trigger words from the text title; (3) constructing a trigger word similarity network; (4) clustering trigger words based on a louvain community discovery algorithm; and (5) outputting a clustering result of the event trigger words. The method solves the problems that the traditional event extraction based on the template needs to induction the event trigger words in advance and consumes a large amount of labor time, can automatically bring together a large number of trigger words with similar semantics, does not need to participate manually, does not need to annotate corpus, can save a large amount of labor time, and provides great convenience for event extraction.

Description

Domain event trigger word clustering method based on louvain community discovery algorithm
Technical Field
The invention relates to the field of event extraction in information extraction, in particular to a field event trigger word clustering method based on a louvain community discovery algorithm.
Background
Automatic content extraction ACE (Automatic Context Extraction) defines that event extraction mainly includes two tasks: identification of event types and extraction of event elements. Because the event trigger words are words capable of describing the occurrence of the event, the event trigger words play an important role in the process of judging the event type, so that the event type is identified, namely the extraction process of the event trigger words. The traditional extraction process of event trigger words based on pattern matching needs to manually summarize trigger words summarizing events, and a great deal of manpower time is consumed; the method is based on machine learning, the event trigger word extraction problem is converted into question classification problem, the method is required to be trained by a large-scale corpus, and is affected by the scale of the corpus, the data sparseness problem is serious, the accuracy is low, and the industrial requirement cannot be met.
Disclosure of Invention
The invention aims at overcoming the defects of the traditional event trigger word extraction method, and provides a field event trigger word clustering method based on a louvain community discovery algorithm.
In order to achieve the above object, the present invention is conceived as follows: the event trigger word is a word capable of clearly representing the occurrence of an event, is an important vocabulary unit for determining the category of the event, is a main predicate verb in a sentence, and many verbs with similar semantics can indicate the same type of event.
According to the inventive idea, the invention adopts the following technical scheme:
a field event trigger word clustering method based on a louvain community discovery algorithm comprises the following specific steps:
(1) Inputting a text set in any field;
(2) Extracting event trigger words from the text titles;
(3) Constructing a trigger word similarity network;
(4) Clustering trigger words based on a louvain community discovery algorithm;
(5) And outputting a clustering result of the event trigger words.
The event trigger word extraction in the step (2) comprises the following steps:
(2-1) performing dependency syntax parsing on the news headline using a dependency syntax parsing tool HanLP, extracting core dependencies (V) HED HED), master dependency (Sub, V SBV ) Dynamic guest dependency (V) VOB Obj), verb V of predicate HED 、V SBV 、V VOB As a candidate trigger word;
(2-2) retaining only candidate trigger words of which the part of speech is verb v, failed verb vi or proper noun vn, and removing words of which the length is 1.
The event trigger word similarity network of the step (3) is expressed as follows:
G=<W,E>
e i,j =cos(vec i ,vec j )
wherein w= { W 1 ,w 2 ,...,w n The number of the trigger words in the network is represented by n, E is a symmetric matrix of n, E ij For calculated trigger word w i And trigger word w j A similarity of less than 0.3 is set to 0; vec i Is to use word2vec model to make w i Expressed as word vectors, vec j Is w j Expressed in the form of a word vector.
The event trigger words are clustered by using a louvain community discovery algorithm in the step (4), and the specific steps are as follows:
(4-1) initially, each node in the network G being in an isolated community;
(4-2) randomly selecting a node i from all nodes;
(4-3) for node i, finding all its neighbor nodes, and respectively calculating if node i is moved from its current community to its community C where its neighbor node j is located j The magnitude Δq of the module gain is generated, wherein the calculation formula of the module Q is as follows:
wherein, gamma is a resolution parameter used for flexibly controlling the quantity and the scale of community division; k (k) i Is the sum of the weights of all edges connected with node i, k j Is the sum of the weights of all edges connected with node j, A i,j Is the weight of the edge between node i and node j, indicating whether u and v are in the same community, if u and v are in the same community, the value is 1, otherwise, the value is 0;
(4-4) finding a neighbor node j' that can generate the maximum modularity gain if the maximum modularity gain ΔQ max > 0, let C i =C j' Moving the node i to a community where the node j is located;
(4-5) when all nodes cannot be moved, indicating that community division is optimal at present, and aggregating the networks to generate a new network; mapping all nodes in the same community into one node in a new network to become a supernode; mapping the intra-community continuous edge into the self edge of the supernode in the new network, wherein the weight is the sum of the weights of the intra-community continuous edges; the weight of the connecting edge between two supernodes in the new network is the sum of the connecting edge weights between the corresponding communities;
(4-6) after the new network construction is completed, jumping to the step (4-1), and performing iterative computation; until all nodes cannot be moved in one iteration, the algorithm terminates.
Compared with the prior art, the invention has the following outstanding characteristics and advantages:
the method solves the problems that the traditional event extraction based on the template needs to induction the event trigger words in advance and consumes a large amount of labor time, can automatically bring together a large number of trigger words with similar semantics, does not need to participate manually, does not need to annotate corpus, can save a large amount of labor time, and provides great convenience for event extraction.
Drawings
FIG. 1 is a flow chart of a domain event trigger word clustering method based on a louvain community discovery algorithm of the invention.
FIG. 2 is a dependency syntax parsing sample of the present invention.
Detailed Description
Embodiments of the present invention are further described below with reference to the accompanying drawings.
In the field event trigger word clustering method based on the louvain community discovery algorithm, financial field events are taken as an example, and any 10000 news text sets from 2018, 9 months to 2018, 12 months are obtained from new wave financial news websites to cluster event trigger words. As shown in fig. 1, the event trigger word clustering method based on the louvain community discovery algorithm in this embodiment includes the following steps:
s1, inputting a financial field event text set, for example, 10000 news text sets in the financial field.
S2, extracting event trigger words from the event text titles, and performing dependency syntax analysis on the news titles by using a dependency syntax analysis tool HanLP, wherein the effect is shown in FIG. 2. Extracting core dependency relationship (V) HED HED), master dependency (Sub, V SBV ) Dynamic guest dependency (V) VOB Obj), verb V of predicate HED ,V SBV ,V VOB As a candidate trigger word. Only candidate trigger words with parts of speech of verb v, failed verb vi or proper noun vn are reserved, and words with length of 1 are removed.
S3, triggering the similar network by an event, wherein the similar network is expressed as follows:
G=<W,E>
e i,j =cos(vec i ,vec j )
wherein w= { W 1 ,w 2 ,...,w n The number of the trigger words in the network is represented by n, E is a symmetric matrix of n, E ij For calculated trigger word w i And trigger word w j A similarity of less than 0.3 is set to 0; vec i Is to use word2vec model to make w i Expressed as word vectors, vec j Is w j Expressed in the form of a word vector.
S4, event trigger word clustering process based on a louvain community discovery algorithm is as follows:
s4.1, initially, each node in the network G is in an isolated community;
s4.2, randomly selecting a node i from all nodes;
s4.3 for node i, finding all the neighbor nodes, and respectively calculating if the node i is moved from the community in which the node i is currently located to the community C in which the neighbor node j is located j Calculation of the magnitude Δq of the resulting modularity gain, modularity QThe formula is as follows:
wherein, gamma is a resolution parameter used for flexibly controlling the quantity and the scale of community division; k (k) i Is the sum of the weights of all edges connected with node i, k j Is the sum of the weights of all edges connected with the node j; a is that i,j Is the weight of the edge between node i and node j, indicating whether u and v are in the same community, if u and v are in the same community, the value is 1, otherwise, the value is 0;
s4.4 find the neighbor node j' that can generate the maximum modularity gain, if the maximum modularity gain ΔQ max > 0, let C i =C j' Moving the node i to a community where the node j is located;
s4.5, when all nodes cannot be moved, the community division is optimized at present, and the networks are aggregated to generate a new network. Mapping all nodes in the same community into one node in a new network to become a supernode; mapping the intra-community continuous edge into the self edge of the supernode in the new network, wherein the weight is the sum of the weights of the intra-community continuous edges; the weight of the connecting edge between two supernodes in the new network is the sum of the connecting edge weights between the corresponding communities;
and S4.6, after the new network construction is completed, jumping to the step S4.1, and performing iterative computation. Until all nodes cannot be moved in one iteration, the algorithm terminates.
S5, outputting event trigger word clustering results, wherein trigger words with similar semantics are clustered into a community.

Claims (1)

1. A field event trigger word clustering method based on a louvain community discovery algorithm is characterized in that: the method comprises the following specific steps:
(1) Inputting a text set in any field;
(2) Extracting event trigger words from the text titles;
(3) Constructing a trigger word similarity network;
(4) Clustering trigger words based on a louvain community discovery algorithm;
(5) Outputting a clustering result of the event trigger words;
the event trigger word extraction in the step (2) comprises the following steps:
(2-1) performing dependency syntax parsing on the news headline using a dependency syntax parsing tool HanLP, extracting core dependencies (V) HED HED), master dependency (Sub, V SBV ) Dynamic guest dependency (V) VOB Obj), verb V of predicate HED 、V SBV 、V VOB As a candidate trigger word;
(2-2) only preserving candidate trigger words with parts of speech of verb v, failed verb vi or proper noun vn, and removing words with length of 1;
the event trigger word similarity network of the step (3) is expressed as follows:
G=<W,E>
e i,j =cos(vec i ,vec j )
wherein w= { W 1 ,w 2 ,...,w n The number of the trigger words in the network is represented by n, E is a symmetric matrix of n, E ij For calculated trigger word w i And trigger word w j A similarity of less than 0.3 is set to 0; vec i Is to use word2vec model to make w i Expressed as word vectors, vec j Is w j Expressed in the form of a word vector;
the event trigger words are clustered by using a louvain community discovery algorithm in the step (4), and the specific steps are as follows:
(4-1) initially, each node in the network G being in an isolated community;
(4-2) randomly selecting a node i from all nodes;
(4-3) for node i, finding all its neighbor nodes, and respectively calculating if node i is moved from its current community to its community C where its neighbor node j is located j The magnitude Δq of the module gain is generated, wherein the calculation formula of the module Q is as follows:
wherein, gamma is a resolution parameter used for flexibly controlling the quantity and the scale of community division; k (k) i Is the sum of the weights of all edges connected with node i, k j Is the sum of the weights of all edges connected with node j, A i,j Is the weight of the edge between node i and node j, indicating whether u and v are in the same community, if u and v are in the same community, the value is 1, otherwise, the value is 0;
(4-4) finding a neighbor node j' that can generate the maximum modularity gain if the maximum modularity gain ΔQ max > 0, let C i =C j' Moving the node i to a community where the node j is located;
(4-5) when all nodes cannot be moved, indicating that community division is optimal at present, and aggregating the networks to generate a new network; mapping all nodes in the same community into one node in a new network to become a supernode; mapping the intra-community continuous edge into the self edge of the supernode in the new network, wherein the weight is the sum of the weights of the intra-community continuous edges; the weight of the connecting edge between two supernodes in the new network is the sum of the connecting edge weights between the corresponding communities;
(4-6) after the new network construction is completed, jumping to the step (4-1), and performing iterative computation; until all nodes cannot be moved in one iteration, the algorithm terminates.
CN201910980755.5A 2019-10-16 2019-10-16 Domain event trigger word clustering method based on louvain community discovery algorithm Active CN110929509B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910980755.5A CN110929509B (en) 2019-10-16 2019-10-16 Domain event trigger word clustering method based on louvain community discovery algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910980755.5A CN110929509B (en) 2019-10-16 2019-10-16 Domain event trigger word clustering method based on louvain community discovery algorithm

Publications (2)

Publication Number Publication Date
CN110929509A CN110929509A (en) 2020-03-27
CN110929509B true CN110929509B (en) 2023-09-15

Family

ID=69848930

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910980755.5A Active CN110929509B (en) 2019-10-16 2019-10-16 Domain event trigger word clustering method based on louvain community discovery algorithm

Country Status (1)

Country Link
CN (1) CN110929509B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112395860A (en) * 2020-11-27 2021-02-23 山东省计算中心(国家超级计算济南中心) Large-scale parallel policy data knowledge extraction method and system
CN112632280B (en) * 2020-12-28 2022-05-24 平安科技(深圳)有限公司 Text classification method and device, terminal equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103745258A (en) * 2013-09-12 2014-04-23 北京工业大学 Minimal spanning tree-based clustering genetic algorithm complex web community mining method
CN108509551A (en) * 2018-03-19 2018-09-07 西北大学 A kind of micro blog network key user digging system under the environment based on Spark and method
CN108509607A (en) * 2018-04-03 2018-09-07 三盟科技股份有限公司 A kind of community discovery method and system based on Louvain algorithms
CN108681936A (en) * 2018-04-26 2018-10-19 浙江邦盛科技有限公司 A kind of fraud clique recognition methods propagated based on modularity and balance label

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103745258A (en) * 2013-09-12 2014-04-23 北京工业大学 Minimal spanning tree-based clustering genetic algorithm complex web community mining method
CN108509551A (en) * 2018-03-19 2018-09-07 西北大学 A kind of micro blog network key user digging system under the environment based on Spark and method
CN108509607A (en) * 2018-04-03 2018-09-07 三盟科技股份有限公司 A kind of community discovery method and system based on Louvain algorithms
CN108681936A (en) * 2018-04-26 2018-10-19 浙江邦盛科技有限公司 A kind of fraud clique recognition methods propagated based on modularity and balance label

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于word2vec和依存分析的事件识别研究;王红斌等;《软件》;20170615(第06期);全文 *
基于样本加权的引文网络的社团划分;肖雪等;《图书情报工作》;20161020(第20期);全文 *

Also Published As

Publication number Publication date
CN110929509A (en) 2020-03-27

Similar Documents

Publication Publication Date Title
CN108073568B (en) Keyword extraction method and device
CN107480143B (en) Method and system for segmenting conversation topics based on context correlation
CN107180045B (en) Method for extracting geographic entity relation contained in internet text
CN110825877A (en) Semantic similarity analysis method based on text clustering
CN111325029B (en) Text similarity calculation method based on deep learning integrated model
CN105022754B (en) Object classification method and device based on social network
CN109815336B (en) Text aggregation method and system
CN110619051B (en) Question sentence classification method, device, electronic equipment and storage medium
CN108027814B (en) Stop word recognition method and device
WO2020232898A1 (en) Text classification method and apparatus, electronic device and computer non-volatile readable storage medium
CN107562919B (en) Multi-index integrated software component retrieval method and system based on information retrieval
CN112163424A (en) Data labeling method, device, equipment and medium
CN106570180A (en) Artificial intelligence based voice searching method and device
CN110633371A (en) Log classification method and system
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
CN110909126A (en) Information query method and device
CN108304382A (en) Mass analysis method based on manufacturing process text data digging and system
CN110929509B (en) Domain event trigger word clustering method based on louvain community discovery algorithm
WO2014002774A1 (en) Synonym extraction system, method, and recording medium
CN113360647A (en) 5G mobile service complaint source-tracing analysis method based on clustering
CN112989813A (en) Scientific and technological resource relation extraction method and device based on pre-training language model
CN112632982A (en) Dialogue text emotion analysis method capable of being used for supplier evaluation
CN104572632A (en) Method for determining translation direction of word with proper noun translation
CN114186022A (en) Scheduling instruction quality inspection method and system based on voice transcription and knowledge graph
CN106202033B (en) A kind of adverbial word Word sense disambiguation method and device based on interdependent constraint and knowledge

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant