CN110929509B - Domain event trigger word clustering method based on louvain community discovery algorithm - Google Patents
Domain event trigger word clustering method based on louvain community discovery algorithm Download PDFInfo
- Publication number
- CN110929509B CN110929509B CN201910980755.5A CN201910980755A CN110929509B CN 110929509 B CN110929509 B CN 110929509B CN 201910980755 A CN201910980755 A CN 201910980755A CN 110929509 B CN110929509 B CN 110929509B
- Authority
- CN
- China
- Prior art keywords
- node
- community
- event trigger
- event
- trigger word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a field event trigger word clustering method based on a louvain community discovery algorithm, which comprises the following specific steps: (1) inputting a text set in any field; (2) extracting event trigger words from the text title; (3) constructing a trigger word similarity network; (4) clustering trigger words based on a louvain community discovery algorithm; and (5) outputting a clustering result of the event trigger words. The method solves the problems that the traditional event extraction based on the template needs to induction the event trigger words in advance and consumes a large amount of labor time, can automatically bring together a large number of trigger words with similar semantics, does not need to participate manually, does not need to annotate corpus, can save a large amount of labor time, and provides great convenience for event extraction.
Description
Technical Field
The invention relates to the field of event extraction in information extraction, in particular to a field event trigger word clustering method based on a louvain community discovery algorithm.
Background
Automatic content extraction ACE (Automatic Context Extraction) defines that event extraction mainly includes two tasks: identification of event types and extraction of event elements. Because the event trigger words are words capable of describing the occurrence of the event, the event trigger words play an important role in the process of judging the event type, so that the event type is identified, namely the extraction process of the event trigger words. The traditional extraction process of event trigger words based on pattern matching needs to manually summarize trigger words summarizing events, and a great deal of manpower time is consumed; the method is based on machine learning, the event trigger word extraction problem is converted into question classification problem, the method is required to be trained by a large-scale corpus, and is affected by the scale of the corpus, the data sparseness problem is serious, the accuracy is low, and the industrial requirement cannot be met.
Disclosure of Invention
The invention aims at overcoming the defects of the traditional event trigger word extraction method, and provides a field event trigger word clustering method based on a louvain community discovery algorithm.
In order to achieve the above object, the present invention is conceived as follows: the event trigger word is a word capable of clearly representing the occurrence of an event, is an important vocabulary unit for determining the category of the event, is a main predicate verb in a sentence, and many verbs with similar semantics can indicate the same type of event.
According to the inventive idea, the invention adopts the following technical scheme:
a field event trigger word clustering method based on a louvain community discovery algorithm comprises the following specific steps:
(1) Inputting a text set in any field;
(2) Extracting event trigger words from the text titles;
(3) Constructing a trigger word similarity network;
(4) Clustering trigger words based on a louvain community discovery algorithm;
(5) And outputting a clustering result of the event trigger words.
The event trigger word extraction in the step (2) comprises the following steps:
(2-1) performing dependency syntax parsing on the news headline using a dependency syntax parsing tool HanLP, extracting core dependencies (V) HED HED), master dependency (Sub, V SBV ) Dynamic guest dependency (V) VOB Obj), verb V of predicate HED 、V SBV 、V VOB As a candidate trigger word;
(2-2) retaining only candidate trigger words of which the part of speech is verb v, failed verb vi or proper noun vn, and removing words of which the length is 1.
The event trigger word similarity network of the step (3) is expressed as follows:
G=<W,E>
e i,j =cos(vec i ,vec j )
wherein w= { W 1 ,w 2 ,...,w n The number of the trigger words in the network is represented by n, E is a symmetric matrix of n, E ij For calculated trigger word w i And trigger word w j A similarity of less than 0.3 is set to 0; vec i Is to use word2vec model to make w i Expressed as word vectors, vec j Is w j Expressed in the form of a word vector.
The event trigger words are clustered by using a louvain community discovery algorithm in the step (4), and the specific steps are as follows:
(4-1) initially, each node in the network G being in an isolated community;
(4-2) randomly selecting a node i from all nodes;
(4-3) for node i, finding all its neighbor nodes, and respectively calculating if node i is moved from its current community to its community C where its neighbor node j is located j The magnitude Δq of the module gain is generated, wherein the calculation formula of the module Q is as follows:
wherein, gamma is a resolution parameter used for flexibly controlling the quantity and the scale of community division; k (k) i Is the sum of the weights of all edges connected with node i, k j Is the sum of the weights of all edges connected with node j, A i,j Is the weight of the edge between node i and node j, indicating whether u and v are in the same community, if u and v are in the same community, the value is 1, otherwise, the value is 0;
(4-4) finding a neighbor node j' that can generate the maximum modularity gain if the maximum modularity gain ΔQ max > 0, let C i =C j' Moving the node i to a community where the node j is located;
(4-5) when all nodes cannot be moved, indicating that community division is optimal at present, and aggregating the networks to generate a new network; mapping all nodes in the same community into one node in a new network to become a supernode; mapping the intra-community continuous edge into the self edge of the supernode in the new network, wherein the weight is the sum of the weights of the intra-community continuous edges; the weight of the connecting edge between two supernodes in the new network is the sum of the connecting edge weights between the corresponding communities;
(4-6) after the new network construction is completed, jumping to the step (4-1), and performing iterative computation; until all nodes cannot be moved in one iteration, the algorithm terminates.
Compared with the prior art, the invention has the following outstanding characteristics and advantages:
the method solves the problems that the traditional event extraction based on the template needs to induction the event trigger words in advance and consumes a large amount of labor time, can automatically bring together a large number of trigger words with similar semantics, does not need to participate manually, does not need to annotate corpus, can save a large amount of labor time, and provides great convenience for event extraction.
Drawings
FIG. 1 is a flow chart of a domain event trigger word clustering method based on a louvain community discovery algorithm of the invention.
FIG. 2 is a dependency syntax parsing sample of the present invention.
Detailed Description
Embodiments of the present invention are further described below with reference to the accompanying drawings.
In the field event trigger word clustering method based on the louvain community discovery algorithm, financial field events are taken as an example, and any 10000 news text sets from 2018, 9 months to 2018, 12 months are obtained from new wave financial news websites to cluster event trigger words. As shown in fig. 1, the event trigger word clustering method based on the louvain community discovery algorithm in this embodiment includes the following steps:
s1, inputting a financial field event text set, for example, 10000 news text sets in the financial field.
S2, extracting event trigger words from the event text titles, and performing dependency syntax analysis on the news titles by using a dependency syntax analysis tool HanLP, wherein the effect is shown in FIG. 2. Extracting core dependency relationship (V) HED HED), master dependency (Sub, V SBV ) Dynamic guest dependency (V) VOB Obj), verb V of predicate HED ,V SBV ,V VOB As a candidate trigger word. Only candidate trigger words with parts of speech of verb v, failed verb vi or proper noun vn are reserved, and words with length of 1 are removed.
S3, triggering the similar network by an event, wherein the similar network is expressed as follows:
G=<W,E>
e i,j =cos(vec i ,vec j )
wherein w= { W 1 ,w 2 ,...,w n The number of the trigger words in the network is represented by n, E is a symmetric matrix of n, E ij For calculated trigger word w i And trigger word w j A similarity of less than 0.3 is set to 0; vec i Is to use word2vec model to make w i Expressed as word vectors, vec j Is w j Expressed in the form of a word vector.
S4, event trigger word clustering process based on a louvain community discovery algorithm is as follows:
s4.1, initially, each node in the network G is in an isolated community;
s4.2, randomly selecting a node i from all nodes;
s4.3 for node i, finding all the neighbor nodes, and respectively calculating if the node i is moved from the community in which the node i is currently located to the community C in which the neighbor node j is located j Calculation of the magnitude Δq of the resulting modularity gain, modularity QThe formula is as follows:
wherein, gamma is a resolution parameter used for flexibly controlling the quantity and the scale of community division; k (k) i Is the sum of the weights of all edges connected with node i, k j Is the sum of the weights of all edges connected with the node j; a is that i,j Is the weight of the edge between node i and node j, indicating whether u and v are in the same community, if u and v are in the same community, the value is 1, otherwise, the value is 0;
s4.4 find the neighbor node j' that can generate the maximum modularity gain, if the maximum modularity gain ΔQ max > 0, let C i =C j' Moving the node i to a community where the node j is located;
s4.5, when all nodes cannot be moved, the community division is optimized at present, and the networks are aggregated to generate a new network. Mapping all nodes in the same community into one node in a new network to become a supernode; mapping the intra-community continuous edge into the self edge of the supernode in the new network, wherein the weight is the sum of the weights of the intra-community continuous edges; the weight of the connecting edge between two supernodes in the new network is the sum of the connecting edge weights between the corresponding communities;
and S4.6, after the new network construction is completed, jumping to the step S4.1, and performing iterative computation. Until all nodes cannot be moved in one iteration, the algorithm terminates.
S5, outputting event trigger word clustering results, wherein trigger words with similar semantics are clustered into a community.
Claims (1)
1. A field event trigger word clustering method based on a louvain community discovery algorithm is characterized in that: the method comprises the following specific steps:
(1) Inputting a text set in any field;
(2) Extracting event trigger words from the text titles;
(3) Constructing a trigger word similarity network;
(4) Clustering trigger words based on a louvain community discovery algorithm;
(5) Outputting a clustering result of the event trigger words;
the event trigger word extraction in the step (2) comprises the following steps:
(2-1) performing dependency syntax parsing on the news headline using a dependency syntax parsing tool HanLP, extracting core dependencies (V) HED HED), master dependency (Sub, V SBV ) Dynamic guest dependency (V) VOB Obj), verb V of predicate HED 、V SBV 、V VOB As a candidate trigger word;
(2-2) only preserving candidate trigger words with parts of speech of verb v, failed verb vi or proper noun vn, and removing words with length of 1;
the event trigger word similarity network of the step (3) is expressed as follows:
G=<W,E>
e i,j =cos(vec i ,vec j )
wherein w= { W 1 ,w 2 ,...,w n The number of the trigger words in the network is represented by n, E is a symmetric matrix of n, E ij For calculated trigger word w i And trigger word w j A similarity of less than 0.3 is set to 0; vec i Is to use word2vec model to make w i Expressed as word vectors, vec j Is w j Expressed in the form of a word vector;
the event trigger words are clustered by using a louvain community discovery algorithm in the step (4), and the specific steps are as follows:
(4-1) initially, each node in the network G being in an isolated community;
(4-2) randomly selecting a node i from all nodes;
(4-3) for node i, finding all its neighbor nodes, and respectively calculating if node i is moved from its current community to its community C where its neighbor node j is located j The magnitude Δq of the module gain is generated, wherein the calculation formula of the module Q is as follows:
wherein, gamma is a resolution parameter used for flexibly controlling the quantity and the scale of community division; k (k) i Is the sum of the weights of all edges connected with node i, k j Is the sum of the weights of all edges connected with node j, A i,j Is the weight of the edge between node i and node j, indicating whether u and v are in the same community, if u and v are in the same community, the value is 1, otherwise, the value is 0;
(4-4) finding a neighbor node j' that can generate the maximum modularity gain if the maximum modularity gain ΔQ max > 0, let C i =C j' Moving the node i to a community where the node j is located;
(4-5) when all nodes cannot be moved, indicating that community division is optimal at present, and aggregating the networks to generate a new network; mapping all nodes in the same community into one node in a new network to become a supernode; mapping the intra-community continuous edge into the self edge of the supernode in the new network, wherein the weight is the sum of the weights of the intra-community continuous edges; the weight of the connecting edge between two supernodes in the new network is the sum of the connecting edge weights between the corresponding communities;
(4-6) after the new network construction is completed, jumping to the step (4-1), and performing iterative computation; until all nodes cannot be moved in one iteration, the algorithm terminates.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910980755.5A CN110929509B (en) | 2019-10-16 | 2019-10-16 | Domain event trigger word clustering method based on louvain community discovery algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910980755.5A CN110929509B (en) | 2019-10-16 | 2019-10-16 | Domain event trigger word clustering method based on louvain community discovery algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110929509A CN110929509A (en) | 2020-03-27 |
CN110929509B true CN110929509B (en) | 2023-09-15 |
Family
ID=69848930
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910980755.5A Active CN110929509B (en) | 2019-10-16 | 2019-10-16 | Domain event trigger word clustering method based on louvain community discovery algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110929509B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112395860A (en) * | 2020-11-27 | 2021-02-23 | 山东省计算中心(国家超级计算济南中心) | Large-scale parallel policy data knowledge extraction method and system |
CN112632280B (en) * | 2020-12-28 | 2022-05-24 | 平安科技(深圳)有限公司 | Text classification method and device, terminal equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103745258A (en) * | 2013-09-12 | 2014-04-23 | 北京工业大学 | Minimal spanning tree-based clustering genetic algorithm complex web community mining method |
CN108509551A (en) * | 2018-03-19 | 2018-09-07 | 西北大学 | A kind of micro blog network key user digging system under the environment based on Spark and method |
CN108509607A (en) * | 2018-04-03 | 2018-09-07 | 三盟科技股份有限公司 | A kind of community discovery method and system based on Louvain algorithms |
CN108681936A (en) * | 2018-04-26 | 2018-10-19 | 浙江邦盛科技有限公司 | A kind of fraud clique recognition methods propagated based on modularity and balance label |
-
2019
- 2019-10-16 CN CN201910980755.5A patent/CN110929509B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103745258A (en) * | 2013-09-12 | 2014-04-23 | 北京工业大学 | Minimal spanning tree-based clustering genetic algorithm complex web community mining method |
CN108509551A (en) * | 2018-03-19 | 2018-09-07 | 西北大学 | A kind of micro blog network key user digging system under the environment based on Spark and method |
CN108509607A (en) * | 2018-04-03 | 2018-09-07 | 三盟科技股份有限公司 | A kind of community discovery method and system based on Louvain algorithms |
CN108681936A (en) * | 2018-04-26 | 2018-10-19 | 浙江邦盛科技有限公司 | A kind of fraud clique recognition methods propagated based on modularity and balance label |
Non-Patent Citations (2)
Title |
---|
基于word2vec和依存分析的事件识别研究;王红斌等;《软件》;20170615(第06期);全文 * |
基于样本加权的引文网络的社团划分;肖雪等;《图书情报工作》;20161020(第20期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN110929509A (en) | 2020-03-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108073568B (en) | Keyword extraction method and device | |
CN107480143B (en) | Method and system for segmenting conversation topics based on context correlation | |
CN107180045B (en) | Method for extracting geographic entity relation contained in internet text | |
CN110825877A (en) | Semantic similarity analysis method based on text clustering | |
CN111325029B (en) | Text similarity calculation method based on deep learning integrated model | |
CN105022754B (en) | Object classification method and device based on social network | |
CN109815336B (en) | Text aggregation method and system | |
CN110619051B (en) | Question sentence classification method, device, electronic equipment and storage medium | |
CN108027814B (en) | Stop word recognition method and device | |
WO2020232898A1 (en) | Text classification method and apparatus, electronic device and computer non-volatile readable storage medium | |
CN107562919B (en) | Multi-index integrated software component retrieval method and system based on information retrieval | |
CN112163424A (en) | Data labeling method, device, equipment and medium | |
CN106570180A (en) | Artificial intelligence based voice searching method and device | |
CN110633371A (en) | Log classification method and system | |
CN109522396B (en) | Knowledge processing method and system for national defense science and technology field | |
CN110909126A (en) | Information query method and device | |
CN108304382A (en) | Mass analysis method based on manufacturing process text data digging and system | |
CN110929509B (en) | Domain event trigger word clustering method based on louvain community discovery algorithm | |
WO2014002774A1 (en) | Synonym extraction system, method, and recording medium | |
CN113360647A (en) | 5G mobile service complaint source-tracing analysis method based on clustering | |
CN112989813A (en) | Scientific and technological resource relation extraction method and device based on pre-training language model | |
CN112632982A (en) | Dialogue text emotion analysis method capable of being used for supplier evaluation | |
CN104572632A (en) | Method for determining translation direction of word with proper noun translation | |
CN114186022A (en) | Scheduling instruction quality inspection method and system based on voice transcription and knowledge graph | |
CN106202033B (en) | A kind of adverbial word Word sense disambiguation method and device based on interdependent constraint and knowledge |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |