CN110929509A - Louvain community discovery algorithm-based field event trigger word clustering method - Google Patents
Louvain community discovery algorithm-based field event trigger word clustering method Download PDFInfo
- Publication number
- CN110929509A CN110929509A CN201910980755.5A CN201910980755A CN110929509A CN 110929509 A CN110929509 A CN 110929509A CN 201910980755 A CN201910980755 A CN 201910980755A CN 110929509 A CN110929509 A CN 110929509A
- Authority
- CN
- China
- Prior art keywords
- node
- community
- trigger words
- event
- event trigger
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method for clustering field event trigger words based on a louvain community discovery algorithm, which comprises the following specific steps: (1) inputting a text set in any field; (2) extracting event trigger words from the text titles; (3) constructing a trigger word similar network; (4) clustering the trigger words based on a louvain community discovery algorithm; (5) and outputting the clustering result of the event trigger words. The method solves the problems that the traditional event extraction based on the template needs to summarize the event trigger words in advance and consumes a great deal of manpower time, can automatically gather a great deal of trigger words with similar semantics together without manual participation and corpus labeling, can save a great deal of manpower time, and provides great convenience for the event extraction.
Description
Technical Field
The invention relates to the field of event extraction in information extraction, in particular to a method for clustering field event trigger words based on a louvain community discovery algorithm.
Background
Automatic content extraction ace (automatic content extraction) definition event extraction mainly comprises two tasks: identification of event types and extraction of event elements. The event trigger word is a word capable of describing the occurrence of an event and plays an important role in the process of judging the type of the event, so that the identification of the type of the event is the extraction process of the event trigger word. The traditional extraction process of the event trigger words based on pattern matching needs to manually summarize the trigger words of the summary event, and a large amount of labor time is consumed; the method based on machine learning converts the event trigger word extraction problem into the question classification problem, the method needs to rely on a large-scale corpus for training and is influenced by the scale of the corpus, the problem of data sparsity is serious, the accuracy is low, and the industrial requirement cannot be met.
Disclosure of Invention
The invention aims to provide a clustering method of event-triggered words in the field based on a luvain community discovery algorithm aiming at the defects of the traditional method for extracting the event-triggered words.
In order to achieve the above object, the present invention is conceived as follows: the event trigger is a word capable of clearly representing occurrence of an event, is an important lexical unit for determining a category of the event, is a main verb in a sentence, and many verbs having similar semantics can indicate the same type of event.
According to the above inventive idea, the invention adopts the following technical scheme:
a method for clustering field event trigger words based on a louvain community discovery algorithm comprises the following specific steps:
(1) inputting a text set in any field;
(2) extracting event trigger words from the text titles;
(3) constructing a trigger word similar network;
(4) clustering the trigger words based on a louvain community discovery algorithm;
(5) and outputting the clustering result of the event trigger words.
The event in the step (2) triggers the extraction of words, and the process is as follows:
(2-1) performing dependency parsing on the news headlines by using a dependency parsing tool HanLP, and extracting core dependency relationship (V) in the sentencesHEDHED), the main meaningDependence (Sub, V)SBV) Activity guest dependency relationship (V)VOBObj), verb to predicate VHED、VSBV、VVOBAs a candidate trigger word;
(2-2), only preserving candidate trigger words with parts of speech of verbs v, bad verb vi or verb vn, and removing words with length of 1.
The event trigger word similarity network of the step (3) is expressed as follows:
G=<W,E>
ei,j=cos(veci,vecj)
wherein W ═ { W ═ W1,w2,...,wnIs the set of event trigger words, n is the number of trigger words in the network, E is a symmetric matrix of n x n, EijFor calculated trigger words wiAnd a trigger word wjThe similarity of less than 0.3 is set as 0; veciIs to use word2vec model to convert wiExpressed as a word vector, vecjIs wjRepresented in the form of a word vector.
The event triggering words are clustered by using the louvain community discovery algorithm in the step (4), and the method specifically comprises the following steps:
(4-1), each node in the network G is in an isolated community at the beginning;
(4-2) randomly selecting a node i from all the nodes;
(4-3) for the node i, finding all neighbor nodes of the node i, and respectively calculating whether the node i is moved from the community where the node i is located to the community C where the neighbor node j is locatedjThe magnitude of the resulting modularity gain, Δ Q, wherein the modularity Q is calculated as follows:
wherein gamma is a resolution parameter for flexibly controlling societyThe number and scale of partitions; k is a radical ofiIs the sum of the weights, k, of all edges connected to node ijIs the sum of the weights of all edges connected to node j, Ai,jIs the weight of the edge between node i and node j, indicating whether u and v are in the same community, if u and v are in the same community, the value is 1, otherwise, the value is 0;
(4-4) finding out the neighbor node j' which can generate the maximum modularity gain, if the maximum modularity gain delta QmaxIf greater than 0, let Ci=Cj'If yes, moving the node i to the community where the node j is located;
(4-5) when all the nodes cannot be moved, indicating that the community division is optimal at present, and aggregating the networks to generate a new network; mapping all nodes in the same community as one node in a new network to become a super node; the internal connection edge of the community is mapped to the self edge of the super node in the new network, and the weight is the sum of the internal connection edge weights; the weight of the connecting edge between two supernodes in the new network is the sum of the weights of the connecting edges between corresponding communities;
(4-6) after the new network is built, jumping to the step (4-1), and performing iterative computation; until all nodes cannot be moved in one iteration process, the algorithm terminates.
Compared with the prior art, the invention has the following outstanding characteristics and advantages:
the method solves the problems that the traditional event extraction based on the template needs to summarize the event trigger words in advance and consumes a great deal of manpower time, can automatically gather a great deal of trigger words with similar semantics together without manual participation and corpus labeling, can save a great deal of manpower time, and provides great convenience for the event extraction.
Drawings
FIG. 1 is a flow chart of a method for clustering domain event trigger words based on a louvain community discovery algorithm according to the present invention.
FIG. 2 is a sample dependency syntax resolution of the present invention.
Detailed Description
The embodiments of the present invention will be further described with reference to the accompanying drawings.
The field event trigger word clustering method based on the louvain community discovery algorithm takes financial field events as an example, and obtains any 10000 news text sets from 9 months in 2018 to 12 months in 2018 from a news website of New wave and finance to cluster event trigger words. As shown in fig. 1, the event-triggered word clustering method based on the louvain community discovery algorithm of the embodiment includes the following steps:
s1, inputting a financial field event text set, for example, 10000 news text sets of a financial field.
S2, extracting event trigger words from the event text headlines, and performing dependency syntax analysis on the news headlines by using a dependency syntax analysis tool HanLP, wherein the effect is shown in figure 2. Extracting core dependency relationship (V) in sentencesHEDHED), Sub, V, dependency relationshipSBV) Activity guest dependency relationship (V)VOBObj), verb to predicate VHED,VSBV,VVOBAs candidate trigger words. And only the candidate trigger words with the parts of speech of verb v, short verb vi or verb vn are reserved, and the word with the length of 1 is removed.
S3, triggering the similar network by an event, wherein the similar network is expressed as follows:
G=<W,E>
ei,j=cos(veci,vecj)
wherein W ═ { W ═ W1,w2,...,wnIs the set of event trigger words, n is the number of trigger words in the network, E is a symmetric matrix of n x n, EijFor calculated trigger words wiAnd a trigger word wjThe size of the degree of similarity of (c),the similarity less than 0.3 is set to 0; veciIs to use word2vec model to convert wiExpressed as a word vector, vecjIs wjRepresented in the form of a word vector.
S4, an event triggering word clustering process based on the louvain community discovery algorithm is as follows:
s4.1, at the beginning, each node in the network G is in an isolated community;
s4.2, randomly selecting a node i from all nodes;
s4.3 for the node i, finding all neighbor nodes of the node i, and respectively calculating whether the node i is moved from the community where the node i is located to the community C where the neighbor node j is locatedjThe magnitude of the resulting modularity gain, Δ Q, is calculated as follows:
wherein, gamma is a resolution parameter used for flexibly controlling the number and scale of community division; k is a radical ofiIs the sum of the weights, k, of all edges connected to node ijIs the sum of the weights of all edges connected to node j; a. thei,jIs the weight of the edge between node i and node j, indicating whether u and v are in the same community, if u and v are in the same community, the value is 1, otherwise, the value is 0;
s4.4 finding out the neighbor node j' which can generate the maximum modularity gain, if the maximum modularity gain is delta QmaxIf greater than 0, let Ci=Cj'If yes, moving the node i to the community where the node j is located;
s4.5 when all the nodes can not be moved, the community division is shown to be optimal at present, and the network is aggregated to generate a new network. Mapping all nodes in the same community as one node in a new network to become a super node; the internal connection edge of the community is mapped to the self edge of the super node in the new network, and the weight is the sum of the internal connection edge weights; the weight of the connecting edge between two supernodes in the new network is the sum of the weights of the connecting edges between corresponding communities;
s4.6, after the new network is built, the step S4.1 is skipped to, and iterative computation is carried out. Until all nodes cannot be moved in one iteration process, the algorithm terminates.
And S5, outputting event trigger word clustering results, and clustering trigger words with similar semantics into a community.
Claims (4)
1. A method for clustering field event trigger words based on a louvain community discovery algorithm is characterized by comprising the following steps: the method comprises the following specific steps:
(1) inputting a text set in any field;
(2) extracting event trigger words from the text titles;
(3) constructing a trigger word similar network;
(4) clustering the trigger words based on a louvain community discovery algorithm;
(5) and outputting the clustering result of the event trigger words.
2. The method for clustering domain event trigger words based on the luvain community discovery algorithm according to claim 1, characterized in that: the event in the step (2) triggers the extraction of words, and the process is as follows:
(2-1) performing dependency parsing on the news headlines by using a dependency parsing tool HanLP, and extracting core dependency relationship (V) in the sentencesHEDHED), Sub, V, dependency relationshipSBV) Activity guest dependency relationship (V)VOBObj), verb to predicate VHED、VSBV、VVOBAs a candidate trigger word;
(2-2), only preserving candidate trigger words with parts of speech of verbs v, bad verb vi or verb vn, and removing words with length of 1.
3. The method for clustering domain event trigger words based on the luvain community discovery algorithm according to claim 1, characterized in that: the event trigger word similarity network of the step (3) is expressed as follows:
G=<W,E>
ei,j=cos(veci,vecj)
wherein W ═ { W ═ W1,w2,...,wnIs the set of event trigger words, n is the number of trigger words in the network, E is a symmetric matrix of n x n, EijFor calculated trigger words wiAnd a trigger word wjThe similarity of less than 0.3 is set as 0; veciIs to use word2vec model to convert wiExpressed as a word vector, vecjIs wjRepresented in the form of a word vector.
4. The method for clustering domain event trigger words based on the luvain community discovery algorithm according to claim 1, characterized in that: the event triggering words are clustered by using the louvain community discovery algorithm in the step (4), and the method specifically comprises the following steps:
(4-1), each node in the network G is in an isolated community at the beginning;
(4-2) randomly selecting a node i from all the nodes;
(4-3) for the node i, finding all neighbor nodes of the node i, and respectively calculating whether the node i is moved from the community where the node i is located to the community C where the neighbor node j is locatedjThe magnitude of the resulting modularity gain, Δ Q, wherein the modularity Q is calculated as follows:
wherein, gamma is a resolution parameter used for flexibly controlling the number and scale of community division; k is a radical ofiIs the sum of the weights, k, of all edges connected to node ijIs a node withSum of the weights of all edges j are connected, Ai,jIs the weight of the edge between node i and node j, indicating whether u and v are in the same community, if u and v are in the same community, the value is 1, otherwise, the value is 0;
(4-4) finding out the neighbor node j' which can generate the maximum modularity gain, if the maximum modularity gain delta QmaxIf greater than 0, let Ci=Cj'If yes, moving the node i to the community where the node j is located;
(4-5) when all the nodes cannot be moved, indicating that the community division is optimal at present, and aggregating the networks to generate a new network; mapping all nodes in the same community as one node in a new network to become a super node; the internal connection edge of the community is mapped to the self edge of the super node in the new network, and the weight is the sum of the internal connection edge weights; the weight of the connecting edge between two supernodes in the new network is the sum of the weights of the connecting edges between corresponding communities;
(4-6) after the new network is built, jumping to the step (4-1), and performing iterative computation; until all nodes cannot be moved in one iteration process, the algorithm terminates.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910980755.5A CN110929509B (en) | 2019-10-16 | 2019-10-16 | Domain event trigger word clustering method based on louvain community discovery algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910980755.5A CN110929509B (en) | 2019-10-16 | 2019-10-16 | Domain event trigger word clustering method based on louvain community discovery algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110929509A true CN110929509A (en) | 2020-03-27 |
CN110929509B CN110929509B (en) | 2023-09-15 |
Family
ID=69848930
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910980755.5A Active CN110929509B (en) | 2019-10-16 | 2019-10-16 | Domain event trigger word clustering method based on louvain community discovery algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110929509B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112395860A (en) * | 2020-11-27 | 2021-02-23 | 山东省计算中心(国家超级计算济南中心) | Large-scale parallel policy data knowledge extraction method and system |
CN112632280A (en) * | 2020-12-28 | 2021-04-09 | 平安科技(深圳)有限公司 | Text classification method and device, terminal equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103745258A (en) * | 2013-09-12 | 2014-04-23 | 北京工业大学 | Minimal spanning tree-based clustering genetic algorithm complex web community mining method |
CN108509607A (en) * | 2018-04-03 | 2018-09-07 | 三盟科技股份有限公司 | A kind of community discovery method and system based on Louvain algorithms |
CN108509551A (en) * | 2018-03-19 | 2018-09-07 | 西北大学 | A kind of micro blog network key user digging system under the environment based on Spark and method |
CN108681936A (en) * | 2018-04-26 | 2018-10-19 | 浙江邦盛科技有限公司 | A kind of fraud clique recognition methods propagated based on modularity and balance label |
-
2019
- 2019-10-16 CN CN201910980755.5A patent/CN110929509B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103745258A (en) * | 2013-09-12 | 2014-04-23 | 北京工业大学 | Minimal spanning tree-based clustering genetic algorithm complex web community mining method |
CN108509551A (en) * | 2018-03-19 | 2018-09-07 | 西北大学 | A kind of micro blog network key user digging system under the environment based on Spark and method |
CN108509607A (en) * | 2018-04-03 | 2018-09-07 | 三盟科技股份有限公司 | A kind of community discovery method and system based on Louvain algorithms |
CN108681936A (en) * | 2018-04-26 | 2018-10-19 | 浙江邦盛科技有限公司 | A kind of fraud clique recognition methods propagated based on modularity and balance label |
Non-Patent Citations (2)
Title |
---|
王红斌等: "基于word2vec和依存分析的事件识别研究", 《软件》 * |
肖雪等: "基于样本加权的引文网络的社团划分", 《图书情报工作》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112395860A (en) * | 2020-11-27 | 2021-02-23 | 山东省计算中心(国家超级计算济南中心) | Large-scale parallel policy data knowledge extraction method and system |
CN112632280A (en) * | 2020-12-28 | 2021-04-09 | 平安科技(深圳)有限公司 | Text classification method and device, terminal equipment and storage medium |
CN112632280B (en) * | 2020-12-28 | 2022-05-24 | 平安科技(深圳)有限公司 | Text classification method and device, terminal equipment and storage medium |
WO2022142025A1 (en) * | 2020-12-28 | 2022-07-07 | 平安科技(深圳)有限公司 | Text classification method and apparatus, and terminal device and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN110929509B (en) | 2023-09-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107766324B (en) | Text consistency analysis method based on deep neural network | |
CN107180045B (en) | Method for extracting geographic entity relation contained in internet text | |
CN110825877A (en) | Semantic similarity analysis method based on text clustering | |
CN107506389B (en) | Method and device for extracting job skill requirements | |
CN104346379B (en) | A kind of data element recognition methods of logic-based and statistical technique | |
US10019492B2 (en) | Stop word identification method and apparatus | |
CN105404674B (en) | Knowledge-dependent webpage information extraction method | |
CN108519971B (en) | Cross-language news topic similarity comparison method based on parallel corpus | |
WO2017198031A1 (en) | Semantic parsing method and apparatus | |
CN112650923A (en) | Public opinion processing method and device for news events, storage medium and computer equipment | |
US20190130030A1 (en) | Generation method, generation device, and recording medium | |
CN111767725A (en) | Data processing method and device based on emotion polarity analysis model | |
CN109522396B (en) | Knowledge processing method and system for national defense science and technology field | |
CN112527981B (en) | Open type information extraction method and device, electronic equipment and storage medium | |
CN115186654B (en) | Method for generating document abstract | |
CN110674301A (en) | Emotional tendency prediction method, device and system and storage medium | |
CN110929509A (en) | Louvain community discovery algorithm-based field event trigger word clustering method | |
CN112989813A (en) | Scientific and technological resource relation extraction method and device based on pre-training language model | |
CN107341142B (en) | Enterprise relation calculation method and system based on keyword extraction and analysis | |
CN111444713B (en) | Method and device for extracting entity relationship in news event | |
CN104572633A (en) | Method for determining meanings of polysemous word | |
CN105354184A (en) | Method for using optimized vector space model to automatically classify document | |
CN113360647A (en) | 5G mobile service complaint source-tracing analysis method based on clustering | |
CN110413985B (en) | Related text segment searching method and device | |
CN106776590A (en) | A kind of method and system for obtaining entry translation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |