CN108132927B

CN108132927B - Keyword extraction method for combining graph structure and node association

Info

Publication number: CN108132927B
Application number: CN201711285868.0A
Authority: CN
Inventors: 马慧芳; 王双; 李苗; 赵卫中; 李宁
Original assignee: Northwest Normal University
Current assignee: Northwest Normal University
Priority date: 2017-12-07
Filing date: 2017-12-07
Publication date: 2022-02-11
Anticipated expiration: 2037-12-07
Also published as: CN108132927A

Abstract

The invention discloses a keyword extraction method for combining graph structures and node association. The method comprises the following steps: 1) mining a frequent closed term set of a document and generating a strong association rule set; 2) taking a rule head and a rule body which are not repeated in the strong association rule set as nodes, if edges exist between the nodes and only if strong association rules exist among the nodes, and taking the association strength of the association rules as edge weight to construct an association graph of the document; 3) randomly walking on the association diagram by using a GSNA algorithm, iteratively calculating the importance score of each node, and sorting the results in a descending order; 4) clustering the first nodes, and taking the class center point of each class as a key extraction result of the document. The method can find the keywords of the long text without an external corpus, can avoid the influence of the text length of the field corpus, can check the importance of one word in a larger range, can improve the extraction precision of the field keyword set, and can be suitable for extracting the field keyword set of a large-scale field corpus.

Description

Keyword extraction method for combining graph structure and node association

Technical Field

The invention relates to a novel keyword extraction method, belongs to the field of word information processing, and particularly relates to a keyword extraction method for fusing graph structure and node association.

Background

With the popularization of network technology, webpage news and various electronic documents quickly infiltrate into the lives of people, and how users acquire valuable information from massive documents, a text keyword extraction technology is very important. The keyword extraction technology is widely applied to the aspects of webpage retrieval, text classification, topic word bank construction and the like, and the keyword extraction technology is helpful for a user to quickly understand the core content of a document. In most text mining tasks, keyword extraction is performed by sequencing the text content according to the relevance degree of terms, so that a keyword extraction algorithm for a plurality of single documents is generated accordingly.

The long text keyword extraction technique depends on the context in which the terms are located, and the keyword selection can be performed by adopting supervised learning or unsupervised learning. Two ways exist for extracting the key words by using a supervised learning algorithm, namely a binary classification method and feature selection. The binary classification method is to determine whether the candidate word is a keyword, and a classical classification method is often used. The feature selection method focuses on the internal statistical features of terms in the text and the external statistical features in the corpus. The technology for extracting the keywords by the unsupervised learning algorithm has three directions, namely a keyword extraction method based on potential semantic analysis, a keyword ordering method based on a graph and a keyword clustering method based on a theme. The latent semantic analysis excessively depends on the corpus to perform learning training, and under the actual condition, the corpus often cannot be obtained in time, so that a chance is provided for the development of a keyword extraction technology based on a graph, which can be separated from the corpus. The basic source of the graph-based keyword ranking algorithm is the PageRank algorithm. GraphSum considers that the relevance relation between the nodes has strong and weak scores, and the corresponding PageRank score is reduced when voting is carried out between the nodes with negative relevance relation, so that the node score is more reasonable. The Attribute indicates that not only the structure of the graph but also external information such as node attributes need to be considered when calculating the importance of the nodes, and systematically proposes 13 attributes of the nodes to measure the similarity of the nodes, so that the node ranking has higher reliability.

Disclosure of Invention

The invention provides a keyword Extraction Method (GSNA) for fusing Graph Structure and node Association, which comprehensively considers the Graph Structure attribute of Nodes, semantic information and the Association characteristics among the Nodes and can find keywords of long texts under the condition of departing from an external corpus.

In order to achieve the above object, the method for extracting keywords associated with a fusion graph structure and nodes according to the present invention comprises the following steps:

1) acquiring a text and performing text preprocessing;

2) mining a frequent closed item set of a document and generating a strong association rule set;

3) taking a rule head and a rule body which are not repeated in the strong association rule set as nodes, if edges exist between the nodes and only if strong association rules exist among the nodes, and taking the association strength of the association rules as edge weight to construct an association graph of the document;

4) randomly walking on the association diagram by using a GSNA algorithm, iteratively calculating the importance score of each node, and sorting the results in a descending order;

5) clustering the first nodes, and taking the class center point of each class as a key extraction result of the document. The method for preprocessing the text in the step 1) comprises the following steps:

a. acquiring a text, wherein the text consists of a plurality of sentences;

b. removing all directories, titles, figures and tables in the text and only leaving text information;

c. segmenting words of the text, and segmenting words of the English text based on simple spaces; for Chinese text, performing word segmentation by using a common word segmentation algorithm;

d. filtering stop words of the text;

e. duplicate words in each sentence are removed.

The method for generating the strong association rule set in 2) comprises the following steps:

a. establishing a transaction set and an item set; the transaction is a single sentence in the text T; the transaction set is the whole preprocessed text T; any subset of all words contained by the set of terms for the entire text T is denoted as I ═ w₁,w₂,...,w_m}，w_iThe ith word in the item set;

b. solving a closed item set; the closed item set is item set Xf and only if

Wherein

Is a closure operation;

c. solving the length of the item set, wherein the number of items in one item set is called the length of the item set, and the item set with the length of k is called a k-item set;

d. the support degree of the item set is obtained, and for the association rule A → B,

the support degree sup (A → B) of A → B is the ratio of the number of the affairs containing AbuB in T to the total number of the affairs in T;

e. calculating the correlation strength; for the association rule a → B,

the strength of the association of A → B is:

since lift (a → B) ═ lift (B → a), the lift metric satisfies symmetry. From the analysis of probability distribution, when lift (a → B) ═ 1, a and B obey independent probability distribution, which indicates that there is no correlation between a and B; when lift (A → B) >1, the posterior probability of B after A is introduced is higher than the prior probability of B, which indicates that a positive association relationship exists between A and B; when lift (A → B) <1, the posterior probability of B after A is introduced is lower than the prior probability of B, which indicates that a negative correlation exists between A and B;

f. solving a strong association rule; for the association rule a → B,

if meet sup (A → B) ≥ minsup and lift (A → B) ≤ max^-liftOr lift (A → B) ≥ min^+liftA → B is a strong association rule; wherein minsup is the minimum support threshold value of the association rule, max^-liftIs the maximum negative correlation threshold, min^+liftIs a minimum positive correlation threshold;

g. solving a closed frequent item set, wherein the item set X is the frequently closed item set if and only if (X) is not less than min and S (X) is not less than X; the invention adopts the CLOSE algorithm to dig the frequent closed item set so as to generate the strong association rule set;

the CLOSE algorithm is realized by the following steps:

step (1) preprocessing a text, then, changing the text into a 1-item set after the text is deduplicated, scanning a transaction set to calculate closure and support degree of the 1-item set, recording a generated closure and support degree set as FCC1, and then, removing the 1-item set with the support degree smaller than min and the same closure from FCC1 to obtain FC1(FC1 is a closure and support degree set obtained by a 1-item set through a CLOSE algorithm);

step (2) in the process of generating an i-item set (the initial value of i is 1), combining two different item sets in the generators of FC (i-1) in pairs as generators of FCCi, calculating the closure and the support degree of the i-item set by scanning a transaction set to obtain FCCi, and removing the item set with the support degree smaller than min and the item set with the repeated closure to obtain FCi;

step (3) adding i, repeating step (2) until the generator of FCCi is empty, and ending the CLOSE algorithm;

step (4) finally, all the closures of the FCs 1 to FCi are taken out to obtain all closed frequent item sets FC of the transaction set;

h. generating a strong association rule; first, the frequent closure set for transaction set T is L ═ L₁,L₂,...,L_kIn which L is_iA frequent closure set that is a set of i-items; secondly, extracting closures with different lengths in the L and the support degrees thereof to form a frequent closed item set FC; then, FC gets the frequent item set with different length and its support set Fre ═ f₁,f₂,...,f₃In which f_iI-frequent item set and its support set; finally, according to f_i(2. ltoreq. i.ltoreq.k) generates a set of association rules AR of different forms, which are composed of a rule head of an m-item set (1. ltoreq. m.ltoreq.i-1) and a rule body of the (i-m) -item set, and calculates lift (m-itemset → (i-m) -itemset) according to Fre. The extraction satisfies that lift is less than or equal to max^-liftOr lift is more than or equal to min^+liftThe rule of (2) is used as a strong association rule.

The method for generating the association map in 3) comprises the following steps: taking out all rule heads and rule bodies in the strong association rule set, putting the rule heads and the rule bodies in the set, taking non-repeated item sets in the set as nodes of the association graph, taking the strong association existing between the points as edges of the association graph, and taking lift values as weights of the edges.

The GSNA algorithm in 4) comprises the following steps:

a. quantifying graph structure attributes of the nodes; in order to quantify the graph structure attribute of the node, six indexes are selected; wherein, atr1, atr2 and atr3 respectively represent the number of nodes with distances of 1, 2 and 3 from one node, atr4 is the ratio of atr1 to the average value of the degrees of the neighbor nodes, atr5 is the ratio of art2 to art1, and atr6 is the ratio of art3 to art 2; assuming the existence of node N in the dependency graph_iThen N is_iFormalized as N_i＝(atr1,atr2,atr3,atr4,atr5,atr6)；

b. Solving the semantic distance between two nodes; suppose the word set of the preprocessed document is I ═ w₁,w₂,...,w_n}, node N_iThe co-occurrence probability distribution with I is P ═ P₁,p₂,...,p_n}，N_jThe co-occurrence probability distribution with I is Q ═ Q₁,q₂,...,q_nIs then N_iAnd N_jHas a semantic distance of

c. Calculating the similarity between nodes combining the graph structure attribute and the semantic information; integrating graph structure attributes and semantic information for nodes, node N_iAnd N_jThe similarity measure of (c) is shown in formula (3); when N is present_iAnd N_jThe greater the difference of the graph structure attribute or semantic distribution, the greater the similarity s_ijThe lower, conversely, the lower N_iAnd N_jThe closer the structural attributes or semantic distributions of (S) are, the more closely s_ijThe higher the correspondence;

e. normalizing each node at offSimilarity in the joint graph; assuming that the correlation diagram contains N nodes in total, the nodes with strong correlation have similarity, and the nodes without strong correlation have similarity of 0, a similarity matrix S of the correlation diagram can be formed_N×N. To more formally measure the similarity degree of a node with all nodes, the similarity degree calculation of each node in the association graph is normalized as follows:

f. modifying the PageRank formula;

r_irepresenting a node N_iDegree of similarity to the average of all nodes in the graph, r ═ r₁,r₂,...,r_N) Namely, the fixed probability distribution of all nodes in the association graph related to the graph structure and semantic attributes; using r_iModifying the PageRank formula to ensure that the nodes do not obey equal probability jump any more, but the nodes with higher similarity degree with the association graph are selected with higher probability, wherein the formula is as follows:

since the correlation diagram is an undirected graph, C (N) in the formula (5)_k) Representing a node N_kE represents N_kThe number of neighbors of (2);

g. the PR voting score of positive associations is strengthened and the PR voting score of negative associations is reduced. In the association graph, there are edges between two nodes and only if there is a strong association rule between them. To enhance the PR voting score for positive associations and lower the PR voting score for negative associations, equation (5) is improved using lift values as follows:

h. randomly walking on the association diagram according to a formula (6), and iteratively calculating the PR value of each node until the formula (7) is met, so that the node score reaches a convergence state, wherein beta is a random walk termination threshold value;

and 5) clustering the nodes, wherein the steps are as follows: carrying out K-Mediod clustering on the sequenced nodes; firstly, selecting the first P nodes after descending sorting; selecting K from P nodes as P class centers with K not more than P; then calculating the semantic distance between the remaining nodes and the center of each class, and distributing the semantic distance to the class with the closest distance; finally, after all the nodes are distributed, searching the class center of each class again, and distributing the nodes of the non-class center again; and continuously repeating the processes of class center adjustment and node distribution until the difference value of the two adjacent node distribution results reaches a clustering termination threshold value, and extracting the class center of each cluster as a final keyword of the text. The basic technical flow is shown in fig. 1.

The invention discloses a keyword extraction method for combining graph structure and node association, which has the beneficial effects that: the invention uses the CLOSE algorithm to mine the strong association rule for constructing the association graph, the strong association rule obtained by taking the frequent closed item set as the reference has small scale, high speed and no information loss, thereby avoiding the problem of excessively complex graph structure caused by extracting redundant frequent item sets and improving the overall execution efficiency of the algorithm; the similarity of the graph structure attribute of the node in the association graph and the semantic information calculation node in the association graph is comprehensively considered, so that node jumping is not subject to equal probability distribution any more, and the possibility of jumping among similar nodes is increased; the invention provides that when the random walk is carried out on the association diagram, the voting among the nodes has positive and negative association scores, the positive association voting scores among the nodes are enhanced, the negative association voting scores among the nodes are reduced, and the node scores are more reasonable; the method can find the keywords of the long text under the condition of being separated from an external corpus; the graph structure attribute of the node, the correlation characteristics between the semantic information and the node are comprehensively considered, and the low execution efficiency of the GSNA caused by constructing a complex correlation graph is avoided; mining a frequent closed item set of a document and generating a strong association rule set; secondly, taking the nonrepetitive rule head and the rule body in the strong association rule set as nodes, if edges exist between the nodes and only if strong association rules exist, and taking the association strength of the association rules as edge weight to construct an association diagram of the document; then, randomly walking on the association diagram by using a GSNA algorithm, iteratively calculating the importance score of each node, and sequencing the results in a descending order; and finally, clustering the previous nodes in order to avoid semantic inclusion relation among the extracted keywords, and taking the class center point of each class as a key extraction result of the document.

Drawings

FIG. 1 is a basic technical flow diagram of the present invention;

FIG. 2 is a diagram illustrating the steps of generating a closed frequent itemset according to the present invention;

FIG. 3 is a diagram of strong association rule generation of the present invention;

FIG. 4 is a correlation diagram of a document according to the present invention.

Detailed Description

Example 1

As shown in fig. 1-4, the invention provides a keyword extraction method for combining graph structures and node associations, which comprises the following steps:

1) acquiring a text and performing text preprocessing;

a. acquiring a text, selecting 'species origin' of Darwin as the text, wherein the text consists of a plurality of sentences;

c. segmenting words of the text, and segmenting words of the English text based on simple spaces; for Chinese text, performing word segmentation by using a common word segmentation algorithm; english text can be used as a space as a natural delimiter, Chinese is only a character, a sentence and a paragraph can be simply delimited by an obvious delimiter, a unique word does not have a delimiter in a form, and proper word segmentation software can be selected to segment Chinese; because the selected text is the Chinese text, the Chinese text is segmented by using the ending segmentation; for example, a section of "provenance of species":

the role of natural selection lies entirely in the preservation and accumulation of the various variants which, for each organism, are favourable both under organic and inorganic conditions during their lifetime. The end result of this is that the relationship of various organisms to their external conditions is increasingly improving. Such improvements entail progressive development of the regime of most organisms worldwide. "

After word segmentation will become:

d. Filtering stop words of the text; stop words have no practical meaning, and for English text, 'the', 'is', 'at', 'which', 'on' etc. can be used as stop words, and for Chinese text, 'a', 'here', 'where', 'I', 'though', etc. can be used as stop words. After the text is filtered for stop words, a set of sentence sets St ═ St { St } separated by stop marks (","₁,st₂,...,st_nThe deactivation vocabulary can be defined by itself, which is not limited by the present invention. The invention uses a hundred degree stop word list and adds some stop words defined by texts. For the segment given in step (2.2), the filter by stop word becomes:

"natural selection action is fully preserved and accumulated organic and inorganic conditions of various variant organisms in life period are favorable

The final result is that the biological external condition relation is improved day by day

Improvement incurs gradual progress in the world's biosystems ";

e. removing repeated words in each sentence, wherein after the filtering step of the stop words, the repeated words become:

"natural selection action is fully preserved and accumulated in the life period of various variant organisms under the organic and inorganic conditions

Improvement incurs gradual progress in the world's biosystems ";

a. establishing a transaction set and an item set; the transaction is a single sentence in the text T; the transaction set is the whole preprocessed text T; any subset of all words contained by the set of terms for the entire text T is denoted as I ═ w₁,w₂,...,w_mWi is the ith word in the item set; for the book of "provenance of species", the transaction set is all the texts preprocessed in the book, the transaction is one sentence of the preprocessed texts, and the set of any words in the texts preprocessed by the term set is denoted as I ═ w₁,w₂,...,w_m}，w_iThe ith word in the item set;

b. solving a closed item set; the closed term set is term set X if and only if S (X) f (g (X)) fg (X) X, where S fg is a closure operation; assume that there are five transactions in the transaction set, with transaction IDs tr1, tr2, tr3, tr4, tr5, and item sets { a, c, d }, { b, c, e }, { a, b, c, e }, { b, e }, and { a, b, c, e }, respectively. One set of closure terms is S ({ b, e }) ═ f (g ({ b, e }))) f ({2,3,4,5}) } b, e }.

c. The length of the item set is calculated, the item set { a, c, d } is called a 3-item set, and the item set { a, b, c, e } is called a 4-item set;

d. and (3) calculating the support degree of the item set, wherein for the item set X, the support degree sup (X) is the ratio of the number of the affairs g (X) containing X to the number of the whole affairs, namely | g (X) |/| T |. For the association rule a → B,

the support sup (A → B) of A → B is the ratio of the number of transactions containing AbuB in T to the total number of transactions in T. According to the example in b, e are in the item sets tr2, tr3, tr4, tr5, so the item set { b, e } has a support of sup { b, e } ═ 4/5;

e. calculating the correlation strength; for the association rule a → B,

the strength of the association of A → B is:

f. solving a strong association rule; for the association rule a → B,

g. solving a closed frequent item set, wherein the item set X is the frequently closed item set if and only if (X) is not less than min and S (X) is not less than X; minsup is the minimum support of X in the transaction set.

The conventional association rule mining algorithm Apriori constructs an association rule by mining frequent item sets with different lengths, and when the business scale is large, because the business set is continuously scanned in the process of generating the association rule, a large-scale redundant frequent item set is easily obtained, so that the efficiency is low, and the analysis process becomes complicated. In order to quickly extract effective strong association rules from a document transaction set, the invention applies a CLOSE algorithm to mine a frequent closed item set so as to generate a strong association rule set, the algorithm uses closure operation and combines a breadth-first strategy to search an item set space, thereby realizing the quick pruning of a non-frequent item set and a non-closed item set in the search space.

Assuming minsup-2/5, the example in step (4) 1 first generates a 1-item set as FCC1 according to the ClOSE algorithm, scans the transaction set to calculate the closure and support degree of the 1-item set to obtain FCC1, and then changes FCC1// FFC1 to FCC1 to remove the item set with support degree less than 2/5 and the 1-item set with the same closure to obtain FC 1; in generating the 2-item set, only a different set of items in FC1 are combined pairwise to yield { a, b }, { a, c }, { b, c }, since

Therefore, after the { a, c } is pruned, the { a, b } and the { b, c } are reserved as generators of FCC2, the closure and the support degree of the 2-item set are calculated by scanning the transaction set to obtain FCC2, and then the item set with the support degree smaller than 2/5 and the item set of the repeated closure are removed to obtain FC 2; in the generation of FCC3 by FC2, the 3-item set generated by generator of FC2 is only { a, b, c }, but

So the post-pruned 3-item set for { a, b, c } is null, by which time the CLOSE algorithm ends. Finally, all closing of FC1 and FC2 are taken out, so that all closed frequent item sets FC of the transaction set are obtained. FIG. 2 is a diagram illustrating the closed frequent item set generation process of the transaction set in detail;

h. generating a strong association rule; first, the frequent closure set for transaction set T is L ═ L₁,L₂,...,L_kIn which L is_iA frequent closure set that is a set of i-items; secondly, extracting closures with different lengths in the L and the support degrees thereof to form a frequent closed item set FC; then, FC gets the frequent item set with different length and its support set Fre ═ f₁,f₂,...,f₃In which f_iIs a collection of i-frequent items anda set of support degrees; finally, according to f_iFor 2 ≦ i ≦ k, a different form of association rule set AR is generated, which is composed of a rule head with an m-term set of 1 ≦ m ≦ i-1 and a rule body of the (i-m) -term set, and calculates lift (m-itemset → (i-m) -itemset) according to Fre; the extraction satisfies that lift is less than or equal to max^-liftOr lift is more than or equal to min^+liftThe rule of (2) is used as a strong association rule. According to the example in step (4) 1, assuming that minsup-2/5, the 3-item set { b, c, e } may have a rule body of the rule head 1-item set of the 2-item set and a rule body of the rule head 2-item set of the 1-item set, the lift metric satisfies symmetry, which can also be seen from the figure; the 2-item set { a, c } has only the rule head and the rule body of the 1-item set, and lift (a → c) ═ lift (c → a), and the frequently closed item set generation process is shown in fig. 3;

3) taking a rule head and a rule body which are not repeated in the strong association rule set as nodes, if edges exist between the nodes and only if strong association rules exist among the nodes, and taking the association strength of the association rules as edge weight to construct an association graph of the document; taking out all rule heads and rule bodies in the strong association rule set, putting the rule heads and the rule bodies in the set, taking non-repeated item sets in the set as nodes of the association graph, taking the strong association existing between the points as edges of the association graph, and taking lift values as weights of the edges. According to the example in 1 in step (4), a preprocessed document transaction set sample is assumed to be min equal to 0.6, min^+lift＝1.2，max^-lift0.95, obtaining the frequently closed item set of the transaction set as FC { { a, c }, { b, e }, { c }, { b, c, e } } according to the CLOSE algorithm, and deriving a strong association rule set

Extracting non-repeated rule heads and rule bodies in the AR as association graph nodes V { { b, c }, { b, e }, { c, e }, { a }, { b }, { c }, { e } } and constructing a directionless weighted association graph shown in fig. 4 by using the association strength of the strong association rule in the AR as edge weights;

text repetition by different nodes in association graphsThe importance is different, the invention considers the existing method for measuring the node importance in the graph, and provides a GSNA algorithm comprehensively considering the similarity of the nodes and the correlation strength between the nodes, and the algorithm is essentially an improvement on the PageRank algorithm. The PageRank algorithm considers that an edge between the nodes A and B can be regarded as voting of the node A to the node B, the higher the sum of the edge weights of the nodes is, the higher the importance of the node is, and the PageRank simulates an internet owner, can randomly walk between the points on the graph and can jump. After a limited number of excursions, each point reaches a steady state and the expected value of each point tends to a constant value. And generating node sequencing once after each walk iteration, wherein the sequencing value is the PageRank score. Suppose the graph contains N nodes in total, N_iFor the i-th node in the dependency graph, e denotes N_iNumber of edge entries of (2), PR (N)_k) Represents N_kPageRank score of (C) (N)_k) Represents N_kD is used to measure the attenuation coefficient of a node migrating to another node, the PageRank score of Ni is defined as follows:

wherein, the jump between nodes is equal probability event, namely 1/N, and theoretically, the possibility of jump between similar nodes is higher. In addition, there is a phenomenon that nodes have not only a positive association relationship but also a negative association relationship with each other. The positive incidence relation is that one node appears together with another node, and the negative incidence relation is that the probability of another node appearing is reduced after one node appears. For negative associations between nodes, the voting score of PR should be decreased, while for positive associations, the voting score of PR needs to be increased.

The GSNA algorithm improves the PageRank from two aspects of the similarity and the relevance of the graph structure among the nodes. When measuring the similarity of the nodes, the GSNA considers the structural attribute and the implicit semantic information of the nodes in the association diagram; when measuring the relevance of the nodes, positive relevance transmission needs to be enhanced, negative relevance transmission needs to be reduced, and PR scores are reasonably scaled by GSNA, so that the node ranking is more accurate.

For the GSNA algorithm, the steps are as follows:

and (1) quantizing the graph structure attribute of the node. Six indexes are selected in order to quantify the graph structure attribute of the node. Wherein, atr1, atr2 and atr3 respectively represent the number of nodes with distances of 1, 2 and 3 from one node, and visually represent the association capability of one node with other nodes, atr4 is the ratio of the atr1 to the average value of the degrees of the neighbor nodes, which represents the similar matching degree of one node with the surrounding neighbor nodes, atr5 is the ratio of art2 to art1, and atr6 is the ratio of art3 to art2, which quantitatively represents the propagation capability of one node in the association diagram. Assuming the existence of node N in the dependency graph_iThen N is_iMay be formalized as Ni ═ by (atr1, atr2, atr3, atr4, atr5, atr 6).

Step (2) solving the semantic distance between two nodes

Suppose the word set of the preprocessed document is I ═ w₁,w₂,...,w_n}, node N_iThe co-occurrence probability distribution with I is P ═ P₁,p₂,...,p_nP is N_iThe number of sentences which appear simultaneously with the word in the I is larger than the total number of sentences of the text, and then the I is traversed to obtain P. N is a radical of_jThe co-occurrence probability distribution with I is Q ═ Q₁,q₂,...,q_nIs then N_iAnd N_jIs shown in equation (3). The larger the semantic distance between two nodes, the JSD (N)_i,N_j) The larger.

For the association graph of the document produced in fig. 4, the word set is I ═ { a, b, c, d, e }, the probability distribution of co-occurrence of the nodes { b, e } and I is P ═ 2/5,4/5,3/5,0,4/5}, and the probability distribution of co-occurrence of the nodes { a } and I is P ═ 3/5,2/5,3/5,1/5,2/5 }.

And (3) combining the similarity between the graph structure attribute and the nodes of the semantic information. Integrating graph structure attributes and semantic information for nodes, node N_iAnd N_jPhase ofThe similarity measure is shown in equation (4).

N_i＝(atr1,atr2,atr3,atr4,atr5,atr6)，

The larger the node similarity in conjunction with graph structure attributes, the higher the JSD (N)_i,N_j) The larger the semantic distance between two nodes, when N is_iAnd N_jThe greater the difference of the graph structure attribute or semantic distribution, the greater the similarity s_ijThe lower, conversely, the lower N_iAnd N_jThe closer the structural attributes or semantic distributions of (S) are, the more closely s_ijThe higher the correspondence.

And (4) normalizing the similarity of each node in the association graph. Assuming that the correlation diagram contains N nodes in total, the nodes with strong correlation have similarity, and the nodes without strong correlation have similarity of 0, a similarity matrix S of the correlation diagram can be formed_N×N. To more formally measure the similarity degree of a node with all nodes, the similarity degree calculation of each node in the association graph is normalized as follows:

and (5) correcting the PageRank formula.

r_iRepresenting a node N_iDegree of similarity to the average of all nodes in the graph, r ═ r₁,r₂,...,r_N) I.e. a fixed probability distribution that all nodes in the dependency graph are related to graph structure and semantic attributes. Using r_iModifying the PageRank formula to ensure that the nodes do not obey equal probability jump any more, but the nodes with higher similarity degree with the association graph are selected with higher probability, wherein the formula is as follows:

since the correlation diagram is an undirected graph, C (N) in the formula (6)_k) Representing a node N_kE represents N_kThe number of neighbors.

And (6) enhancing the PR voting score of the positive association and reducing the PR voting score of the negative association. In the association graph, there are edges between two nodes and only if there is a strong association rule between them. To enhance the PR voting score for positive associations and lower the PR voting score for negative associations, equation (6) is improved using lift values as follows:

and (7) carrying out random walk on the association diagram according to a formula (6), and iteratively calculating the PR value of each node until a formula (8) is met, so that the node score reaches a convergence state, wherein beta is a random walk termination threshold value.

5) Clustering the first nodes, and taking the class center point of each class as a key extraction result of the document;

the nodes of the association graph are formed according to the frequent closed item set taking subsets with different lengths, the subsets of the k-item set contain 2k-1 non-empty subsets in total, and the 1-item set is necessarily the subset of the (k-1) -item set, and the phenomenon that the (k-1) -item set has semantic inclusion relation with the 1-item set can be considered. In order to prevent semantic inclusion relation between important nodes obtained after the nodes are sorted in a descending order, the invention carries out K-Mediod clustering on the sorted nodes.

Because the condition that non-key terms are clustered independently exists, the first P nodes after descending sorting are selected firstly; selecting K from P nodes as P class centers with K not more than P; then calculating the semantic distance between the remaining nodes and the center of each class, and distributing the semantic distance to the class with the closest distance; and finally, searching the class center of each class again after all the nodes are distributed, and distributing the nodes of the non-class center again. And continuously repeating the processes of class center adjustment and node distribution until the difference value of the two adjacent node distribution results reaches a clustering termination threshold value, and extracting the class center of each cluster as a final keyword of the text. The algorithm is concretely realized by the following steps:

selecting the first P nodes after descending sorting;

selecting K as P class centers with K being not more than K from P nodes;

calculating the semantic distance between the residual nodes and the center of each class, and distributing the semantic distance to the class with the closest distance;

step 4, calculating the minimum value of the semantic distance sum of all other points for other nodes except the class center node in each class, and taking the minimum value point as a new class center node;

and (5) continuously performing the step (4) until the difference value of the node distribution results of the two adjacent times reaches a clustering termination threshold value, and extracting the class center of each cluster as a final keyword of the text.

The invention marks the extracted keyword sequence as N_retTagging the vocabulary sequence as N_relThe evaluation indexes involved include MAP (mean Average precision), Recall, and F_β. Since the keywords expected to be retrieved by the invention are high in accuracy and ranked as far as possible, the keyword F is ranked as early as possible_βWith slight modification, MAP was introduced into F_βAnd (4) calculating. The formula is defined as follows:

wherein, function index (t)_i,N_rel) Returning the term t_iIn the sequence N_relThe index position of (a); g (t)_i,N_rel) To indicate a function, if the term t_iAppearing in the original vocabulary sequence N_relIf yes, returning to 1, otherwise, returning to 0; compared with the recall rate, the method is more focused on improving the accuracy rate, so F_βThe value of beta in the index is 0.5.

Design of experiments

In order to verify the feasibility of the algorithm, the experimental design comprises four links. Firstly selecting experimental data and proper evaluation indexes, secondly searching input parameter settings corresponding to an optimal association diagram model, then discussing the extraction number of keywords under the condition of the optimal association diagram, and finally selecting four keyword extraction algorithms to compare with the algorithm of the invention, thereby further explaining the superiority of the algorithm of the invention.

1. Influence of input parameters

Because the input parameters of the model are more, different parameter settings can bring different degrees of influence on the algorithm result. For example, if the minor setting in the CLOSE algorithm is too high, a large number of meaningful words are discarded, and if the minor setting is too low, a large number of redundant words are included, and min^+liftAnd max^-liftThe change of the relation coefficient d directly influences the form structure of the association graph, and meanwhile, the adjustment of the damping coefficient d of the GSNA algorithm during random walk also influences the keyword extraction effect. The invention sets the variation range of min to be 0.5 percent and 10 percent]；min^+liftIncrease at a rate of step size 0.1, base 0.1; max^-liftThe initial value is 5, and the speed is increased by 5 steps; d is taken according to experience, and the variation range is set to be [0.15,0.85 ]]. Suppose the ordered keyword sequence is N_vIn order to avoid the extraction effect analysis complexity caused by excessive parameter variation, the extraction quantity of the keywords is not considered temporarily, and when the model parameters are adjusted, the keyword sequence N obtained each time is extracted_vFront | N of_relI terms as the extracted keyword sequence N_retAnd used to calculate the MAP.

2 extraction of word number selection

The keyword extraction focuses on selecting the words most relevant to the text, so that the quantity and the scale of the selected keywords are important. The invention introduces an extraction parameter r ═ N_ret/N_vAccording to the parameter found in 1, then N_ret＝r×N_vSequence N representing the sequence from the rank_vThe scale of the key words is extracted. Based on the resulting parameters in 1, the optimal r is calculated.

3. Algorithm comparison and analysis

In order to verify the superiority and comparability of the algorithm, TextRank, GraphSum, a keyword Extraction Method (Key Words Extraction Method by node Association, NA) considering the Graph Structure and not considering the Graph Structure, and a keyword Extraction Method (Key Words Extraction Method by Graph Structure, GS) considering the node Association and not considering the Graph Structure are compared with the keyword results obtained by the algorithm.

The four comparison methods are selected based on the following considerations that (1) TextRank is a traditional algorithm for automatically generating keywords, and does not consider the graph structure and the node association information; (2) comparing an NA algorithm only considering graph structure information and a GS algorithm only considering node relevance information with the GSNA algorithm simultaneously considering graph structure and node relevance, and displaying the importance of graph structure and node relevance to keyword extraction; (3) GSNA is improved on the GraphSum method, GraphSum is most similar to the method of the invention, and the GraphSum is used for generating text summaries, so as to obtain text keywords, the sentence covering step of the original GraphSum algorithm is removed during the experiment.

Claims

1. A keyword extraction method for combining graph structure and node association is characterized in that:

1) acquiring a text and performing text preprocessing; the text preprocessing method comprises the following steps:

a. acquiring a text, wherein the text consists of a plurality of sentences;

b. the text only leaves text information except all directories, titles, figures and tables in the text;

d. filtering stop words of the text;

e. removing repeated words in each sentence;

2) mining a frequent closed item set of a document and generating a strong association rule set; the method for generating the strong association rule set comprises the following steps:

b. solving a closed item set; the closed item set is item set Xf and only if

Wherein

Is a closure operation;

e. calculating the correlation strength; for the association rule a → B,

the strength of the association of A → B is:

f. solving a strong association rule; for the association rule a → B,

g. solving a closed frequent item set, wherein the item set X is the frequently closed item set if and only if (X) is not less than min and S (X) is not less than X;

h. generating a strong association rule; first, the frequent closure set for transaction set T is L ═ L₁,L₂,...,L_kIn which L is_iA frequent closure set that is a set of i-items; secondly, extracting closures with different lengths in the L and the support degrees thereof to form a frequent closed item set FC; then, FC gets the frequent item set with different length and its support set Fre ═ f₁,f₂,...,f₃In which f_iI-frequent item set and its support set; finally, according to f_iFor 2 ≦ i ≦ k, a different form of association rule set AR is generated, which is composed of a rule head with an m-term set of 1 ≦ m ≦ i-1 and a rule body of the (i-m) -term set, and calculates lift (m-itemset → (i-m) -itemset) according to Fre; the extraction satisfies that lift is less than or equal to max^-liftOr lift is more than or equal to min^+liftThe rule of (2) is used as a strong association rule;

5) clustering the first nodes, and taking the class center point of each class as a key extraction result of the document.

2. The method for extracting keywords associated with nodes in a fused graph structure according to claim 1, wherein the method comprises the following steps: the GSNA algorithm in 4) comprises the following steps:

e. normalizing the similarity of each node in the association graph;assuming that the correlation diagram contains N nodes in total, the nodes with strong correlation have similarity, and the nodes without strong correlation have similarity of 0, a similarity matrix S of the correlation diagram can be formed_N×N(ii) a To more formally measure the similarity degree of a node with all nodes, the similarity degree calculation of each node in the association graph is normalized as follows:

f. modifying the PageRank formula;

g. the PR voting score of the positive correlation is strengthened, and the PR voting score of the negative correlation is reduced; in the association graph, two nodes have edges between them and only if there is a strong association rule between them; to enhance the PR voting score for positive associations and lower the PR voting score for negative associations, equation (5) is improved using lift values as follows:

3. the method for extracting keywords associated with nodes in a fused graph structure according to claim 1, wherein the method comprises the following steps: and 5) clustering the nodes, wherein the steps are as follows: carrying out K-Mediod clustering on the sequenced nodes; firstly, selecting the first P nodes after descending sorting; selecting K from P nodes as P class centers with K not more than P; then calculating the semantic distance between the remaining nodes and the center of each class, and distributing the semantic distance to the class with the closest distance; finally, after all the nodes are distributed, searching the class center of each class again, and distributing the nodes of the non-class center again; and continuously repeating the processes of class center adjustment and node distribution until the difference value of the two adjacent node distribution results reaches a clustering termination threshold value, and extracting the class center of each cluster as a final keyword of the text.