CN109635081B

CN109635081B - Text keyword weight calculation method based on word frequency power law distribution characteristics

Info

Publication number: CN109635081B
Application number: CN201811403149.9A
Authority: CN
Inventors: 陈雪; 郭峻材; 王小飞; 乐金雄; 王鹏; 骆祥峰; 魏晓; 张惠然
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2018-11-23
Filing date: 2018-11-23
Publication date: 2023-06-13
Anticipated expiration: 2038-11-23
Also published as: CN109635081A

Abstract

The invention discloses a text keyword weight calculation method based on word frequency power law distribution characteristics, which comprises the following specific steps: s1: opening a text for preprocessing, including removing stop words and segmentation words, and taking the rest words as candidate keywords; s2: using candidate keywords as nodes, word frequency as node weight, word co-occurrence frequency as edge weight, and constructing an undirected keyword network; s3: extracting a core network from the keyword network; s4: updating the weight of each node in the core network; s5: adding an edge to the core network and updating node weights; s6: judging whether the node weight distribution of the core network after the edge addition accords with the power law distribution, if not, turning to the step S5; if yes, go to step S7; s7: and outputting the weight corresponding to each candidate keyword. The method calculates the weight of the keyword aiming at a single text, does not depend on a field text set and a training set, is simple and easy to operate, and has better effect.

Description

Text keyword weight calculation method based on word frequency power law distribution characteristics

Technical Field

The invention relates to a text keyword weight calculation method based on word frequency power law distribution characteristics.

Background

The most widely used keyword extraction algorithms at present are TF-IDF, TEXTRANK, topic model and word2vec based on deep learning. TF-IDF is a typical bag of words method, ignoring the order, grammar and syntax of words, only considering text as a collection of several words, and each word is independent and appears independent of the appearance of other words. In addition, when calculating the Inverse Document Frequency (IDF) of a word, a domain text set needs to be relied on, and a single text cannot be targeted, and the quality and the scale of the domain text set have great influence on word weight calculation and keyword extraction. And (3) the text candidate keyword nodes and the co-occurrence relation links thereof form a directed graph, and node weights are iteratively propagated among the nodes by using a PAGERANK algorithm until convergence. Although the TEXTRANK does not need a training set and only needs to calculate the weight of a single text, the weight iteration process is based on word frequency and word co-occurrence frequency, and semantic information of the text is not considered. The topic model considers text to be made up of multiple topics, which in turn are probability distributions of words. The word-topic matrix and the topic-document matrix are inferred through a large-scale training set, namely the word-document matrix, so that text topic keywords are obtained. word2vec can convert words into multidimensional dense word vectors which can be understood by a machine, and is widely applied to the fields of machine translation, online automatic question-answering and the like, the accuracy is high, but a large-scale training set is needed for constructing the multidimensional dense word vectors, and the parameter scale of the training process is huge.

Disclosure of Invention

Aiming at the defects existing in the prior art, the invention provides a text keyword weight calculation method based on word frequency power law distribution characteristics. The method fully utilizes the characteristics that the word frequency of the text presents power law distribution and the probability that the high-frequency word becomes a keyword is larger than that of the low-frequency word, utilizes a probability formula to convert the word weight initially expressed by the word frequency into the word weight expressed by the word importance, and only aims at a single text without the assistance of the field knowledge or the field text set of the text and without the recursion convergence process.

In order to achieve the above object, the present invention is conceived as follows:

after the text preprocessing, using candidate keywords as nodes, word frequency as initial weight of the nodes, word co-occurrence as edges, word co-occurrence frequency as initial weight of the edges, and constructing an undirected keyword network. Because the word frequency of the text presents the characteristic of power law distribution, the weight of the text word (the index representing the importance of the word) should also present the power law distribution, and the probability that the high-frequency word becomes a high-weight word (keyword) is larger than that of the low-frequency word. The word weight taking word frequency as an initial value is converted into the weight taking word importance as a word by utilizing a probability formula, so that the weight of all words can be obtained.

According to the inventive idea, the invention adopts the following technical scheme:

a text keyword weight calculation method based on word frequency power law distribution characteristics comprises the following specific steps:

s1: opening a text for preprocessing, including removing stop words and segmentation words, and taking the rest words as candidate keywords;

s2: using candidate keywords as nodes, word frequency as node weight, word co-occurrence frequency as edge weight, and constructing an undirected keyword network;

s3: extracting a core network from the keyword network;

s4: updating the weight of each node in the core network;

s5: adding an edge to the core network and updating node weights;

s6: judging whether the node weight distribution of the core network after the edge addition accords with the power law distribution, if not, turning to the step S5; if yes, go to step S7;

s7: and outputting the weight corresponding to each candidate keyword.

In the step S3, a core network is extracted from the keyword network, and the method is as follows: two node sets A and B, and one edge set L are designed. A is a core network node set, B is a keyword network node set, and L is a core network edge set. In the initial state, all nodes belong to B. And selecting the node with the largest weight from the B as the current node to be added into the A, selecting the edge with the largest weight from the edges of the current node as the current edge to be added into the L, and updating the current node as the other end node corresponding to the current edge at the moment. And if the updated current node already belongs to A, selecting the node with the largest weight from B as the current node. Repeating the steps until the set B is empty.

The weight of each node in the core network is updated in the step S4, which is to update the weight of each node to the sum of the original weight of each node and the weights of all the current sides of the node.

In the step S5, an edge adding operation is performed on the core network and node weights are updated, and the method is that two nodes are selected and an edge is added, the probability of each node being selected is proportional to the weight of the node, the greater the weight is, the greater the probability of the node being selected is, and the weights of the two selected nodes are respectively added with 1; if the two nodes are connected by the edge, the edge weight is added with 1; otherwise, adding an edge between the two nodes, wherein the weight is 1.

The node weight distribution in the step S6 is defined as the number N of nodes with weight k _k Where k=0, 1,2 ….

The power law distribution function in the step S6 is f (n) =c×n ^-α Wherein c and α are both constants.

The node weight distribution after each edge addition is calculated in the step S6 is as follows:

wherein N is _k And N _k-1 The number of the nodes with the weights of k and k-1 respectively, and theta is a smoothing factor so as to avoid that the nodes with k=0 cannot carry out edge adding operation; solving the differential equation yields: />

Further get->

From this follows: n (N) _k ～k ^-3 The method comprises the steps of carrying out a first treatment on the surface of the The weight distribution will gradually follow the power law distribution when the number of edges is sufficiently large.

Compared with the prior art, the text keyword weight calculation method has the following outstanding advantages:

the method has no field text set and training set, can utilize word frequency and weight of the text to present power law distribution by scanning single text, and the word with larger word frequency has larger weight, and the word frequency is converted into word weight by utilizing a probability formula. The specific method is that edges representing word frequency are deleted by probability, then nodes are selected to be added with the edges according to the probability that the current weight is in direct proportion, and meanwhile node weights are added until the weight distribution presents power law distribution. Since the power law distribution is of the order of magnitude of scalability, it can be used as a function of TFIDF and is simpler than TFIDF.

Drawings

Fig. 1 is a flow chart of the method of the present invention.

Detailed Description

Embodiments of the present invention are further described below with reference to the accompanying drawings.

This example is given by way of example in the article "rating: A Structured P2P Overlay Based on Harmonic Series" in journal IEEE Transactions on Parallel and Distributed Systems.

As shown in fig. 1, a text keyword weight calculation method based on word frequency power law distribution characteristics comprises the following specific steps:

s1: and opening the text for preprocessing, wherein stop words and segmentation words are removed, and the rest words are used as candidate keywords.

S2: and using the candidate keywords as nodes, using word frequency as node weight, using word co-occurrence frequency as edge weight, and constructing an undirected keyword network. The keyword network is represented in a table form, see table 1, where 1 (a) represents the weight of the node, and 1 (b) represents the edge weight.

TABLE 1 (a) node weights for keyword networks

TABLE 1 (b) edge weights for keyword networks

S3: extracting a core network from the keyword network; the method comprises the following steps: two node sets A and B, and one edge set L are designed. A is a core network node set, B is a keyword network node set, and L is a core network edge set. In the initial state, all nodes belong to B. And selecting the node with the largest weight from the B as the current node to be added into the A, selecting the edge with the largest weight from the edges of the current node as the current edge to be added into the L, and updating the current node as the other end node corresponding to the current edge at the moment. And if the updated current node already belongs to A, selecting the node with the largest weight from B as the current node. Repeating the steps until the set B is empty. The core network is represented in a table, see table 2,2 (a) for node weights and 2 (b) for edge weights.

2 (a) node weights of core network

Edge weight of core network

S4: updating the weight of each node in the core network; the method is to update the weight of each node to the sum of the original weight of each node and the weights of all the current sides of the node. The updated core network is represented by a table, and table 3 represents the updated node weight, and the edge weight at this time is still 2 (b).

TABLE 3 node weights for core networks

S5: adding an edge to the core network and updating node weights; selecting two nodes and adding an edge, wherein the probability of each node being selected is in direct proportion to the weight of the node, the greater the weight is, the greater the probability of the node being selected is, and the weight of each of the two selected nodes is added with 1; if the two nodes are connected by the edge, the edge weight is added with 1; otherwise, adding an edge between the two nodes, wherein the weight is 1. The weight of each node and its selected probability are shown in table 4 (a) below. Assuming that the selected nodes are P2P and overlay, the node weights and edge weights after the update operations are performed by the corresponding P2P and overlay in tables 4 (b) and 4 (c).

TABLE 4 (a) node weights and probabilities of being selected

Table 4 (b). Node weight update

Table 4 (c). Edge weight update

S6: judging whether the node weight distribution of the core network after the edge addition accords with the power law distribution, if not, turning to the step S5; if so, go to step S7. Wherein the node weight distribution is defined as the number N of nodes with weight k _k Where k=0, 1,2 …. The power law distribution function is f (n) =c×n- ^α Wherein c and α are both constants. The node weight distribution after each edge addition is calculated as follows:

wherein N is _k And N _k-1 The number of the nodes with the weights of k and k-1 respectively, and theta is a smoothing factor so as to avoid that the nodes with k=0 cannot carry out edge adding operation; solving the differential equation yields:

further get->

S7: and outputting the weight corresponding to each candidate keyword.

Claims

1. The text keyword weight calculation method based on word frequency power law distribution characteristics is characterized in that the method fully utilizes the characteristics that word frequency of a text presents power law distribution and probability that high-frequency words become keywords is larger than that of low-frequency words, and a probability formula is utilized to convert word weights initially represented by word frequency into word weights represented by word importance; the method comprises the following specific steps:

s3: extracting a core network from the keyword network;

s4: updating the weight of each node in the core network;

s5: adding an edge to the core network and updating node weights;

s7: and outputting the weight corresponding to each candidate keyword.

2. The text keyword weight calculation method based on the word frequency power law distribution characteristics according to claim 1, wherein the core network is extracted from the keyword network in step S3, and the method is as follows: designing two node sets A and B and an edge set L; a is a core network node set, B is a keyword network node set, and L is a core network edge set; in the initial state, all nodes belong to B, the node with the largest weight is selected from B as the current node to be added into A, the edge with the largest weight is selected from the edges of the current node to be added into L as the current edge, and at the moment, the current node is updated to be the other end node corresponding to the current edge; if the updated current node already belongs to A, the node with the largest weight is selected from B as the current node, and the steps are repeated until the B set is empty.

3. The text keyword weight calculation method based on word frequency power law distribution characteristics according to claim 1, wherein the step S4 is to update the weight of each node in the core network by updating the weight of each node to the sum of the original weight of each node and the weights of all the current sides of the node.

4. The text keyword weight calculation method based on word frequency power law distribution characteristics according to claim 1, wherein in the step S5, an operation of adding one edge to a core network and updating node weights are performed, and the method is that two nodes are selected and added with one edge, wherein the probability of each node being selected is proportional to the weight, and the greater the weight is, the greater the probability of the node being selected is, and the two selected node weights are respectively added with 1; if the two nodes are connected by the edge, the edge weight is added with 1; otherwise, adding an edge between the two nodes, wherein the weight is 1.

5. The text keyword weight calculation method based on word frequency power law distribution characteristics according to claim 1, wherein the node weight distribution in the step S6 is defined as the number N of nodes with weight k _k Where k=0, 1,2 ….

6. The text keyword weight calculation method based on word frequency power law distribution characteristics according to claim 1, wherein the power law distribution function in the step S6 is f (n) =c×n ^-α Wherein c and α are both constants.

7. The text keyword weight calculation method based on word frequency power law distribution characteristics according to claim 1, wherein the node weight distribution after each edge addition is calculated in step S6 as follows:

Further get->

From this follows: n (N) _k ～k ^-3 The method comprises the steps of carrying out a first treatment on the surface of the The weight distribution will gradually follow the power law distribution when the number of edges is sufficiently large. />