CN109635081B - Text keyword weight calculation method based on word frequency power law distribution characteristics - Google Patents
Text keyword weight calculation method based on word frequency power law distribution characteristics Download PDFInfo
- Publication number
- CN109635081B CN109635081B CN201811403149.9A CN201811403149A CN109635081B CN 109635081 B CN109635081 B CN 109635081B CN 201811403149 A CN201811403149 A CN 201811403149A CN 109635081 B CN109635081 B CN 109635081B
- Authority
- CN
- China
- Prior art keywords
- weight
- node
- edge
- keyword
- power law
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
Abstract
The invention discloses a text keyword weight calculation method based on word frequency power law distribution characteristics, which comprises the following specific steps: s1: opening a text for preprocessing, including removing stop words and segmentation words, and taking the rest words as candidate keywords; s2: using candidate keywords as nodes, word frequency as node weight, word co-occurrence frequency as edge weight, and constructing an undirected keyword network; s3: extracting a core network from the keyword network; s4: updating the weight of each node in the core network; s5: adding an edge to the core network and updating node weights; s6: judging whether the node weight distribution of the core network after the edge addition accords with the power law distribution, if not, turning to the step S5; if yes, go to step S7; s7: and outputting the weight corresponding to each candidate keyword. The method calculates the weight of the keyword aiming at a single text, does not depend on a field text set and a training set, is simple and easy to operate, and has better effect.
Description
Technical Field
The invention relates to a text keyword weight calculation method based on word frequency power law distribution characteristics.
Background
The most widely used keyword extraction algorithms at present are TF-IDF, TEXTRANK, topic model and word2vec based on deep learning. TF-IDF is a typical bag of words method, ignoring the order, grammar and syntax of words, only considering text as a collection of several words, and each word is independent and appears independent of the appearance of other words. In addition, when calculating the Inverse Document Frequency (IDF) of a word, a domain text set needs to be relied on, and a single text cannot be targeted, and the quality and the scale of the domain text set have great influence on word weight calculation and keyword extraction. And (3) the text candidate keyword nodes and the co-occurrence relation links thereof form a directed graph, and node weights are iteratively propagated among the nodes by using a PAGERANK algorithm until convergence. Although the TEXTRANK does not need a training set and only needs to calculate the weight of a single text, the weight iteration process is based on word frequency and word co-occurrence frequency, and semantic information of the text is not considered. The topic model considers text to be made up of multiple topics, which in turn are probability distributions of words. The word-topic matrix and the topic-document matrix are inferred through a large-scale training set, namely the word-document matrix, so that text topic keywords are obtained. word2vec can convert words into multidimensional dense word vectors which can be understood by a machine, and is widely applied to the fields of machine translation, online automatic question-answering and the like, the accuracy is high, but a large-scale training set is needed for constructing the multidimensional dense word vectors, and the parameter scale of the training process is huge.
Disclosure of Invention
Aiming at the defects existing in the prior art, the invention provides a text keyword weight calculation method based on word frequency power law distribution characteristics. The method fully utilizes the characteristics that the word frequency of the text presents power law distribution and the probability that the high-frequency word becomes a keyword is larger than that of the low-frequency word, utilizes a probability formula to convert the word weight initially expressed by the word frequency into the word weight expressed by the word importance, and only aims at a single text without the assistance of the field knowledge or the field text set of the text and without the recursion convergence process.
In order to achieve the above object, the present invention is conceived as follows:
after the text preprocessing, using candidate keywords as nodes, word frequency as initial weight of the nodes, word co-occurrence as edges, word co-occurrence frequency as initial weight of the edges, and constructing an undirected keyword network. Because the word frequency of the text presents the characteristic of power law distribution, the weight of the text word (the index representing the importance of the word) should also present the power law distribution, and the probability that the high-frequency word becomes a high-weight word (keyword) is larger than that of the low-frequency word. The word weight taking word frequency as an initial value is converted into the weight taking word importance as a word by utilizing a probability formula, so that the weight of all words can be obtained.
According to the inventive idea, the invention adopts the following technical scheme:
a text keyword weight calculation method based on word frequency power law distribution characteristics comprises the following specific steps:
s1: opening a text for preprocessing, including removing stop words and segmentation words, and taking the rest words as candidate keywords;
s2: using candidate keywords as nodes, word frequency as node weight, word co-occurrence frequency as edge weight, and constructing an undirected keyword network;
s3: extracting a core network from the keyword network;
s4: updating the weight of each node in the core network;
s5: adding an edge to the core network and updating node weights;
s6: judging whether the node weight distribution of the core network after the edge addition accords with the power law distribution, if not, turning to the step S5; if yes, go to step S7;
s7: and outputting the weight corresponding to each candidate keyword.
In the step S3, a core network is extracted from the keyword network, and the method is as follows: two node sets A and B, and one edge set L are designed. A is a core network node set, B is a keyword network node set, and L is a core network edge set. In the initial state, all nodes belong to B. And selecting the node with the largest weight from the B as the current node to be added into the A, selecting the edge with the largest weight from the edges of the current node as the current edge to be added into the L, and updating the current node as the other end node corresponding to the current edge at the moment. And if the updated current node already belongs to A, selecting the node with the largest weight from B as the current node. Repeating the steps until the set B is empty.
The weight of each node in the core network is updated in the step S4, which is to update the weight of each node to the sum of the original weight of each node and the weights of all the current sides of the node.
In the step S5, an edge adding operation is performed on the core network and node weights are updated, and the method is that two nodes are selected and an edge is added, the probability of each node being selected is proportional to the weight of the node, the greater the weight is, the greater the probability of the node being selected is, and the weights of the two selected nodes are respectively added with 1; if the two nodes are connected by the edge, the edge weight is added with 1; otherwise, adding an edge between the two nodes, wherein the weight is 1.
The node weight distribution in the step S6 is defined as the number N of nodes with weight k k Where k=0, 1,2 ….
The power law distribution function in the step S6 is f (n) =c×n -α Wherein c and α are both constants.
The node weight distribution after each edge addition is calculated in the step S6 is as follows:wherein N is k And N k-1 The number of the nodes with the weights of k and k-1 respectively, and theta is a smoothing factor so as to avoid that the nodes with k=0 cannot carry out edge adding operation; solving the differential equation yields: />Further get->From this follows: n (N) k ~k -3 The method comprises the steps of carrying out a first treatment on the surface of the The weight distribution will gradually follow the power law distribution when the number of edges is sufficiently large.
Compared with the prior art, the text keyword weight calculation method has the following outstanding advantages:
the method has no field text set and training set, can utilize word frequency and weight of the text to present power law distribution by scanning single text, and the word with larger word frequency has larger weight, and the word frequency is converted into word weight by utilizing a probability formula. The specific method is that edges representing word frequency are deleted by probability, then nodes are selected to be added with the edges according to the probability that the current weight is in direct proportion, and meanwhile node weights are added until the weight distribution presents power law distribution. Since the power law distribution is of the order of magnitude of scalability, it can be used as a function of TFIDF and is simpler than TFIDF.
Drawings
Fig. 1 is a flow chart of the method of the present invention.
Detailed Description
Embodiments of the present invention are further described below with reference to the accompanying drawings.
This example is given by way of example in the article "rating: A Structured P2P Overlay Based on Harmonic Series" in journal IEEE Transactions on Parallel and Distributed Systems.
As shown in fig. 1, a text keyword weight calculation method based on word frequency power law distribution characteristics comprises the following specific steps:
s1: and opening the text for preprocessing, wherein stop words and segmentation words are removed, and the rest words are used as candidate keywords.
S2: and using the candidate keywords as nodes, using word frequency as node weight, using word co-occurrence frequency as edge weight, and constructing an undirected keyword network. The keyword network is represented in a table form, see table 1, where 1 (a) represents the weight of the node, and 1 (b) represents the edge weight.
TABLE 1 (a) node weights for keyword networks
TABLE 1 (b) edge weights for keyword networks
S3: extracting a core network from the keyword network; the method comprises the following steps: two node sets A and B, and one edge set L are designed. A is a core network node set, B is a keyword network node set, and L is a core network edge set. In the initial state, all nodes belong to B. And selecting the node with the largest weight from the B as the current node to be added into the A, selecting the edge with the largest weight from the edges of the current node as the current edge to be added into the L, and updating the current node as the other end node corresponding to the current edge at the moment. And if the updated current node already belongs to A, selecting the node with the largest weight from B as the current node. Repeating the steps until the set B is empty. The core network is represented in a table, see table 2,2 (a) for node weights and 2 (b) for edge weights.
2 (a) node weights of core network
Edge weight of core network
S4: updating the weight of each node in the core network; the method is to update the weight of each node to the sum of the original weight of each node and the weights of all the current sides of the node. The updated core network is represented by a table, and table 3 represents the updated node weight, and the edge weight at this time is still 2 (b).
TABLE 3 node weights for core networks
S5: adding an edge to the core network and updating node weights; selecting two nodes and adding an edge, wherein the probability of each node being selected is in direct proportion to the weight of the node, the greater the weight is, the greater the probability of the node being selected is, and the weight of each of the two selected nodes is added with 1; if the two nodes are connected by the edge, the edge weight is added with 1; otherwise, adding an edge between the two nodes, wherein the weight is 1. The weight of each node and its selected probability are shown in table 4 (a) below. Assuming that the selected nodes are P2P and overlay, the node weights and edge weights after the update operations are performed by the corresponding P2P and overlay in tables 4 (b) and 4 (c).
TABLE 4 (a) node weights and probabilities of being selected
Table 4 (b). Node weight update
Table 4 (c). Edge weight update
S6: judging whether the node weight distribution of the core network after the edge addition accords with the power law distribution, if not, turning to the step S5; if so, go to step S7. Wherein the node weight distribution is defined as the number N of nodes with weight k k Where k=0, 1,2 …. The power law distribution function is f (n) =c×n- α Wherein c and α are both constants. The node weight distribution after each edge addition is calculated as follows:wherein N is k And N k-1 The number of the nodes with the weights of k and k-1 respectively, and theta is a smoothing factor so as to avoid that the nodes with k=0 cannot carry out edge adding operation; solving the differential equation yields:further get->From this follows: n (N) k ~k -3 The method comprises the steps of carrying out a first treatment on the surface of the The weight distribution will gradually follow the power law distribution when the number of edges is sufficiently large.
S7: and outputting the weight corresponding to each candidate keyword.
Claims (7)
1. The text keyword weight calculation method based on word frequency power law distribution characteristics is characterized in that the method fully utilizes the characteristics that word frequency of a text presents power law distribution and probability that high-frequency words become keywords is larger than that of low-frequency words, and a probability formula is utilized to convert word weights initially represented by word frequency into word weights represented by word importance; the method comprises the following specific steps:
s1: opening a text for preprocessing, including removing stop words and segmentation words, and taking the rest words as candidate keywords;
s2: using candidate keywords as nodes, word frequency as node weight, word co-occurrence frequency as edge weight, and constructing an undirected keyword network;
s3: extracting a core network from the keyword network;
s4: updating the weight of each node in the core network;
s5: adding an edge to the core network and updating node weights;
s6: judging whether the node weight distribution of the core network after the edge addition accords with the power law distribution, if not, turning to the step S5; if yes, go to step S7;
s7: and outputting the weight corresponding to each candidate keyword.
2. The text keyword weight calculation method based on the word frequency power law distribution characteristics according to claim 1, wherein the core network is extracted from the keyword network in step S3, and the method is as follows: designing two node sets A and B and an edge set L; a is a core network node set, B is a keyword network node set, and L is a core network edge set; in the initial state, all nodes belong to B, the node with the largest weight is selected from B as the current node to be added into A, the edge with the largest weight is selected from the edges of the current node to be added into L as the current edge, and at the moment, the current node is updated to be the other end node corresponding to the current edge; if the updated current node already belongs to A, the node with the largest weight is selected from B as the current node, and the steps are repeated until the B set is empty.
3. The text keyword weight calculation method based on word frequency power law distribution characteristics according to claim 1, wherein the step S4 is to update the weight of each node in the core network by updating the weight of each node to the sum of the original weight of each node and the weights of all the current sides of the node.
4. The text keyword weight calculation method based on word frequency power law distribution characteristics according to claim 1, wherein in the step S5, an operation of adding one edge to a core network and updating node weights are performed, and the method is that two nodes are selected and added with one edge, wherein the probability of each node being selected is proportional to the weight, and the greater the weight is, the greater the probability of the node being selected is, and the two selected node weights are respectively added with 1; if the two nodes are connected by the edge, the edge weight is added with 1; otherwise, adding an edge between the two nodes, wherein the weight is 1.
5. The text keyword weight calculation method based on word frequency power law distribution characteristics according to claim 1, wherein the node weight distribution in the step S6 is defined as the number N of nodes with weight k k Where k=0, 1,2 ….
6. The text keyword weight calculation method based on word frequency power law distribution characteristics according to claim 1, wherein the power law distribution function in the step S6 is f (n) =c×n -α Wherein c and α are both constants.
7. The text keyword weight calculation method based on word frequency power law distribution characteristics according to claim 1, wherein the node weight distribution after each edge addition is calculated in step S6 as follows:wherein N is k And N k-1 The number of the nodes with the weights of k and k-1 respectively, and theta is a smoothing factor so as to avoid that the nodes with k=0 cannot carry out edge adding operation; solving the differential equation yields: />Further get->From this follows: n (N) k ~k -3 The method comprises the steps of carrying out a first treatment on the surface of the The weight distribution will gradually follow the power law distribution when the number of edges is sufficiently large. />
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811403149.9A CN109635081B (en) | 2018-11-23 | 2018-11-23 | Text keyword weight calculation method based on word frequency power law distribution characteristics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811403149.9A CN109635081B (en) | 2018-11-23 | 2018-11-23 | Text keyword weight calculation method based on word frequency power law distribution characteristics |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109635081A CN109635081A (en) | 2019-04-16 |
CN109635081B true CN109635081B (en) | 2023-06-13 |
Family
ID=66068954
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811403149.9A Active CN109635081B (en) | 2018-11-23 | 2018-11-23 | Text keyword weight calculation method based on word frequency power law distribution characteristics |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109635081B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111104793A (en) * | 2019-12-19 | 2020-05-05 | 浙江工商大学 | Short text theme determination method |
CN112489646B (en) * | 2020-11-18 | 2024-04-02 | 北京华宇信息技术有限公司 | Speech recognition method and device thereof |
CN113010740B (en) * | 2021-03-09 | 2023-05-30 | 腾讯科技(深圳)有限公司 | Word weight generation method, device, equipment and medium |
CN113255883B (en) * | 2021-05-07 | 2023-07-25 | 青岛大学 | Weight initialization method based on power law distribution |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103544242A (en) * | 2013-09-29 | 2014-01-29 | 广东工业大学 | Microblog-oriented emotion entity searching system |
CN104462286A (en) * | 2014-11-27 | 2015-03-25 | 重庆邮电大学 | Microblog topic finding method based on modified LDA |
CN104462253A (en) * | 2014-11-20 | 2015-03-25 | 武汉数为科技有限公司 | Topic detection or tracking method for network text big data |
CN105302881A (en) * | 2015-10-14 | 2016-02-03 | 上海大学 | Literature search system-oriented search prompt word generation method |
CN105843795A (en) * | 2016-03-21 | 2016-08-10 | 华南理工大学 | Topic model based document keyword extraction method and system |
CN106776881A (en) * | 2016-11-28 | 2017-05-31 | 中国科学院软件研究所 | A kind of realm information commending system and method based on microblog |
-
2018
- 2018-11-23 CN CN201811403149.9A patent/CN109635081B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103544242A (en) * | 2013-09-29 | 2014-01-29 | 广东工业大学 | Microblog-oriented emotion entity searching system |
CN104462253A (en) * | 2014-11-20 | 2015-03-25 | 武汉数为科技有限公司 | Topic detection or tracking method for network text big data |
CN104462286A (en) * | 2014-11-27 | 2015-03-25 | 重庆邮电大学 | Microblog topic finding method based on modified LDA |
CN105302881A (en) * | 2015-10-14 | 2016-02-03 | 上海大学 | Literature search system-oriented search prompt word generation method |
CN105843795A (en) * | 2016-03-21 | 2016-08-10 | 华南理工大学 | Topic model based document keyword extraction method and system |
CN106776881A (en) * | 2016-11-28 | 2017-05-31 | 中国科学院软件研究所 | A kind of realm information commending system and method based on microblog |
Non-Patent Citations (2)
Title |
---|
基于复杂网络的信息流传输优化研究;陈留情;《中国优秀硕士学位论文全文数据库 (信息科技辑)》;20180415;全文 * |
基于多种知识网络建模的用户创新社区知识发现与分析方法;廖晓;《中国博士学位论文全文数据库 (经济与管理科学辑)》;20160515;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN109635081A (en) | 2019-04-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109635081B (en) | Text keyword weight calculation method based on word frequency power law distribution characteristics | |
CN108052593B (en) | Topic keyword extraction method based on topic word vector and network structure | |
CN110705260B (en) | Text vector generation method based on unsupervised graph neural network structure | |
CN106599029A (en) | Chinese short text clustering method | |
CN108959461B (en) | Entity linking method based on graph model | |
Wen et al. | Research on keyword extraction based on word2vec weighted textrank | |
CN104615687B (en) | A kind of entity fine grit classification method and system towards knowledge base update | |
CN105512245A (en) | Enterprise figure building method based on regression model | |
CN104778204B (en) | More document subject matters based on two layers of cluster find method | |
CN103092828A (en) | Text similarity measuring method based on semantic analysis and semantic relation network | |
CN108710611A (en) | A kind of short text topic model generation method of word-based network and term vector | |
CN110717042A (en) | Method for constructing document-keyword heterogeneous network model | |
CN109885693B (en) | Method and system for rapid knowledge comparison based on knowledge graph | |
CN103646029A (en) | Similarity calculation method for blog articles | |
CN105243083B (en) | Document subject matter method for digging and device | |
CN113051397A (en) | Academic paper homonymy disambiguation method based on heterogeneous information network representation learning and word vector representation | |
CN107622047B (en) | Design decision knowledge extraction and expression method | |
CN107402919B (en) | Machine translation data selection method and machine translation data selection system based on graph | |
CN111831910A (en) | Citation recommendation algorithm based on heterogeneous network | |
CN104166712A (en) | Method and system for scientific and technical literature retrieval | |
CN115409124B (en) | Small sample sensitive information identification method based on fine tuning prototype network | |
Yang et al. | A topic-specific web crawler with web page hierarchy based on HTML Dom-Tree | |
CN111046191B (en) | Semantic enhancement method and device in power field | |
Tsuboi et al. | An algorithm for extracting shape expression schemas from graphs | |
CN111046181B (en) | Actor-critic method for automatic classification induction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |