CN109635081B - Text keyword weight calculation method based on word frequency power law distribution characteristics - Google Patents

Text keyword weight calculation method based on word frequency power law distribution characteristics Download PDF

Info

Publication number
CN109635081B
CN109635081B CN201811403149.9A CN201811403149A CN109635081B CN 109635081 B CN109635081 B CN 109635081B CN 201811403149 A CN201811403149 A CN 201811403149A CN 109635081 B CN109635081 B CN 109635081B
Authority
CN
China
Prior art keywords
weight
node
edge
keyword
power law
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811403149.9A
Other languages
Chinese (zh)
Other versions
CN109635081A (en
Inventor
陈雪
郭峻材
王小飞
乐金雄
王鹏
骆祥峰
魏晓
张惠然
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CN201811403149.9A priority Critical patent/CN109635081B/en
Publication of CN109635081A publication Critical patent/CN109635081A/en
Application granted granted Critical
Publication of CN109635081B publication Critical patent/CN109635081B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Abstract

The invention discloses a text keyword weight calculation method based on word frequency power law distribution characteristics, which comprises the following specific steps: s1: opening a text for preprocessing, including removing stop words and segmentation words, and taking the rest words as candidate keywords; s2: using candidate keywords as nodes, word frequency as node weight, word co-occurrence frequency as edge weight, and constructing an undirected keyword network; s3: extracting a core network from the keyword network; s4: updating the weight of each node in the core network; s5: adding an edge to the core network and updating node weights; s6: judging whether the node weight distribution of the core network after the edge addition accords with the power law distribution, if not, turning to the step S5; if yes, go to step S7; s7: and outputting the weight corresponding to each candidate keyword. The method calculates the weight of the keyword aiming at a single text, does not depend on a field text set and a training set, is simple and easy to operate, and has better effect.

Description

Text keyword weight calculation method based on word frequency power law distribution characteristics
Technical Field
The invention relates to a text keyword weight calculation method based on word frequency power law distribution characteristics.
Background
The most widely used keyword extraction algorithms at present are TF-IDF, TEXTRANK, topic model and word2vec based on deep learning. TF-IDF is a typical bag of words method, ignoring the order, grammar and syntax of words, only considering text as a collection of several words, and each word is independent and appears independent of the appearance of other words. In addition, when calculating the Inverse Document Frequency (IDF) of a word, a domain text set needs to be relied on, and a single text cannot be targeted, and the quality and the scale of the domain text set have great influence on word weight calculation and keyword extraction. And (3) the text candidate keyword nodes and the co-occurrence relation links thereof form a directed graph, and node weights are iteratively propagated among the nodes by using a PAGERANK algorithm until convergence. Although the TEXTRANK does not need a training set and only needs to calculate the weight of a single text, the weight iteration process is based on word frequency and word co-occurrence frequency, and semantic information of the text is not considered. The topic model considers text to be made up of multiple topics, which in turn are probability distributions of words. The word-topic matrix and the topic-document matrix are inferred through a large-scale training set, namely the word-document matrix, so that text topic keywords are obtained. word2vec can convert words into multidimensional dense word vectors which can be understood by a machine, and is widely applied to the fields of machine translation, online automatic question-answering and the like, the accuracy is high, but a large-scale training set is needed for constructing the multidimensional dense word vectors, and the parameter scale of the training process is huge.
Disclosure of Invention
Aiming at the defects existing in the prior art, the invention provides a text keyword weight calculation method based on word frequency power law distribution characteristics. The method fully utilizes the characteristics that the word frequency of the text presents power law distribution and the probability that the high-frequency word becomes a keyword is larger than that of the low-frequency word, utilizes a probability formula to convert the word weight initially expressed by the word frequency into the word weight expressed by the word importance, and only aims at a single text without the assistance of the field knowledge or the field text set of the text and without the recursion convergence process.
In order to achieve the above object, the present invention is conceived as follows:
after the text preprocessing, using candidate keywords as nodes, word frequency as initial weight of the nodes, word co-occurrence as edges, word co-occurrence frequency as initial weight of the edges, and constructing an undirected keyword network. Because the word frequency of the text presents the characteristic of power law distribution, the weight of the text word (the index representing the importance of the word) should also present the power law distribution, and the probability that the high-frequency word becomes a high-weight word (keyword) is larger than that of the low-frequency word. The word weight taking word frequency as an initial value is converted into the weight taking word importance as a word by utilizing a probability formula, so that the weight of all words can be obtained.
According to the inventive idea, the invention adopts the following technical scheme:
a text keyword weight calculation method based on word frequency power law distribution characteristics comprises the following specific steps:
s1: opening a text for preprocessing, including removing stop words and segmentation words, and taking the rest words as candidate keywords;
s2: using candidate keywords as nodes, word frequency as node weight, word co-occurrence frequency as edge weight, and constructing an undirected keyword network;
s3: extracting a core network from the keyword network;
s4: updating the weight of each node in the core network;
s5: adding an edge to the core network and updating node weights;
s6: judging whether the node weight distribution of the core network after the edge addition accords with the power law distribution, if not, turning to the step S5; if yes, go to step S7;
s7: and outputting the weight corresponding to each candidate keyword.
In the step S3, a core network is extracted from the keyword network, and the method is as follows: two node sets A and B, and one edge set L are designed. A is a core network node set, B is a keyword network node set, and L is a core network edge set. In the initial state, all nodes belong to B. And selecting the node with the largest weight from the B as the current node to be added into the A, selecting the edge with the largest weight from the edges of the current node as the current edge to be added into the L, and updating the current node as the other end node corresponding to the current edge at the moment. And if the updated current node already belongs to A, selecting the node with the largest weight from B as the current node. Repeating the steps until the set B is empty.
The weight of each node in the core network is updated in the step S4, which is to update the weight of each node to the sum of the original weight of each node and the weights of all the current sides of the node.
In the step S5, an edge adding operation is performed on the core network and node weights are updated, and the method is that two nodes are selected and an edge is added, the probability of each node being selected is proportional to the weight of the node, the greater the weight is, the greater the probability of the node being selected is, and the weights of the two selected nodes are respectively added with 1; if the two nodes are connected by the edge, the edge weight is added with 1; otherwise, adding an edge between the two nodes, wherein the weight is 1.
The node weight distribution in the step S6 is defined as the number N of nodes with weight k k Where k=0, 1,2 ….
The power law distribution function in the step S6 is f (n) =c×n Wherein c and α are both constants.
The node weight distribution after each edge addition is calculated in the step S6 is as follows:
Figure BDA0001876719150000021
wherein N is k And N k-1 The number of the nodes with the weights of k and k-1 respectively, and theta is a smoothing factor so as to avoid that the nodes with k=0 cannot carry out edge adding operation; solving the differential equation yields: />
Figure BDA0001876719150000022
Further get->
Figure BDA0001876719150000023
From this follows: n (N) k ~k -3 The method comprises the steps of carrying out a first treatment on the surface of the The weight distribution will gradually follow the power law distribution when the number of edges is sufficiently large.
Compared with the prior art, the text keyword weight calculation method has the following outstanding advantages:
the method has no field text set and training set, can utilize word frequency and weight of the text to present power law distribution by scanning single text, and the word with larger word frequency has larger weight, and the word frequency is converted into word weight by utilizing a probability formula. The specific method is that edges representing word frequency are deleted by probability, then nodes are selected to be added with the edges according to the probability that the current weight is in direct proportion, and meanwhile node weights are added until the weight distribution presents power law distribution. Since the power law distribution is of the order of magnitude of scalability, it can be used as a function of TFIDF and is simpler than TFIDF.
Drawings
Fig. 1 is a flow chart of the method of the present invention.
Detailed Description
Embodiments of the present invention are further described below with reference to the accompanying drawings.
This example is given by way of example in the article "rating: A Structured P2P Overlay Based on Harmonic Series" in journal IEEE Transactions on Parallel and Distributed Systems.
As shown in fig. 1, a text keyword weight calculation method based on word frequency power law distribution characteristics comprises the following specific steps:
s1: and opening the text for preprocessing, wherein stop words and segmentation words are removed, and the rest words are used as candidate keywords.
S2: and using the candidate keywords as nodes, using word frequency as node weight, using word co-occurrence frequency as edge weight, and constructing an undirected keyword network. The keyword network is represented in a table form, see table 1, where 1 (a) represents the weight of the node, and 1 (b) represents the edge weight.
Figure BDA0001876719150000031
TABLE 1 (a) node weights for keyword networks
Figure BDA0001876719150000032
Figure BDA0001876719150000041
TABLE 1 (b) edge weights for keyword networks
S3: extracting a core network from the keyword network; the method comprises the following steps: two node sets A and B, and one edge set L are designed. A is a core network node set, B is a keyword network node set, and L is a core network edge set. In the initial state, all nodes belong to B. And selecting the node with the largest weight from the B as the current node to be added into the A, selecting the edge with the largest weight from the edges of the current node as the current edge to be added into the L, and updating the current node as the other end node corresponding to the current edge at the moment. And if the updated current node already belongs to A, selecting the node with the largest weight from B as the current node. Repeating the steps until the set B is empty. The core network is represented in a table, see table 2,2 (a) for node weights and 2 (b) for edge weights.
Figure BDA0001876719150000042
2 (a) node weights of core network
Figure BDA0001876719150000043
Edge weight of core network
S4: updating the weight of each node in the core network; the method is to update the weight of each node to the sum of the original weight of each node and the weights of all the current sides of the node. The updated core network is represented by a table, and table 3 represents the updated node weight, and the edge weight at this time is still 2 (b).
Figure BDA0001876719150000051
TABLE 3 node weights for core networks
S5: adding an edge to the core network and updating node weights; selecting two nodes and adding an edge, wherein the probability of each node being selected is in direct proportion to the weight of the node, the greater the weight is, the greater the probability of the node being selected is, and the weight of each of the two selected nodes is added with 1; if the two nodes are connected by the edge, the edge weight is added with 1; otherwise, adding an edge between the two nodes, wherein the weight is 1. The weight of each node and its selected probability are shown in table 4 (a) below. Assuming that the selected nodes are P2P and overlay, the node weights and edge weights after the update operations are performed by the corresponding P2P and overlay in tables 4 (b) and 4 (c).
Figure BDA0001876719150000052
TABLE 4 (a) node weights and probabilities of being selected
Figure BDA0001876719150000053
Table 4 (b). Node weight update
Figure BDA0001876719150000054
Figure BDA0001876719150000061
Table 4 (c). Edge weight update
S6: judging whether the node weight distribution of the core network after the edge addition accords with the power law distribution, if not, turning to the step S5; if so, go to step S7. Wherein the node weight distribution is defined as the number N of nodes with weight k k Where k=0, 1,2 …. The power law distribution function is f (n) =c×n- α Wherein c and α are both constants. The node weight distribution after each edge addition is calculated as follows:
Figure BDA0001876719150000062
wherein N is k And N k-1 The number of the nodes with the weights of k and k-1 respectively, and theta is a smoothing factor so as to avoid that the nodes with k=0 cannot carry out edge adding operation; solving the differential equation yields:
Figure BDA0001876719150000063
further get->
Figure BDA0001876719150000064
From this follows: n (N) k ~k -3 The method comprises the steps of carrying out a first treatment on the surface of the The weight distribution will gradually follow the power law distribution when the number of edges is sufficiently large.
S7: and outputting the weight corresponding to each candidate keyword.

Claims (7)

1. The text keyword weight calculation method based on word frequency power law distribution characteristics is characterized in that the method fully utilizes the characteristics that word frequency of a text presents power law distribution and probability that high-frequency words become keywords is larger than that of low-frequency words, and a probability formula is utilized to convert word weights initially represented by word frequency into word weights represented by word importance; the method comprises the following specific steps:
s1: opening a text for preprocessing, including removing stop words and segmentation words, and taking the rest words as candidate keywords;
s2: using candidate keywords as nodes, word frequency as node weight, word co-occurrence frequency as edge weight, and constructing an undirected keyword network;
s3: extracting a core network from the keyword network;
s4: updating the weight of each node in the core network;
s5: adding an edge to the core network and updating node weights;
s6: judging whether the node weight distribution of the core network after the edge addition accords with the power law distribution, if not, turning to the step S5; if yes, go to step S7;
s7: and outputting the weight corresponding to each candidate keyword.
2. The text keyword weight calculation method based on the word frequency power law distribution characteristics according to claim 1, wherein the core network is extracted from the keyword network in step S3, and the method is as follows: designing two node sets A and B and an edge set L; a is a core network node set, B is a keyword network node set, and L is a core network edge set; in the initial state, all nodes belong to B, the node with the largest weight is selected from B as the current node to be added into A, the edge with the largest weight is selected from the edges of the current node to be added into L as the current edge, and at the moment, the current node is updated to be the other end node corresponding to the current edge; if the updated current node already belongs to A, the node with the largest weight is selected from B as the current node, and the steps are repeated until the B set is empty.
3. The text keyword weight calculation method based on word frequency power law distribution characteristics according to claim 1, wherein the step S4 is to update the weight of each node in the core network by updating the weight of each node to the sum of the original weight of each node and the weights of all the current sides of the node.
4. The text keyword weight calculation method based on word frequency power law distribution characteristics according to claim 1, wherein in the step S5, an operation of adding one edge to a core network and updating node weights are performed, and the method is that two nodes are selected and added with one edge, wherein the probability of each node being selected is proportional to the weight, and the greater the weight is, the greater the probability of the node being selected is, and the two selected node weights are respectively added with 1; if the two nodes are connected by the edge, the edge weight is added with 1; otherwise, adding an edge between the two nodes, wherein the weight is 1.
5. The text keyword weight calculation method based on word frequency power law distribution characteristics according to claim 1, wherein the node weight distribution in the step S6 is defined as the number N of nodes with weight k k Where k=0, 1,2 ….
6. The text keyword weight calculation method based on word frequency power law distribution characteristics according to claim 1, wherein the power law distribution function in the step S6 is f (n) =c×n Wherein c and α are both constants.
7. The text keyword weight calculation method based on word frequency power law distribution characteristics according to claim 1, wherein the node weight distribution after each edge addition is calculated in step S6 as follows:
Figure FDA0001876719140000021
wherein N is k And N k-1 The number of the nodes with the weights of k and k-1 respectively, and theta is a smoothing factor so as to avoid that the nodes with k=0 cannot carry out edge adding operation; solving the differential equation yields: />
Figure FDA0001876719140000022
Further get->
Figure FDA0001876719140000023
From this follows: n (N) k ~k -3 The method comprises the steps of carrying out a first treatment on the surface of the The weight distribution will gradually follow the power law distribution when the number of edges is sufficiently large. />
CN201811403149.9A 2018-11-23 2018-11-23 Text keyword weight calculation method based on word frequency power law distribution characteristics Active CN109635081B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811403149.9A CN109635081B (en) 2018-11-23 2018-11-23 Text keyword weight calculation method based on word frequency power law distribution characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811403149.9A CN109635081B (en) 2018-11-23 2018-11-23 Text keyword weight calculation method based on word frequency power law distribution characteristics

Publications (2)

Publication Number Publication Date
CN109635081A CN109635081A (en) 2019-04-16
CN109635081B true CN109635081B (en) 2023-06-13

Family

ID=66068954

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811403149.9A Active CN109635081B (en) 2018-11-23 2018-11-23 Text keyword weight calculation method based on word frequency power law distribution characteristics

Country Status (1)

Country Link
CN (1) CN109635081B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111104793A (en) * 2019-12-19 2020-05-05 浙江工商大学 Short text theme determination method
CN112489646B (en) * 2020-11-18 2024-04-02 北京华宇信息技术有限公司 Speech recognition method and device thereof
CN113010740B (en) * 2021-03-09 2023-05-30 腾讯科技(深圳)有限公司 Word weight generation method, device, equipment and medium
CN113255883B (en) * 2021-05-07 2023-07-25 青岛大学 Weight initialization method based on power law distribution

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103544242A (en) * 2013-09-29 2014-01-29 广东工业大学 Microblog-oriented emotion entity searching system
CN104462286A (en) * 2014-11-27 2015-03-25 重庆邮电大学 Microblog topic finding method based on modified LDA
CN104462253A (en) * 2014-11-20 2015-03-25 武汉数为科技有限公司 Topic detection or tracking method for network text big data
CN105302881A (en) * 2015-10-14 2016-02-03 上海大学 Literature search system-oriented search prompt word generation method
CN105843795A (en) * 2016-03-21 2016-08-10 华南理工大学 Topic model based document keyword extraction method and system
CN106776881A (en) * 2016-11-28 2017-05-31 中国科学院软件研究所 A kind of realm information commending system and method based on microblog

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103544242A (en) * 2013-09-29 2014-01-29 广东工业大学 Microblog-oriented emotion entity searching system
CN104462253A (en) * 2014-11-20 2015-03-25 武汉数为科技有限公司 Topic detection or tracking method for network text big data
CN104462286A (en) * 2014-11-27 2015-03-25 重庆邮电大学 Microblog topic finding method based on modified LDA
CN105302881A (en) * 2015-10-14 2016-02-03 上海大学 Literature search system-oriented search prompt word generation method
CN105843795A (en) * 2016-03-21 2016-08-10 华南理工大学 Topic model based document keyword extraction method and system
CN106776881A (en) * 2016-11-28 2017-05-31 中国科学院软件研究所 A kind of realm information commending system and method based on microblog

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于复杂网络的信息流传输优化研究;陈留情;《中国优秀硕士学位论文全文数据库 (信息科技辑)》;20180415;全文 *
基于多种知识网络建模的用户创新社区知识发现与分析方法;廖晓;《中国博士学位论文全文数据库 (经济与管理科学辑)》;20160515;全文 *

Also Published As

Publication number Publication date
CN109635081A (en) 2019-04-16

Similar Documents

Publication Publication Date Title
CN109635081B (en) Text keyword weight calculation method based on word frequency power law distribution characteristics
CN108052593B (en) Topic keyword extraction method based on topic word vector and network structure
CN110705260B (en) Text vector generation method based on unsupervised graph neural network structure
CN106599029A (en) Chinese short text clustering method
CN108959461B (en) Entity linking method based on graph model
Wen et al. Research on keyword extraction based on word2vec weighted textrank
CN104615687B (en) A kind of entity fine grit classification method and system towards knowledge base update
CN105512245A (en) Enterprise figure building method based on regression model
CN104778204B (en) More document subject matters based on two layers of cluster find method
CN103092828A (en) Text similarity measuring method based on semantic analysis and semantic relation network
CN108710611A (en) A kind of short text topic model generation method of word-based network and term vector
CN110717042A (en) Method for constructing document-keyword heterogeneous network model
CN109885693B (en) Method and system for rapid knowledge comparison based on knowledge graph
CN103646029A (en) Similarity calculation method for blog articles
CN105243083B (en) Document subject matter method for digging and device
CN113051397A (en) Academic paper homonymy disambiguation method based on heterogeneous information network representation learning and word vector representation
CN107622047B (en) Design decision knowledge extraction and expression method
CN107402919B (en) Machine translation data selection method and machine translation data selection system based on graph
CN111831910A (en) Citation recommendation algorithm based on heterogeneous network
CN104166712A (en) Method and system for scientific and technical literature retrieval
CN115409124B (en) Small sample sensitive information identification method based on fine tuning prototype network
Yang et al. A topic-specific web crawler with web page hierarchy based on HTML Dom-Tree
CN111046191B (en) Semantic enhancement method and device in power field
Tsuboi et al. An algorithm for extracting shape expression schemas from graphs
CN111046181B (en) Actor-critic method for automatic classification induction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant