CN114912446A - Keyword extraction method and device and storage medium - Google Patents
Keyword extraction method and device and storage medium Download PDFInfo
- Publication number
- CN114912446A CN114912446A CN202210473957.2A CN202210473957A CN114912446A CN 114912446 A CN114912446 A CN 114912446A CN 202210473957 A CN202210473957 A CN 202210473957A CN 114912446 A CN114912446 A CN 114912446A
- Authority
- CN
- China
- Prior art keywords
- word
- participle
- participles
- semantic
- keyword
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 38
- 239000013598 vector Substances 0.000 claims abstract description 91
- 230000011218 segmentation Effects 0.000 claims abstract description 62
- 238000000034 method Methods 0.000 claims abstract description 25
- 238000012216 screening Methods 0.000 claims abstract description 20
- 238000012545 processing Methods 0.000 claims abstract description 18
- 238000004364 calculation method Methods 0.000 claims description 23
- 238000004590 computer program Methods 0.000 claims description 12
- 230000007246 mechanism Effects 0.000 claims description 6
- 238000009499 grossing Methods 0.000 claims description 4
- 238000012937 correction Methods 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 238000010276 construction Methods 0.000 claims description 2
- 230000000694 effects Effects 0.000 abstract description 10
- 238000012549 training Methods 0.000 description 15
- 238000010586 diagram Methods 0.000 description 11
- 238000004422 calculation algorithm Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000002372 labelling Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005295 random walk Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a keyword extraction method, a keyword extraction device and a storage medium. The method comprises the following steps: performing word segmentation on the text to be extracted; constructing a word segmentation word graph; generating corresponding word vectors according to the sememes of the word segments; calculating word meaning similarity between adjacent participles in the participle word graph according to the word vector of each participle, and calculating the initial score of each participle according to the word meaning similarity so as to obtain candidate keywords through screening; and processing the initial score according to the word frequency-reverse file frequency value of each candidate keyword to obtain a final score, thereby screening to obtain the keywords. On the basis of a word graph model, the word meaning of the participle is fused with the semantic information, so that word vectors of the participle with multiple meanings are distinguished in different contexts, then the score of each participle is calculated by combining the co-occurrence relation among the participles and the word meaning information of the participle, and the score of the participle is corrected according to the word frequency and the reverse file frequency, so that the keyword extraction effect is improved.
Description
Technical Field
The present invention relates to the field of natural language processing, and in particular, to a keyword extraction method, device and storage medium.
Background
In recent years, text keyword extraction methods are mainly classified into two types, namely unsupervised methods and supervised methods, according to different model training modes. The supervised method is to extract and convert the keywords into a binary problem or a sequence labeling problem for judging whether each word in the text is the keyword. With the rapid development of deep learning technology, a supervising method for extracting keywords by adopting a deep learning model is endless, and high accuracy and recall rate are achieved. However, the training of such models relies on large-scale corpora and high-quality manual labeling, and a large amount of resources are consumed. In contrast, the unsupervised method does not depend on large-scale corpora and manual labeling, and is convenient and quick. The existing unsupervised keyword extraction method mainly comprises four categories, namely statistics-based, theme-based, clustering-based and graph model-based, wherein compared with other methods, the keyword extraction method based on the graph model fully considers the structural characteristics and the association characteristics among words of a text, has a good keyword extraction effect, and is widely applied.
Disclosure of Invention
The inventor finds that the existing unsupervised method for extracting the text keywords has limited accuracy and recall rate of extracting the text keywords, and the effect of extracting the keywords has larger promotion space. In order to at least partially solve the technical problems in the prior art, the inventor makes the present invention, and provides the following technical solutions through specific embodiments:
in a first aspect, an embodiment of the present invention provides a keyword extraction method, including the following steps:
performing word segmentation on a text to be extracted to obtain a word segmentation set;
constructing a participle word graph corresponding to the participle set according to a preset word graph model;
respectively generating word vectors corresponding to the participles according to the sememes of the participles in the participle set;
calculating word meaning similarity between adjacent participles in the participle word graph according to word vectors of the participles, and calculating initial scores of the participles in the participle word graph according to the word meaning similarity;
screening the participles in the participle set according to the initial scores to obtain at least one candidate keyword;
determining a word frequency-reverse file frequency value of each candidate keyword, and processing the word frequency-reverse file frequency value and the initial score to obtain a final score of each candidate keyword;
and screening the at least one candidate keyword according to the final score to obtain at least one keyword.
Further, the generating word vectors corresponding to the participles according to the sememes of the participles in the participle set includes:
determining a meaning item corresponding to each participle in the participle set and a sememe corresponding to the meaning item;
generating a meaning item vector of each meaning item according to a meaning source vector of a meaning source corresponding to the meaning item;
and according to the attention mechanism, respectively carrying out weighted summation on the semantic item vectors of the semantic items corresponding to the participles to obtain the word vectors corresponding to the participles.
Further, the generating a semantic item vector of each semantic item according to the semantic item corresponding to the semantic item specifically includes:
and calculating the average value of the semantic element vectors of all the semantic elements corresponding to the semantic elements to obtain the semantic element vectors corresponding to the semantic elements.
Further, according to the attention mechanism, the weighted summation of the semantic item vectors of the semantic items corresponding to the participles respectively adopts the following calculation formula:
wherein e represents a word vector of the participle w,a semantic term vector representing the jth semantic term of the participle w,representing the weight of the jth sense of the participle w;
the weight of the jth meaning term of the participle w is calculated by adopting the following calculation formula:
wherein,a term vector representing the jth and kth terms of the participle w, w c ' denotes an average value of word vectors of a predetermined number of divided words before and after the divided word w.
Further, the initial scores of the participles in the participle word graph obtained by calculation according to the word sense similarity adopt the following calculation formula:
wherein, w i 、w j 、w k Respectively representing the ith, jth and kth participles in the participle word graph, S (w) i )、S(w j ) Respectively represent participles w i And word segmentation w j Initial fraction of (c), In (w) i ) Indicating the directional participle w in the participle word graph i The word segmentation set of (2); 0ut (w) j ) Representing the participle w in the participle word graph j Set of word segments pointed to, d is a smoothing factor, Sim (w) i ,w j ) Representing a participle w i And w j Similarity of sense between them, Sim (w) k ,w j ) Representing a participle w k And w j Word sense similarity between them.
Further, the word meaning similarity between adjacent participles in the participle word graph obtained by calculation according to the word vector of each participle adopts the following calculation formula:
wherein, Sim (w) i ,w j ) Representing a participle w i And w j Similarity of sense between e i 、e j Respectively represent words w i 、w j The word vector of (2).
Further, the determining a word frequency-inverse document frequency value of each candidate keyword, and processing the word frequency-inverse document frequency value and the initial score to obtain a final score of each candidate keyword includes:
respectively calculating the word frequency-reverse file frequency value of each candidate keyword according to the word frequency of each candidate keyword in the text to be extracted and the reverse file frequency in a preset corpus;
and aiming at each candidate keyword, carrying out normalization processing on the word frequency-reverse file frequency value and the initial score, and carrying out weighted summation according to a preset weighting coefficient to obtain a final score of each candidate keyword.
Further, the word segmentation is performed on the text to be extracted to obtain a word segmentation set, which includes:
and according to the knowledge field to which the text to be processed belongs, segmenting the text to be extracted by using the dictionary in the corresponding field to obtain a segmentation set.
In a second aspect, an embodiment of the present invention provides a keyword extraction method and apparatus, including:
the text preprocessing module is used for segmenting words of a text to be extracted to obtain a word segmentation set;
the word graph construction module is used for constructing a word graph corresponding to the word set according to a preset word graph model;
the word vector generation module is used for respectively generating word vectors corresponding to the participles according to the sememes of the participles in the participle set;
the score calculation module is used for calculating word meaning similarity between adjacent participles in the participle word graph according to word vectors of the participles and calculating initial scores of the participles in the participle word graph according to the word meaning similarity;
the candidate keyword screening module is used for screening the participles in the participle set according to the initial scores to obtain at least one candidate keyword;
the score correction module is used for determining a word frequency-reverse file frequency value of each candidate keyword, and processing the word frequency-reverse file frequency value and the initial score to obtain a final score of each candidate keyword;
and the keyword screening module is used for screening the at least one candidate keyword according to the final score to obtain at least one keyword.
In a third aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the keyword extraction method according to any one of the above schemes.
The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:
the method comprises the steps of performing word segmentation on a text to be extracted to obtain a word segmentation set, constructing a word segmentation word graph according to the word segmentation set, and generating word vectors of the word segmentation according to the sememes corresponding to the word segmentation; then, calculating word meaning similarity between adjacent participles in the participle word graph according to the word vector of each participle, and further calculating to obtain an initial score of each participle; and processing the initial score according to the word frequency-reverse file frequency value of the word segmentation to obtain a final score of the word segmentation, and determining the text keywords according to the final score. On the basis of a word graph model, word vectors containing more semantic information are obtained for the word senses of the participles by fusing the semantic information with the word senses of the participles, so that the word vectors of the participles with multiple senses are distinguished in different contexts, and the scores of the participles are calculated by combining the co-occurrence relation among the participles and the word sense information of the participles, so that the score calculation result of the participles is more accurate, and the keyword extraction effect is improved; on the basis, the scores of the participles are corrected according to the word frequency and the reverse file frequency, so that the final scores of the low-frequency keywords are improved, the influence of high-frequency irrelevant words on the keyword extraction result is reduced, and the keyword extraction effect is further improved.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
fig. 1 is a schematic flowchart of a keyword extraction method according to a first embodiment of the present invention;
FIG. 2 is a flowchart illustrating another keyword extraction method according to an embodiment of the present invention;
FIG. 3 is a topological structure diagram of a word segmentation graph according to a first embodiment of the present invention;
FIG. 4 is a diagram illustrating semantic items and semantic information of a word segmentation according to a first embodiment of the present invention;
fig. 5 is a block diagram illustrating a structure of a keyword extraction apparatus according to a second embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Example one
The existing keyword extraction method based on the graph model takes a TextRank algorithm as a main representative. In the TextRank algorithm, based on the basic thought of a word graph model, the word graph can be constructed by the composition words of a text according to the context co-occurrence relation among the words, the importance of each word in the word graph is calculated through a random walk algorithm, and the keywords are determined according to the importance sorting.
In the process of calculating the importance of the words, the TextRank algorithm only considers the structure information of the text, considers that the influence degree of each adjacent word in the word graph on the central word is the same, and ignores the word meaning information among the words. A keyword is a group of words that can express the subject matter of a text, and words associated with the keyword appear in a context close to the keyword. Therefore, if the co-occurrence relationship between words in the text and the word sense information can be combined, the importance of the words in the word graph can be calculated more accurately, and the keywords of the text can be extracted more efficiently.
Based on this, as shown in fig. 1, an embodiment of the present invention provides a keyword extraction method, including the following steps:
and S1, performing word segmentation on the text to be extracted to obtain a word segmentation set.
Specifically, a word segmentation tool is used for segmenting the text to be extracted, and the word segmentation tool may be a preset word segmentation model, such as an open source jieba word segmentation device. Obtaining a word segmentation set after word segmentation, wherein the word segmentation in the word segmentation set is arranged according to the context sequence of each word segmentation in the extracted text, and the word segmentation can be expressed as:
W=[w1,w2,w3,…,wn]
wherein W represents a participle set, and wn represents the nth participle in the participle set.
Preferably, the dictionary of the corresponding field is used for word segmentation aiming at the knowledge field of the text to be extracted, so that the accuracy of the word segmentation result is improved. For example, when a financial text is segmented, a preset financial domain dictionary is loaded into a custom dictionary of a jieba segmenter, so that financial terms in the financial text are prevented from being split by mistake, and the accuracy of segmentation results is improved. The preset financial domain dictionary can be a dictionary formed by combining a highly-contained financial vocabulary English-Chinese dictionary with open financial domain hot words in the Internet.
Preferably, the participle set obtained by the participle is also subjected to stop word processing to obtain the participle set after the stop word is removed. The stop words generally include punctuation marks and irrelevant words such as common words, tone words and the like, such as's','s' and the like, and specifically, the stop words can be processed by using an open source Chinese stop word list from the internet.
And S2, constructing a participle word graph corresponding to the participle set according to a preset word graph model.
It should be noted that the word graph model used in the present embodiment may be a general word graph model in the field of keyword extraction, and specifically, refer to a word graph model in the prior art. Specifically, according to the co-occurrence relationship of the participles in the participle set in the co-occurrence window with the preset length, the participles are used as nodes, the co-occurrence relationship is edges, and a participle word graph G ═ V, E is constructed in a preset word graph model, wherein V is a node set, and E is an edge set. In the keyword extraction task, each node v represents a participle w. When dividing word w i And word segmentation w j When the two edges exist in the same co-occurrence window, two directed edges are added to the word segmentation word graph, namely v i →v j And v j →v i . The topology of the word segmentation graph can refer to fig. 3.
In this embodiment, the length of the co-occurrence window refers to the size of the word-taking window, that is, the context word-taking range using the central word as the reference point, for example, when the length of the co-occurrence window is a, the number of the obtained context words is 2A, where a represents a positive integer. It should be noted that the length of the co-occurrence window may affect the extraction effect of the final keyword, and the specific length may be determined according to the experimental result, for example, in one embodiment, the length of the co-occurrence window is set to 3.
And S3, respectively generating word vectors corresponding to the participles according to the sememes of the participles in the participle set.
Specifically, the information of each participle in the participle set and the semantic meanings of the participle is input into a preset word vector training model for training, the preset word vector training model utilizes the co-occurrence relation among the context participles, and generates word vectors representing different word meanings according to different semantic meanings of the participle by maximizing the conditional probability of generating surrounding words by a central word. In addition, the model adopts an attention (attention) mechanism to endow each word meaning of the participle with different weights, so that the participle can obtain different word vector representations according to the difference of the word meanings of the participle under different contexts. For example, 'apple' has two senses of brand and fruit, and under the context of 'i own one-step iphone', the weight value of the sense of brand is higher. Regarding the hyper-parameters of the preset word vector training model, the length of a word vector training window may be set to be 3, and the dimension of a word vector may be set to be 200.
In an embodiment, as shown in fig. 2, the step S3 specifically includes:
s31, determining a meaning item (Sense) corresponding to each Word (Word) in the Word segmentation set and a Sememe (Sememe) corresponding to the meaning item.
Specifically, each meaning item corresponding to the word segmentation and the corresponding meaning source of each meaning item are determined according to the word meaning item and the meaning source information in the preset knowledge base. The preset knowledge base has specified meaning item and sense information corresponding to each word, such as a Hownet (Hownet) knowledge base, and the like, taking "apple" as an example, and the meaning item and sense information corresponding to the apple "in the Hownet knowledge base are shown in fig. 4.
S32, generating the meaning item vector of each meaning item according to the meaning original vector of the meaning item corresponding to the meaning item.
Specifically, the word segmentation and the semantic item semantic information of the word segmentation are input into the preset word vector training model for training, a semantic vector is generated for each semantic, and then a semantic item vector corresponding to the semantic item is generated according to the semantic vector. When a sense corresponds to multiple senses, in one embodiment, the sense vector is obtained by calculating the average value of the sense vectors of the senses corresponding to the sense. In another embodiment, the semantic item vector may also be obtained by performing weighted summation on the semantic vector of each semantic item corresponding to the semantic item.
And S33, respectively carrying out weighted summation on the semantic item vectors of the semantic items corresponding to the participles according to the attention mechanism to obtain the word vectors corresponding to the participles.
Specifically, the weighted summation may be performed on the semantic item vector of the semantic item corresponding to the participle by using the following calculation formula:
wherein e represents a word vector of the participle w,a semantic term vector representing the jth semantic term of the participle w,representing the weight of the jth sense of the participle w;
the weight of the jth meaning term of the participle w is calculated by adopting the following calculation formula:
wherein,a sense vector respectively representing the jth and kth senses of the participle w; w is a c ' represents an average value of word vectors of a preset number of participles before and after the participle w, a value of the preset number is related to a training window length of the word vector, and an optimal value is obtained according to an experimental result, for example in one embodiment,the preset number is set to 2.
The semantic item vector may be an average value of semantic elements of the semantic items corresponding to the semantic items, and the semantic elements are obtained by training the preset word vector training model.
S4, according to the word vector of each participle, calculating to obtain the word meaning similarity between adjacent participles in the participle word graph, and according to the word meaning similarity, calculating to obtain the initial score of each participle in the participle word graph.
It is easy to know that, using the basic TextRank algorithm, the calculation formula of each node score in the word graph is as follows:
wherein: s (v) i )、S(v j ) Respectively represent nodes V i 、V j Fraction of (A), In (V) i ) Is other node to V i A set of nodes of (a); 0ut (V) j ) Is node V j Set of pointed-to nodes, | In (V) i ) Is linked to V i The node number of nodes, d, is a smoothing factor, which is typically set to a value of 0.85.
When the algorithm is used for solving the scores of all the participles in the participle word graph, only the structural information of the text to be extracted is considered, the influence degree of each adjacent participle in the participle word graph on the central participle is considered to be the same, and the word meaning relation among the participles is ignored. On the basis of a TextRank algorithm, the embodiment of the invention uses word sense similarity to replace uniform weight as the weight of the edge in the participle word graph, and calculates the initial score S of each participle, wherein the calculation formula is as follows:
wherein, w i 、w j 、w k Respectively representing the ith, jth and kth participles in the participle word graph; s (w) i )、S(w j ) Respectively represent participles w i And word segmentation w j An initial score of (a); in (w) i ) Indicating the directional participle w in the participle word graph i Set of participles of, 0ut (w) j ) Representing the participle w in the participle word graph j A pointed participle set; d is a smoothing factor, which is typically set to a value of 0.85; sim (w) i ,w j ) Representing a participle w i And w j Similarity of sense between them, Sim (w) k ,w j ) Representing a participle w k And w j Word sense similarity between them.
In this embodiment, the word meaning similarity is calculated according to the word segmentation word vectors trained by the preset word vector training model, and the value of the word meaning similarity may be the cosine similarity between the word segmentation word vectors, and the specific calculation formula is as follows:
wherein e is i 、e j Are respectively the word w i 、w j And the word vectors are trained through the preset word vector training model.
The embodiment of the invention generates word vectors containing more word meaning information by integrating the sememes in the process of training the word vectors, calculates the word meaning similarity between the participles according to the word vectors, and measures the relation between the adjacent participles and the central word in the participle word graph by using the word meaning similarity between the participles, namely calculates the score of each participle by using the word meaning similarity between the participles as the weight of the edge in the participle word graph, thereby better embodying the importance of each participle in the text semantics.
S5, screening the participles in the participle set according to the initial scores to obtain at least one candidate keyword.
Specifically, the participles in the participle set are screened according to the initial score of the participle and the preset condition. The preset condition is preset and can be set according to an application scene, for example, the preset condition can be that the initial score ranks the top N segmented words in descending order, and the N candidate keywords are obtained after the screening according to the preset condition. Wherein, N is a positive integer, and the value thereof can be preset according to the application scene requirement. In an embodiment, N may also take the number of all the participles in the participle set, that is, all the participles in the participle set are taken as candidate keywords, so as to avoid missing keywords.
S6, determining a word frequency-inverse file frequency value (TF-IDF) of each candidate keyword, and processing the word frequency-inverse file frequency value and the initial score to obtain a final score of each candidate keyword.
It is understood that the method of calculating the score of the segmented word in step S4 may result in a higher initial score for words with higher frequency of occurrence in the text, while some common high-frequency words may not be keywords of the text, and may also result in a relatively lower initial score for keywords with lower frequency of occurrence in the text. In order to improve the score of the low-frequency keyword and reduce the score of the irrelevant high-frequency word, the embodiment introduces a word frequency-reverse file frequency value to correct the initial score.
In this embodiment, the word frequency-inverse document frequency value of a candidate keyword is equal to the word frequency (TF) multiplied by the Inverse Document Frequency (IDF) of the candidate keyword. Wherein, TF is the frequency of the candidate key words appearing in the text to be extracted; the IDF may be obtained by dividing the total number of texts in the predetermined corpus by the number of texts including the candidate keyword, and then taking the logarithm of the obtained quotient. The lower the text proportion of the candidate keyword contained in the preset corpus is, the larger the IDF of the candidate keyword is, the better the classification capability of the candidate keyword is.
After the word frequency-reverse file frequency value of each candidate keyword is obtained through calculation, normalization processing is carried out on the word frequency-reverse file frequency value TF-IDF and the initial score S aiming at each candidate keyword, and the value range is controlled in a [0,1] interval, so that the initial score of the candidate keyword and the word frequency-reverse file frequency value are located on the same order of magnitude. And then, according to a preset weighting coefficient alpha, carrying out weighted summation on the word frequency-reverse file frequency value and the initial score to obtain a final score of each candidate keyword. The calculation formula of the final score is as follows:
C(w i )=α*S(w i )′+(1-α)*TF-IDF(w i )′
wherein, C (w) i ) Representing participles w i The final score of (a); α represents a weighting coefficient; s (w) i ) ' and TF-IDF (w) i ) ' respectively denote participles w i And a normalized value of the word frequency-inverse file frequency value.
In the present embodiment, the preset weighting coefficients are obtained according to experimental results, and in one embodiment, the obtained extraction effects on different preset weighting coefficients are shown in table 1.
TABLE 1
As can be seen from table 1, when the preset weighting factor α is 0.32, the accuracy (Precision), Recall (Recall) and F-value (F-Measure) of the keyword extraction are the highest, i.e., the extraction effect is the best.
S7, screening the at least one candidate keyword according to the final score to obtain at least one keyword.
Specifically, the candidate keywords are screened according to preset conditions according to the final scores of the candidate keywords. The preset condition is preset and can be set according to an application scenario, for example, the preset condition can be that M top-ranked participles are ranked from large to small according to final scores in the candidate keywords, and the M keywords are obtained after screening according to the preset condition. Wherein, M is a positive integer, and the value thereof can be preset according to the application scene requirement. Namely, at least one keyword is screened out to be used as the keyword of the text to be extracted.
The method comprises the steps of firstly segmenting words of a text to be extracted to obtain a segmentation set, constructing a segmentation word graph according to the segmentation set, and generating word vectors of the segmentation words according to the sememes corresponding to the segmentation words; then, calculating word meaning similarity between adjacent participles in the participle word graph according to the word vector of each participle, and further calculating to obtain an initial score of each participle; and processing the initial score according to the word frequency-reverse file frequency value of the word segmentation to obtain a final score of the word segmentation, and determining the text keywords according to the final score.
The word vector containing more semantic information is obtained for the word meaning fusion semantic information of the participles on the basis of the word graph model, so that the word vectors of the participles with multiple meanings are distinguished in different contexts, the scores of the participles are calculated by combining the co-occurrence relation among the participles and the word meaning information of the participles, the score calculation result of the participles is more accurate, and the keyword extraction effect is improved; on the basis, the scores of the participles are corrected according to the word frequency and the reverse file frequency, so that the final scores of the low-frequency keywords are improved, the influence of high-frequency irrelevant words on the keyword extraction result is reduced, and the keyword extraction effect is further improved.
Example two
Based on the inventive concept of the first embodiment, as shown in fig. 5, an embodiment of the present invention further provides a keyword extraction apparatus, including:
the text preprocessing module 100 is configured to perform word segmentation on a text to be extracted to obtain a word segmentation set;
the word graph building module 200 is configured to build a word graph corresponding to the word set according to a preset word graph model;
a word vector generating module 300, configured to generate word vectors corresponding to the participles according to the sememes of the participles in the participle set;
the score calculation module 400 is configured to calculate word sense similarity between adjacent segmented words in the segmented word graph according to word vectors of the segmented words, and calculate initial scores of the segmented words in the segmented word graph according to the word sense similarity;
a candidate keyword screening module 500, configured to screen the participles in the participle set according to the initial score to obtain at least one candidate keyword;
a score correction module 600, configured to determine a word frequency-inverse file frequency value of each candidate keyword, and process the word frequency-inverse file frequency value and the initial score to obtain a final score of each candidate keyword;
and a keyword screening module 700, configured to screen the at least one candidate keyword according to the final score to obtain at least one keyword.
Because the principle of the problem solved by the keyword extraction device is similar to that of the keyword extraction method in the first embodiment, the implementation of the keyword extraction device can refer to the implementation of the method in the first embodiment, and repeated details are not repeated.
There is also provided, according to an embodiment of the present invention, a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement any one of the keyword extraction methods according to the first embodiment.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.
Claims (10)
1. A keyword extraction method is characterized by comprising the following steps:
performing word segmentation on a text to be extracted to obtain a word segmentation set;
constructing a participle word graph corresponding to the participle set according to a preset word graph model;
respectively generating word vectors corresponding to the participles according to the sememes of the participles in the participle set;
calculating word meaning similarity between adjacent participles in the participle word graph according to word vectors of the participles, and calculating initial scores of the participles in the participle word graph according to the word meaning similarity;
screening the participles in the participle set according to the initial scores to obtain at least one candidate keyword;
determining a word frequency-reverse file frequency value of each candidate keyword, and processing the word frequency-reverse file frequency value and the initial score to obtain a final score of each candidate keyword;
and screening the at least one candidate keyword according to the final score to obtain at least one keyword.
2. The method for extracting keywords according to claim 1, wherein the generating word vectors corresponding to the participles according to the sememes of the participles in the participle set comprises:
determining a meaning item corresponding to each participle in the participle set and a sememe corresponding to the meaning item;
generating a sense item vector of each sense item according to a sense element vector of a sense element corresponding to the sense item;
and according to the attention mechanism, respectively carrying out weighted summation on the semantic item vectors of the semantic items corresponding to the participles to obtain the word vectors corresponding to the participles.
3. The method for extracting keywords according to claim 2, wherein the generating of the semantic item vector of each semantic item according to the semantic item corresponding to the semantic item specifically comprises:
and calculating the average value of the semantic element vectors of all the semantic elements corresponding to the semantic elements to obtain the semantic element vectors corresponding to the semantic elements.
4. The method for extracting keywords according to claim 3, wherein the weighted summation of the semantic item vectors of the semantic items corresponding to the participles according to the attention mechanism adopts the following calculation formula:
wherein e represents a word vector of the participle w,a semantic term vector representing the jth semantic term of the participle w,representing the weight of the jth sense of the participle w;
the weight of the jth meaning term of the participle w is calculated by adopting the following calculation formula:
5. The method for extracting keywords according to claim 1, wherein the initial score of each participle in the participle word graph obtained by calculation according to the word sense similarity adopts the following calculation formula:
wherein, w i 、w j 、w k Respectively representing the ith, jth and kth participles in the participle word graph, S (w) i )、S(w j ) Respectively represent participles w i And word segmentation w j Initial fraction of (a), In (w) i ) Indicating the directional participle w in the participle word graph i The word segmentation set of (2); 0ut (w) j ) Representing the participle w in the participle word graph j Set of word segments pointed to, d is a smoothing factor, Sim (w) i ,w j ) Representing participles w i And w j Similarity of sense between them, Sim (w) k ,w j ) Representing a participle w k And w j Word sense similarity between them.
6. The method for extracting keywords according to claim 5, wherein the word sense similarity between adjacent participles in the participle word graph obtained by calculation according to the word vector of each participle is calculated by the following formula:
wherein, Sim (w) i ,w j ) Representing a participle w i And w j Degree of word sense similarity therebetween, e i 、e j Respectively represent words w i 、w j The word vector of (2).
7. The method of claim 1, wherein the determining a word frequency-inverse document frequency value for each candidate keyword, and processing the word frequency-inverse document frequency value and the initial score to obtain a final score for each candidate keyword comprises:
respectively calculating the word frequency-reverse file frequency value of each candidate keyword according to the word frequency of each candidate keyword in the text to be extracted and the reverse file frequency in a preset corpus;
and aiming at each candidate keyword, carrying out normalization processing on the word frequency-reverse file frequency value and the initial score, and carrying out weighted summation according to a preset weighting coefficient to obtain a final score of each candidate keyword.
8. The method for extracting keywords according to claim 1, wherein the segmenting words of the text to be extracted to obtain a segmentation set comprises:
and according to the knowledge field to which the text to be processed belongs, performing word segmentation on the text to be extracted by using the dictionary in the corresponding field to obtain a word segmentation set.
9. A keyword extraction device is characterized by comprising:
the text preprocessing module is used for segmenting words of the text to be extracted to obtain a word segmentation set;
the word graph construction module is used for constructing a word graph corresponding to the word set according to a preset word graph model;
the word vector generation module is used for respectively generating word vectors corresponding to the participles according to the sememes of the participles in the participle set;
the score calculation module is used for calculating word sense similarity between adjacent participles in the participle word graph according to the word vector of each participle and calculating initial scores of each participle in the participle word graph according to the word sense similarity;
the candidate keyword screening module is used for screening the participles in the participle set according to the initial scores to obtain at least one candidate keyword;
the score correction module is used for determining a word frequency-reverse file frequency value of each candidate keyword, and processing the word frequency-reverse file frequency value and the initial score to obtain a final score of each candidate keyword;
and the keyword screening module is used for screening the at least one candidate keyword according to the final score to obtain at least one keyword.
10. A computer-readable storage medium on which a computer program is stored, the program, when executed by a processor, implementing the keyword extraction method according to any one of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210473957.2A CN114912446A (en) | 2022-04-29 | 2022-04-29 | Keyword extraction method and device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210473957.2A CN114912446A (en) | 2022-04-29 | 2022-04-29 | Keyword extraction method and device and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114912446A true CN114912446A (en) | 2022-08-16 |
Family
ID=82765224
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210473957.2A Pending CN114912446A (en) | 2022-04-29 | 2022-04-29 | Keyword extraction method and device and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114912446A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115712700A (en) * | 2022-11-18 | 2023-02-24 | 生态环境部环境规划院 | Hot word extraction method, system, computer device and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109255118A (en) * | 2017-07-11 | 2019-01-22 | 普天信息技术有限公司 | A kind of keyword extracting method and device |
CN113342928A (en) * | 2021-05-07 | 2021-09-03 | 上海大学 | Method and system for extracting process information from steel material patent text based on improved TextRank algorithm |
-
2022
- 2022-04-29 CN CN202210473957.2A patent/CN114912446A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109255118A (en) * | 2017-07-11 | 2019-01-22 | 普天信息技术有限公司 | A kind of keyword extracting method and device |
CN113342928A (en) * | 2021-05-07 | 2021-09-03 | 上海大学 | Method and system for extracting process information from steel material patent text based on improved TextRank algorithm |
Non-Patent Citations (2)
Title |
---|
XIONG AO ET.AL: "News keywords extraction algorithm based on TextRank and classified TF-IDF", 2020 INTERNATIONAL WIRELESS COMMUNICATIONS AND MOBILE COMPUTING (IWCMC), 27 July 2020 (2020-07-27), pages 4 * |
YILIN NIU ET.AL: "Improved Word Representation Learning with Sememes", ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PAGES:2049-2058, 4 August 2017 (2017-08-04), pages 4 - 5 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115712700A (en) * | 2022-11-18 | 2023-02-24 | 生态环境部环境规划院 | Hot word extraction method, system, computer device and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110442760B (en) | Synonym mining method and device for question-answer retrieval system | |
CN106156204B (en) | Text label extraction method and device | |
CN108255813B (en) | Text matching method based on word frequency-inverse document and CRF | |
CN109408743B (en) | Text link embedding method | |
WO2019228203A1 (en) | Short text classification method and system | |
CN111104526A (en) | Financial label extraction method and system based on keyword semantics | |
CN109002473B (en) | Emotion analysis method based on word vectors and parts of speech | |
CN113704416B (en) | Word sense disambiguation method and device, electronic equipment and computer-readable storage medium | |
CN105808530B (en) | Interpretation method and device in a kind of statistical machine translation | |
CN112989802B (en) | Bullet screen keyword extraction method, bullet screen keyword extraction device, bullet screen keyword extraction equipment and bullet screen keyword extraction medium | |
CN110825850B (en) | Natural language theme classification method and device | |
US20220269939A1 (en) | Graph-based labeling rule augmentation for weakly supervised training of machine-learning-based named entity recognition | |
CN113051368B (en) | Double-tower model training method, retrieval device and electronic equipment | |
CN109815400A (en) | Personage's interest extracting method based on long text | |
CN114707516B (en) | Long text semantic similarity calculation method based on contrast learning | |
CN112507711A (en) | Text abstract extraction method and system | |
CN113722439B (en) | Cross-domain emotion classification method and system based on antagonism class alignment network | |
CN113553806B (en) | Text data enhancement method, device, equipment and medium | |
CN109766547B (en) | Sentence similarity calculation method | |
CN110866102A (en) | Search processing method | |
CN111274402B (en) | E-commerce comment emotion analysis method based on unsupervised classifier | |
CN102298589A (en) | Method and device for generating emotion tendentiousness template, and method and device for using emotion tendentiousness template | |
CN114997288A (en) | Design resource association method | |
CN114722176A (en) | Intelligent question answering method, device, medium and electronic equipment | |
CN117251524A (en) | Short text classification method based on multi-strategy fusion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |