CN106970910B

CN106970910B - Keyword extraction method and device based on graph model

Info

Publication number: CN106970910B
Application number: CN201710208956.4A
Authority: CN
Inventors: 王亮
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2017-03-31
Filing date: 2017-03-31
Publication date: 2020-03-27
Anticipated expiration: 2037-03-31
Also published as: CN106970910A

Abstract

The embodiment of the invention provides a keyword extraction method and device based on a graph model, wherein the method comprises the following steps: acquiring a text to be processed, and performing word segmentation on the text to be processed to obtain candidate keywords corresponding to the text to be processed; searching a word vector corresponding to the candidate keyword in a word vector model, wherein the word vector model comprises word vectors of the candidate keyword; constructing a word similarity matrix of the candidate keywords according to the word vectors; and sequencing the candidate keywords according to the word similarity matrix of the candidate keywords, and extracting the keywords of the text to be processed. By applying the embodiment of the invention, the accuracy of keyword extraction is effectively improved.

Description

Keyword extraction method and device based on graph model

Technical Field

The invention relates to the technical field of keyword extraction, in particular to a keyword extraction method and device based on a graph model.

Background

Keywords are used as representative words in a text segment, and have been widely applied to information retrieval, text classification, and the like. The keyword extraction method based on the graph model has been widely applied to search ranking, citation analysis, social networking, natural language processing (such as keyword extraction, extraction of article subject sentences, etc.), and the like. The graph model is a generic term of a technology for representing probability distribution by using a graph, and a text can be mapped into a network graph with words as nodes and association relations between the words as edges. Two basic assumptions for the graph model-based keyword extraction method are: 1. the quantity assumes: the more links a certain node is linked with other nodes, the more important the node is; 2. the quality assumption is that: the nodes connected with the node A have different qualities, and the nodes with high quality can transfer more weight to other nodes through the links, so the more the nodes with high quality are linked to the node A, the more important the node A is. Therefore, the key of the graph model-based keyword extraction method is the calculation of link weights, and the link weights between nodes are word-to-word similarities.

The existing keyword extraction method based on the graph model is characterized in that a text is divided into a plurality of composition units (words and sentences) and the graph model is established, the composition units in the text are sorted by using a voting mechanism, and then the composition units which are sorted in the front are selected as keywords. Specifically, a given text is segmented according to a complete sentence; then, performing word segmentation and part-of-speech tagging on each sentence to obtain words and part-of-speech tags corresponding to the words; filtering stop words such as prepositions, auxiliary words, conjunctions, exclamation words and the like in the words according to the words and the part-of-speech labels, reserving words with specified parts-of-speech such as nouns, verbs, adjectives and the like, and taking the words with the specified parts-of-speech as candidate keywords; and then according to the candidate keywords, constructing a candidate keyword graph model, namely taking the candidate keywords as nodes of the candidate keyword graph model, and taking incidence relations among the candidate keywords as edges of the keyword graph model, wherein the incidence relations among the candidate keywords are obtained by calculating the similarity among the candidate keywords. In the keyword extraction method based on the graph model, similarity between words is constructed in a windowing mode, the words in each window vote for adjacent windows, the voting weight depends on the number of votes of the words, and the similarity between words can be obtained by co-occurrence of the words because the co-occurrence of the words exists between each window and the adjacent window; and finally, performing iterative voting on the votes of the candidate keywords on the graph to obtain the vote ranking of the candidate keywords, and selecting the candidate keywords with the top votes as the keywords.

However, in the existing keyword extraction method based on the graph model, the similarity between words can be obtained only by the co-occurrence of words, so that the repeated words are weighted too heavily, for example, some of candidate keywords cannot become keywords, but the repeated words such as content, calculation, processing, resolution, and the like are the highest, which results in low keyword extraction accuracy. In addition, the result of extracting the keywords is sensitive to the size of the window, and since the size of the window needs to be manually set, for example, a sentence is composed of the following words in sequence: w1, w2, w3, w4 and w5 … wn, the size of the window is set to be k, then w1, w2, w3 … wk, w2, w3, w4 … wk +1, w3, w4 and w5 … wk +2 are all windows, an undirected and unweighted edge exists between nodes corresponding to any two words in one window, and then the selection of windows with different sizes can cause distinct results and low keyword extraction accuracy.

Disclosure of Invention

The embodiment of the invention aims to provide a keyword extraction method and device based on a graph model, so that the keyword extraction accuracy is improved. The specific technical scheme is as follows:

the embodiment of the invention discloses a keyword extraction method based on a graph model, which comprises the following steps:

acquiring a text to be processed, and performing word segmentation on the text to be processed to obtain candidate keywords corresponding to the text to be processed;

searching a word vector corresponding to the candidate keyword in a word vector model, wherein the word vector model comprises word vectors of the candidate keyword;

constructing a word similarity matrix of the candidate keywords according to the word vectors;

and sequencing the candidate keywords according to the word similarity matrix of the candidate keywords, and extracting the keywords of the text to be processed.

Optionally, the constructing a word similarity matrix of the candidate keyword according to the word vector includes:

according to the formula:

calculating cosine values of corresponding word vector included angles among the candidate keywords, wherein theta represents the included angle of the vectors among the candidate keywords, and x_1kCharacteristic value, x, of corresponding vector in n-dimensional space representing one candidate keyword_2kRepresenting the characteristic value of a corresponding vector in another candidate keyword n-dimensional space, wherein n represents the dimension of the vector space;

and constructing the similarity matrix of the candidate keywords according to the cosine value of the included angle of the word vectors.

Optionally, the sorting the candidate keywords according to the word similarity matrix of the candidate keywords includes:

calculating a word similarity matrix of the candidate keywords according to a PageRank algorithm to obtain corresponding PageRank values of the candidate keywords;

sorting the candidate keywords according to the PageRank value to obtain the importance degree of the candidate keywords;

and extracting the keywords of the text to be processed according to the importance degree.

Optionally, the calculating a word similarity matrix of the candidate keyword according to the PageRank algorithm includes:

determining an initial value of the PageRank algorithm according to the order number of the word similarity matrix;

calculating an initial characteristic vector value of the candidate keyword according to the initial value and the word similarity matrix;

according to the formula:

p_t＝M^Tp_t-1

calculating a feature vector value of the candidate keyword, wherein when t is 1, then p₁Representing said initial characteristic vector value, p₀Representing said initial weight, p_tRepresenting the characteristic vector value, p, of the word similarity matrix at step t_t-1Representing the characteristic vector value of the word similarity matrix in the t-1 step, M representing the word similarity matrix of the candidate keywords, M^TRepresenting the transposition of the word similarity matrix, t representing the number of steps of calculation, and the value of t being greater than or equal to 1;

and when the norm of the characteristic vector value of the t step and the characteristic vector value of the t-1 step is smaller than the error tolerance of the PageRank algorithm, the characteristic vector value of the t step is the corresponding PageRank value of the candidate keyword.

Optionally, the obtaining a text to be processed and performing word segmentation on the text to be processed to obtain candidate keywords corresponding to the text to be processed includes:

obtaining a text to be processed, and segmenting the text to be processed to obtain stop words and words with specified parts of speech, wherein the stop words at least comprise prepositions, auxiliary words, conjunctions and exclamation words, and the words with the specified parts of speech at least comprise nouns, verbs and adjectives;

and filtering the stop words to obtain the words with the specified part of speech, wherein the words with the specified part of speech are candidate keywords corresponding to the text to be processed.

Optionally, the word vector is obtained by word2vec training.

The embodiment of the invention also discloses a keyword extraction device based on the graph model, which comprises:

the acquisition module is used for acquiring a text to be processed and segmenting the text to be processed to obtain candidate keywords corresponding to the text to be processed;

the searching module is used for searching word vectors corresponding to the candidate keywords in a word vector model, and the word vector model comprises the word vectors of the candidate keywords;

the processing module is used for constructing a word similarity matrix of the candidate keywords according to the word vectors;

and the extraction module is used for sequencing the candidate keywords according to the word similarity matrix of the candidate keywords and extracting the keywords of the text to be processed.

Optionally, the processing module includes:

a first calculation unit for, according to the formula:

and the constructing unit is used for constructing the candidate keyword similarity matrix according to the cosine value of the word vector included angle.

Optionally, the extracting module includes:

the second calculation unit is used for calculating a word similarity matrix of the candidate keywords according to a PageRank algorithm to obtain corresponding PageRank values of the candidate keywords;

the sorting unit is used for sorting the candidate keywords according to the PageRank value to obtain the importance degree of the candidate keywords;

and the extraction unit is used for extracting the keywords of the text to be processed according to the importance degree.

Optionally, the second computing unit includes:

the first determining subunit is used for determining an initial value of the PageRank algorithm according to the order number of the word similarity matrix;

the first calculating subunit is used for calculating an initial characteristic vector value of the candidate keyword according to the initial value and the word similarity matrix;

a second calculation subunit configured to, according to the formula:

p_t＝M^Tp_t-1

and a second determining subunit, configured to, when a norm of the feature vector value of the t step and the feature vector value of the t-1 step is smaller than an error tolerance of the PageRank algorithm, determine that the feature vector value of the t step is a corresponding PageRank value of the candidate keyword.

Optionally, the obtaining module includes:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a text to be processed and segmenting the text to be processed to obtain stop words and words with specified parts of speech, the stop words at least comprise prepositions, auxiliary words, conjunctions and exclamation words, and the words with the specified parts of speech at least comprise nouns, verbs and adjectives;

and the processing unit is used for filtering the stop words to obtain the words with the specified parts of speech, wherein the words with the specified parts of speech are the candidate keywords corresponding to the text to be processed.

Optionally, the word vector is obtained by word2vec training.

According to the method and the device for extracting the keywords based on the graph model, provided by the embodiment of the invention, the similarity between words in the text is calculated through the word vector, and the similarity matrix is constructed, so that the extracted keywords reflect the semantic importance of the keywords in the current text to a certain extent. When the similarity matrix is constructed, the similarity between words is obtained based on word vector calculation instead of the co-occurrence between words, so that the problem of overlarge word weighting caused by the co-occurrence between words in the keyword extraction process is solved, the size of a window does not need to be set artificially, keywords which are more consistent with the theme of a document are selected through the semantic similarity, and the accuracy of keyword extraction is improved. Of course, it is not necessary for any product or method of practicing the invention to achieve all of the above-described advantages at the same time.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic structural diagram of a graph model in a conventional graph model-based keyword extraction method;

fig. 2 is a flowchart of a keyword extraction method based on a graph model according to an embodiment of the present invention;

fig. 3 is a structural diagram of a keyword extraction apparatus based on a graph model according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The method for extracting keywords based on the graph model is an effective method for extracting keywords, wherein the graph model is a generic term of a technology for representing probability distribution by a class of graphs, and a text can be mapped into a network graph which takes words as nodes and takes the association relationship between the words as edges. As shown in fig. 1, fig. 1 is a schematic structural diagram of a graph model in a conventional graph model-based keyword extraction method, where w1, w2, w3 … w10, and w11 in fig. 1 are candidate keywords, and are also nodes of the graph model, respectively, an edge formed by lines between the nodes represents an association relationship of each candidate keyword, and a thicker line represents a greater weight of the edge, i.e., the greater association relationship between the two keywords connected by the edge, and the present invention extracts the keywords based on the graph model.

Referring to fig. 2, fig. 2 is a flowchart of a keyword extraction method based on a graph model according to an embodiment of the present invention, including the following steps:

s201, obtaining a text to be processed, and performing word segmentation on the text to be processed to obtain candidate keywords corresponding to the text to be processed.

Specifically, a text to be processed is obtained, the obtained text to be processed is subjected to word segmentation, and the purpose of word segmentation is to perform word segmentation on the text to be processed according to a certain rule, so that candidate keywords are extracted. Chinese is often expressed in terms of words, phrases, colloquials, etc., and thus Chinese word segmentation has a great uncertainty. The current main word segmentation method comprises the following steps: the method for segmenting words based on character string matching, namely mechanical segmentation, has mature and widely used algorithm, realizes segmentation by matching mail text with dictionary words, and is characterized in that the completeness of the dictionary is used; the word segmentation method based on understanding, namely an artificial intelligence method, has high word segmentation precision and complex algorithm; the word segmentation method based on statistics has the advantages of recognizing unknown words and proper nouns, but the training text amount is large. The word segmentation methods have high word segmentation accuracy and a rapid word segmentation system. Here, the existing word segmentation method is used for segmenting the text to be processed, stop words such as prepositions, auxiliary words, conjunctions and interjections in the words are automatically filtered, words with specified parts of speech such as nouns, verbs and adjectives are reserved, and the words with the specified parts of speech are used as candidate keywords. Thus, the candidate keywords corresponding to the text to be processed are obtained.

S202, searching word vectors corresponding to the candidate keywords in the word vector model, wherein the word vector model comprises the word vectors of the candidate keywords.

Generally, a neural network takes a word in a word list as an input, outputs a low-dimensional vector to represent the word, and then continuously optimizes parameters by using a back propagation method. The output low-dimensional vector is a parameter of the first layer of the neural network. The neural network models for generating word Vectors are divided into two types, one is a word vector model obtained through word2vec or GloVe (Global Vectors for word replication) training and the like, the purpose of the model is to generate the word Vectors, the other is to generate the word Vectors as byproducts, and the difference between the two types is that the calculated amount is different. Another difference between the two models is that the goals of the training are different: the purpose of word2vec and GloVe is to train word vectors that can represent semantic relationships that can be used in subsequent tasks; if semantic relationships are not needed for subsequent tasks, the word vectors generated in this manner are of little use. Another model trains the word vectors according to the specific task requirements. Of course, if the particular task is modeling a language, the word vectors generated by the two models are very similar.

Specifically, a method is first found to mathematically transform the natural language understanding problem into a machine learning problem. The word vector has good semantic characteristics and is a common way for representing word characteristics. The word vector is a multidimensional real number vector, and the vector contains semantic and grammatical relations in natural language. The value of each dimension of the word vector represents a feature with a certain semantic and grammatical interpretation. Each dimension of the word vector may be referred to as a word feature. The word vector is represented by Distributed Representation, a low-dimensional real vector. The word vector calculation is to map the words in the language vocabulary into a vector with fixed length by a training method. distributedRepression is a dense, low-dimensional real finite quantity, each dimension of which represents a potential feature of a word, which captures useful syntactic and semantic features, and is characterized by distributing different syntactic and semantic features of a word to each of its dimensions for representation. The low-dimensional space representation method is adopted, the problem of dimension disaster is solved, the association attribute between words is mined, the similarity between the two words can be obtained by calculating the distance between word vectors, and the accuracy of vector semantics is improved.

The word vector model comprises word vectors corresponding to the candidate keywords, and the word vectors corresponding to the candidate keywords are found in the word vector model, mainly for calculating the distance between the candidate keywords, so as to obtain the similarity between the candidate keywords. The word vectors are introduced into the existing keyword extraction method based on the graph model, and the similarity among the candidate keywords is calculated through the word vectors, so that the problems that the similarity among the words is established in a windowing mode in the existing method, and the window size needs to be set manually, so that the extraction accuracy of the candidate keywords is low are solved.

S203, constructing a word similarity matrix of the candidate keywords according to the word vectors.

Specifically, the magnitude of the cosine distance between word vectors represents the distance of the relation between words, that is, the similarity between candidate keywords is obtained by calculating the cosine distance between word vectors. Here, the obtained similarity between the candidate keywords is expressed by numerical values, and these numerical values constitute elements in the word similarity matrix. Wherein the matrix is an N-th determinant. As shown in table 1, A, B, C, D, E, F, G, H in the table represents word vectors corresponding to each candidate keyword, and the numerical value in the table is the cosine distance between the word vectors, i.e. the similarity between the candidate keywords.

TABLE 1

Then according to the similarity between these candidate keywords, a similarity matrix of the candidate keywords is constructed, which is denoted by M, that is

And S204, sequencing the candidate keywords according to the word similarity matrix of the candidate keywords, and extracting the keywords of the text to be processed.

Specifically, a word similarity matrix of the candidate keywords is calculated through a keyword ranking algorithm in the keyword extraction method based on the graph model, and ranking algorithm values corresponding to the candidate keywords are obtained. And then, sorting the candidate keywords according to a sorting algorithm value. And finally, selecting candidate keywords ranked at the top as keywords of the text to be processed. Here, the number of candidate keywords ranked in the top is selected according to actual needs.

Therefore, according to the keyword extraction method based on the graph model, provided by the embodiment of the invention, the similarity between words in the text is calculated through the word vector, and the similarity matrix is constructed, so that the extracted keywords reflect the semantic importance of the keywords in the current text to a certain extent. When the similarity matrix is constructed, the similarity between words is obtained based on word vector calculation instead of the co-occurrence between words, so that the problem of overlarge word weighting caused by the co-occurrence between words in the keyword extraction process is solved, the size of a window does not need to be set artificially, keywords which are more consistent with the theme of a document are selected through the semantic similarity, and the accuracy of keyword extraction is improved.

In an optional embodiment of the present invention, constructing a word similarity matrix of the candidate keyword according to the word vector includes:

according to the formula:

calculating cosine values of corresponding word vector included angles among the candidate keywords, wherein theta represents the included angle of the vector among the candidate keywords, and x_1kCharacteristic value, x, of corresponding vector in n-dimensional space representing one candidate keyword_2kAnd representing the characteristic value of a corresponding vector in the n-dimensional space of the other candidate keyword, wherein n represents the dimension of the vector space.

And constructing a candidate keyword similarity matrix according to the cosine value of the word vector included angle.

Specifically, the similarity between words is obtained by calculating the distance between word vectors. The distance between the word vectors is calculated through the cosine value of the included angle between the word vectors, so the method constructs the similarity matrix of the candidate keywords by calculating the cosine value of the included angle of the word vectors corresponding to the candidate keywords and then according to the cosine value of the included angle of the word vectors.

The cosine values of the corresponding word vector included angles among the candidate keywords are obtained by a cosine value calculation formula of the included angle of the n-dimensional space vector, and in the n-dimensional space, for example, two vectors are respectively vector A (x)₁₁，x₁₂…x_1n) And vector B (x)₂₁，x₂₂…x_2n) Then, the cosine of the angle between vector a and vector B is calculated as:

where θ represents the angle between vector A and vector B, x_1kRepresenting the corresponding eigenvalue, x, of the vector A_2kRepresenting the eigenvalues corresponding to vector B, n representing the dimension of the vector space

Here, in the two-dimensional space, for example, there are two vectors each of which is a vector a (x)₁₁，x₁₂) And vector B (x)₂₁，x₂₂) Then, the cosine of the angle between vector a and vector B is calculated as:

where θ represents the angle between vector A and vector B, x₁₁And x₁₂Representing the corresponding eigenvalue, x, of the vector A₂₁And x₂₂Representing the corresponding eigenvalue of vector B.

In three-dimensional space, for example, there are two vectors, each vector A (x)₁₁，x₁₂，x₁₃) Vector B (x)₂₁，x₂₂，x₂₃) Then, the cosine of the angle between vector a and vector B is calculated as:

where θ represents the angle between vector A and vector B, x₁₁、x₁₂And x₁₃Representing the corresponding eigenvalue, x, of the vector A₂₁、x₂₂And x₂₃Representing the corresponding eigenvalue of vector B.

The cosine values of the included angle between two vectors in a higher-dimensional space are not listed, and all the cosine values which are in accordance with the calculation formula of the cosine values of the included angle of n-dimensional space vectors belong to the protection scope of the invention.

In the embodiment of the present invention, ranking the candidate keywords according to the word similarity matrix of the candidate keywords includes:

specifically, the Pagerank algorithm is part of the Google ranking algorithm (ranking formula), is a method used by Google to identify the rank/importance of a web page, and is the only standard used by Google to measure the quality of a web site. The present invention ranks keywords by the principles of the Pagerank algorithm. And calculating a word similarity matrix of the candidate keywords through a PageRank algorithm, and finally obtaining the corresponding PageRank values of the candidate keywords through the iterative regression algorithm.

here, the maximum Pagerank value of the candidate keyword indicates that, when the user searches for a keyword, the keyword is the keyword in which the user is most interested, and the other keywords are sequentially decreased. For example, the obtained candidate keywords are ranked in order of B: 1.47, H: 1.41, E: 1.39, A: 1.30, F: 1.14, G: 1.12, D: 1.09, C: 1.08, the most important candidate keyword B is shown, and the importance degrees of other candidate keywords are sequentially reduced according to the ranking.

Here, the top ranked (top N) candidate keywords are extracted as keywords of the text to be processed, as actually required.

In the embodiment of the invention, the word similarity matrix of the candidate keywords is calculated according to the PageRank algorithm, and the method comprises the following steps:

determining an initial value of a PageRank algorithm according to the order number of the word similarity matrix;

specifically, the initial value of the PageRank algorithm is determined according to the size N of the matrix, namely

p₀Represents the initial value of the PageRank algorithm. Here, since the PageRank algorithm assumes that the probability of each web page is equalThus, it is assumed according to the PageRank algorithm that the probability of occurrence of each candidate keyword is equal, i.e.

And will be

As an initial value for the PageRank algorithm. Calculating an initial characteristic vector value of the candidate keyword according to the initial value and the word similarity matrix;

in particular, according to the formula

p₁＝M^Tp₀

Calculating initial feature vector values of the candidate keywords, wherein p₁Representing the initial feature vector value, p, of the PageRank algorithm₀Representing an initial value of the PageRank algorithm, M representing a word similarity matrix of candidate keywords, M^TRepresenting the transpose of the word similarity matrix.

According to the formula:

p_t＝M^Tp_t-1

calculating a feature vector value of the candidate keyword, wherein when t is 1, p is₁Representing said initial characteristic vector value, p₀Representing said initial weight, p_tCharacteristic vector value, p, representing word similarity matrix at step t_t-1Representing the characteristic vector value of the word similarity matrix in the t-1 step, M representing the word similarity matrix of the candidate keywords, M^TRepresenting the transposition of a word similarity matrix, wherein t represents the calculated step number, and the value of t is more than or equal to 1;

specifically, the PageRank algorithm is an iterative regression algorithm, and the final PageRank value corresponding to the candidate keyword is obtained by repeatedly and iteratively calculating the word similarity matrix of the candidate keyword, so that the accuracy of the extracted key is more accurate.

And when the norm of the characteristic vector value in the t step and the characteristic vector value in the t-1 step is smaller than the error tolerance of the PageRank algorithm, the characteristic vector value in the t step is the corresponding PageRank value of the candidate keyword.

Here, because an error exists in the vector calculation process, an error tolerance e is preset in the PageRank algorithm, and when the norm of the characteristic vector value in the t-th step and the norm of the characteristic vector value in the t-1-th step is smaller than the error tolerance of the PageRank algorithm, the PageRank value corresponding to the obtained candidate keyword is more accurate, which is beneficial to improving the keyword extraction accuracy. The specific algorithm is as follows:

the specific process comprises the following steps:

first, the PageRank algorithm is implemented by inputting a random, irreducible, aperiodic matrix M, the size of the matrix N, and the tolerance of the error. Here, the matrix M is constructed by a word vector, i.e., a word similarity matrix in the present invention, and the size N of the matrix is the order number of the matrix. In addition, because the calculation process of the vector has errors, the PageRank algorithm presets an error tolerance e.

Then, the PageRank algorithm calculates feature vector values for the candidate keywords by:

step 1, determining an initial value of the PageRank algorithm according to the size N of the matrix, namely

p₀Represents the initial value of the PageRank algorithm. Here, since the PageRank algorithm assumes that the probability of each web page is equal, the probability of each candidate keyword appearing is assumed to be equal according to the PageRank algorithm, i.e.

And will be

As an initial value for the PageRank algorithm.

In step 2, t is 0, where t represents the number of steps calculated by the PageRank algorithm, and then t is 0, which means that the similarity matrix M has not been calculated yet.

And (3) repeating the calculation according to t as t +1 in the steps 3 and 4.

Step 5, according to the formula

p_t＝M^Tp_t-1

Calculating a word similarity matrix feature vector value, wherein p_tCharacteristic vector value, p, representing word similarity matrix at step t_t-1And representing the characteristic vector value of the word similarity matrix in the t-1 step, M representing the word similarity matrix of the candidate keywords, and t representing the calculated step number. Here, since the PageRank algorithm is an iterative regression algorithm, the feature vector value of the word similarity matrix can be obtained more accurately only by continuously performing iterative computation on the word similarity matrix M.

Step 6, δ | | | p_t-p_t-1||

And 7, the unit delta is less than e, and the calculation is not stopped until the norm of the characteristic vector value of the word similarity matrix at the t step and the characteristic vector value of the word similarity matrix at the t-1 step is less than the error tolerance e.

Step 8, return p_tAnd obtaining a final word similarity matrix characteristic vector value.

Finally, the eigenvector P, i.e. the final word similarity matrix eigenvector value P, is output_t。

In the embodiment of the present invention, obtaining a text to be processed, and performing word segmentation on the text to be processed to obtain candidate keywords corresponding to the text to be processed, includes:

the method comprises the steps of obtaining a text to be processed, carrying out word segmentation on the text to be processed to obtain stop words and words with appointed parts of speech, wherein the stop words at least comprise prepositions, auxiliary words, conjunctions and exclamation words, and the words with the appointed parts of speech at least comprise nouns, verbs and adjectives.

Specifically, the words obtained after the text to be processed is segmented can be divided into two types: stop words and words of a specified part of speech. In information retrieval, in order to save storage space and improve search efficiency, some characters or words are automatically filtered before or after processing natural language data (or text), and the characters or words are called stop words. And filtering out stop words to obtain words with the appointed part of speech, wherein the words with the appointed part of speech are candidate keywords corresponding to the text to be processed. The stop words are words which appear in a large amount in the text but have little effect on the characteristics of the text, such as the words "i'm, then, yes, then, and additionally" in the text. To filter stop words, a stop word list is constructed, which mainly includes adverbs, conjunctions, prepositions, and word help words mentioned in context. Therefore, after Chinese word segmentation, stop words must be filtered, so that the density of keywords can be effectively improved, the dimensionality of a text can be greatly reduced, and the occurrence of dimensionality disasters is avoided.

In the embodiment of the invention, the word vector is trained by word2vec, and the words are expressed into a vector form.

Specifically, Word2vec is an efficient tool for Google to open sources in 2013 and characterize words as real-valued vectors, and by using the idea of deep learning, processing of text contents can be simplified into vector operation in a K-dimensional vector space through training, and similarity in the vector space can be used for representing similarity in text semantics. Word2vec uses the Word vector Representation of Distributed Representation. Distributored retrieval was first proposed by Hinton in 1986. The basic idea is to map each word to a K-dimensional real number vector (K is generally a hyper-parameter in the model) through training, and judge semantic similarity between words through distances between words (such as cosine similarity, euclidean distance, etc.). It adopts a three-layer neural network, input layer-hidden layer-output layer. The Huffman coding is used according to the word frequency, so that the activated contents of all word hiding layers with similar word frequencies are basically consistent, the number of the word hiding layers activated by the words with higher occurrence frequency is less, and the calculation complexity is effectively reduced. The word2vec algorithm is based on deep learning, and simplifies the processing of text content into vector operation in a K-dimensional vector space through model training. The similarity on the vector space can be used for representing the semantic similarity of the text, word vectors can be converted into vectors, and synonyms can be searched.

Compared with the existing keyword extraction method, the keyword extraction method based on the graph model provided by the invention has the advantage that a better effect is achieved. Table 2 shows the ranking of the keywords obtained by the keyword extraction method proposed in the present invention, compared with the ranking of the keywords obtained by the existing keyword extraction method.

TABLE 2

As can be seen from table 2, the 1 st and 2 nd texts belong to short texts, and each candidate keyword in the text only appears once, so that the probability of each candidate keyword becoming a keyword being extracted is the same, and thus, the keywords cannot be accurately extracted from the texts 1 and 2 by the existing keyword extraction method, and the keyword extraction method provided by the present invention can obtain the ranking of each candidate keyword, thereby extracting the keywords. The 3 rd text belongs to a long text, and all candidate keywords appearing in the text also repeatedly appear in the text, so that the results show that the words are rather accepted as keywords in the sequence of the keywords obtained by the existing keyword extraction method, and have no practical significance, but the words are regarded as the candidate keywords due to more repeated occurrences in the text; the keyword sequence obtained by the keyword extraction method provided by the invention enables the keyword extraction accuracy to be higher.

Referring to fig. 3, fig. 3 is a structural diagram of a keyword extraction apparatus based on a graph model according to an embodiment of the present invention, where the apparatus includes the following modules:

the acquiring module 301 is configured to acquire a text to be processed, and perform word segmentation on the text to be processed to obtain candidate keywords corresponding to the text to be processed;

a searching module 302, configured to search a word vector corresponding to the candidate keyword in a word vector model, where the word vector model includes word vectors of the candidate keyword;

the processing module 303 is configured to construct a word similarity matrix of the candidate keyword according to the word vector;

and the extraction module 304 is configured to sort the candidate keywords according to the word similarity matrix of the candidate keywords, and extract keywords of the text to be processed.

Further, the processing module 303 includes:

a first calculation unit for, according to the formula:

calculating cosine values of corresponding word vector included angles among the candidate keywords, wherein theta represents the included angle of the vector among the candidate keywords, and x_1kCharacteristic value, x, of corresponding vector in n-dimensional space representing one candidate keyword_2kRepresenting the characteristic value of a corresponding vector in another candidate keyword n-dimensional space, wherein n represents the dimension of the vector space;

and the construction unit is used for constructing a candidate keyword similarity matrix according to the cosine value of the word vector included angle.

Further, the extraction module 304 includes:

Further, the second calculation unit includes:

the first calculation subunit is used for calculating an initial characteristic vector value of the candidate keyword according to the initial value and the word similarity matrix;

a second calculation subunit configured to, according to the formula:

p_t＝M^Tp_t-1

and the second determining subunit is used for determining the characteristic vector value in the t step as the corresponding PageRank value of the candidate keyword when the norm of the characteristic vector value in the t step and the characteristic vector value in the t-1 step is smaller than the error tolerance of the PageRank algorithm.

Further, the obtaining module 301 includes:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a text to be processed and segmenting the text to be processed to obtain stop words and words with appointed parts of speech, the stop words at least comprise prepositions, auxiliary words, conjunctions and exclamation words, and the words with appointed parts of speech at least comprise nouns, verbs and adjectives;

and the processing unit is used for filtering stop words to obtain words with the appointed part of speech, and the words with the appointed part of speech are candidate keywords corresponding to the text to be processed.

Further, the word vector is obtained by word2vec training.

Therefore, according to the keyword extraction device based on the graph model provided by the embodiment of the invention, the similarity between words in the text is calculated through the word vector of the processing module, and the similarity matrix is constructed, so that the extracted keywords reflect the semantic importance of the keywords in the current text to a certain extent. When the similarity matrix is constructed, the similarity between words is obtained based on word vector calculation instead of the co-occurrence between words, so that the problem of overlarge word weighting caused by the co-occurrence between words in the keyword extraction process is solved, the size of a window does not need to be set artificially, keywords which are more consistent with the theme of a document are selected through the semantic similarity, and the accuracy of keyword extraction is improved.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A keyword extraction method based on a graph model is characterized by comprising the following steps:

sorting the candidate keywords according to the word similarity matrix of the candidate keywords, and extracting the keywords of the text to be processed;

the sorting the candidate keywords according to the word similarity matrix of the candidate keywords comprises:

extracting keywords of the text to be processed according to the importance degree;

the calculating of the word similarity matrix of the candidate keywords according to the PageRank algorithm includes:

according to the formula:

p_t＝M^Tp_t-1

calculating a feature vector value of the candidate keyword, wherein when t is 1, then p₁Representing said initial characteristic vector value, p₀Represents the initial value, p_tRepresenting the characteristic vector value, p, of the word similarity matrix at step t_t-1Representing the characteristic vector value of the word similarity matrix in the t-1 step, M representing the word similarity matrix of the candidate keywords, M^TRepresenting the transposition of the word similarity matrix, t representing the number of steps of calculation, and the value of t being greater than or equal to 1;

2. The method of claim 1, wherein said constructing a word similarity matrix for the candidate keywords from the word vectors comprises:

according to the formula:

3. The method according to any one of claims 1 to 2, wherein the obtaining of the text to be processed and the word segmentation of the text to be processed to obtain the candidate keywords corresponding to the text to be processed comprises:

4. The method of any of claims 1-2, wherein the word vector is obtained by word2vec training.

5. A keyword extraction apparatus based on a graph model, the apparatus comprising:

the extraction module is used for sequencing the candidate keywords according to the word similarity matrix of the candidate keywords and extracting the keywords of the text to be processed;

the extraction module comprises:

the extraction unit is used for extracting the keywords of the text to be processed according to the importance degree;

the second calculation unit includes:

a second calculation subunit configured to, according to the formula:

p_t＝M^Tp_t-1

calculating a feature vector value of the candidate keyword, wherein when t is 1, then p₁Representing said initial characteristic vector value, p₀Represents the initial value, p_tRepresenting the word phaseCharacteristic vector value, p, of similarity matrix in t step_t-1Representing the characteristic vector value of the word similarity matrix in the t-1 step, M representing the word similarity matrix of the candidate keywords, M^TRepresenting the transposition of the word similarity matrix, t representing the number of steps of calculation, and the value of t being greater than or equal to 1;

6. The apparatus of claim 5, wherein the processing module comprises:

a first calculation unit for, according to the formula:

7. The apparatus of any one of claims 5 to 6, wherein the obtaining module comprises:

8. The apparatus of any of claims 5 to 6, wherein the word vector is obtained by word2vec training.