CN108763206B

CN108763206B - Method for quickly sequencing keywords of single text

Info

Publication number: CN108763206B
Application number: CN201810491735.7A
Authority: CN
Inventors: 徐小龙; 柳林青; 孙雁飞; 李云; 李洋; 徐佳; 王俊昌; 朱洁
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2018-05-22
Filing date: 2018-05-22
Publication date: 2022-04-05
Anticipated expiration: 2038-05-22
Also published as: CN108763206A

Abstract

The invention discloses a method for quickly sequencing keywords of a single text, which is characterized by comprising the following steps of: s1: selecting a single text, converting the single text into a corresponding graph model structure, and generating a candidate word adjacency matrix according to the graph model structure; s2: iteration is carried out by adopting a power method to generate an approximate value of the eigenvector corresponding to the eigenvalue of the candidate word adjacency matrix with the value of 1; s3: in step S2, performing qualitative analysis on the feature vector generated by each power method iteration to generate a local rank vector; s4: setting a judgment threshold, calculating a reverse order value between sequencing vectors generated by two adjacent iterations, comparing the reverse order value with a reverse order value corresponding to the previous iteration, and simultaneously comparing the reverse order value of the previous iteration with the judgment threshold; the method can lead the iterative process to be rapidly converged, can effectively reduce the time complexity of calculation, and has the characteristics of high extraction precision and high sequencing accuracy.

Description

Method for quickly sequencing keywords of single text

Technical Field

The invention belongs to the field of natural language processing, mainly comprises a process of extracting and sequencing single texts, and particularly relates to a method for quickly sequencing keywords of the single texts.

Background

One of the goals of natural language processing tasks is to simplify the presentation and effectively organize large volumes of documents for human search and study. Specifically, how to represent a document by a word, a sentence or even a few words, how to respond to a demand quickly and give a more accurate response content after people put forward a search demand. The basic task of the target is a keyword extraction task of a document, a keyword sequence is generated by an algorithm through processing the document, the keywords in the sequence sequentially form an expression and dependency relationship with the corresponding document, and the data related to the task has two types: the key ranking results and the vector representation of the keys.

Currently, several methods are mainstream in the task of single-text keyword sorting: a term frequency based method, a candidate word position based method, a text node graph model based method, and a combination of the above methods.

A method of a node graph model based on a text, the theory and the realization thought of the method continue to order the importance of the nodes of a super-large-scale internet, the text is represented as a node graph model by performing model conversion, and iterative computation is performed by using a power method; judging whether the iteration is finished or not by measuring the value change condition of the iteration vector, and outputting corresponding keyword sequencing.

The power method iteration method of the node graph model is theoretically complete, the research on the node graph model does not focus on the model, parameters and the iteration process of the method too much at present, the application of the method in each natural language processing subtask field and the combination of other algorithms are focused more, and in this respect, the power method iteration method of the node graph model is more regarded as an 'atomic method' to be applied to the natural language processing task.

In an actual keyword extraction task, the output of a text node graph model calculation method is a vector space model, wherein elements and values represent corresponding words and weights thereof, after sequencing, the first m words with high weights are keywords of a document, and the different weights of the m keywords indicate different representing degrees of the m keywords to the document; meanwhile, as an atomic method, the quantitative output of the method can be conveniently combined with other algorithms through vector calculation, so that a more reasonable and effective output result is generated; in addition, for other tasks, a single document keyword set and a quantitative representation of its weight are needed as input to better perform the next work, such as clustering, partitioning, sentiment analysis, etc. of documents.

In the task of providing only a keyword sequence and even only a keyword set without providing a keyword vector value, the method needs high time complexity, so that the iterative process becomes less suitable.

Disclosure of Invention

The invention mainly aims to provide a method for quickly sequencing single text keywords, which can complete the extraction and sequencing operations of the single text keywords with lower time complexity on the premise of ensuring higher keyword extraction precision, and the specific technical scheme is as follows:

a method of quickly ranking a single text keyword, the method comprising the steps of:

s1: selecting a single text, converting the single text into a corresponding graph model structure, and generating a candidate word adjacency matrix according to the graph model structure;

s2: iteration is carried out by adopting a power method to generate an approximate value of the eigenvector corresponding to the eigenvalue of the candidate word adjacency matrix with the value of 1;

s3: in step S2, performing qualitative analysis on the feature vector generated by each power method iteration to generate a local rank vector;

s4: setting a judgment threshold, calculating the inverse sequence value between the sequencing vectors generated by two adjacent iterations, comparing the inverse sequence value with the inverse sequence value corresponding to the previous iteration, and simultaneously comparing the inverse sequence value of the previous iteration with the judgment threshold to compare the inverse sequence value with the judgment threshold.

Further, in step S1, the single text is regarded as a word package, the graph model structure is constructed according to the word package, and the graph model structure is converted to generate the candidate word adjacency matrix corresponding to the single text.

Further, the generation process of the candidate word adjacency matrix includes the steps of: first, assume that there is one single text T, let T ═ C₁,C₂,...C_mIn which C is_iFor a particular sentence in the T,

s_ijand (C) if the word s belongs to the P, the node graph G belongs to the C_i(ii) a And after the single text T is subjected to lexical item segmentation, taking the co-occurrence probability among the candidate words as weighting values of edges in the node graph G, generating the adjacency matrix of the text graph model by the method, and taking a matrix M (w) as a matrix_ij) e.Rn × n.

Further, in step S4, if the negative sequence value is smaller than the negative sequence value corresponding to the previous iteration and both of them are smaller than the determination threshold, the iteration is stopped and the sorting vector is output, otherwise, steps S2 to S4 are repeated.

Further, the power-law iteration includes the steps of: first, a vector sequence is set

Using unit vector P (0) e^T＝(1,1...1)^T∈RⁿFor the initial value, the formula P (t +1) ═ M is used^TX p (t), t 0,1,2.∞ and assigning a value to the vector sequence p (t); then, in each iteration process, the current elements in P (t) are simply selected and sorted for m times, the serial numbers of the elements with the largest m in P (t) are selected and sorted, and the vector sequence is generated

Calculating the inverse value K (Q (t), Q (t-1)) of the vector pair (Q (t), Q (t-1)) and comparing with a given threshold value epsilon; finally according to the formula

Judging whether the iteration is ended or not according to the comparison result of K (Q (t), Q (t-1)) < K (Q (t-1) and Q (t-2)) < epsilon, and outputting a vector Q (t) if K (Q (t), Q (t-1)) < K (Q (t-1) and Q (t-2)) < epsilon; otherwise, repeating the steps S2-S4.

Further, in the above-mentioned calculation formula K (Q (t), Q (t-1)) of the inverse numerical value, a sequence σ and a sequence τ are set, and the sort number of the node in the node map G, which is numbered i, in the sequence σ is set to σ (i), and the sort number in the sequence τ is set to τ (i), whereby a functional formula can be obtained

In the functional formula, when the modulus of σ and τ is odd, the value of MP is | σ ²1, when the norm of σ and τ is even, the value of MP is | σ²For all node pairs (i, j) present in σ and τ, a decision is made that the value of this function is between [0,1 ]]In the meantime.

The method for rapidly sequencing keywords of a single text of the invention comprises the steps of firstly generating a candidate word adjacency matrix according to a graph model structure formed by the text, generating an approximate value of the eigenvector corresponding to the eigenvalue with the value of 1 of the candidate word adjacency matrix by adopting a power method iteration, and qualitatively analyzing the feature vectors generated each time to generate local sequencing vectors, calculating the inverse sequence value between the sequencing vectors generated by two adjacent iterations, comparing the inverse sequence value with the inverse sequence value generated by the previous iteration, meanwhile, the inverted sequence value generated in the previous iteration is compared with a set judgment threshold value, if the inverted sequence value of the current iteration is smaller than the inverted sequence value of the previous iteration and the inverted sequence value of the previous iteration is smaller than the set judgment threshold value, stopping iteration and outputting the sequencing vector, otherwise, repeating the iteration step until all steps meeting the comparison condition; compared with the prior art, the invention has the beneficial effects that: the method has lower time complexity, greatly reduces the iteration times compared with the existing power method iteration method based on the node graph model, and reduces the operation times of the experimental text by 10 to 25 percent on the whole; the stability of the keyword sequence can be ensured, and the keyword sequencing and extraction precision can not be reduced.

Drawings

FIG. 1 is a block diagram representation of a flow diagram of a single text keyword ranking method of the present invention;

FIG. 2 is a diagram illustrating a data structure required for generating an adjacency matrix from a single text according to the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention.

For the existing power method, the (| | P (t) -P (t-1) | calculation of luminance₁Epsilon is used as a judgment condition to compare the quantized values of the iteration vectors, and can be called as quantitative analysis of the vector P (t) sequence, and the iteration can be judged to be ended only when the sum of the absolute values of the variation of each element in the sequence is less than a certain threshold epsilon.

For the vector p (t), the "magnitude relationship" between the elements is likely to be determined early before the quantitative analysis is qualified. The 'size relation' between elements, namely the stability of the sequence of the vector does not change or changes slightly locally because of the continuous progress of the iteration (for example, for a vector sequence with 500 non-coincident words, the size relation of the 299 th word and the 300 th word after the ordering is exchanged after a certain iteration), and then the vector enters into the 'ordering stable' stage; at this stage, the algorithm still requires many iterations to qualify the quantitative analysis of p (t), while subsequent iterations are useless for tasks that do not require the actual values of the result vectors; therefore, the present invention provides a method for quickly ordering keywords of a single text, which introduces qualitative analysis into the iterative process of the power method, and redefines the convergence standard of the iterative process, as shown in fig. 1 and 2.

Referring to fig. 1, a flow chart of a method for quickly sorting keywords of a single text according to the present invention is shown, which specifically includes the following steps: s1: selecting a single text, converting the single text into a corresponding graph model structure, and generating a candidate word adjacency matrix according to the graph model structure; s2: iteration is carried out by adopting a power method to generate an approximate value of the eigenvector corresponding to the eigenvalue of the candidate word adjacency matrix with the value of 1; s3: in step S2, performing qualitative analysis on the feature vector generated by each power method iteration to generate a local rank vector; and S4: setting a judgment threshold, calculating a reverse order value between the sequencing vectors generated by two adjacent iterations, comparing the reverse order value with a reverse order value corresponding to the previous iteration, and simultaneously comparing the reverse order value of the previous iteration with the judgment threshold.

In addition, in step S4, if the negative sequence value is smaller than the negative sequence value corresponding to the previous iteration and both of them are smaller than the determination threshold, the iteration is stopped and the sorting vector is output, otherwise, steps S2 to S4 are repeated.

Specifically, in step S1, firstly, the single text is regarded as a word package, then a graph model structure is constructed according to the word package, and finally the graph model structure is converted to generate a candidate word adjacency matrix, which, in combination with fig. 2, may be described as follows:

first, a single text T is set, where T ═ C₁,C₂,...,C_m}，C_iFor a particular sentence in the T,

s_ijin order to be a certain candidate word,

if the graph G is (P, V), and if the word s belongs to P, s belongs to C_i(ii) a Then, the text is converted into a word package model, namely a matrix S, through processes of word segmentation, term segmentation and the like, and finally an adjacent matrix M ═ (w ═ is generated through operations of de-duplication, statistics, calculation of conditional probability and the like on the matrix S_ij) E.g. Rn x n; subsequently creating a sequence of vectors

Making P (0) equal to e^T＝(1,1...1)^T∈RⁿPerforming iterative calculation to obtain P (t +1) ═ M^TX p (t), t ═ 0,1,2.∞; and to P (t)The first m items are simply selected and ordered to generate Q (t), and calculation is carried out

K (Q (t), Q (t-1)) ≦ K (Q (t-1), Q (t-2)) < epsilon, if true, stopping iteration, outputting Q (t), and otherwise, repeatedly performing iterative computation, wherein P (t +1) ═ M (t-1) is true^TX P (t),

t

0,1,2 ∞ and P (t) is subjected to first m simple selection sorting to generate Q (t), and then Q (t) is calculated

And whether K (Q (t), Q (t-1)) < epsilon (Q (t-1), Q (t-2)) < epsilon is true.

The vector P (t) ═ p (p) generated by each iteration calculation of the single text keyword ordering method of the invention_i)∈RⁿThe first m elements with the maximum value in the vector are sorted from large to small according to the value, and then the vector is obtained

Wherein q is_i＝f(p_i) The rank of the ith element in P (t) in the sequence Q (t) ordered from large to small, i.e., the single fire f P → R,

f(p_i)＞f(p_j) Then as t increases, we get a sequence of vectors

P (t), t 0,1,2.∞, also yields the vector sequence q (t), t 0,1,2.∞.

The specific calculation process of the inverse numerical value and the calculation function K (σ, τ) thereof is described as follows:

first, assuming that the sort number of a node numbered i in the node map G in the sequence σ is σ (i), and the sort number in the sequence τ is τ (i), there is a function:

when the modulo of σ and τ are odd, the value of MP is | σ ²1, when the norm of σ and τ is even, the value of MP is | σ²The decision is made for all node pairs (i, j) present in σ and τ, and the function is a value between 0 and 1.

The complexity analysis of the invention for sorting the keywords in the single text is as follows: if a single text needs to rank the first m keywords to take part of the keywords as the keywords of the document, m rounds of selection ranking need to be performed in each iteration process, which needs additional m × n times of calculation, then, the qualitative analysis inverse order numerical calculation function K (Q (t), Q (t-1)) needs additional mlog (m) times of calculation, and assuming that t1 iterations are needed to be finished, and data processing at the constant term level is calculated, the algorithm is executed until the iteration is finished, and the total number of calculations needs to be performed

t₁[n²+mn+mlog(m)+const₁]+const₂Sub-calculation of, wherein const₁And const₂Representing two constants with different meanings.

The invention relates to a method for quickly sequencing single text keywords, which comprises the steps of firstly generating a candidate word adjacency matrix according to a graph model structure formed by a text, generating approximate values of eigenvectors corresponding to eigenvalues of which the value of the candidate word adjacency matrix is 1 by adopting power method iteration, carrying out qualitative analysis on the eigenvectors generated each time to generate local sequencing vectors, finally calculating the inverse sequence value between the sequencing vectors generated by two adjacent iterations, comparing the inverse sequence values calculated by two adjacent iterations with each other, comparing the inverse sequence values with a judgment threshold value, stopping the iteration if the inverse sequence value is smaller than the inverse sequence value corresponding to the previous iteration and both are smaller than the judgment threshold value, and outputting all steps of the sequencing vectors; compared with the prior art, the invention has the beneficial effects that: the method has lower time complexity, greatly reduces the iteration times compared with the existing power method iteration method based on the node graph model, and reduces the operation times of the experimental text by 10 to 25 percent on the whole; the stability of the keyword sequence can be ensured, and the keyword sequencing and extraction precision can not be reduced.

Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing detailed description, or equivalent changes may be made in some of the features of the embodiments described above. All equivalent structures made by using the contents of the specification and the attached drawings of the invention can be directly or indirectly applied to other related technical fields, and are also within the protection scope of the patent of the invention.

Claims

1. A method for fast ranking of a single text keyword, the method comprising the steps of:

s4: setting a judgment threshold, calculating a reverse order value between sequencing vectors generated by two adjacent iterations, comparing the reverse order value with a reverse order value corresponding to the previous iteration, and simultaneously comparing the reverse order value of the previous iteration with the judgment threshold;

wherein the power-law iteration comprises the steps of: first, a vector sequence is set

Calculating the inverse value K (Q (t), Q (t-1)) of the vector pair (Q (t), Q (t-1)) and comparing with a given threshold value epsilon; finally, according to the formula K (Q (t), Q (t-1)) ≦ K (Q (t-1), Q (t-2)) <Judging whether iteration is ended or not according to the comparison result of epsilon, and if K (Q (t), Q (t-1)) < K (Q (t-1) and Q (t-2)) < epsilon, outputting a vector Q (t); otherwise, repeating the steps S2-S4.

2. The method according to claim 1, wherein in step S1, the single text is first treated as a word package, then the graph model structure is constructed according to the word package, and finally the graph model structure is transformed to generate the candidate word adjacency matrix corresponding to the single text.

3. The method of claim 2, wherein the generating process of the candidate word adjacency matrix comprises the steps of: first, assume that there is one single text T, let T ═ C₁,C₂,...C_mIn which C is_iFor a particular sentence in the T,

s_ijand (C) if the word s belongs to the P, the s belongs to the C_i(ii) a And after the single text T is subjected to lexical item segmentation, taking the co-occurrence probability among the candidate words as weighting values of edges in the node graph G, generating the adjacency matrix of the text graph model by the method, and taking a matrix M (w) as a matrix_ij)∈R^n×nAnd (4) showing.

4. The method according to claim 3, wherein in step S4, if the inverse value is smaller than the inverse value corresponding to the previous iteration and both are smaller than the decision threshold, the iteration is stopped and the sorting vector is outputted, otherwise, steps S2-S4 are repeated.

5. The method according to claim 4, wherein in the formula K (Q (t), Q (t-1)) for calculating the inverse numerical value, the method further comprises the step of calculating the inverse numerical valueDetermining a sequence sigma and a sequence tau, and making the sorting sequence number of the node numbered i in the node graph G in the sequence sigma be sigma (i) and the sorting sequence number in the sequence tau be tau (i), thereby obtaining a functional formula

In the functional formula, when the modulus of σ and τ is odd, the value of MP is | σ²1, when the norm of σ and τ is even, the value of MP is | σ²For all node pairs (i, j) present in σ and τ, a determination is made as to whether the value of this function is [0,1 ]]In the meantime.