CN110263343B

CN110263343B - Phrase vector-based keyword extraction method and system

Info

Publication number: CN110263343B
Application number: CN201910548261.XA
Authority: CN
Inventors: 孙新; 赵永妍; 申长虹; 杨凯歌; 张颖捷
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2019-06-24
Filing date: 2019-06-24
Publication date: 2021-06-15
Anticipated expiration: 2039-06-24
Also published as: CN110263343A

Abstract

The invention relates to the technical field of natural language processing and deep learning, in particular to a phrase vector-based keyword extraction method and system. The main technical scheme of the invention comprises the following steps: segmenting words of an original text, labeling parts of speech, and reserving n-tuple according to the parts of speech to obtain a candidate term set; constructing vector representations of a large number of phrases contained in the candidate keyword set; calculating the theme weight of each candidate term; and constructing a graph by taking the candidate terms as vertexes in the graph and the co-occurrence information of the candidate terms as edges, calculating the weight of the edges by using the semantic similarity between the candidate terms and the co-occurrence information, and iteratively calculating the score of each candidate term and sequencing. According to the keyword extraction method and system provided by the invention, the topic information in the document is introduced, the context information is introduced through the semantic similarity between the phrases, the key words in the whole text can be captured, the semantic precision is high, and the application range is wide.

Description

Phrase vector-based keyword extraction method and system

Technical Field

The invention relates to the technical field of natural language processing and deep learning, in particular to a phrase vector-based keyword extraction method and system.

Background

In recent years, mass data brings great convenience to people, and simultaneously brings great challenges to analysis and search of data. Under the background of big data, how to quickly acquire required key information from massive data becomes a problem which needs to be solved urgently. Keyword extraction refers to automatically extracting important words or phrases with themes from documents through an algorithm. In scientific literature, keywords or phrases can help users quickly learn about the content of a paper. Meanwhile, the keywords or phrases may also be used as search entries in information retrieval, natural language processing, and text mining. In the keyword extraction task, word vectors containing word semantics are applied and achieve good effects. However, many professional papers, including enterprise papers, contain a large number of proper nouns, and these nouns are often not single words but phrases, so that the word vector alone is not enough to satisfy the requirement of the keyword extraction task, and the text needs to construct vector representation for the phrases.

The current scholars propose to construct phrase vectors on the basis of word vectors using self-encoders for combination. The Auto Encoder (Auto Encoder) has only two parts of the Encoder and the decoder, when the Auto Encoder combines the word vectors to construct the phrase vector, the Encoder part can input the representation of each word in the phrase, then compress them into an intermediate hidden layer vector, and the decoder part can re-resolve the input phrase through the hidden layer vector, so that the intermediate vector can be regarded as the phrase vector representation containing semantic information. However, in the conventional self-encoder, the encoding and decoding are performed directly using the basic full-connection network, in which layers are fully connected, and nodes between each layer are connectionless, and the general self-encoding network cannot process sequence information in a structure like a phrase.

In addition, the existing algorithm calculates the semantic similarity of words only through word vectors, and ignores the subject information of the text. TextRank is a keyword extraction algorithm based on a graph, and the basic idea is to form a graph by using candidate terms in a document, construct edges by using co-occurrence relations of the candidate terms in the document, iteratively calculate weights by voting among the candidate terms, and finally rank the candidate terms according to scores to determine the finally extracted keywords. In the conventional TextRank, the initial weight of each vertex in the graph is 1 (or 1/n, n is the number of vertices), and the weight of each edge is also set to 1, that is, the number of votes of each vertex is uniformly applied to each vertex connected to the edge. Such an approach, while simple and convenient, ignores the document's themes and does not take into account the semantic relationships between vertices.

In a Recurrent Neural Network (RNN), the nodes between hidden layers are no longer unconnected but connected, and the input of a hidden layer contains not only the output of the input layer but also the output of the hidden layer at the last instant. RNNs are therefore suitable for encoding sequence data. However, in the RNN propagation process, the forgetting of history information and the accumulation of errors are an important problem, and now people usually use Long Short-Term Memery (LSTM) to improve.

LSTM is a special type of RNN that uses cell states to record information, which have only a small amount of linear interactions during sequence transmission and can better retain historical information. LSTM then uses gating mechanisms to protect and control cell state. The gating mechanism is an abstract concept, which is actually composed of a sigmoid function and a dot product operation in concrete implementation, and controls the information transfer by outputting a value between 0 and 1, wherein the closer the output value to 0, the less information is allowed to pass, and the closer to 1, the more information is allowed to pass.

In an LSTM unit, the information transferred in the previous step is processed first, and the LSTM controls the forgetting and keeping of the history information through a forgetting gate (forget gate). Forget door f_tAccording to the current information, whether the information before forgetting is needed or not is determined, and the specific formula is as follows:

f_t＝σ(W_f·[h_t-1,x_t]+b_f)

where σ denotes a sigmoid function, W_fAnd b_fRespectively representing the weight matrix and the bias in the forgetting gate.

Then LSTM needs to process the current input information, and the current input is controlled by the input gateThe portion of the information to be retained, and then a cell state is created using the tanh function

The information of the time node is added to the cell state.

i_t＝σ(W_i·[h_t-1,x_t]+b_i)

By means of the forget gate and the input gate, the LSTM can determine which past information needs to be left and which current information needs to be stored, thereby calculating the current cell state C_t。

Finally, the LSTM uses the sigmoid function, determines the information needing to be output at the current moment through an output gate (output gate) according to the historical information and the current input information, and similarly to the input state, the output state is filtered by using a tanh function.

o_t＝σ(W_o·[h_t-1，x_t]+b_o)

o_t＝o_t*tanh(C_t)

Through a smart door mechanism, the long-time memory neural network can memorize the previous information, and the problem of gradient disappearance is avoided.

Disclosure of Invention

The invention provides a keyword extraction method and system based on phrase vectors, aiming at solving the two problems that the word vectors are not enough to meet the requirement of a keyword extraction task and the existing algorithm ignores the subject information of a text.

In order to achieve the above object, in a first aspect, the present invention provides a keyword extraction method based on a phrase vector, where the method includes:

s1, segmenting the text, marking the part of speech, and reserving n-tuple to obtain a candidate term set;

s2, constructing phrase vectors for the candidate terms through a self-encoder;

s3, determining the theme of the text, calculating the similarity between the candidate term and the theme vector, and taking the similarity as the theme weight of the candidate term;

and S4, acquiring keywords from the candidate term set through a TextRank algorithm.

Further, the self-encoder in step S2 includes an encoder and a decoder, the encoder is composed of a bi-directional LSTM layer and a full connection layer, and the decoding portion is composed of a unidirectional LSTM layer and a softmax layer.

Further, the self-encoder in step S2 includes an encoder and a decoder, and the training method includes the following steps:

s21, selecting a training sample to obtain candidate terms;

s22, candidate term c_j＝(x₁，x₂，...，x_T) In the encoder, bi-directional LSTM is used to perform calculations from both front and back directions:

wherein the content of the first and second substances,

and

the hidden layer state and the cell state in the left-to-right and right-to-left directions at time T (T ═ 1, 2.., T), respectively,

and

hidden layer state and cell state in the left-to-right and right-to-left directions, respectively, at time t-1, x_tWords in the candidate terms input at the time t; t represents the number of words in the candidate term;

s23, in the encoder, ES is obtained through formula calculation_T：

h′_T＝f(W_hh_T+b_h)

C′_T＝f(W_cC_T+b_c)

Wherein the content of the first and second substances,

is a connector, W_h、b_h、W_c、b_cRepresenting parameter matrices and offsets in a fully connected network, f represents the activation function ReLU, ES in a fully connected network_TIs h'_TAnd C'_TA tuple is formed;

s24, in the decoder part, with ES_TDecoding using unidirectional LSTM for initial state:

wherein z is_tIs the hidden layer state of the decoder at time t, z_t-1Hidden layer state at time t-1, ES_TIn order to be the state of the encoder,

words in the candidate terms output at the time t-1;

s25, according to z_tEstimating the probability of the current word:

wherein, W_sz_t+b_sScoring is performed for each possible output word, softmax being a normalization function.

S26, when the loss function L is continuously reduced and finally tends to be stable in the training process, obtaining the parameter W of the encoder_h、b_h、W_c、b_cAnd W in the decoder_s、b_sThereby determining a self-encoder; the calculation formula of the loss function L is as follows:

further, in step S2, the candidate term is input from the encoder, and the ES output from the encoder_TIs the phrase vector of the candidate term.

Further, the topic vector in the step S3

The calculation formula of (2) is as follows:

wherein the content of the first and second substances,

is a subject term t_iThe corresponding vector is represented by a vector that,

is a text d_iIs represented by the topic vector of (1).

Further, in the TextRank algorithm of the step S4, if the candidate term c is selected_jAnd c_kOccurs in the co-occurrence window, then c_jAnd c_kThere is an edge between them, and the calculation formula of the weight of the edge is:

w_jk＝similarity(c_j，c_k)×occur_count(c_j，c_k)

wherein the content of the first and second substances,

are respectively candidate terms c_jAnd c_kVector representation of (c) occur_count(c_j，c_k) Denotes c_jAnd c_kNumber of co-occurrences in the co-occurrence window, similarity (c)_j，c_k) Is c_jAnd c_kSimilarity between, w_jkRepresents c_jAnd c_kThe weight of the edges in between.

Further, the step of iteratively calculating vertex weights in the TextRank algorithm of step S4 includes the following steps:

iteratively calculating the weight of the candidate terms until the maximum iteration number is reached, and scoring the weight

The calculation formula is as follows:

wherein the content of the first and second substances,

representing candidate terms c_jD is the damping coefficient, preferably d is 0.85;

is a candidate term c_jSubject weight of, w_jkIs a candidate term c_jAnd candidate term c_kWeight of edges in between, w_kpIs a candidate term c_kAnd candidate term c_pThe weight of the edge in between,

representation and candidate terms c_jA set of connected-up candidate terms,

are the elements of, and, for the same reason,

representation and candidate terms c_kA set of connected-up candidate terms,

are elements thereof.

In a second aspect, the invention provides a phrase vector-based keyword extraction system, which comprises a text preprocessing module, a word segmentation module, a word classification module and a word classification module, wherein the text preprocessing module is used for performing word segmentation on an original text, labeling part of speech, and reserving n-tuple according to the part of speech to obtain a candidate term set;

phrase vector construction module for candidate term c_i＝(x₁，x₂，...，x_T) Obtaining, by an auto-encoder, a phrase vector having a semantic representation;

the theme weight calculation module is used for calculating the theme weight of the candidate lexical item;

and the candidate word ordering module is used for calculating a weight score for the candidate terms and taking TopK candidate terms as keywords.

Further, the system further comprises an auto-encoder training module, which is used for obtaining the auto-encoded parameters through sample training, so as to determine the auto-encoder.

Compared with the existing keyword extraction method and system, the phrase vector-based keyword extraction method and system provided by the invention have the following beneficial effects:

1. according to the keyword extraction method and system provided by the invention, the topic information in the document is introduced, the context information is introduced through the semantic similarity between words, the key words in the whole text can be captured, and the extracted keywords are more accurate.

2. According to the keyword extraction method and system provided by the invention, the phrase vectors are used for acquiring the keywords, so that the calculation process is simple and efficient.

3. The phrase vector calculation method provided by the invention creatively introduces the self-encoder based on the LSTM to compress the word vector, can better represent the semantic information of the phrase, and has higher semantic precision and wider application range.

4. The invention improves the TextRank algorithm, creatively utilizes the phrase vectors to calculate the theme weight of each candidate term, and calculates the side weight by the semantic similarity between the candidate terms and the co-occurrence information, thereby not only considering the theme of the whole document, but also introducing the semantic information between the vertexes, and leading the accuracy of the ranking algorithm to be higher.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a schematic structural diagram of an auto-encoder according to an embodiment of the present invention;

FIG. 2 is a flowchart of a keyword extraction method based on phrase vectors according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

The invention is further described with reference to the following figures and detailed description.

In order to make the technical solutions and advantages in the examples of the present application more apparent, the following further detailed description of the exemplary embodiments of the present application with reference to the accompanying drawings makes it clear that the described embodiments are only a part of the embodiments of the present application, and not an exhaustive list of all embodiments. It should be noted that, in the present application, the examples may be combined with each other without conflict.

The invention provides a phrase vector-based keyword extraction method, as shown in fig. 2, the method comprises the following steps:

s1, for original text d_iPerforming word segmentation and part-of-speech tagging, and reserving n-tuple according to the part-of-speech to obtain a candidate term set

S2, for each candidate term c_j＝(x₁,x₂,…,x_T) A phrase vector representation of the candidate term is obtained by the self-encoder. Wherein x is_iIs a candidate term c_jThe word vector of the ith word in (a) and T represents the number of words in the candidate term.

S3, calculating each candidate term c_jAnd topic vector

Is taken as the subject weight

Wherein d is_iThe ith document is represented. The self-encoder comprises an encoder and a decoder, the encoder part is composed of a bidirectional LSTM layerAnd a full connection layer, and the decoding part is composed of a unidirectional LSTM layer and a softmax layer.

And S4, acquiring keywords from the candidate term set through a modified TextRank algorithm.

In step S2, in the encoder, for each candidate term c to be input_jRespectively calculating from front and back directions by using bidirectional LSTM, and taking the state h of the hidden layer at the last moment_TAnd cell state C_TAs the final state, splicing, and finally obtaining the output ES of the coding layer through a full connection layer_T。

In the decoder, with ES_TFor initial input, decoding is carried out by using a unidirectional LSTM structure, the probability distribution of each step of decoding is obtained through a softmax layer, and finally the probability of decoding the correct word corresponding to each step is maximized through a loss function L.

The aim of training is to optimize parameters of the self-encoder, so that the decoder can take the output of the encoder as input, and restore semantic information of candidate terms input by the encoder to the maximum extent.

The specific training method comprises the following steps:

(1) and selecting a training sample, and then performing word segmentation and other operations on the sample as in S1 to obtain a candidate term set.

Candidate terms for c_j＝(x₁,x₂,…,x_T) Is represented by, wherein x_iIs a candidate term c_jThe word vector of the ith word in (a) and T represents the number of words in the candidate term. With candidate term c_jFor example, Beijing university of Phytology, x₁Is the word vector, x, corresponding to "Beijing₂Is the word vector, x, corresponding to "reason worker₃Is the word vector for "university".

(2) The model is trained using a large number of candidate terms. Taking candidate term 'Beijing university of rational workers' as an example, inputting word vector representation corresponding to 'Beijing', 'rational worker' and 'university', obtaining phrase vector representation of 'Beijing university of rational workers' through encoding, obtaining corresponding probability values of 'Beijing', 'rational worker' and 'university' in sequence through decoding of the phrase vector, and maximizing the probability values through training.

For each candidate term c_j＝(x₁,x₂,…,x_T) In the encoder section, the encoder uses bi-directional LSTM to perform calculations from both front and back directions, respectively:

wherein the content of the first and second substances,

and

the hidden layer state and the cell state in the two directions from left to right and from right to left at the time T (T is 1,2, …, T), respectively,

and

hidden layer state and cell state in the left-to-right and right-to-left directions, respectively, at time t-1, x_tThe words in the candidate terms input at time t. At each instant of time, the current hidden layer state h_tAnd cell state C_tAll depend on the hidden layer state h at the last moment_t-1Cell state C_t-1And the current input x_t。

The state h of the hidden layer at the last moment is taken_TAnd cell state C_TAs the final state, the states in both directions are directly connected. In addition, in order to provide a fixed-size input for the decoding layer, the connected state needs to be processed through a full-connection layer. Calculating the following formula to obtain one of the decodersFixed size input ES_T：

h′_T＝f(W_hh_T+b_h)

C′_T＝f(W_cC_T+b_c)

Wherein the content of the first and second substances,

is a connector, W_h、b_h、W_c、b_cRepresenting parameter matrices and offsets in a fully connected network, f represents the activation function ReLU, ES in a fully connected network_TIs h'_TAnd C'_TOne tuple is composed that is eventually provided to the decoder.

In the decoder part, with ES_TDecoding using unidirectional LSTM for initial state:

and outputting the words in the candidate terms at the moment t-1.

According to z_tEstimating the probability of the current word:

wherein, W_sIs a parameter matrix, W_sAnd b_sRespectively representing the weight and offset values of the softmax function, z_tIs the hidden layer state of the decoder at time t, W_sz_t+b_sScoring each possible output word, normalizing with softmax to obtain each word

Probability of (2)

The training goal of the auto-encoder is to maximize the probability of outputting the correct phrase: the output from the encoder is the probability for each word, and the training objective is to maximize the probability of outputting the correct word, i.e., to train on the basis of the loss function L, by training to adjust the parameters of the self-encoder (including the parameters in LSTM, W in encoder)_h、b_h、W_c、b_cAnd W in the decoder_s、b_s) When the loss function is continuously reduced and finally tends to be stable in the training process, the fact that the middle vector can well represent phrase semantics can be explained, and the middle vector can be represented as a phrase vector. The loss function L is calculated as follows:

after the training of the self-encoder is finished, the loss function value of the self-encoder tends to be stable. At this point, the training of the self-encoder is completed, and the candidate terms are input into the encoder of the self-encoder, ES_TThe value in (1) is the phrase vector. Through the self-encoder constructed above, word vectors are compressed by using information on the candidate term sequence, and phrase vector representation of the candidate terms is obtained.

After the training of the self-encoder is finished, when the phrase vector representation of the candidate term is required to be obtained, the phrase vector representation ES of the candidate term can be obtained only by utilizing the calculation of the encoding part_TObtained ES_TCan be divided into oneThe entirety of the candidate term takes into account semantic information of the candidate term.

In step S3, the theme weight calculation process is as follows:

(1) determining a topic term set: the method is characterized by taking a topic sentence or paragraph with high generality of text as a representative, such as a topic or abstract of a thesis, determining a topic term of the text from the topic sentence, and adding a topic term set of the text:

wherein d is_iRepresenting the ith document, n is the number of elements in the subject term set. For example, for "mining design industry development idea instance analysis under new situation", the topic term set may be "mining design", "development idea", "instance analysis".

(2) Calculating a topic vector: computing a set of topic terms

The average value of the word or phrase vectors corresponding to all the terms in the document is used as the topic vector of the document

Topics for representing the entire document:

wherein the content of the first and second substances,

is a subject term t_iThe corresponding vector is represented by a vector that,

is a document d_iIs represented by the topic vector of (1).

(3) Calculating the theme weight: for each candidate term c_jCalculate it and document d_iSubject vector of

Cosine distance between them as its theme weight.

Wherein the content of the first and second substances,

is a document d_iCandidate term of (c)_jThe weight of the subject matter of (1),

is a candidate term c_jRepresents the cosine distance cos.

Through the steps (1) to (3), a topic weight between 0 and 1 can be assigned to each candidate term. It should be noted that a topic weight of 1 indicates that the candidate term is closest to the topic of the text, and a topic weight of 0 indicates that the candidate term is farther from the topic of the text.

In step S4, the document d is used_iSet of candidate terms

Constructing an undirected graph for the vertices, computing each candidate term c_jIs scored by the weight of

Take TopK (top K) candidate terms as keywords. The method is realized by improving the TextRank algorithm, and the specific process is as follows:

(1) constructing an undirected graph: with document d_iSet of candidate terms

All elements in (a) construct an undirected graph for the vertices. Wherein if the candidate term c_jAnd c_kIn a co-occurrence window of length n, c_jAnd c_kThere is an edge in between.

(2) ComputingWeight of edge: the weighting of the edges is an improvement of the present invention. A phrase vector is computed that also depends on the self-encoder construction. According to two candidate terms c_jAnd c_kThe cosine distance between the vector representations (c)_j,c_k) And number of co-occurrences occur_count(c_j,c_k) Assigning a weight w to each edge in the graph_jk：

w_jk＝similarity(c_j,c_k)×occur_count(c_j,c_k)

Wherein

Are respectively candidate terms c_jAnd c_kCos represents the cosine distance of the vector occur_count(c_j,c_k) Denotes c_jAnd c_kThe number of co-occurrences in the co-occurrence window is multiplied to enhance the semantic relationship between two words by the number of co-occurrences, w_jkRepresents c_jAnd c_kThe weight of the edges in between.

(3) Iteratively calculating vertex weights: vertex weights are also an improvement of the present invention. Iteratively calculating the weight of each top point in the graph until the maximum iteration number is reached, and scoring the weight

The calculation is as follows:

wherein the content of the first and second substances,

representing a document d_iCandidate term of (c)_jD is a damping coefficient, isEach vertex has a certain probability to vote for other vertices, so that each vertex has a score which is not zero, and the algorithm can be ensured to be converged after multiple iterations, and the value is usually 0.85.

Is a document d_iCandidate term of (c)_jSubject weight of, w_jkIs a candidate term c_jAnd candidate term c_kWeight of edges in between, w_kpIs a candidate term c_kAnd candidate term c_pThe weight of the edge in between,

representation and candidate terms c_jA set of connected-up candidate terms,

are elements of the set and, similarly,

representation and candidate terms c_kA set of connected-up candidate terms,

is an element of the set that is,

representing a document d_iCandidate term of (c)_kThe second half of the right hand side of the equation represents the sum of c_jConnected vertex give_jThe voting of (1).

(4) Candidate term ordering: after multiple iterations, each vertex in the graph can obtain a stable score, and the candidate term set is divided into a plurality of candidate term sets

Scoring by weight

Sorted from big to small and reserved beforeTopK candidate terms serve as keywords of the document.

Through the four steps of S1 to S4, the keywords of the document can be extracted.

The invention also provides a phrase vector-based keyword extraction system, which comprises:

the text preprocessing module is used for segmenting the original text, marking the part of speech and reserving n-tuple according to the part of speech to obtain a candidate term set;

phrase vector construction module for candidate term c_j＝(x₁,x₂,…,x_T) Obtaining, by an auto-encoder, a phrase vector having a semantic representation;

the theme weight calculation module is used for calculating the theme weight of the candidate lexical item; the specific calculation method is as described above.

And the candidate word ordering module is used for calculating a weight score for the candidate terms and taking TopK candidate terms as keywords. The specific selection method is as described above.

Further, the system further includes a self-encoder training module, configured to process sequence information in the phrase structure, and obtain phrase vector representations of the candidate terms, where the training method is as described above.

In the following, an example of enterprise paper data in an enterprise paper database is taken to describe a specific keyword extraction method based on phrase vectors.

The enterprise thesis database is provided with enterprise thesis data in various fields, including fields such as 'title', 'year', 'abstract', 'keyword', 'English keyword', 'classification number', and the like. In the keyword extraction process, the 'title' and the 'abstract' in the database are used as text contents, and the 'keyword' is used as marking data to verify the extraction result.

When training the self-encoder, the "keyword" field in the database is taken as training data, and part of parameters in the training process are shown in table 1.

TABLE 1 training parameter settings

Before extracting the key words, analyzing the labeled data to determine partial parameters in the algorithm. There were 59913 articles in the dataset, with an average of 4.2 tagged keywords per article. First, the length of the labeling keyword, i.e. the number of words contained in each keyword, is counted, and the result is shown in table 2. From table 2, it can be seen that the average length of all keywords is 1.98, and the length of most keywords is between 1 and 3, and the keywords with the length between 1 and 3 occupy 93.9% of all 254376 keywords. Thus 1-tuple, 2-tuple, and 3-tuple in the text are retained in selecting the candidate terms.

Then, the parts of speech of all words in the keyword are counted, and the statistical result is shown in table 3. Part-of-speech tagging is accomplished using a Jieba segmentation tool, with part-of-speech descriptions as shown in table 4. According to table 3, the distribution of parts of speech of words in the keyword has no concentration of length distribution, but is also mainly concentrated on nouns, verbs, and verbs having a noun function, which occupy 73.1% of the whole word parts of speech. Therefore, nouns, verbs, noun verbs, and combinations thereof in the text are taken as candidate terms when candidate term selection is performed.

TABLE 2 keyword Length distribution

TABLE 3 word part-of-speech distributions

TABLE 4Jieba parts of speech

Since the text content only comprises the topic and the abstract of the paper, the topic is taken as the representative of the full-text topic when the topic weight is calculated, and the candidate terms are extracted from the topic to calculate the topic vector of the text. In addition, the size of the co-occurrence window in the candidate word ranking is initially set to 3, and the number of the candidate words finally retained is 10, as shown in table 5.

TABLE 5 keyword extraction results (parts)

Preferably, the method takes a paper data in an enterprise paper database as an example, and provides a specific keyword extraction process.

The data content refers back to the ten-year high-speed development period of the coal industry and the profound influence of the ten-year high-speed development period on the mining design market for the analysis of development idea examples of the mining design industry under new conditions. Under the background that the economy of the current coal industry is rapidly descended and the competition of the coal design market is fierce, taking the development of the mining specialty of the design institute of natural and terrestrial science as an example, the characteristics of the human resources and the business change of the mining specialty are analyzed, the development idea and the implementation measures of the mining specialty are provided, and the reference is provided for the development of the mining specialty of other design enterprises.

Wherein, the analysis of the development idea example of the mining design industry under the new situation is the subject of the thesis, and the rest contents are the abstract of the thesis.

Candidate terms are selected through n-tuple terms and part-of-speech tagging, the candidate terms selected from the abstract of the thesis are used as a subject term set of the text, and the selected candidate terms are shown in table 6.

TABLE 6 candidate term results

The phrase vector representation corresponding to all terms in the subject term set is obtained by using a self-encoder, the average value of the phrase vectors corresponding to all terms in the subject term set is calculated to be used as the subject vector of the text, the size of the subject vector of the document is calculated to be 400, and partial values are shown in table 7.

Table 7 theme weight results (parts)

For each candidate term, the cosine distance between it and the topic vector of the text is calculated to obtain the topic weight, and part of the values are shown in table 8.

Table 8 theme weight results (parts)

And taking the candidate terms as vertexes, taking the co-occurrence information of the candidate terms as edges to construct an undirected graph, distributing weight to each edge in the graph according to cosine distance between vector representations of the two candidate terms and the co-occurrence times of the two candidate terms, and carrying out iterative computation for multiple times according to the theme weight and the weight of the edge to obtain the vertex weight. After several iterations, each vertex in the graph can get a stable score, with some scores as shown in table 9.

TABLE 9 weight score results (parts)

The obtained scoring conditions are sorted, and Top10 candidate terms with the highest scoring are used as final keywords, as shown in table 10.

TABLE 10 keyword extraction results (parts)

It should be noted that "first" and "second" are only used herein to distinguish the same-named entities or operations, and do not imply an order or relationship between the entities or operations.

Those of ordinary skill in the art will understand that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions and scope of the present invention as defined in the appended claims.

Claims

1. A keyword extraction method based on phrase vectors is characterized by comprising the following steps:

s2, constructing phrase vectors for the candidate terms through a self-encoder;

s3, determining the theme of the text, calculating the similarity between the candidate term and the theme vector, and taking the similarity as the theme weight of the candidate term; wherein the subject term set

Taking the average value of phrase vectors corresponding to all the terms in the document as a topic vector of the document

Topics for representing the entire document:

wherein the content of the first and second substances,

is a subject term t_iA corresponding phrase vector representation;

s4, obtaining keywords from the candidate term set through a TextRank algorithm;

wherein, the TextRank algorithm in step S4 further includes iteratively calculating the weight of the candidate term until the maximum iteration number is reached, and the weight

The calculation formula is as follows:

wherein the content of the first and second substances,

representing candidate terms c_jD is the damping coefficient;

representation and candidate terms c_jA set of connected-up candidate terms,

is that

The elements (A) and (B) in (B),

representation and candidate terms c_kA set of connected-up candidate terms,

is that

The elements (A) and (B) in (B),

representing candidate terms c_kThe weight of (c);

the calculation method of the theme weight comprises the following steps: for each candidate term c_jComputing the candidate term and the document d_iSubject vector of

Cosine distance between them as theme weight.

2. The method of claim 1, wherein the self-encoder in step S2 comprises an encoder and a decoder, the encoder is composed of a bi-directional LSTM layer and a full connection layer, and the decoding portion is composed of a unidirectional LSTM layer and a softmax layer.

3. The method according to claim 2, wherein the training method of the self-encoder in step S2 comprises the following steps:

s21, selecting a training sample to obtain candidate terms;

s22, candidate term c_j＝(x₁,x₂,…,x_T) In the encoder, bi-directional LSTM is used to perform calculations from both front and back directions:

wherein the content of the first and second substances,

and

and

hidden layer state and cell state in the left-to-right and right-to-left directions, respectively, at time t-1, x_tThe words in the candidate terms input at the time T are represented by T, and the number of the words in the candidate terms is represented by T;

s23, in the encoder, ES is obtained through formula calculation_T：

h′_T＝f(W_hh_T+b_h)

C′_T＝f(W_cC_T+b_c)

Wherein the content of the first and second substances,

words in the candidate terms output at the time t-1;

s25, according to z_tEstimating the probability of a current word

Wherein softmax is a normalization function, W_sz_t+b_sScore each possible output word, W_sAnd b_sRespectively representing the weight value and the offset value of the softmax function;

s26, when the loss function L is continuously reduced and finally tends to be stable in the training process, obtaining the parameter W of the encoder_h、b_h、W_c、b_cAnd a parameter W of the decoder_s、b_sThereby determining a self-encoder; the calculation formula of the loss function L is as follows:

4. the method of claim 3, wherein in step S2, the candidate term is input from an encoder, and the ES output from the encoder_TIs the phrase vector of the candidate term.

5. The method of claim 1, wherein the term candidate c is selected in the TextRank algorithm of step S4_jAnd c_kOccurs in the co-occurrence window, then c_jAnd c_kThere is an edge between them, and the calculation formula of the weight of the edge is:

w_jk＝similarity(c_j,c_k)×occur_count(c_j,c_k)

wherein the content of the first and second substances,

are respectively candidate terms c_jAnd c_kVector representation of (c) occur_count(c_j,c_k) Denotes c_jAnd c_kNumber of co-occurrences in the co-occurrence window, similarity (c)_j,c_k) Is c_jAnd c_kSimilarity between, w_jkRepresents c_jAnd c_kThe weight of the edges in between.

6. A phrase vector based keyword extraction system, the system comprising:

phraseA vector construction module for constructing candidate term c_j＝(x₁,x₂,…,x_T) Obtaining, by an auto-encoder, a phrase vector having a semantic representation;

the theme weight calculation module is used for calculating the theme weight of the candidate term through the theme vector; wherein the subject term set

Topics for representing the entire document:

wherein the content of the first and second substances,

is a subject term t_iA corresponding phrase vector representation;

the candidate word ordering module is used for calculating a weight score for the candidate terms and taking TopK candidate terms as keywords; the calculation weight score is the weight of the candidate terms of iterative calculation until the maximum iteration times is reached, and the weight

The calculation formula is as follows:

wherein the content of the first and second substances,

representing candidate terms c_jD is the damping coefficient;

representation and candidate terms c_jA set of connected-up candidate terms,

is that

The elements (A) and (B) in (B),

representation and candidate terms c_kA set of connected-up candidate terms,

is that

The elements (A) and (B) in (B),

representing candidate terms c_kThe weight of (c); the calculation method of the theme weight comprises the following steps: for each candidate term c_jComputing the candidate term and the document d_iSubject vector of

Cosine distance between them as theme weight.

7. The system of claim 6, further comprising an auto-encoder training module for determining the auto-encoder by obtaining auto-encoded parameters through sample training.