CN110263343B - Phrase vector-based keyword extraction method and system - Google Patents
Phrase vector-based keyword extraction method and system Download PDFInfo
- Publication number
- CN110263343B CN110263343B CN201910548261.XA CN201910548261A CN110263343B CN 110263343 B CN110263343 B CN 110263343B CN 201910548261 A CN201910548261 A CN 201910548261A CN 110263343 B CN110263343 B CN 110263343B
- Authority
- CN
- China
- Prior art keywords
- candidate
- weight
- term
- encoder
- terms
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to the technical field of natural language processing and deep learning, in particular to a phrase vector-based keyword extraction method and system. The main technical scheme of the invention comprises the following steps: segmenting words of an original text, labeling parts of speech, and reserving n-tuple according to the parts of speech to obtain a candidate term set; constructing vector representations of a large number of phrases contained in the candidate keyword set; calculating the theme weight of each candidate term; and constructing a graph by taking the candidate terms as vertexes in the graph and the co-occurrence information of the candidate terms as edges, calculating the weight of the edges by using the semantic similarity between the candidate terms and the co-occurrence information, and iteratively calculating the score of each candidate term and sequencing. According to the keyword extraction method and system provided by the invention, the topic information in the document is introduced, the context information is introduced through the semantic similarity between the phrases, the key words in the whole text can be captured, the semantic precision is high, and the application range is wide.
Description
Technical Field
The invention relates to the technical field of natural language processing and deep learning, in particular to a phrase vector-based keyword extraction method and system.
Background
In recent years, mass data brings great convenience to people, and simultaneously brings great challenges to analysis and search of data. Under the background of big data, how to quickly acquire required key information from massive data becomes a problem which needs to be solved urgently. Keyword extraction refers to automatically extracting important words or phrases with themes from documents through an algorithm. In scientific literature, keywords or phrases can help users quickly learn about the content of a paper. Meanwhile, the keywords or phrases may also be used as search entries in information retrieval, natural language processing, and text mining. In the keyword extraction task, word vectors containing word semantics are applied and achieve good effects. However, many professional papers, including enterprise papers, contain a large number of proper nouns, and these nouns are often not single words but phrases, so that the word vector alone is not enough to satisfy the requirement of the keyword extraction task, and the text needs to construct vector representation for the phrases.
The current scholars propose to construct phrase vectors on the basis of word vectors using self-encoders for combination. The Auto Encoder (Auto Encoder) has only two parts of the Encoder and the decoder, when the Auto Encoder combines the word vectors to construct the phrase vector, the Encoder part can input the representation of each word in the phrase, then compress them into an intermediate hidden layer vector, and the decoder part can re-resolve the input phrase through the hidden layer vector, so that the intermediate vector can be regarded as the phrase vector representation containing semantic information. However, in the conventional self-encoder, the encoding and decoding are performed directly using the basic full-connection network, in which layers are fully connected, and nodes between each layer are connectionless, and the general self-encoding network cannot process sequence information in a structure like a phrase.
In addition, the existing algorithm calculates the semantic similarity of words only through word vectors, and ignores the subject information of the text. TextRank is a keyword extraction algorithm based on a graph, and the basic idea is to form a graph by using candidate terms in a document, construct edges by using co-occurrence relations of the candidate terms in the document, iteratively calculate weights by voting among the candidate terms, and finally rank the candidate terms according to scores to determine the finally extracted keywords. In the conventional TextRank, the initial weight of each vertex in the graph is 1 (or 1/n, n is the number of vertices), and the weight of each edge is also set to 1, that is, the number of votes of each vertex is uniformly applied to each vertex connected to the edge. Such an approach, while simple and convenient, ignores the document's themes and does not take into account the semantic relationships between vertices.
In a Recurrent Neural Network (RNN), the nodes between hidden layers are no longer unconnected but connected, and the input of a hidden layer contains not only the output of the input layer but also the output of the hidden layer at the last instant. RNNs are therefore suitable for encoding sequence data. However, in the RNN propagation process, the forgetting of history information and the accumulation of errors are an important problem, and now people usually use Long Short-Term Memery (LSTM) to improve.
LSTM is a special type of RNN that uses cell states to record information, which have only a small amount of linear interactions during sequence transmission and can better retain historical information. LSTM then uses gating mechanisms to protect and control cell state. The gating mechanism is an abstract concept, which is actually composed of a sigmoid function and a dot product operation in concrete implementation, and controls the information transfer by outputting a value between 0 and 1, wherein the closer the output value to 0, the less information is allowed to pass, and the closer to 1, the more information is allowed to pass.
In an LSTM unit, the information transferred in the previous step is processed first, and the LSTM controls the forgetting and keeping of the history information through a forgetting gate (forget gate). Forget door ftAccording to the current information, whether the information before forgetting is needed or not is determined, and the specific formula is as follows:
ft=σ(Wf·[ht-1,xt]+bf)
where σ denotes a sigmoid function, WfAnd bfRespectively representing the weight matrix and the bias in the forgetting gate.
Then LSTM needs to process the current input information, and the current input is controlled by the input gateThe portion of the information to be retained, and then a cell state is created using the tanh functionThe information of the time node is added to the cell state.
it=σ(Wi·[ht-1,xt]+bi)
By means of the forget gate and the input gate, the LSTM can determine which past information needs to be left and which current information needs to be stored, thereby calculating the current cell state Ct。
Finally, the LSTM uses the sigmoid function, determines the information needing to be output at the current moment through an output gate (output gate) according to the historical information and the current input information, and similarly to the input state, the output state is filtered by using a tanh function.
ot=σ(Wo·[ht-1,xt]+bo)
ot=ot*tanh(Ct)
Through a smart door mechanism, the long-time memory neural network can memorize the previous information, and the problem of gradient disappearance is avoided.
Disclosure of Invention
The invention provides a keyword extraction method and system based on phrase vectors, aiming at solving the two problems that the word vectors are not enough to meet the requirement of a keyword extraction task and the existing algorithm ignores the subject information of a text.
In order to achieve the above object, in a first aspect, the present invention provides a keyword extraction method based on a phrase vector, where the method includes:
s1, segmenting the text, marking the part of speech, and reserving n-tuple to obtain a candidate term set;
s2, constructing phrase vectors for the candidate terms through a self-encoder;
s3, determining the theme of the text, calculating the similarity between the candidate term and the theme vector, and taking the similarity as the theme weight of the candidate term;
and S4, acquiring keywords from the candidate term set through a TextRank algorithm.
Further, the self-encoder in step S2 includes an encoder and a decoder, the encoder is composed of a bi-directional LSTM layer and a full connection layer, and the decoding portion is composed of a unidirectional LSTM layer and a softmax layer.
Further, the self-encoder in step S2 includes an encoder and a decoder, and the training method includes the following steps:
s21, selecting a training sample to obtain candidate terms;
s22, candidate term cj=(x1,x2,...,xT) In the encoder, bi-directional LSTM is used to perform calculations from both front and back directions:
wherein the content of the first and second substances,andthe hidden layer state and the cell state in the left-to-right and right-to-left directions at time T (T ═ 1, 2.., T), respectively,andhidden layer state and cell state in the left-to-right and right-to-left directions, respectively, at time t-1, xtWords in the candidate terms input at the time t; t represents the number of words in the candidate term;
s23, in the encoder, ES is obtained through formula calculationT:
h′T=f(WhhT+bh)
C′T=f(WcCT+bc)
Wherein the content of the first and second substances,is a connector, Wh、bh、Wc、bcRepresenting parameter matrices and offsets in a fully connected network, f represents the activation function ReLU, ES in a fully connected networkTIs h'TAnd C'TA tuple is formed;
s24, in the decoder part, with ESTDecoding using unidirectional LSTM for initial state:
wherein z istIs the hidden layer state of the decoder at time t, zt-1Hidden layer state at time t-1, ESTIn order to be the state of the encoder,words in the candidate terms output at the time t-1;
s25, according to ztEstimating the probability of the current word:
wherein, Wszt+bsScoring is performed for each possible output word, softmax being a normalization function.
S26, when the loss function L is continuously reduced and finally tends to be stable in the training process, obtaining the parameter W of the encoderh、bh、Wc、bcAnd W in the decoders、bsThereby determining a self-encoder; the calculation formula of the loss function L is as follows:
further, in step S2, the candidate term is input from the encoder, and the ES output from the encoderTIs the phrase vector of the candidate term.
wherein the content of the first and second substances,is a subject term tiThe corresponding vector is represented by a vector that,is a text diIs represented by the topic vector of (1).
Further, in the TextRank algorithm of the step S4, if the candidate term c is selectedjAnd ckOccurs in the co-occurrence window, then cjAnd ckThere is an edge between them, and the calculation formula of the weight of the edge is:
wjk=similarity(cj,ck)×occurcount(cj,ck)
wherein the content of the first and second substances,are respectively candidate terms cjAnd ckVector representation of (c) occurcount(cj,ck) Denotes cjAnd ckNumber of co-occurrences in the co-occurrence window, similarity (c)j,ck) Is cjAnd ckSimilarity between, wjkRepresents cjAnd ckThe weight of the edges in between.
Further, the step of iteratively calculating vertex weights in the TextRank algorithm of step S4 includes the following steps:
iteratively calculating the weight of the candidate terms until the maximum iteration number is reached, and scoring the weightThe calculation formula is as follows:
wherein the content of the first and second substances,representing candidate terms cjD is the damping coefficient, preferably d is 0.85;is a candidate term cjSubject weight of, wjkIs a candidate term cjAnd candidate term ckWeight of edges in between, wkpIs a candidate term ckAnd candidate term cpThe weight of the edge in between,representation and candidate terms cjA set of connected-up candidate terms,are the elements of, and, for the same reason,representation and candidate terms ckA set of connected-up candidate terms,are elements thereof.
In a second aspect, the invention provides a phrase vector-based keyword extraction system, which comprises a text preprocessing module, a word segmentation module, a word classification module and a word classification module, wherein the text preprocessing module is used for performing word segmentation on an original text, labeling part of speech, and reserving n-tuple according to the part of speech to obtain a candidate term set;
phrase vector construction module for candidate term ci=(x1,x2,...,xT) Obtaining, by an auto-encoder, a phrase vector having a semantic representation;
the theme weight calculation module is used for calculating the theme weight of the candidate lexical item;
and the candidate word ordering module is used for calculating a weight score for the candidate terms and taking TopK candidate terms as keywords.
Further, the system further comprises an auto-encoder training module, which is used for obtaining the auto-encoded parameters through sample training, so as to determine the auto-encoder.
Compared with the existing keyword extraction method and system, the phrase vector-based keyword extraction method and system provided by the invention have the following beneficial effects:
1. according to the keyword extraction method and system provided by the invention, the topic information in the document is introduced, the context information is introduced through the semantic similarity between words, the key words in the whole text can be captured, and the extracted keywords are more accurate.
2. According to the keyword extraction method and system provided by the invention, the phrase vectors are used for acquiring the keywords, so that the calculation process is simple and efficient.
3. The phrase vector calculation method provided by the invention creatively introduces the self-encoder based on the LSTM to compress the word vector, can better represent the semantic information of the phrase, and has higher semantic precision and wider application range.
4. The invention improves the TextRank algorithm, creatively utilizes the phrase vectors to calculate the theme weight of each candidate term, and calculates the side weight by the semantic similarity between the candidate terms and the co-occurrence information, thereby not only considering the theme of the whole document, but also introducing the semantic information between the vertexes, and leading the accuracy of the ranking algorithm to be higher.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a schematic structural diagram of an auto-encoder according to an embodiment of the present invention;
FIG. 2 is a flowchart of a keyword extraction method based on phrase vectors according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
The invention is further described with reference to the following figures and detailed description.
In order to make the technical solutions and advantages in the examples of the present application more apparent, the following further detailed description of the exemplary embodiments of the present application with reference to the accompanying drawings makes it clear that the described embodiments are only a part of the embodiments of the present application, and not an exhaustive list of all embodiments. It should be noted that, in the present application, the examples may be combined with each other without conflict.
The invention provides a phrase vector-based keyword extraction method, as shown in fig. 2, the method comprises the following steps:
s1, for original text diPerforming word segmentation and part-of-speech tagging, and reserving n-tuple according to the part-of-speech to obtain a candidate term set
S2, for each candidate term cj=(x1,x2,…,xT) A phrase vector representation of the candidate term is obtained by the self-encoder. Wherein x isiIs a candidate term cjThe word vector of the ith word in (a) and T represents the number of words in the candidate term.
S3, calculating each candidate term cjAnd topic vectorIs taken as the subject weightWherein d isiThe ith document is represented. The self-encoder comprises an encoder and a decoder, the encoder part is composed of a bidirectional LSTM layerAnd a full connection layer, and the decoding part is composed of a unidirectional LSTM layer and a softmax layer.
And S4, acquiring keywords from the candidate term set through a modified TextRank algorithm.
In step S2, in the encoder, for each candidate term c to be inputjRespectively calculating from front and back directions by using bidirectional LSTM, and taking the state h of the hidden layer at the last momentTAnd cell state CTAs the final state, splicing, and finally obtaining the output ES of the coding layer through a full connection layerT。
In the decoder, with ESTFor initial input, decoding is carried out by using a unidirectional LSTM structure, the probability distribution of each step of decoding is obtained through a softmax layer, and finally the probability of decoding the correct word corresponding to each step is maximized through a loss function L.
The aim of training is to optimize parameters of the self-encoder, so that the decoder can take the output of the encoder as input, and restore semantic information of candidate terms input by the encoder to the maximum extent.
The specific training method comprises the following steps:
(1) and selecting a training sample, and then performing word segmentation and other operations on the sample as in S1 to obtain a candidate term set.
Candidate terms for cj=(x1,x2,…,xT) Is represented by, wherein xiIs a candidate term cjThe word vector of the ith word in (a) and T represents the number of words in the candidate term. With candidate term cjFor example, Beijing university of Phytology, x1Is the word vector, x, corresponding to "Beijing2Is the word vector, x, corresponding to "reason worker3Is the word vector for "university".
(2) The model is trained using a large number of candidate terms. Taking candidate term 'Beijing university of rational workers' as an example, inputting word vector representation corresponding to 'Beijing', 'rational worker' and 'university', obtaining phrase vector representation of 'Beijing university of rational workers' through encoding, obtaining corresponding probability values of 'Beijing', 'rational worker' and 'university' in sequence through decoding of the phrase vector, and maximizing the probability values through training.
For each candidate term cj=(x1,x2,…,xT) In the encoder section, the encoder uses bi-directional LSTM to perform calculations from both front and back directions, respectively:
wherein the content of the first and second substances,andthe hidden layer state and the cell state in the two directions from left to right and from right to left at the time T (T is 1,2, …, T), respectively,andhidden layer state and cell state in the left-to-right and right-to-left directions, respectively, at time t-1, xtThe words in the candidate terms input at time t. At each instant of time, the current hidden layer state htAnd cell state CtAll depend on the hidden layer state h at the last momentt-1Cell state Ct-1And the current input xt。
The state h of the hidden layer at the last moment is takenTAnd cell state CTAs the final state, the states in both directions are directly connected. In addition, in order to provide a fixed-size input for the decoding layer, the connected state needs to be processed through a full-connection layer. Calculating the following formula to obtain one of the decodersFixed size input EST:
h′T=f(WhhT+bh)
C′T=f(WcCT+bc)
Wherein the content of the first and second substances,is a connector, Wh、bh、Wc、bcRepresenting parameter matrices and offsets in a fully connected network, f represents the activation function ReLU, ES in a fully connected networkTIs h'TAnd C'TOne tuple is composed that is eventually provided to the decoder.
In the decoder part, with ESTDecoding using unidirectional LSTM for initial state:
wherein z istIs the hidden layer state of the decoder at time t, zt-1Hidden layer state at time t-1, ESTIn order to be the state of the encoder,and outputting the words in the candidate terms at the moment t-1.
According to ztEstimating the probability of the current word:
wherein, WsIs a parameter matrix, WsAnd bsRespectively representing the weight and offset values of the softmax function, ztIs the hidden layer state of the decoder at time t, Wszt+bsScoring each possible output word, normalizing with softmax to obtain each wordProbability of (2)
The training goal of the auto-encoder is to maximize the probability of outputting the correct phrase: the output from the encoder is the probability for each word, and the training objective is to maximize the probability of outputting the correct word, i.e., to train on the basis of the loss function L, by training to adjust the parameters of the self-encoder (including the parameters in LSTM, W in encoder)h、bh、Wc、bcAnd W in the decoders、bs) When the loss function is continuously reduced and finally tends to be stable in the training process, the fact that the middle vector can well represent phrase semantics can be explained, and the middle vector can be represented as a phrase vector. The loss function L is calculated as follows:
after the training of the self-encoder is finished, the loss function value of the self-encoder tends to be stable. At this point, the training of the self-encoder is completed, and the candidate terms are input into the encoder of the self-encoder, ESTThe value in (1) is the phrase vector. Through the self-encoder constructed above, word vectors are compressed by using information on the candidate term sequence, and phrase vector representation of the candidate terms is obtained.
After the training of the self-encoder is finished, when the phrase vector representation of the candidate term is required to be obtained, the phrase vector representation ES of the candidate term can be obtained only by utilizing the calculation of the encoding partTObtained ESTCan be divided into oneThe entirety of the candidate term takes into account semantic information of the candidate term.
In step S3, the theme weight calculation process is as follows:
(1) determining a topic term set: the method is characterized by taking a topic sentence or paragraph with high generality of text as a representative, such as a topic or abstract of a thesis, determining a topic term of the text from the topic sentence, and adding a topic term set of the text:wherein d isiRepresenting the ith document, n is the number of elements in the subject term set. For example, for "mining design industry development idea instance analysis under new situation", the topic term set may be "mining design", "development idea", "instance analysis".
(2) Calculating a topic vector: computing a set of topic termsThe average value of the word or phrase vectors corresponding to all the terms in the document is used as the topic vector of the documentTopics for representing the entire document:
wherein the content of the first and second substances,is a subject term tiThe corresponding vector is represented by a vector that,is a document diIs represented by the topic vector of (1).
(3) Calculating the theme weight: for each candidate term cjCalculate it and document diSubject vector ofCosine distance between them as its theme weight.
Wherein the content of the first and second substances,is a document diCandidate term of (c)jThe weight of the subject matter of (1),is a candidate term cjRepresents the cosine distance cos.
Through the steps (1) to (3), a topic weight between 0 and 1 can be assigned to each candidate term. It should be noted that a topic weight of 1 indicates that the candidate term is closest to the topic of the text, and a topic weight of 0 indicates that the candidate term is farther from the topic of the text.
In step S4, the document d is usediSet of candidate termsConstructing an undirected graph for the vertices, computing each candidate term cjIs scored by the weight ofTake TopK (top K) candidate terms as keywords. The method is realized by improving the TextRank algorithm, and the specific process is as follows:
(1) constructing an undirected graph: with document diSet of candidate termsAll elements in (a) construct an undirected graph for the vertices. Wherein if the candidate term cjAnd ckIn a co-occurrence window of length n, cjAnd ckThere is an edge in between.
(2) ComputingWeight of edge: the weighting of the edges is an improvement of the present invention. A phrase vector is computed that also depends on the self-encoder construction. According to two candidate terms cjAnd ckThe cosine distance between the vector representations (c)j,ck) And number of co-occurrences occurcount(cj,ck) Assigning a weight w to each edge in the graphjk:
wjk=similarity(cj,ck)×occurcount(cj,ck)
WhereinAre respectively candidate terms cjAnd ckCos represents the cosine distance of the vector occurcount(cj,ck) Denotes cjAnd ckThe number of co-occurrences in the co-occurrence window is multiplied to enhance the semantic relationship between two words by the number of co-occurrences, wjkRepresents cjAnd ckThe weight of the edges in between.
(3) Iteratively calculating vertex weights: vertex weights are also an improvement of the present invention. Iteratively calculating the weight of each top point in the graph until the maximum iteration number is reached, and scoring the weightThe calculation is as follows:
wherein the content of the first and second substances,representing a document diCandidate term of (c)jD is a damping coefficient, isEach vertex has a certain probability to vote for other vertices, so that each vertex has a score which is not zero, and the algorithm can be ensured to be converged after multiple iterations, and the value is usually 0.85.Is a document diCandidate term of (c)jSubject weight of, wjkIs a candidate term cjAnd candidate term ckWeight of edges in between, wkpIs a candidate term ckAnd candidate term cpThe weight of the edge in between,representation and candidate terms cjA set of connected-up candidate terms,are elements of the set and, similarly,representation and candidate terms ckA set of connected-up candidate terms,is an element of the set that is,representing a document diCandidate term of (c)kThe second half of the right hand side of the equation represents the sum of cjConnected vertex givejThe voting of (1).
(4) Candidate term ordering: after multiple iterations, each vertex in the graph can obtain a stable score, and the candidate term set is divided into a plurality of candidate term setsScoring by weightSorted from big to small and reserved beforeTopK candidate terms serve as keywords of the document.
Through the four steps of S1 to S4, the keywords of the document can be extracted.
The invention also provides a phrase vector-based keyword extraction system, which comprises:
the text preprocessing module is used for segmenting the original text, marking the part of speech and reserving n-tuple according to the part of speech to obtain a candidate term set;
phrase vector construction module for candidate term cj=(x1,x2,…,xT) Obtaining, by an auto-encoder, a phrase vector having a semantic representation;
the theme weight calculation module is used for calculating the theme weight of the candidate lexical item; the specific calculation method is as described above.
And the candidate word ordering module is used for calculating a weight score for the candidate terms and taking TopK candidate terms as keywords. The specific selection method is as described above.
Further, the system further includes a self-encoder training module, configured to process sequence information in the phrase structure, and obtain phrase vector representations of the candidate terms, where the training method is as described above.
In the following, an example of enterprise paper data in an enterprise paper database is taken to describe a specific keyword extraction method based on phrase vectors.
The enterprise thesis database is provided with enterprise thesis data in various fields, including fields such as 'title', 'year', 'abstract', 'keyword', 'English keyword', 'classification number', and the like. In the keyword extraction process, the 'title' and the 'abstract' in the database are used as text contents, and the 'keyword' is used as marking data to verify the extraction result.
When training the self-encoder, the "keyword" field in the database is taken as training data, and part of parameters in the training process are shown in table 1.
TABLE 1 training parameter settings
Before extracting the key words, analyzing the labeled data to determine partial parameters in the algorithm. There were 59913 articles in the dataset, with an average of 4.2 tagged keywords per article. First, the length of the labeling keyword, i.e. the number of words contained in each keyword, is counted, and the result is shown in table 2. From table 2, it can be seen that the average length of all keywords is 1.98, and the length of most keywords is between 1 and 3, and the keywords with the length between 1 and 3 occupy 93.9% of all 254376 keywords. Thus 1-tuple, 2-tuple, and 3-tuple in the text are retained in selecting the candidate terms.
Then, the parts of speech of all words in the keyword are counted, and the statistical result is shown in table 3. Part-of-speech tagging is accomplished using a Jieba segmentation tool, with part-of-speech descriptions as shown in table 4. According to table 3, the distribution of parts of speech of words in the keyword has no concentration of length distribution, but is also mainly concentrated on nouns, verbs, and verbs having a noun function, which occupy 73.1% of the whole word parts of speech. Therefore, nouns, verbs, noun verbs, and combinations thereof in the text are taken as candidate terms when candidate term selection is performed.
TABLE 2 keyword Length distribution
TABLE 3 word part-of-speech distributions
TABLE 4Jieba parts of speech
Since the text content only comprises the topic and the abstract of the paper, the topic is taken as the representative of the full-text topic when the topic weight is calculated, and the candidate terms are extracted from the topic to calculate the topic vector of the text. In addition, the size of the co-occurrence window in the candidate word ranking is initially set to 3, and the number of the candidate words finally retained is 10, as shown in table 5.
TABLE 5 keyword extraction results (parts)
Preferably, the method takes a paper data in an enterprise paper database as an example, and provides a specific keyword extraction process.
The data content refers back to the ten-year high-speed development period of the coal industry and the profound influence of the ten-year high-speed development period on the mining design market for the analysis of development idea examples of the mining design industry under new conditions. Under the background that the economy of the current coal industry is rapidly descended and the competition of the coal design market is fierce, taking the development of the mining specialty of the design institute of natural and terrestrial science as an example, the characteristics of the human resources and the business change of the mining specialty are analyzed, the development idea and the implementation measures of the mining specialty are provided, and the reference is provided for the development of the mining specialty of other design enterprises.
Wherein, the analysis of the development idea example of the mining design industry under the new situation is the subject of the thesis, and the rest contents are the abstract of the thesis.
Candidate terms are selected through n-tuple terms and part-of-speech tagging, the candidate terms selected from the abstract of the thesis are used as a subject term set of the text, and the selected candidate terms are shown in table 6.
TABLE 6 candidate term results
The phrase vector representation corresponding to all terms in the subject term set is obtained by using a self-encoder, the average value of the phrase vectors corresponding to all terms in the subject term set is calculated to be used as the subject vector of the text, the size of the subject vector of the document is calculated to be 400, and partial values are shown in table 7.
Table 7 theme weight results (parts)
For each candidate term, the cosine distance between it and the topic vector of the text is calculated to obtain the topic weight, and part of the values are shown in table 8.
Table 8 theme weight results (parts)
And taking the candidate terms as vertexes, taking the co-occurrence information of the candidate terms as edges to construct an undirected graph, distributing weight to each edge in the graph according to cosine distance between vector representations of the two candidate terms and the co-occurrence times of the two candidate terms, and carrying out iterative computation for multiple times according to the theme weight and the weight of the edge to obtain the vertex weight. After several iterations, each vertex in the graph can get a stable score, with some scores as shown in table 9.
TABLE 9 weight score results (parts)
The obtained scoring conditions are sorted, and Top10 candidate terms with the highest scoring are used as final keywords, as shown in table 10.
TABLE 10 keyword extraction results (parts)
It should be noted that "first" and "second" are only used herein to distinguish the same-named entities or operations, and do not imply an order or relationship between the entities or operations.
Those of ordinary skill in the art will understand that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions and scope of the present invention as defined in the appended claims.
Claims (7)
1. A keyword extraction method based on phrase vectors is characterized by comprising the following steps:
s1, segmenting the text, marking the part of speech, and reserving n-tuple to obtain a candidate term set;
s2, constructing phrase vectors for the candidate terms through a self-encoder;
s3, determining the theme of the text, calculating the similarity between the candidate term and the theme vector, and taking the similarity as the theme weight of the candidate term; wherein the subject term setTaking the average value of phrase vectors corresponding to all the terms in the document as a topic vector of the documentTopics for representing the entire document:
wherein the content of the first and second substances,is a subject term tiA corresponding phrase vector representation;
s4, obtaining keywords from the candidate term set through a TextRank algorithm;
wherein, the TextRank algorithm in step S4 further includes iteratively calculating the weight of the candidate term until the maximum iteration number is reached, and the weightThe calculation formula is as follows:
wherein the content of the first and second substances,representing candidate terms cjD is the damping coefficient;is a candidate term cjSubject weight of, wjkIs a candidate term cjAnd candidate term ckWeight of edges in between, wkpIs a candidate term ckAnd candidate term cpThe weight of the edge in between,representation and candidate terms cjA set of connected-up candidate terms,is thatThe elements (A) and (B) in (B),representation and candidate terms ckA set of connected-up candidate terms,is thatThe elements (A) and (B) in (B),representing candidate terms ckThe weight of (c);
2. The method of claim 1, wherein the self-encoder in step S2 comprises an encoder and a decoder, the encoder is composed of a bi-directional LSTM layer and a full connection layer, and the decoding portion is composed of a unidirectional LSTM layer and a softmax layer.
3. The method according to claim 2, wherein the training method of the self-encoder in step S2 comprises the following steps:
s21, selecting a training sample to obtain candidate terms;
s22, candidate term cj=(x1,x2,…,xT) In the encoder, bi-directional LSTM is used to perform calculations from both front and back directions:
wherein the content of the first and second substances,andthe hidden layer state and the cell state in the two directions from left to right and from right to left at the time T (T is 1,2, …, T), respectively,andhidden layer state and cell state in the left-to-right and right-to-left directions, respectively, at time t-1, xtThe words in the candidate terms input at the time T are represented by T, and the number of the words in the candidate terms is represented by T;
s23, in the encoder, ES is obtained through formula calculationT:
h′T=f(WhhT+bh)
C′T=f(WcCT+bc)
Wherein the content of the first and second substances,is a connector, Wh、bh、Wc、bcRepresenting parameter matrices and offsets in a fully connected network, f represents the activation function ReLU, ES in a fully connected networkTIs h'TAnd C'TA tuple is formed;
s24, in the decoder part, with ESTDecoding using unidirectional LSTM for initial state:
wherein z istIs the hidden layer state of the decoder at time t, zt-1Hidden layer state at time t-1, ESTIn order to be the state of the encoder,words in the candidate terms output at the time t-1;
Wherein softmax is a normalization function, Wszt+bsScore each possible output word, WsAnd bsRespectively representing the weight value and the offset value of the softmax function;
s26, when the loss function L is continuously reduced and finally tends to be stable in the training process, obtaining the parameter W of the encoderh、bh、Wc、bcAnd a parameter W of the decoders、bsThereby determining a self-encoder; the calculation formula of the loss function L is as follows:
4. the method of claim 3, wherein in step S2, the candidate term is input from an encoder, and the ES output from the encoderTIs the phrase vector of the candidate term.
5. The method of claim 1, wherein the term candidate c is selected in the TextRank algorithm of step S4jAnd ckOccurs in the co-occurrence window, then cjAnd ckThere is an edge between them, and the calculation formula of the weight of the edge is:
wjk=similarity(cj,ck)×occurcount(cj,ck)
wherein the content of the first and second substances,are respectively candidate terms cjAnd ckVector representation of (c) occurcount(cj,ck) Denotes cjAnd ckNumber of co-occurrences in the co-occurrence window, similarity (c)j,ck) Is cjAnd ckSimilarity between, wjkRepresents cjAnd ckThe weight of the edges in between.
6. A phrase vector based keyword extraction system, the system comprising:
the text preprocessing module is used for segmenting the original text, marking the part of speech and reserving n-tuple according to the part of speech to obtain a candidate term set;
phraseA vector construction module for constructing candidate term cj=(x1,x2,…,xT) Obtaining, by an auto-encoder, a phrase vector having a semantic representation;
the theme weight calculation module is used for calculating the theme weight of the candidate term through the theme vector; wherein the subject term setTaking the average value of phrase vectors corresponding to all the terms in the document as a topic vector of the documentTopics for representing the entire document:
wherein the content of the first and second substances,is a subject term tiA corresponding phrase vector representation;
the candidate word ordering module is used for calculating a weight score for the candidate terms and taking TopK candidate terms as keywords; the calculation weight score is the weight of the candidate terms of iterative calculation until the maximum iteration times is reached, and the weightThe calculation formula is as follows:
wherein the content of the first and second substances,representing candidate terms cjD is the damping coefficient;is a candidate term cjSubject weight of, wjkIs a candidate term cjAnd candidate term ckWeight of edges in between, wkpIs a candidate term ckAnd candidate term cpThe weight of the edge in between,representation and candidate terms cjA set of connected-up candidate terms,is thatThe elements (A) and (B) in (B),representation and candidate terms ckA set of connected-up candidate terms,is thatThe elements (A) and (B) in (B),representing candidate terms ckThe weight of (c); the calculation method of the theme weight comprises the following steps: for each candidate term cjComputing the candidate term and the document diSubject vector ofCosine distance between them as theme weight.
7. The system of claim 6, further comprising an auto-encoder training module for determining the auto-encoder by obtaining auto-encoded parameters through sample training.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910548261.XA CN110263343B (en) | 2019-06-24 | 2019-06-24 | Phrase vector-based keyword extraction method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910548261.XA CN110263343B (en) | 2019-06-24 | 2019-06-24 | Phrase vector-based keyword extraction method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110263343A CN110263343A (en) | 2019-09-20 |
CN110263343B true CN110263343B (en) | 2021-06-15 |
Family
ID=67920847
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910548261.XA Active CN110263343B (en) | 2019-06-24 | 2019-06-24 | Phrase vector-based keyword extraction method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110263343B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111274428B (en) * | 2019-12-19 | 2023-06-30 | 北京创鑫旅程网络技术有限公司 | Keyword extraction method and device, electronic equipment and storage medium |
CN111222333A (en) * | 2020-04-22 | 2020-06-02 | 成都索贝数码科技股份有限公司 | Keyword extraction method based on fusion of network high-order structure and topic model |
CN111785254B (en) * | 2020-07-24 | 2023-04-07 | 四川大学华西医院 | Self-service BLS training and checking system based on anthropomorphic dummy |
CN112818686B (en) * | 2021-03-23 | 2023-10-31 | 北京百度网讯科技有限公司 | Domain phrase mining method and device and electronic equipment |
CN113312532B (en) * | 2021-06-01 | 2022-10-21 | 哈尔滨工业大学 | Public opinion grade prediction method based on deep learning and oriented to public inspection field |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8019708B2 (en) * | 2007-12-05 | 2011-09-13 | Yahoo! Inc. | Methods and apparatus for computing graph similarity via signature similarity |
CN103744835A (en) * | 2014-01-02 | 2014-04-23 | 上海大学 | Text keyword extracting method based on subject model |
KR101656245B1 (en) * | 2015-09-09 | 2016-09-09 | 주식회사 위버플 | Method and system for extracting sentences |
CN106372064A (en) * | 2016-11-18 | 2017-02-01 | 北京工业大学 | Characteristic word weight calculating method for text mining |
CN107122413A (en) * | 2017-03-31 | 2017-09-01 | 北京奇艺世纪科技有限公司 | A kind of keyword extracting method and device based on graph model |
CN107133213A (en) * | 2017-05-06 | 2017-09-05 | 广东药科大学 | A kind of text snippet extraction method and system based on algorithm |
CN107247780A (en) * | 2017-06-12 | 2017-10-13 | 北京理工大学 | A kind of patent document method for measuring similarity of knowledge based body |
CN108460019A (en) * | 2018-02-28 | 2018-08-28 | 福州大学 | A kind of emerging much-talked-about topic detecting system based on attention mechanism |
CN108984526A (en) * | 2018-07-10 | 2018-12-11 | 北京理工大学 | A kind of document subject matter vector abstracting method based on deep learning |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101249183B1 (en) * | 2006-08-22 | 2013-04-03 | 에스케이커뮤니케이션즈 주식회사 | Method for extracting subject and sorting document of searching engine, computer readable record medium on which program for executing method is recorded |
CN106997382B (en) * | 2017-03-22 | 2020-12-01 | 山东大学 | Innovative creative tag automatic labeling method and system based on big data |
CN106970910B (en) * | 2017-03-31 | 2020-03-27 | 北京奇艺世纪科技有限公司 | Keyword extraction method and device based on graph model |
CN107193803B (en) * | 2017-05-26 | 2020-07-10 | 北京东方科诺科技发展有限公司 | Semantic-based specific task text keyword extraction method |
CN107832457A (en) * | 2017-11-24 | 2018-03-23 | 国网山东省电力公司电力科学研究院 | Power transmission and transforming equipment defect dictionary method for building up and system based on TextRank algorithm |
CN108710611B (en) * | 2018-05-17 | 2021-08-03 | 南京大学 | Short text topic model generation method based on word network and word vector |
CN109726394A (en) * | 2018-12-18 | 2019-05-07 | 电子科技大学 | Short text Subject Clustering method based on fusion BTM model |
CN109614626A (en) * | 2018-12-21 | 2019-04-12 | 北京信息科技大学 | Keyword Automatic method based on gravitational model |
CN109918660B (en) * | 2019-03-04 | 2021-03-02 | 北京邮电大学 | Keyword extraction method and device based on TextRank |
CN109918510B (en) * | 2019-03-26 | 2022-10-28 | 中国科学技术大学 | Cross-domain keyword extraction method |
-
2019
- 2019-06-24 CN CN201910548261.XA patent/CN110263343B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8019708B2 (en) * | 2007-12-05 | 2011-09-13 | Yahoo! Inc. | Methods and apparatus for computing graph similarity via signature similarity |
CN103744835A (en) * | 2014-01-02 | 2014-04-23 | 上海大学 | Text keyword extracting method based on subject model |
KR101656245B1 (en) * | 2015-09-09 | 2016-09-09 | 주식회사 위버플 | Method and system for extracting sentences |
CN106372064A (en) * | 2016-11-18 | 2017-02-01 | 北京工业大学 | Characteristic word weight calculating method for text mining |
CN107122413A (en) * | 2017-03-31 | 2017-09-01 | 北京奇艺世纪科技有限公司 | A kind of keyword extracting method and device based on graph model |
CN107133213A (en) * | 2017-05-06 | 2017-09-05 | 广东药科大学 | A kind of text snippet extraction method and system based on algorithm |
CN107247780A (en) * | 2017-06-12 | 2017-10-13 | 北京理工大学 | A kind of patent document method for measuring similarity of knowledge based body |
CN108460019A (en) * | 2018-02-28 | 2018-08-28 | 福州大学 | A kind of emerging much-talked-about topic detecting system based on attention mechanism |
CN108984526A (en) * | 2018-07-10 | 2018-12-11 | 北京理工大学 | A kind of document subject matter vector abstracting method based on deep learning |
Non-Patent Citations (5)
Title |
---|
Bidirectional lstm recurrent neural network for keyphrase extraction;Basaldella Marco 等;《Italian Research Conference on Digital Libraries》;20180131;180-187 * |
基于LSTM的自动文本摘要技术研究;洪冬梅;《中国优秀硕士学位论文全文数据库 信息科技辑》;20190115(第12期);I138-1872 * |
基于改进TextRank的关键词抽取算法;张莉婧 等;《北京印刷学院学报》;20160831;第24卷(第4期);51-55 * |
基于深度学习的中文抽取式摘要方法应用;齐翌辰 等;《科教导刊》;20190515(第14期);69-70 * |
融合多特征的TextRank关键词抽取方法;李航 等;《情报杂志》;20170831;第36卷(第8期);183-187 * |
Also Published As
Publication number | Publication date |
---|---|
CN110263343A (en) | 2019-09-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110263343B (en) | Phrase vector-based keyword extraction method and system | |
CN110717047B (en) | Web service classification method based on graph convolution neural network | |
CN109829104B (en) | Semantic similarity based pseudo-correlation feedback model information retrieval method and system | |
CN106599032B (en) | Text event extraction method combining sparse coding and structure sensing machine | |
CN112579778B (en) | Aspect-level emotion classification method based on multi-level feature attention | |
CN110674252A (en) | High-precision semantic search system for judicial domain | |
CN111191002A (en) | Neural code searching method and device based on hierarchical embedding | |
CN111368088A (en) | Text emotion classification method based on deep learning | |
CN113239148B (en) | Scientific and technological resource retrieval method based on machine reading understanding | |
CN112784602B (en) | News emotion entity extraction method based on remote supervision | |
CN113821635A (en) | Text abstract generation method and system for financial field | |
CN115357719A (en) | Power audit text classification method and device based on improved BERT model | |
CN111581943A (en) | Chinese-over-bilingual multi-document news viewpoint sentence identification method based on sentence association graph | |
CN111259147B (en) | Sentence-level emotion prediction method and system based on self-adaptive attention mechanism | |
CN113326374A (en) | Short text emotion classification method and system based on feature enhancement | |
CN114048354B (en) | Test question retrieval method, device and medium based on multi-element characterization and metric learning | |
CN114491062B (en) | Short text classification method integrating knowledge graph and topic model | |
Khalid et al. | Topic detection from conversational dialogue corpus with parallel dirichlet allocation model and elbow method | |
CN116756347B (en) | Semantic information retrieval method based on big data | |
CN111859955A (en) | Public opinion data analysis model based on deep learning | |
Bhargava et al. | Deep paraphrase detection in indian languages | |
CN116108840A (en) | Text fine granularity emotion analysis method, system, medium and computing device | |
CN113342964B (en) | Recommendation type determination method and system based on mobile service | |
CN113688633A (en) | Outline determination method and device | |
Ribeiro et al. | UA. PT Bioinformatics at ImageCLEF 2019: Lifelog Moment Retrieval based on Image Annotation and Natural Language Processing. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |