CN110263343B - Phrase vector-based keyword extraction method and system - Google Patents

Phrase vector-based keyword extraction method and system Download PDF

Info

Publication number
CN110263343B
CN110263343B CN201910548261.XA CN201910548261A CN110263343B CN 110263343 B CN110263343 B CN 110263343B CN 201910548261 A CN201910548261 A CN 201910548261A CN 110263343 B CN110263343 B CN 110263343B
Authority
CN
China
Prior art keywords
candidate
weight
term
encoder
terms
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910548261.XA
Other languages
Chinese (zh)
Other versions
CN110263343A (en
Inventor
孙新
赵永妍
申长虹
杨凯歌
张颖捷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN201910548261.XA priority Critical patent/CN110263343B/en
Publication of CN110263343A publication Critical patent/CN110263343A/en
Application granted granted Critical
Publication of CN110263343B publication Critical patent/CN110263343B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of natural language processing and deep learning, in particular to a phrase vector-based keyword extraction method and system. The main technical scheme of the invention comprises the following steps: segmenting words of an original text, labeling parts of speech, and reserving n-tuple according to the parts of speech to obtain a candidate term set; constructing vector representations of a large number of phrases contained in the candidate keyword set; calculating the theme weight of each candidate term; and constructing a graph by taking the candidate terms as vertexes in the graph and the co-occurrence information of the candidate terms as edges, calculating the weight of the edges by using the semantic similarity between the candidate terms and the co-occurrence information, and iteratively calculating the score of each candidate term and sequencing. According to the keyword extraction method and system provided by the invention, the topic information in the document is introduced, the context information is introduced through the semantic similarity between the phrases, the key words in the whole text can be captured, the semantic precision is high, and the application range is wide.

Description

Phrase vector-based keyword extraction method and system
Technical Field
The invention relates to the technical field of natural language processing and deep learning, in particular to a phrase vector-based keyword extraction method and system.
Background
In recent years, mass data brings great convenience to people, and simultaneously brings great challenges to analysis and search of data. Under the background of big data, how to quickly acquire required key information from massive data becomes a problem which needs to be solved urgently. Keyword extraction refers to automatically extracting important words or phrases with themes from documents through an algorithm. In scientific literature, keywords or phrases can help users quickly learn about the content of a paper. Meanwhile, the keywords or phrases may also be used as search entries in information retrieval, natural language processing, and text mining. In the keyword extraction task, word vectors containing word semantics are applied and achieve good effects. However, many professional papers, including enterprise papers, contain a large number of proper nouns, and these nouns are often not single words but phrases, so that the word vector alone is not enough to satisfy the requirement of the keyword extraction task, and the text needs to construct vector representation for the phrases.
The current scholars propose to construct phrase vectors on the basis of word vectors using self-encoders for combination. The Auto Encoder (Auto Encoder) has only two parts of the Encoder and the decoder, when the Auto Encoder combines the word vectors to construct the phrase vector, the Encoder part can input the representation of each word in the phrase, then compress them into an intermediate hidden layer vector, and the decoder part can re-resolve the input phrase through the hidden layer vector, so that the intermediate vector can be regarded as the phrase vector representation containing semantic information. However, in the conventional self-encoder, the encoding and decoding are performed directly using the basic full-connection network, in which layers are fully connected, and nodes between each layer are connectionless, and the general self-encoding network cannot process sequence information in a structure like a phrase.
In addition, the existing algorithm calculates the semantic similarity of words only through word vectors, and ignores the subject information of the text. TextRank is a keyword extraction algorithm based on a graph, and the basic idea is to form a graph by using candidate terms in a document, construct edges by using co-occurrence relations of the candidate terms in the document, iteratively calculate weights by voting among the candidate terms, and finally rank the candidate terms according to scores to determine the finally extracted keywords. In the conventional TextRank, the initial weight of each vertex in the graph is 1 (or 1/n, n is the number of vertices), and the weight of each edge is also set to 1, that is, the number of votes of each vertex is uniformly applied to each vertex connected to the edge. Such an approach, while simple and convenient, ignores the document's themes and does not take into account the semantic relationships between vertices.
In a Recurrent Neural Network (RNN), the nodes between hidden layers are no longer unconnected but connected, and the input of a hidden layer contains not only the output of the input layer but also the output of the hidden layer at the last instant. RNNs are therefore suitable for encoding sequence data. However, in the RNN propagation process, the forgetting of history information and the accumulation of errors are an important problem, and now people usually use Long Short-Term Memery (LSTM) to improve.
LSTM is a special type of RNN that uses cell states to record information, which have only a small amount of linear interactions during sequence transmission and can better retain historical information. LSTM then uses gating mechanisms to protect and control cell state. The gating mechanism is an abstract concept, which is actually composed of a sigmoid function and a dot product operation in concrete implementation, and controls the information transfer by outputting a value between 0 and 1, wherein the closer the output value to 0, the less information is allowed to pass, and the closer to 1, the more information is allowed to pass.
In an LSTM unit, the information transferred in the previous step is processed first, and the LSTM controls the forgetting and keeping of the history information through a forgetting gate (forget gate). Forget door ftAccording to the current information, whether the information before forgetting is needed or not is determined, and the specific formula is as follows:
ft=σ(Wf·[ht-1,xt]+bf)
where σ denotes a sigmoid function, WfAnd bfRespectively representing the weight matrix and the bias in the forgetting gate.
Then LSTM needs to process the current input information, and the current input is controlled by the input gateThe portion of the information to be retained, and then a cell state is created using the tanh function
Figure GDA0002959546450000021
The information of the time node is added to the cell state.
it=σ(Wi·[ht-1,xt]+bi)
Figure GDA0002959546450000022
By means of the forget gate and the input gate, the LSTM can determine which past information needs to be left and which current information needs to be stored, thereby calculating the current cell state Ct
Figure GDA0002959546450000023
Finally, the LSTM uses the sigmoid function, determines the information needing to be output at the current moment through an output gate (output gate) according to the historical information and the current input information, and similarly to the input state, the output state is filtered by using a tanh function.
ot=σ(Wo·[ht-1,xt]+bo)
ot=ot*tanh(Ct)
Through a smart door mechanism, the long-time memory neural network can memorize the previous information, and the problem of gradient disappearance is avoided.
Disclosure of Invention
The invention provides a keyword extraction method and system based on phrase vectors, aiming at solving the two problems that the word vectors are not enough to meet the requirement of a keyword extraction task and the existing algorithm ignores the subject information of a text.
In order to achieve the above object, in a first aspect, the present invention provides a keyword extraction method based on a phrase vector, where the method includes:
s1, segmenting the text, marking the part of speech, and reserving n-tuple to obtain a candidate term set;
s2, constructing phrase vectors for the candidate terms through a self-encoder;
s3, determining the theme of the text, calculating the similarity between the candidate term and the theme vector, and taking the similarity as the theme weight of the candidate term;
and S4, acquiring keywords from the candidate term set through a TextRank algorithm.
Further, the self-encoder in step S2 includes an encoder and a decoder, the encoder is composed of a bi-directional LSTM layer and a full connection layer, and the decoding portion is composed of a unidirectional LSTM layer and a softmax layer.
Further, the self-encoder in step S2 includes an encoder and a decoder, and the training method includes the following steps:
s21, selecting a training sample to obtain candidate terms;
s22, candidate term cj=(x1,x2,...,xT) In the encoder, bi-directional LSTM is used to perform calculations from both front and back directions:
Figure GDA0002959546450000031
Figure GDA0002959546450000032
wherein the content of the first and second substances,
Figure GDA0002959546450000033
and
Figure GDA0002959546450000034
the hidden layer state and the cell state in the left-to-right and right-to-left directions at time T (T ═ 1, 2.., T), respectively,
Figure GDA0002959546450000035
and
Figure GDA0002959546450000036
hidden layer state and cell state in the left-to-right and right-to-left directions, respectively, at time t-1, xtWords in the candidate terms input at the time t; t represents the number of words in the candidate term;
s23, in the encoder, ES is obtained through formula calculationT
Figure GDA0002959546450000041
Figure GDA0002959546450000042
h′T=f(WhhT+bh)
C′T=f(WcCT+bc)
Wherein the content of the first and second substances,
Figure GDA0002959546450000043
is a connector, Wh、bh、Wc、bcRepresenting parameter matrices and offsets in a fully connected network, f represents the activation function ReLU, ES in a fully connected networkTIs h'TAnd C'TA tuple is formed;
s24, in the decoder part, with ESTDecoding using unidirectional LSTM for initial state:
Figure GDA0002959546450000044
wherein z istIs the hidden layer state of the decoder at time t, zt-1Hidden layer state at time t-1, ESTIn order to be the state of the encoder,
Figure GDA0002959546450000045
words in the candidate terms output at the time t-1;
s25, according to ztEstimating the probability of the current word:
Figure GDA0002959546450000046
wherein, Wszt+bsScoring is performed for each possible output word, softmax being a normalization function.
S26, when the loss function L is continuously reduced and finally tends to be stable in the training process, obtaining the parameter W of the encoderh、bh、Wc、bcAnd W in the decoders、bsThereby determining a self-encoder; the calculation formula of the loss function L is as follows:
Figure GDA0002959546450000047
further, in step S2, the candidate term is input from the encoder, and the ES output from the encoderTIs the phrase vector of the candidate term.
Further, the topic vector in the step S3
Figure GDA0002959546450000048
The calculation formula of (2) is as follows:
Figure GDA0002959546450000049
wherein the content of the first and second substances,
Figure GDA00029595464500000410
is a subject term tiThe corresponding vector is represented by a vector that,
Figure GDA00029595464500000411
is a text diIs represented by the topic vector of (1).
Further, in the TextRank algorithm of the step S4, if the candidate term c is selectedjAnd ckOccurs in the co-occurrence window, then cjAnd ckThere is an edge between them, and the calculation formula of the weight of the edge is:
Figure GDA0002959546450000051
wjk=similarity(cj,ck)×occurcount(cj,ck)
wherein the content of the first and second substances,
Figure GDA0002959546450000052
are respectively candidate terms cjAnd ckVector representation of (c) occurcount(cj,ck) Denotes cjAnd ckNumber of co-occurrences in the co-occurrence window, similarity (c)j,ck) Is cjAnd ckSimilarity between, wjkRepresents cjAnd ckThe weight of the edges in between.
Further, the step of iteratively calculating vertex weights in the TextRank algorithm of step S4 includes the following steps:
iteratively calculating the weight of the candidate terms until the maximum iteration number is reached, and scoring the weight
Figure GDA0002959546450000053
The calculation formula is as follows:
Figure GDA0002959546450000054
wherein the content of the first and second substances,
Figure GDA0002959546450000055
representing candidate terms cjD is the damping coefficient, preferably d is 0.85;
Figure GDA0002959546450000056
is a candidate term cjSubject weight of, wjkIs a candidate term cjAnd candidate term ckWeight of edges in between, wkpIs a candidate term ckAnd candidate term cpThe weight of the edge in between,
Figure GDA0002959546450000057
representation and candidate terms cjA set of connected-up candidate terms,
Figure GDA0002959546450000058
are the elements of, and, for the same reason,
Figure GDA0002959546450000059
representation and candidate terms ckA set of connected-up candidate terms,
Figure GDA00029595464500000510
are elements thereof.
In a second aspect, the invention provides a phrase vector-based keyword extraction system, which comprises a text preprocessing module, a word segmentation module, a word classification module and a word classification module, wherein the text preprocessing module is used for performing word segmentation on an original text, labeling part of speech, and reserving n-tuple according to the part of speech to obtain a candidate term set;
phrase vector construction module for candidate term ci=(x1,x2,...,xT) Obtaining, by an auto-encoder, a phrase vector having a semantic representation;
the theme weight calculation module is used for calculating the theme weight of the candidate lexical item;
and the candidate word ordering module is used for calculating a weight score for the candidate terms and taking TopK candidate terms as keywords.
Further, the system further comprises an auto-encoder training module, which is used for obtaining the auto-encoded parameters through sample training, so as to determine the auto-encoder.
Compared with the existing keyword extraction method and system, the phrase vector-based keyword extraction method and system provided by the invention have the following beneficial effects:
1. according to the keyword extraction method and system provided by the invention, the topic information in the document is introduced, the context information is introduced through the semantic similarity between words, the key words in the whole text can be captured, and the extracted keywords are more accurate.
2. According to the keyword extraction method and system provided by the invention, the phrase vectors are used for acquiring the keywords, so that the calculation process is simple and efficient.
3. The phrase vector calculation method provided by the invention creatively introduces the self-encoder based on the LSTM to compress the word vector, can better represent the semantic information of the phrase, and has higher semantic precision and wider application range.
4. The invention improves the TextRank algorithm, creatively utilizes the phrase vectors to calculate the theme weight of each candidate term, and calculates the side weight by the semantic similarity between the candidate terms and the co-occurrence information, thereby not only considering the theme of the whole document, but also introducing the semantic information between the vertexes, and leading the accuracy of the ranking algorithm to be higher.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a schematic structural diagram of an auto-encoder according to an embodiment of the present invention;
FIG. 2 is a flowchart of a keyword extraction method based on phrase vectors according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
The invention is further described with reference to the following figures and detailed description.
In order to make the technical solutions and advantages in the examples of the present application more apparent, the following further detailed description of the exemplary embodiments of the present application with reference to the accompanying drawings makes it clear that the described embodiments are only a part of the embodiments of the present application, and not an exhaustive list of all embodiments. It should be noted that, in the present application, the examples may be combined with each other without conflict.
The invention provides a phrase vector-based keyword extraction method, as shown in fig. 2, the method comprises the following steps:
s1, for original text diPerforming word segmentation and part-of-speech tagging, and reserving n-tuple according to the part-of-speech to obtain a candidate term set
Figure GDA0002959546450000071
S2, for each candidate term cj=(x1,x2,…,xT) A phrase vector representation of the candidate term is obtained by the self-encoder. Wherein x isiIs a candidate term cjThe word vector of the ith word in (a) and T represents the number of words in the candidate term.
S3, calculating each candidate term cjAnd topic vector
Figure GDA0002959546450000072
Is taken as the subject weight
Figure GDA0002959546450000073
Wherein d isiThe ith document is represented. The self-encoder comprises an encoder and a decoder, the encoder part is composed of a bidirectional LSTM layerAnd a full connection layer, and the decoding part is composed of a unidirectional LSTM layer and a softmax layer.
And S4, acquiring keywords from the candidate term set through a modified TextRank algorithm.
In step S2, in the encoder, for each candidate term c to be inputjRespectively calculating from front and back directions by using bidirectional LSTM, and taking the state h of the hidden layer at the last momentTAnd cell state CTAs the final state, splicing, and finally obtaining the output ES of the coding layer through a full connection layerT
In the decoder, with ESTFor initial input, decoding is carried out by using a unidirectional LSTM structure, the probability distribution of each step of decoding is obtained through a softmax layer, and finally the probability of decoding the correct word corresponding to each step is maximized through a loss function L.
The aim of training is to optimize parameters of the self-encoder, so that the decoder can take the output of the encoder as input, and restore semantic information of candidate terms input by the encoder to the maximum extent.
The specific training method comprises the following steps:
(1) and selecting a training sample, and then performing word segmentation and other operations on the sample as in S1 to obtain a candidate term set.
Candidate terms for cj=(x1,x2,…,xT) Is represented by, wherein xiIs a candidate term cjThe word vector of the ith word in (a) and T represents the number of words in the candidate term. With candidate term cjFor example, Beijing university of Phytology, x1Is the word vector, x, corresponding to "Beijing2Is the word vector, x, corresponding to "reason worker3Is the word vector for "university".
(2) The model is trained using a large number of candidate terms. Taking candidate term 'Beijing university of rational workers' as an example, inputting word vector representation corresponding to 'Beijing', 'rational worker' and 'university', obtaining phrase vector representation of 'Beijing university of rational workers' through encoding, obtaining corresponding probability values of 'Beijing', 'rational worker' and 'university' in sequence through decoding of the phrase vector, and maximizing the probability values through training.
For each candidate term cj=(x1,x2,…,xT) In the encoder section, the encoder uses bi-directional LSTM to perform calculations from both front and back directions, respectively:
Figure GDA0002959546450000081
Figure GDA0002959546450000082
wherein the content of the first and second substances,
Figure GDA0002959546450000083
and
Figure GDA0002959546450000084
the hidden layer state and the cell state in the two directions from left to right and from right to left at the time T (T is 1,2, …, T), respectively,
Figure GDA0002959546450000085
and
Figure GDA0002959546450000086
hidden layer state and cell state in the left-to-right and right-to-left directions, respectively, at time t-1, xtThe words in the candidate terms input at time t. At each instant of time, the current hidden layer state htAnd cell state CtAll depend on the hidden layer state h at the last momentt-1Cell state Ct-1And the current input xt
The state h of the hidden layer at the last moment is takenTAnd cell state CTAs the final state, the states in both directions are directly connected. In addition, in order to provide a fixed-size input for the decoding layer, the connected state needs to be processed through a full-connection layer. Calculating the following formula to obtain one of the decodersFixed size input EST
Figure GDA0002959546450000087
Figure GDA0002959546450000088
h′T=f(WhhT+bh)
C′T=f(WcCT+bc)
Wherein the content of the first and second substances,
Figure GDA0002959546450000089
is a connector, Wh、bh、Wc、bcRepresenting parameter matrices and offsets in a fully connected network, f represents the activation function ReLU, ES in a fully connected networkTIs h'TAnd C'TOne tuple is composed that is eventually provided to the decoder.
In the decoder part, with ESTDecoding using unidirectional LSTM for initial state:
Figure GDA00029595464500000810
wherein z istIs the hidden layer state of the decoder at time t, zt-1Hidden layer state at time t-1, ESTIn order to be the state of the encoder,
Figure GDA00029595464500000811
and outputting the words in the candidate terms at the moment t-1.
According to ztEstimating the probability of the current word:
Figure GDA00029595464500000812
wherein, WsIs a parameter matrix, WsAnd bsRespectively representing the weight and offset values of the softmax function, ztIs the hidden layer state of the decoder at time t, Wszt+bsScoring each possible output word, normalizing with softmax to obtain each word
Figure GDA0002959546450000091
Probability of (2)
Figure GDA0002959546450000092
The training goal of the auto-encoder is to maximize the probability of outputting the correct phrase: the output from the encoder is the probability for each word, and the training objective is to maximize the probability of outputting the correct word, i.e., to train on the basis of the loss function L, by training to adjust the parameters of the self-encoder (including the parameters in LSTM, W in encoder)h、bh、Wc、bcAnd W in the decoders、bs) When the loss function is continuously reduced and finally tends to be stable in the training process, the fact that the middle vector can well represent phrase semantics can be explained, and the middle vector can be represented as a phrase vector. The loss function L is calculated as follows:
Figure GDA0002959546450000093
after the training of the self-encoder is finished, the loss function value of the self-encoder tends to be stable. At this point, the training of the self-encoder is completed, and the candidate terms are input into the encoder of the self-encoder, ESTThe value in (1) is the phrase vector. Through the self-encoder constructed above, word vectors are compressed by using information on the candidate term sequence, and phrase vector representation of the candidate terms is obtained.
After the training of the self-encoder is finished, when the phrase vector representation of the candidate term is required to be obtained, the phrase vector representation ES of the candidate term can be obtained only by utilizing the calculation of the encoding partTObtained ESTCan be divided into oneThe entirety of the candidate term takes into account semantic information of the candidate term.
In step S3, the theme weight calculation process is as follows:
(1) determining a topic term set: the method is characterized by taking a topic sentence or paragraph with high generality of text as a representative, such as a topic or abstract of a thesis, determining a topic term of the text from the topic sentence, and adding a topic term set of the text:
Figure GDA0002959546450000094
wherein d isiRepresenting the ith document, n is the number of elements in the subject term set. For example, for "mining design industry development idea instance analysis under new situation", the topic term set may be "mining design", "development idea", "instance analysis".
(2) Calculating a topic vector: computing a set of topic terms
Figure GDA0002959546450000095
The average value of the word or phrase vectors corresponding to all the terms in the document is used as the topic vector of the document
Figure GDA0002959546450000096
Topics for representing the entire document:
Figure GDA0002959546450000097
wherein the content of the first and second substances,
Figure GDA0002959546450000098
is a subject term tiThe corresponding vector is represented by a vector that,
Figure GDA0002959546450000099
is a document diIs represented by the topic vector of (1).
(3) Calculating the theme weight: for each candidate term cjCalculate it and document diSubject vector of
Figure GDA00029595464500000910
Cosine distance between them as its theme weight.
Figure GDA0002959546450000101
Wherein the content of the first and second substances,
Figure GDA0002959546450000102
is a document diCandidate term of (c)jThe weight of the subject matter of (1),
Figure GDA0002959546450000103
is a candidate term cjRepresents the cosine distance cos.
Through the steps (1) to (3), a topic weight between 0 and 1 can be assigned to each candidate term. It should be noted that a topic weight of 1 indicates that the candidate term is closest to the topic of the text, and a topic weight of 0 indicates that the candidate term is farther from the topic of the text.
In step S4, the document d is usediSet of candidate terms
Figure GDA0002959546450000104
Constructing an undirected graph for the vertices, computing each candidate term cjIs scored by the weight of
Figure GDA0002959546450000105
Take TopK (top K) candidate terms as keywords. The method is realized by improving the TextRank algorithm, and the specific process is as follows:
(1) constructing an undirected graph: with document diSet of candidate terms
Figure GDA0002959546450000106
All elements in (a) construct an undirected graph for the vertices. Wherein if the candidate term cjAnd ckIn a co-occurrence window of length n, cjAnd ckThere is an edge in between.
(2) ComputingWeight of edge: the weighting of the edges is an improvement of the present invention. A phrase vector is computed that also depends on the self-encoder construction. According to two candidate terms cjAnd ckThe cosine distance between the vector representations (c)j,ck) And number of co-occurrences occurcount(cj,ck) Assigning a weight w to each edge in the graphjk
Figure GDA0002959546450000107
wjk=similarity(cj,ck)×occurcount(cj,ck)
Wherein
Figure GDA0002959546450000108
Are respectively candidate terms cjAnd ckCos represents the cosine distance of the vector occurcount(cj,ck) Denotes cjAnd ckThe number of co-occurrences in the co-occurrence window is multiplied to enhance the semantic relationship between two words by the number of co-occurrences, wjkRepresents cjAnd ckThe weight of the edges in between.
(3) Iteratively calculating vertex weights: vertex weights are also an improvement of the present invention. Iteratively calculating the weight of each top point in the graph until the maximum iteration number is reached, and scoring the weight
Figure GDA0002959546450000109
The calculation is as follows:
Figure GDA00029595464500001010
wherein the content of the first and second substances,
Figure GDA00029595464500001011
representing a document diCandidate term of (c)jD is a damping coefficient, isEach vertex has a certain probability to vote for other vertices, so that each vertex has a score which is not zero, and the algorithm can be ensured to be converged after multiple iterations, and the value is usually 0.85.
Figure GDA00029595464500001012
Is a document diCandidate term of (c)jSubject weight of, wjkIs a candidate term cjAnd candidate term ckWeight of edges in between, wkpIs a candidate term ckAnd candidate term cpThe weight of the edge in between,
Figure GDA0002959546450000111
representation and candidate terms cjA set of connected-up candidate terms,
Figure GDA0002959546450000112
are elements of the set and, similarly,
Figure GDA0002959546450000113
representation and candidate terms ckA set of connected-up candidate terms,
Figure GDA0002959546450000114
is an element of the set that is,
Figure GDA0002959546450000115
representing a document diCandidate term of (c)kThe second half of the right hand side of the equation represents the sum of cjConnected vertex givejThe voting of (1).
(4) Candidate term ordering: after multiple iterations, each vertex in the graph can obtain a stable score, and the candidate term set is divided into a plurality of candidate term sets
Figure GDA0002959546450000116
Scoring by weight
Figure GDA0002959546450000117
Sorted from big to small and reserved beforeTopK candidate terms serve as keywords of the document.
Through the four steps of S1 to S4, the keywords of the document can be extracted.
The invention also provides a phrase vector-based keyword extraction system, which comprises:
the text preprocessing module is used for segmenting the original text, marking the part of speech and reserving n-tuple according to the part of speech to obtain a candidate term set;
phrase vector construction module for candidate term cj=(x1,x2,…,xT) Obtaining, by an auto-encoder, a phrase vector having a semantic representation;
the theme weight calculation module is used for calculating the theme weight of the candidate lexical item; the specific calculation method is as described above.
And the candidate word ordering module is used for calculating a weight score for the candidate terms and taking TopK candidate terms as keywords. The specific selection method is as described above.
Further, the system further includes a self-encoder training module, configured to process sequence information in the phrase structure, and obtain phrase vector representations of the candidate terms, where the training method is as described above.
In the following, an example of enterprise paper data in an enterprise paper database is taken to describe a specific keyword extraction method based on phrase vectors.
The enterprise thesis database is provided with enterprise thesis data in various fields, including fields such as 'title', 'year', 'abstract', 'keyword', 'English keyword', 'classification number', and the like. In the keyword extraction process, the 'title' and the 'abstract' in the database are used as text contents, and the 'keyword' is used as marking data to verify the extraction result.
When training the self-encoder, the "keyword" field in the database is taken as training data, and part of parameters in the training process are shown in table 1.
TABLE 1 training parameter settings
Figure GDA0002959546450000121
Before extracting the key words, analyzing the labeled data to determine partial parameters in the algorithm. There were 59913 articles in the dataset, with an average of 4.2 tagged keywords per article. First, the length of the labeling keyword, i.e. the number of words contained in each keyword, is counted, and the result is shown in table 2. From table 2, it can be seen that the average length of all keywords is 1.98, and the length of most keywords is between 1 and 3, and the keywords with the length between 1 and 3 occupy 93.9% of all 254376 keywords. Thus 1-tuple, 2-tuple, and 3-tuple in the text are retained in selecting the candidate terms.
Then, the parts of speech of all words in the keyword are counted, and the statistical result is shown in table 3. Part-of-speech tagging is accomplished using a Jieba segmentation tool, with part-of-speech descriptions as shown in table 4. According to table 3, the distribution of parts of speech of words in the keyword has no concentration of length distribution, but is also mainly concentrated on nouns, verbs, and verbs having a noun function, which occupy 73.1% of the whole word parts of speech. Therefore, nouns, verbs, noun verbs, and combinations thereof in the text are taken as candidate terms when candidate term selection is performed.
TABLE 2 keyword Length distribution
Figure GDA0002959546450000122
TABLE 3 word part-of-speech distributions
Figure GDA0002959546450000123
TABLE 4Jieba parts of speech
Figure GDA0002959546450000124
Figure GDA0002959546450000131
Since the text content only comprises the topic and the abstract of the paper, the topic is taken as the representative of the full-text topic when the topic weight is calculated, and the candidate terms are extracted from the topic to calculate the topic vector of the text. In addition, the size of the co-occurrence window in the candidate word ranking is initially set to 3, and the number of the candidate words finally retained is 10, as shown in table 5.
TABLE 5 keyword extraction results (parts)
Figure GDA0002959546450000132
Preferably, the method takes a paper data in an enterprise paper database as an example, and provides a specific keyword extraction process.
The data content refers back to the ten-year high-speed development period of the coal industry and the profound influence of the ten-year high-speed development period on the mining design market for the analysis of development idea examples of the mining design industry under new conditions. Under the background that the economy of the current coal industry is rapidly descended and the competition of the coal design market is fierce, taking the development of the mining specialty of the design institute of natural and terrestrial science as an example, the characteristics of the human resources and the business change of the mining specialty are analyzed, the development idea and the implementation measures of the mining specialty are provided, and the reference is provided for the development of the mining specialty of other design enterprises.
Wherein, the analysis of the development idea example of the mining design industry under the new situation is the subject of the thesis, and the rest contents are the abstract of the thesis.
Candidate terms are selected through n-tuple terms and part-of-speech tagging, the candidate terms selected from the abstract of the thesis are used as a subject term set of the text, and the selected candidate terms are shown in table 6.
TABLE 6 candidate term results
Figure GDA0002959546450000141
The phrase vector representation corresponding to all terms in the subject term set is obtained by using a self-encoder, the average value of the phrase vectors corresponding to all terms in the subject term set is calculated to be used as the subject vector of the text, the size of the subject vector of the document is calculated to be 400, and partial values are shown in table 7.
Table 7 theme weight results (parts)
Figure GDA0002959546450000142
For each candidate term, the cosine distance between it and the topic vector of the text is calculated to obtain the topic weight, and part of the values are shown in table 8.
Table 8 theme weight results (parts)
Figure GDA0002959546450000143
Figure GDA0002959546450000151
And taking the candidate terms as vertexes, taking the co-occurrence information of the candidate terms as edges to construct an undirected graph, distributing weight to each edge in the graph according to cosine distance between vector representations of the two candidate terms and the co-occurrence times of the two candidate terms, and carrying out iterative computation for multiple times according to the theme weight and the weight of the edge to obtain the vertex weight. After several iterations, each vertex in the graph can get a stable score, with some scores as shown in table 9.
TABLE 9 weight score results (parts)
Figure GDA0002959546450000152
The obtained scoring conditions are sorted, and Top10 candidate terms with the highest scoring are used as final keywords, as shown in table 10.
TABLE 10 keyword extraction results (parts)
Figure GDA0002959546450000153
It should be noted that "first" and "second" are only used herein to distinguish the same-named entities or operations, and do not imply an order or relationship between the entities or operations.
Those of ordinary skill in the art will understand that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions and scope of the present invention as defined in the appended claims.

Claims (7)

1. A keyword extraction method based on phrase vectors is characterized by comprising the following steps:
s1, segmenting the text, marking the part of speech, and reserving n-tuple to obtain a candidate term set;
s2, constructing phrase vectors for the candidate terms through a self-encoder;
s3, determining the theme of the text, calculating the similarity between the candidate term and the theme vector, and taking the similarity as the theme weight of the candidate term; wherein the subject term set
Figure FDA0002959546440000011
Taking the average value of phrase vectors corresponding to all the terms in the document as a topic vector of the document
Figure FDA0002959546440000012
Topics for representing the entire document:
Figure FDA0002959546440000013
wherein the content of the first and second substances,
Figure FDA0002959546440000014
is a subject term tiA corresponding phrase vector representation;
s4, obtaining keywords from the candidate term set through a TextRank algorithm;
wherein, the TextRank algorithm in step S4 further includes iteratively calculating the weight of the candidate term until the maximum iteration number is reached, and the weight
Figure FDA0002959546440000015
The calculation formula is as follows:
Figure FDA0002959546440000016
wherein the content of the first and second substances,
Figure FDA0002959546440000017
representing candidate terms cjD is the damping coefficient;
Figure FDA0002959546440000018
is a candidate term cjSubject weight of, wjkIs a candidate term cjAnd candidate term ckWeight of edges in between, wkpIs a candidate term ckAnd candidate term cpThe weight of the edge in between,
Figure FDA0002959546440000019
representation and candidate terms cjA set of connected-up candidate terms,
Figure FDA00029595464400000110
is that
Figure FDA00029595464400000111
The elements (A) and (B) in (B),
Figure FDA00029595464400000112
representation and candidate terms ckA set of connected-up candidate terms,
Figure FDA00029595464400000113
is that
Figure FDA00029595464400000114
The elements (A) and (B) in (B),
Figure FDA00029595464400000115
representing candidate terms ckThe weight of (c);
the calculation method of the theme weight comprises the following steps: for each candidate term cjComputing the candidate term and the document diSubject vector of
Figure FDA00029595464400000116
Cosine distance between them as theme weight.
2. The method of claim 1, wherein the self-encoder in step S2 comprises an encoder and a decoder, the encoder is composed of a bi-directional LSTM layer and a full connection layer, and the decoding portion is composed of a unidirectional LSTM layer and a softmax layer.
3. The method according to claim 2, wherein the training method of the self-encoder in step S2 comprises the following steps:
s21, selecting a training sample to obtain candidate terms;
s22, candidate term cj=(x1,x2,…,xT) In the encoder, bi-directional LSTM is used to perform calculations from both front and back directions:
Figure FDA0002959546440000021
Figure FDA0002959546440000022
wherein the content of the first and second substances,
Figure FDA0002959546440000023
and
Figure FDA0002959546440000024
the hidden layer state and the cell state in the two directions from left to right and from right to left at the time T (T is 1,2, …, T), respectively,
Figure FDA0002959546440000025
and
Figure FDA0002959546440000026
hidden layer state and cell state in the left-to-right and right-to-left directions, respectively, at time t-1, xtThe words in the candidate terms input at the time T are represented by T, and the number of the words in the candidate terms is represented by T;
s23, in the encoder, ES is obtained through formula calculationT
Figure FDA0002959546440000027
Figure FDA0002959546440000028
h′T=f(WhhT+bh)
C′T=f(WcCT+bc)
Wherein the content of the first and second substances,
Figure FDA0002959546440000029
is a connector, Wh、bh、Wc、bcRepresenting parameter matrices and offsets in a fully connected network, f represents the activation function ReLU, ES in a fully connected networkTIs h'TAnd C'TA tuple is formed;
s24, in the decoder part, with ESTDecoding using unidirectional LSTM for initial state:
Figure FDA00029595464400000210
wherein z istIs the hidden layer state of the decoder at time t, zt-1Hidden layer state at time t-1, ESTIn order to be the state of the encoder,
Figure FDA00029595464400000211
words in the candidate terms output at the time t-1;
s25, according to ztEstimating the probability of a current word
Figure FDA00029595464400000212
Figure FDA00029595464400000213
Wherein softmax is a normalization function, Wszt+bsScore each possible output word, WsAnd bsRespectively representing the weight value and the offset value of the softmax function;
s26, when the loss function L is continuously reduced and finally tends to be stable in the training process, obtaining the parameter W of the encoderh、bh、Wc、bcAnd a parameter W of the decoders、bsThereby determining a self-encoder; the calculation formula of the loss function L is as follows:
Figure FDA00029595464400000214
4. the method of claim 3, wherein in step S2, the candidate term is input from an encoder, and the ES output from the encoderTIs the phrase vector of the candidate term.
5. The method of claim 1, wherein the term candidate c is selected in the TextRank algorithm of step S4jAnd ckOccurs in the co-occurrence window, then cjAnd ckThere is an edge between them, and the calculation formula of the weight of the edge is:
Figure FDA0002959546440000031
wjk=similarity(cj,ck)×occurcount(cj,ck)
wherein the content of the first and second substances,
Figure FDA0002959546440000032
are respectively candidate terms cjAnd ckVector representation of (c) occurcount(cj,ck) Denotes cjAnd ckNumber of co-occurrences in the co-occurrence window, similarity (c)j,ck) Is cjAnd ckSimilarity between, wjkRepresents cjAnd ckThe weight of the edges in between.
6. A phrase vector based keyword extraction system, the system comprising:
the text preprocessing module is used for segmenting the original text, marking the part of speech and reserving n-tuple according to the part of speech to obtain a candidate term set;
phraseA vector construction module for constructing candidate term cj=(x1,x2,…,xT) Obtaining, by an auto-encoder, a phrase vector having a semantic representation;
the theme weight calculation module is used for calculating the theme weight of the candidate term through the theme vector; wherein the subject term set
Figure FDA0002959546440000033
Taking the average value of phrase vectors corresponding to all the terms in the document as a topic vector of the document
Figure FDA0002959546440000034
Topics for representing the entire document:
Figure FDA0002959546440000035
wherein the content of the first and second substances,
Figure FDA0002959546440000036
is a subject term tiA corresponding phrase vector representation;
the candidate word ordering module is used for calculating a weight score for the candidate terms and taking TopK candidate terms as keywords; the calculation weight score is the weight of the candidate terms of iterative calculation until the maximum iteration times is reached, and the weight
Figure FDA0002959546440000037
The calculation formula is as follows:
Figure FDA0002959546440000038
wherein the content of the first and second substances,
Figure FDA0002959546440000039
representing candidate terms cjD is the damping coefficient;
Figure FDA00029595464400000310
is a candidate term cjSubject weight of, wjkIs a candidate term cjAnd candidate term ckWeight of edges in between, wkpIs a candidate term ckAnd candidate term cpThe weight of the edge in between,
Figure FDA0002959546440000041
representation and candidate terms cjA set of connected-up candidate terms,
Figure FDA0002959546440000042
is that
Figure FDA0002959546440000043
The elements (A) and (B) in (B),
Figure FDA0002959546440000044
representation and candidate terms ckA set of connected-up candidate terms,
Figure FDA0002959546440000045
is that
Figure FDA0002959546440000046
The elements (A) and (B) in (B),
Figure FDA0002959546440000047
representing candidate terms ckThe weight of (c); the calculation method of the theme weight comprises the following steps: for each candidate term cjComputing the candidate term and the document diSubject vector of
Figure FDA0002959546440000048
Cosine distance between them as theme weight.
7. The system of claim 6, further comprising an auto-encoder training module for determining the auto-encoder by obtaining auto-encoded parameters through sample training.
CN201910548261.XA 2019-06-24 2019-06-24 Phrase vector-based keyword extraction method and system Active CN110263343B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910548261.XA CN110263343B (en) 2019-06-24 2019-06-24 Phrase vector-based keyword extraction method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910548261.XA CN110263343B (en) 2019-06-24 2019-06-24 Phrase vector-based keyword extraction method and system

Publications (2)

Publication Number Publication Date
CN110263343A CN110263343A (en) 2019-09-20
CN110263343B true CN110263343B (en) 2021-06-15

Family

ID=67920847

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910548261.XA Active CN110263343B (en) 2019-06-24 2019-06-24 Phrase vector-based keyword extraction method and system

Country Status (1)

Country Link
CN (1) CN110263343B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111274428B (en) * 2019-12-19 2023-06-30 北京创鑫旅程网络技术有限公司 Keyword extraction method and device, electronic equipment and storage medium
CN111222333A (en) * 2020-04-22 2020-06-02 成都索贝数码科技股份有限公司 Keyword extraction method based on fusion of network high-order structure and topic model
CN111785254B (en) * 2020-07-24 2023-04-07 四川大学华西医院 Self-service BLS training and checking system based on anthropomorphic dummy
CN112818686B (en) * 2021-03-23 2023-10-31 北京百度网讯科技有限公司 Domain phrase mining method and device and electronic equipment
CN113312532B (en) * 2021-06-01 2022-10-21 哈尔滨工业大学 Public opinion grade prediction method based on deep learning and oriented to public inspection field

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8019708B2 (en) * 2007-12-05 2011-09-13 Yahoo! Inc. Methods and apparatus for computing graph similarity via signature similarity
CN103744835A (en) * 2014-01-02 2014-04-23 上海大学 Text keyword extracting method based on subject model
KR101656245B1 (en) * 2015-09-09 2016-09-09 주식회사 위버플 Method and system for extracting sentences
CN106372064A (en) * 2016-11-18 2017-02-01 北京工业大学 Characteristic word weight calculating method for text mining
CN107122413A (en) * 2017-03-31 2017-09-01 北京奇艺世纪科技有限公司 A kind of keyword extracting method and device based on graph model
CN107133213A (en) * 2017-05-06 2017-09-05 广东药科大学 A kind of text snippet extraction method and system based on algorithm
CN107247780A (en) * 2017-06-12 2017-10-13 北京理工大学 A kind of patent document method for measuring similarity of knowledge based body
CN108460019A (en) * 2018-02-28 2018-08-28 福州大学 A kind of emerging much-talked-about topic detecting system based on attention mechanism
CN108984526A (en) * 2018-07-10 2018-12-11 北京理工大学 A kind of document subject matter vector abstracting method based on deep learning

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101249183B1 (en) * 2006-08-22 2013-04-03 에스케이커뮤니케이션즈 주식회사 Method for extracting subject and sorting document of searching engine, computer readable record medium on which program for executing method is recorded
CN106997382B (en) * 2017-03-22 2020-12-01 山东大学 Innovative creative tag automatic labeling method and system based on big data
CN106970910B (en) * 2017-03-31 2020-03-27 北京奇艺世纪科技有限公司 Keyword extraction method and device based on graph model
CN107193803B (en) * 2017-05-26 2020-07-10 北京东方科诺科技发展有限公司 Semantic-based specific task text keyword extraction method
CN107832457A (en) * 2017-11-24 2018-03-23 国网山东省电力公司电力科学研究院 Power transmission and transforming equipment defect dictionary method for building up and system based on TextRank algorithm
CN108710611B (en) * 2018-05-17 2021-08-03 南京大学 Short text topic model generation method based on word network and word vector
CN109726394A (en) * 2018-12-18 2019-05-07 电子科技大学 Short text Subject Clustering method based on fusion BTM model
CN109614626A (en) * 2018-12-21 2019-04-12 北京信息科技大学 Keyword Automatic method based on gravitational model
CN109918660B (en) * 2019-03-04 2021-03-02 北京邮电大学 Keyword extraction method and device based on TextRank
CN109918510B (en) * 2019-03-26 2022-10-28 中国科学技术大学 Cross-domain keyword extraction method

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8019708B2 (en) * 2007-12-05 2011-09-13 Yahoo! Inc. Methods and apparatus for computing graph similarity via signature similarity
CN103744835A (en) * 2014-01-02 2014-04-23 上海大学 Text keyword extracting method based on subject model
KR101656245B1 (en) * 2015-09-09 2016-09-09 주식회사 위버플 Method and system for extracting sentences
CN106372064A (en) * 2016-11-18 2017-02-01 北京工业大学 Characteristic word weight calculating method for text mining
CN107122413A (en) * 2017-03-31 2017-09-01 北京奇艺世纪科技有限公司 A kind of keyword extracting method and device based on graph model
CN107133213A (en) * 2017-05-06 2017-09-05 广东药科大学 A kind of text snippet extraction method and system based on algorithm
CN107247780A (en) * 2017-06-12 2017-10-13 北京理工大学 A kind of patent document method for measuring similarity of knowledge based body
CN108460019A (en) * 2018-02-28 2018-08-28 福州大学 A kind of emerging much-talked-about topic detecting system based on attention mechanism
CN108984526A (en) * 2018-07-10 2018-12-11 北京理工大学 A kind of document subject matter vector abstracting method based on deep learning

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Bidirectional lstm recurrent neural network for keyphrase extraction;Basaldella Marco 等;《Italian Research Conference on Digital Libraries》;20180131;180-187 *
基于LSTM的自动文本摘要技术研究;洪冬梅;《中国优秀硕士学位论文全文数据库 信息科技辑》;20190115(第12期);I138-1872 *
基于改进TextRank的关键词抽取算法;张莉婧 等;《北京印刷学院学报》;20160831;第24卷(第4期);51-55 *
基于深度学习的中文抽取式摘要方法应用;齐翌辰 等;《科教导刊》;20190515(第14期);69-70 *
融合多特征的TextRank关键词抽取方法;李航 等;《情报杂志》;20170831;第36卷(第8期);183-187 *

Also Published As

Publication number Publication date
CN110263343A (en) 2019-09-20

Similar Documents

Publication Publication Date Title
CN110263343B (en) Phrase vector-based keyword extraction method and system
CN110717047B (en) Web service classification method based on graph convolution neural network
CN109829104B (en) Semantic similarity based pseudo-correlation feedback model information retrieval method and system
CN106599032B (en) Text event extraction method combining sparse coding and structure sensing machine
CN112579778B (en) Aspect-level emotion classification method based on multi-level feature attention
CN111310471A (en) Travel named entity identification method based on BBLC model
CN111191002A (en) Neural code searching method and device based on hierarchical embedding
CN114065758A (en) Document keyword extraction method based on hypergraph random walk
CN113239148B (en) Scientific and technological resource retrieval method based on machine reading understanding
CN115357719A (en) Power audit text classification method and device based on improved BERT model
CN111581943A (en) Chinese-over-bilingual multi-document news viewpoint sentence identification method based on sentence association graph
CN111259147B (en) Sentence-level emotion prediction method and system based on self-adaptive attention mechanism
CN113821635A (en) Text abstract generation method and system for financial field
CN114048354B (en) Test question retrieval method, device and medium based on multi-element characterization and metric learning
Khalid et al. Topic detection from conversational dialogue corpus with parallel dirichlet allocation model and elbow method
CN113326374B (en) Short text emotion classification method and system based on feature enhancement
CN111859955A (en) Public opinion data analysis model based on deep learning
CN116108840A (en) Text fine granularity emotion analysis method, system, medium and computing device
Bhargava et al. Deep paraphrase detection in indian languages
CN113342964B (en) Recommendation type determination method and system based on mobile service
CN113688633A (en) Outline determination method and device
Lokman et al. A conceptual IR chatbot framework with automated keywords-based vector representation generation
CN114595324A (en) Method, device, terminal and non-transitory storage medium for power grid service data domain division
Ribeiro et al. UA. PT Bioinformatics at ImageCLEF 2019: Lifelog Moment Retrieval based on Image Annotation and Natural Language Processing.
CN111061939A (en) Scientific research academic news keyword matching recommendation method based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant