CN110263343A - The keyword abstraction method and system of phrase-based vector - Google Patents
The keyword abstraction method and system of phrase-based vector Download PDFInfo
- Publication number
- CN110263343A CN110263343A CN201910548261.XA CN201910548261A CN110263343A CN 110263343 A CN110263343 A CN 110263343A CN 201910548261 A CN201910548261 A CN 201910548261A CN 110263343 A CN110263343 A CN 110263343A
- Authority
- CN
- China
- Prior art keywords
- lexical item
- candidate
- candidate lexical
- vector
- item
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to natural language processing and depth learning technology field, in particular to a kind of the keyword abstraction method and system of phrase-based vector.Main technical schemes of the invention include: urtext to be segmented and marked part of speech, retain n tuple according to part of speech, obtain candidate word item collection;The a large amount of phrases building vector for including in candidate key set of words is indicated;Calculate the topic weights of each candidate lexical item;Using candidate lexical item as the vertex in figure, using the co-occurrence information of candidate lexical item as side structural map, with the weight of semantic similarity and co-occurrence information calculating side between candidate lexical item, the score of each candidate lexical item and sequence are iterated to calculate.Keyword abstraction method provided by the invention and system had both introduced the subject information in document, introduced contextual information further through the semantic similarity between phrase, more can capture heavy duty word in full text, semantic precision is high, has a wide range of application.
Description
Technical field
The present invention relates to natural language processing and depth learning technology field, in particular to a kind of pass of phrase-based vector
Keyword abstracting method and system.
Background technique
In recent years, mass data is while bringing great convenience, also the same analysis and lookup band to data
Huge challenge is carried out.Under big data background, keynote message required for how rapidly obtaining from mass data becomes people
Problem in the urgent need to address.Keyword abstraction refer to automatically extracted from document by algorithm it is important, have theme
The word or phrase of property.In scientific and technical literature, keyword or phrase can help user to quickly understand papers contents.Meanwhile it is crucial
Word or phrase are also used as the search entry in information retrieval, natural language processing and text mining.Appoint in keyword abstraction
In business, the term vector comprising the semanteme of word, which has been obtained, to be applied and achieves good effect.However, many professional papers,
Including containing a large amount of proper noun in enterprise's paper, and these nouns are not often single words but phrase, therefore only
The needs of keyword abstraction task are insufficient for term vector, text, which needs to construct vector to phrase, to be indicated.
Currently existing scholar proposes to be combined based on term vector using self-encoding encoder to construct phrase vector.It is self-editing
Code device (Auto Encoder) only has two parts of encoder and decoder in structure, is carried out with self-encoding encoder to word vector
Combination can input the expression of each word in phrase in encoder section, then their boil down tos come when constructing phrase vector
One intermediate hidden layers vector parses the phrase of input in decoder section, then in this again by hidden layer vector
Between vector can be considered to contain the phrase vector of semantic information and indicate.However, directly being used in traditional self-encoding encoder
The fully-connected network on basis is coded and decoded, wherein connecting entirely between layers, the node between every layer is without even
It connects, this common autoencoder network can not handle the sequence information in structure as similar phrase.
The algorithm that in addition, there will be only calculates the semantic similarity of word by term vector, and has ignored the theme of text
Information.TextRank is a kind of keyword abstraction algorithm based on figure, its basic thought is with the candidate lexical item structure in document
Cheng Tu constructs side with the cooccurrence relation of candidate lexical item in a document, then by the mutual ballot between candidate lexical item come iteration
Weight is calculated, finally candidate lexical item is ranked up according to score to determine the keyword finally extracted.Traditional
In TextRank, the initial weight on each vertex is 1 (or 1/n, n are number of vertices) in figure, and the weight of each edge is also set as
1, that is to say, that the poll on each vertex, which can equably be thrown, gives its connected each vertex.Although the simple side of such method
Just, but the thematic of document is not only had ignored, but also does not account for the semantic relation between vertex.
In Recognition with Recurrent Neural Network (Recurrent Neural Network, RNN), the node between hidden layer is no longer
It is connectionless but have connection, and the input of hidden layer not only output comprising input layer also comprising last moment hidden layer
Output.Therefore RNN is adapted to encode sequence data.However in the communication process of RNN, the forgetting of historical information and
The accumulation of error is a major issue, and present people are usually using long Memory Neural Networks (Long Short-Term in short-term
Memery, LSTM) Lai Gaijin.
LSTM is a kind of RNN specific type, it records information using cell state, and cell state is in sequence transmission process
In only a small amount of linear reciprocal, can preferably retain historical information.Then LSTM is protected and is controlled using door control mechanism
Cell state.Door control mechanism is an abstract concept, it is actually by a sigmoid function and point in specific implementation
What multiplication was constituted, door control mechanism controls the transmitting of information by the value between output one 0 to 1, and output valve is closer to 0 table
Show allow by information it is fewer, closer to 1 indicate allow by information it is more.
In a LSTM unit, to be processed first is information that previous step passes over, and LSTM is by forgeing door
(forget gate) controls the forgetting and reservation of historical information.Forget door ftAccording to current information, decide whether to forget
Information before, specific formula is as follows:
ft=σ (Wf·[ht-1,xt]+bf)
Wherein σ indicates sigmoid function, WfAnd bfRespectively indicate weight matrix and the biasing forgotten in door.
It is information currently entered that LSTM is to be treated later, and first passing through input gate control current input information will retain
Part, later, with one cell state of tanh function creationThe information of the moment node is added in the cell state.
it=σ (Wi·[ht-1,xt]+bi)
By forgeing door and input gate, LSTM can determine which past information needs are left, and current which
Information need to be to be stored, to calculate current cell state Ct。
Last LSTM can pass through out gate according to historical information and current input information using sigmoid function
(output gate) determines the information for current time needing to export, and similar with input state, output state can also use a tanh
Function filtering.
ot=σ (Wo·[ht-1,xt]+bo)
ot=ot*tanh(Ct)
By cleverly door machine system, the information before Memory Neural Networks can be remembered in short-term is grown, while in turn avoiding " ladder
Degree disappears " the problem of.
Summary of the invention
In order to solve, term vector is insufficient for the needs of keyword abstraction task and existing algorithm has ignored text
Subject information these two aspects problem, the present invention provides the keyword abstraction method and system of a kind of phrase-based vector.
To achieve the above object, in a first aspect, the present invention provides a kind of keyword abstraction method of phrase-based vector, institute
The method of stating includes:
S1, part of speech is segmented and marked to text, retain n tuple and obtain candidate word item collection;
It S2, is that candidate lexical item constructs phrase vector by self-encoding encoder;
S3, the theme for determining the text calculate the similarity of candidate lexical item and theme vector, using the similarity as
The topic weights of candidate's lexical item;
S4, pass through TextRank algorithm, obtain keyword from the candidate word item collection.
Further, the self-encoding encoder in the step S2 includes encoder and decoder, and encoder is by LSTM layers two-way
It is formed with full articulamentum, decoded portion is formed by unidirectional LSTM layers and softmax layers.
Further, the self-encoding encoder in the step S2 includes encoder and decoder, and training method includes following step
It is rapid:
S21, training sample is chosen, obtains candidate lexical item;
S22, to candidate lexical item cj=(x1,x2,…,xT), in the encoder, using two-way LSTM from former and later two directions
It is respectively calculated:
Wherein,WithRespectively t (t=1,2 ..., the T) moment is from left to right and from right to left in both direction
Hiding layer state and cell state,WithRespectively the t-1 moment is from left to right and two from right to left
Hiding layer state and cell state on direction, xtFor the word in the candidate lexical item of t moment input;T is indicated in candidate lexical item
The quantity of word;
S23, in the encoder, is calculated ES by formulaT:
h′T=f (WhhT+bh)
C′T=f (WcCT+bc)
Wherein,For connector, Wh、bh、Wc、bcThe parameter matrix in fully-connected network and biasing are represented, f indicates full connection
Activation primitive ReLU, ES in networkTIt is h 'TWith C 'TOne tuple of composition;
S24, in decoder section, with ESTIt is decoded for original state using unidirectional LSTM:
Wherein, ztIt is hiding layer state of the decoder in t moment, zt-1For the hiding layer state at t-1 moment, ESTFor coding
Device state,The word in candidate lexical item exported for the t-1 moment;
S25, according to ztEstimate the probability of current word:
Wherein, Wszt+bsIt gives a mark to each possible output word, softmax is normalized function.
S26, when in training process loss function L constantly become smaller finally tend towards stability when, obtain the parameter W of encoderh、bh、
Wc、bcAnd the W in decoders、zt, so that it is determined that self-encoding encoder;Wherein, the calculation formula of loss function L are as follows:
Further, in the step S2, candidate's lexical item inputs self-encoding encoder, the ES of encoder outputTIn value
For the phrase vector of the candidate lexical item.
Further, theme vector in the step S3Calculation formula are as follows:
Wherein,It is theme lexical item tiCorresponding vector expression,It is text diTheme vector indicate.
Further, in the TextRank algorithm of the step S4, if candidate lexical item cjAnd ckGo out in co-occurrence window
Show, then cjAnd ckBetween there are a line, the calculation formula of the weight on side are as follows:
wjk=similarity (cj,ck)×occurcount(cj,ck)
Wherein,It is candidate lexical item c respectivelyjAnd ckVector indicate, occurcount(cj,ck) indicate cjAnd ckAltogether
The number occurred jointly in existing window, similarity (cj,ck) it is cjAnd ckBetween similarity, wjkRepresent cjAnd ckBetween
The weight on side.
It further, further include iterative calculation vertex weights in the TextRank algorithm of the step S4, including following
Step:
The weight for iterating to calculate candidate lexical item, until reaching maximum number of iterations, weighted scoreCalculation formula are as follows:
Wherein,Indicate candidate lexical item cjScore, d is damped coefficient, it is preferred that d 0.85;It is candidate word
Item cjTopic weights, wjkIt is candidate lexical item cjWith candidate lexical item ckBetween side weight, wkpIt is candidate lexical item ckWith candidate lexical item
cpBetween side weight,It indicates and candidate lexical item cjThe set of connected candidate lexical item,It is element therein, similarly,It indicates and candidate lexical item ckThe set of connected candidate lexical item,It is element therein.
Second aspect, the present invention provides a kind of keyword abstraction systems of phrase-based vector, and the system comprises texts
This preprocessing module retains n tuple according to part of speech, obtains candidate lexical item for part of speech to be segmented and marked to urtext
Collection;
Phrase vector constructs module, for candidate lexical item cj=(x1,x2,…,xT), had by self-encoding encoder
The phrase vector of semantic expressiveness;
Topic weights computing module, for calculating the topic weights of candidate lexical item;
Candidate word sorting module takes TopK candidate lexical item as keyword for calculating weighted score for candidate lexical item.
Further, the system also includes self-encoding encoder training modules, for obtaining what oneself encoded by sample training
Parameter, so that it is determined that self-encoding encoder.
The keyword abstraction method and system of a kind of phrase-based vector provided by the invention, with existing keyword abstraction side
Method and system are compared, and are had the following beneficial effects:
1, keyword abstraction method provided by the invention and system had both introduced the subject information in document, further through word
Semantic similarity between language introduces contextual information, can more capture the heavy duty word in full text, make the keyword extracted more
Add accurate.
2, keyword abstraction method provided by the invention and system obtain keyword using phrase vector, so that calculating
Journey becomes succinct efficient.
3, phrase vector calculation provided by the invention innovatively introduces the self-encoding encoder based on LSTM to term vector
It is compressed, can preferably indicate the semantic information of phrase, semantic precision is higher, and application range is wider.
4, present invention improves over TextRank algorithms, innovatively calculate theme to each candidate lexical item using phrase vector
Weight, and between candidate lexical item semantic similarity and co-occurrence information calculate the weight on side jointly, can consider entire document
Theme, and introduce the semantic information between vertex, keep the accuracy of sort algorithm higher.
Detailed description of the invention
In order to illustrate more clearly of the embodiment of the present disclosure or technical solution in the prior art, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
Disclosed some embodiments for those of ordinary skill in the art without creative efforts, can be with
Other attached drawings are obtained according to these figures.
Fig. 1 is the structural schematic diagram of the self-encoding encoder of one embodiment of the invention;
Fig. 2 is the flow chart of the keyword abstraction method of the phrase-based vector of one embodiment of the invention.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention
In attached drawing, the technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is
A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art
Every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.
Present invention will be further explained below with reference to the attached drawings and specific embodiments.
In order to which technical solution in present application example and advantage is more clearly understood, below in conjunction with attached drawing to the application's
Exemplary embodiment is described in more detail, it is clear that and described embodiment is only a part of the embodiment of the application,
Rather than the exhaustion of all embodiments.It should be noted that in the absence of conflict, the example in the application can be tied mutually
It closes.
The present invention provides a kind of keyword abstraction method of phrase-based vector, as shown in Fig. 2, this method includes following step
It is rapid:
S1, to urtext diPart of speech is segmented and marked, n tuple is retained according to part of speech, obtains candidate word item collection
S2, to each candidate lexical item cj=(x1,x2,…,xT), the phrase vector of candidate lexical item is obtained by self-encoding encoder
It indicates.Wherein, xiIt is candidate lexical item cjIn the term vector of i-th word indicate that T indicates the word quantity in candidate lexical item.
S3, each candidate lexical item c is calculatediWith theme vectorSimilarity as its topic weightsWherein, diIt indicates
I-th document.Self-encoding encoder includes encoder and decoder, and encoder section is made of two-way LSTM layers and full articulamentum, is solved
Code part is formed by unidirectional LSTM layers and softmax layers.
S4, by improved TextRank algorithm, obtain keyword from the candidate word item collection.
In step s 2, in the encoder, to each candidate lexical item c to be enteredj, using two-way LSTM from former and later two
Direction is respectively calculated, the last one moment is taken to hide layer state hTWith cell state CTAs end-state, and spelled
It connects, obtains the output ES of coding layer finally by a full articulamentumT。
In a decoder, with ESTIt for initial input, is decoded using unidirectional LSTM structure, is obtained by softmax layers
To the decoded probability distribution of each step, the probability for decoding the corresponding correct word of each step is maximized finally by loss function L.
Trained purpose is to optimize the parameter of self-encoding encoder, is allowed a decoder to the output of encoder as input, maximum
The semantic information of the candidate lexical item of the reduction encoder input of degree.
Specific training method are as follows:
(1) training sample is chosen, then as S1, sample is carried out the operation such as to segment, obtains candidate word item collection.
Candidate lexical item cj=(x1,x2,…,xT) indicate, wherein xiIt is candidate lexical item cjIn i-th of word term vector
It indicates, T indicates the word quantity in candidate lexical item.With candidate lexical item cjFor " Beijing Institute of Technology ", x1It is that " Beijing " is corresponding
Term vector, x2It is " science and engineering " corresponding term vector, x3It is " university " corresponding term vector.
(2) model is trained using a large amount of candidate lexical items.By taking candidate lexical item " Beijing Institute of Technology " as an example, inputs and be
" Beijing " " science and engineering " " university " corresponding term vector indicates that the encoded phrase vector for obtaining " Beijing Institute of Technology " indicates, and
Decoding sequence is obtained by the phrase vector decoding and is followed successively by " Beijing " " science and engineering " " university " corresponding probability value, it is made by training
It maximizes.
To each candidate lexical item cj=(x1,s2,…,xT), in encoder section, encoder uses two-way LSTM from front and back
Both direction is respectively calculated:
Wherein,WithRespectively t (t=1,2 ..., the T) moment is from left to right and from right to left in both direction
Hiding layer state and cell state,WithRespectively the t-1 moment is from left to right and two from right to left
Hiding layer state and cell state on direction, xtFor the word in the candidate lexical item of t moment input.At each moment, when
Preceding hiding layer state htWith cell state CtCalculating will rely on the hiding layer state h at a momentt-1, cell state Ct-1
With current input xt。
The last one moment is taken to hide layer state hTWith cell state CTIt, directly will be in both direction as end-state
State is attached.In addition to providing the input of a fixed size to decoding layer, it is also necessary to pass through a full articulamentum pair
State after connection is handled.Calculate the input ES that following formula obtains a fixed size of decoderT:
h′T=f (WhhT+bh)
C′T=f (WcCT+bc)
Wherein,For connector, Wh、bh、Wc、bcThe parameter matrix in fully-connected network and biasing are represented, f indicates full connection
Activation primitive ReLU, ES in networkTIt is h 'TWith C 'TComposition can finally be provided to a tuple of decoder.
In decoder section, with ESTIt is decoded for original state using unidirectional LSTM:
Wherein, ztIt is hiding layer state of the decoder in t moment, zt-1For the hiding layer state at t-1 moment, ESTFor coding
Device state,The word in candidate lexical item exported for the t-1 moment.
According to ztEstimate the probability of current word:
Wherein, WsIt is parameter matrix, ztIt is hiding layer state of the decoder in t moment, Wszt+bsTo each possible output
Word is given a mark, and is normalized to obtain each word with softmaxProbability
The training objective of self-encoding encoder is to make the maximum probability for exporting correct phrase: self-encoding encoder output is corresponding each
The probability of word, training objective are that the maximum probability for making to export correct word passes through that is, being trained according to loss function L
Parameter (the W including the parameter in LSTM, in encoder of training adjustment self-encoding encoderh、bh、Wc、bcAnd the W in decoders、
zt), when in training process loss function constantly become smaller finally tend towards stability when, can illustrate that intermediate vector can fine earth's surface
Show phrase semanteme, intermediate vector table can be shown as phrase vector by we.The loss function L calculates as follows:
After self-encoding encoder training, loss function value tends towards stability.Self-encoding encoder training at this time is completed, will be candidate
Lexical item inputs in the encoder of self-encoding encoder, ESTIn value be phrase vector.By the self-encoding encoder constructed above, utilize
Information in candidate lexical item sequence compresses term vector, and the phrase vector for obtaining candidate lexical item indicates.
After the completion of self-encoding encoder training, when the phrase vector for needing to obtain candidate lexical item indicates, coding need to be only utilized
Part calculates, and the phrase vector that can be obtained candidate lexical item indicates EST, gained ESTIt can be examined with the entirety of a candidate lexical item
Consider the semantic information of candidate's lexical item.
In step s3, topic weights calculating process is as follows:
(1) determine descriptor item collection: the theme sentence or paragraph for having succinct generalization using text is representatives, such as paper
Topic or abstract, therefrom determine text theme lexical item, the descriptor item collection of text is added:Wherein di
Indicate i-th document, n is the theme the element number of lexical item concentration.For example, to " mining design industry development thinking under the new situation
For instance analysis ", descriptor item collection can be " mining design ", " thinking of development ", " instance analysis ".
(2) it calculates theme vector: calculating descriptor item collectionIn all corresponding word or expression vectors of lexical item be averaged
Value, the theme vector as documentFor indicating the theme of entire chapter document:
Wherein,It is theme lexical item tiCorresponding vector expression,It is document diTheme vector indicate.
(3) topic weights are calculated: to each candidate lexical item cj, calculate it and document diTheme vectorBetween cosine
Distance, as its topic weights.
Wherein,It is document diCandidate lexical item cjTopic weights,It is candidate lexical item cjVector indicate, cos table
Show COS distance.
By above (1)~(3) step, the topic weights between one 0 to 1 can be distributed for each candidate lexical item.It needs
It is noted that topic weights are theme of the 1 expression candidate's lexical item closest to text, candidate's lexical item distance is indicated for 0
The theme of text is farther out.
In step s 4, with document diCandidate word item collectionNon-directed graph is constructed for vertex, calculates each candidate lexical item cj
Weighted scoreTake a candidate lexical item of TopK (preceding K) as keyword.This is by improving TextRank algorithm come real
Existing, specific process is as follows:
(1) non-directed graph is constructed: with document diCandidate word item collectionIn all elements be vertex construct a non-directed graph.
Wherein, if candidate lexical item cjAnd ckOccur in the co-occurrence window that a length is n, then cjAnd ckBetween there are a lines.
(2) calculate while weight: while weight be improvements of the invention.Calculate same dependence self-encoding encoder construction
Phrase vector.According to two candidate lexical item cjAnd ckVector expression between COS distance similarity (cj,ck) and co-occurrence
Number occurcount(cj,ck) be figure in each edge distribute weight wjk:
wjk=similarity (cj,ck)×occurcount(cj,ck)
WhereinIt is candidate lexical item c respectivelyjAnd ckVector indicate, cos indicate vector COS distance,
occurcount(cj,ck) indicate cjAnd ckThe number occurred jointly in co-occurrence window occurs mutually riding two words of the two simultaneously
Number reinforce their semantic relation, wjkRepresent cjAnd ckBetween side weight.
(3) iterate to calculate vertex weights: vertex weights are also improvements of the invention.Iterate to calculate each vertex in figure
Weight, until reaching maximum number of iterations, weighted scoreIt calculates as follows:
Wherein,Indicate document diCandidate lexical item cjWeight, d is damped coefficient, and effect is to make each vertex
There is certain probability to vote to other vertex, vertex each in this way there can be the score being not zero, it is ensured that algorithm is multiple
It can be restrained after iteration, usual value is 0.85.It is document diCandidate lexical item cjTopic weights, wjkIt is candidate lexical item cj
With candidate lexical item ckBetween side weight, wkpIt is candidate lexical item ckWith candidate lexical item cpBetween side weight,It indicates and waits
Select lexical item cjConnected candidate lexical item set,It is the element in the set, similarly,It indicates and candidate lexical item ckIt is connected
Candidate lexical item set,It is the element in the set,Indicate document diCandidate lexical item ckWeight, on the right of equation
Latter half indicate be and cjConnected vertex is to cjBallot.
(4) candidate lexical item sequence: after successive ignition, each vertex in figure can obtain a stable score, will
Candidate word item collectionBy weighted scoreDescending sequence, keyword of the TopK candidate lexical item as document before retaining.
By tetra- steps of above-mentioned S1~S4, so that it may extract the keyword of document.
The present invention also provides a kind of keyword abstraction systems of phrase-based vector, comprising:
Text Pretreatment module retains n tuple according to part of speech, obtains for part of speech to be segmented and marked to urtext
To candidate word item collection;
Phrase vector constructs module, for candidate lexical item cj=(x1,x2,…,xT), had by self-encoding encoder
The phrase vector of semantic expressiveness;
Topic weights computing module, for calculating the topic weights of candidate lexical item;Specific calculation method is as described above.
Candidate word sorting module takes TopK candidate lexical item as keyword for calculating weighted score for candidate lexical item.
Specific choosing method is as described above.
Further, the system also includes self-encoding encoder training modules, for handling the sequence information in phrase structure,
The phrase vector for obtaining candidate lexical item indicates that training method is as described above.
Below with enterprise's paper data instance in enterprise's paper database, illustrate the key of specific phrase-based vector
Word abstracting method.
Have environmental protection and enterprise's paper data of other multiple fields in enterprise's paper database, include in data " topic ",
The fields such as " time ", " abstract ", " keyword ", " English keyword ", " classification number ".During keyword abstraction, with data
" topic " and " abstract " in library is used as content of text, and " keyword " verifies extraction result as labeled data.
In training self-encoding encoder, take " keyword " field in database as training data, the portion in training process
Divide parameter as shown in table 1.
The setting of 1 training parameter of table
Before carrying out keyword abstraction, labeled data is analyzed to determine the partial parameters in algorithm.Data set
In share 59913 paper data, average every paper has 4.2 mark keywords.Firstly, the length of statistics mark keyword
Degree, i.e., the number of words that each keyword includes, the results are shown in Table 2.It can be found that the average length of whole keywords from table 2
Degree is 1.98, and the length of most keywords, all between 1 to 3, keyword of the length between 1 to 3 is in whole
93.9% is occupied in 254376 keywords.Therefore 1 tuple when selecting candidate lexical item in reservation text, 2 tuples and 3 yuan
Group.
Then, the part of speech of whole words in keyword is counted, statistical result is as shown in table 3.Part-of-speech tagging utilizes Jieba points
Word tool is completed, and part of speech explanation in part is as shown in table 4.According to table 3, the part of speech of word is distributed no distribution of lengths collection in keyword
In, but it is also mainly gathered in noun, verb and the verb with noun function, these three parts of speech occupy whole word parts of speech
73.1%.Therefore, take noun, verb and name verb in text and combinations thereof as candidate when carrying out candidate lexical item and selecting
Lexical item.
The distribution of 2 length keywords of table
The distribution of 3 word part of speech of table
4 Jieba part of speech explanation of table
It is main using topic as full text when calculating topic weights due to only including the topic and abstract of paper in content of text
The theme vector that candidate lexical item calculates text is extracted in the representative of topic from topic.In addition the co-occurrence window in candidate word sequence is big
Small to be initially set to 3, the candidate word number finally retained takes 10, as shown in table 5.
5 keyword abstraction result (part) of table
Preferably, the present invention takes a paper data instance in enterprise's paper database, provides specific keyword and takes out
Take process.
Data content is that " 10 years height of coal industry have been looked back in the instance analysis of mining design industry development thinking under the new situation
Fast period of expansion and its profound influence that mining design market is generated.In current coal industry economy, rapidly downlink, coal are designed
Under the background of Market competition, by taking the development of world tech design institute mining speciality as an example, the manpower money of mining speciality is analyzed
Source and business Variation Features propose the thinking of development and implementing measure of mining speciality, are other design companies mining specialities
Development provides reference ".
Wherein, " mining design industry development thinking instance analysis under the new situation " is the topic of paper, remaining content is opinion
The abstract of text.
Candidate lexical item is chosen by n tuple lexical item and part-of-speech tagging, the candidate lexical item selected from the abstract of paper is made
For the descriptor item collection of text, the candidate lexical item selected is as shown in table 6.
The candidate lexical item result of table 6
It is indicated using the corresponding phrase vector of lexical items all in self-encoding encoder acquisition descriptor item collection, calculates descriptor item collection
In the corresponding phrase vector of all lexical items average value, as the theme vector of text, the theme vector that document is calculated is big
Small is 400, and partial value is as shown in table 7.
7 topic weights result (part) of table
To each candidate lexical item, its COS distance between the theme vector of text is calculated, its topic weights, portion are obtained
Score value is as shown in table 8.
8 topic weights result (part) of table
Using candidate lexical item as vertex, the co-occurrence information of candidate lexical item constructs non-directed graph as side, according to two candidate words
The co-occurrence number of COS distance and the two between the vector expression of item is each edge distribution weight in figure, according to topic weights
Vertex weights are calculated with the weight successive ignition on side.After successive ignition, each vertex in figure can obtain one surely
Fixed score, part score are as shown in table 9.
9 weighted score result (part) of table
Obtained scoring event is ranked up, using Top10 of highest scoring candidate lexical item as final keyword,
As shown in table 10.
10 keyword abstraction result (part) of table
It should be noted that " first " and " second " is used merely to distinguish the identical entity of title or operation herein, and
Sequence or relationship between these entities or operation are not implied that.
Those of ordinary skill in the art will appreciate that: the above embodiments are only used to illustrate the technical solution of the present invention., and
It is non-that it is limited;Although present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art
It is understood that it is still possible to modify the technical solutions described in the foregoing embodiments, either to part of or
All technical features are equivalently replaced;And these are modified or replaceed, it does not separate the essence of the corresponding technical solution this hair
Bright claim limited range.
Claims (9)
1. a kind of keyword abstraction method of phrase-based vector, which is characterized in that the described method includes:
S1, part of speech is segmented and marked to text, retain n tuple and obtain candidate word item collection;
It S2, is that candidate lexical item constructs phrase vector by self-encoding encoder;
S3, the theme for determining the text calculate the similarity of candidate lexical item and theme vector, using the similarity as described in
The topic weights of candidate lexical item;
S4, pass through TextRank algorithm, obtain keyword from the candidate word item collection.
2. the method according to claim 1, wherein the self-encoding encoder in the step S2 includes encoder reconciliation
Code device, encoder are made of two-way LSTM layers and full articulamentum, and decoded portion is formed by unidirectional LSTM layers and softmax layers.
3. according to the method described in claim 2, it is characterized in that, the training method of the self-encoding encoder in the step S2 includes
Following steps:
S21, training sample is chosen, obtains candidate lexical item;
S22, to candidate lexical item cj=(x1, x2..., xT), in the encoder, distinguished using two-way LSTM from former and later two directions
It is calculated:
Wherein,WithRespectively t (t=1, the 2 ..., T) moment is from left to right and from right to left in both direction
Layer state and cell state are hidden,WithRespectively t-1 moment from left to right and from right to left two sides
Upward hiding layer state and cell state, xtFor the word in the candidate lexical item of t moment input, T indicates single in candidate lexical item
The quantity of word;
S23, in the encoder, is calculated ES by formulaT:
h′T=f (WhhT+bh)
C′T=f (WcCT+bc)
Wherein,For connector, Wh、bh、Wc、bcThe parameter matrix in fully-connected network and biasing are represented, f indicates fully-connected network
In activation primitive ReLU, ESTIt is h 'TWith C 'TOne tuple of composition;
S24, in decoder section, with ESTIt is decoded for original state using unidirectional LSTM:
Wherein, ztIt is hiding layer state of the decoder in t moment, zt-1For the hiding layer state at t-1 moment, ESTFor encoder shape
State,The word in candidate lexical item exported for the t-1 moment;
S25, according to ztEstimate the probability of current word
Wherein, Wszt+bsIt gives a mark to each possible output word, softmax is normalized function;
S26, when in training process loss function L constantly become smaller finally tend towards stability when, obtain the parameter W of encoderh、bh、Wc、
bcAnd the parameter W of decoders、zt, so that it is determined that self-encoding encoder;Wherein, the calculation formula of loss function L are as follows:
4. according to the method described in claim 3, it is characterized in that, candidate's lexical item input encodes certainly in the step S2
Device, the ES of encoder outputTIn value be the candidate lexical item phrase vector.
5. the method according to claim 1, wherein theme vector in the step S3Calculation formula are as follows:
Wherein,It is theme lexical item tiCorresponding vector expression,It is text diTheme vector indicate.
6. the method according to claim 1, wherein in the TextRank algorithm of the step S4, candidate word
Item cjAnd ckOccur in co-occurrence window, then cjAnd ckBetween there are a line, the calculation formula of the weight on side are as follows:
wjk=similarity (cj, ck)×occurcount(cj, ck)
Wherein,It is candidate lexical item c respectivelyjAnd ckVector indicate, occurcount(cj, ck) indicate cjAnd ckIn co-occurrence window
The number occurred jointly in mouthful, similarity (cj, ck) it is cjAnd ckBetween similarity, wjkRepresent cjAnd ckBetween side
Weight.
7. according to the method described in claim 6, it is characterized in that, further including changing in the TextRank algorithm of the step S4
In generation, calculates the weight of candidate lexical item, until reaching maximum number of iterations, weightCalculation formula are as follows:
Wherein,Indicate candidate lexical itemcjWeight, d is damped coefficient, it is preferred that d 0.85;It is candidate lexical itemcj's
Topic weights, wjkIt is candidate lexical item cjWith candidate lexical item ckBetween side weight, wkpIt is candidate lexical item ckWith candidate lexical item cpBetween
The weight on side,It indicates and candidate lexical item cjThe set of connected candidate lexical item,It isIn element,It indicates
With candidate lexical item ckThe set of connected candidate lexical item,It isIn element,Indicate candidate lexical item ckWeight.
8. a kind of keyword abstraction system of phrase-based vector, which is characterized in that the system comprises:
Text Pretreatment module retains n tuple according to part of speech, is waited for part of speech to be segmented and marked to urtext
Select lexical item collection;
Phrase vector constructs module, for candidate lexical item cj=(x1, x2..., xT), being obtained by self-encoding encoder has semanteme
The phrase vector of expression;
Topic weights computing module, for calculating the topic weights of candidate lexical item;
Candidate word sorting module takes TopK candidate lexical item as keyword for calculating weighted score for candidate lexical item.
9. system according to claim 8, which is characterized in that the system also includes self-encoding encoder training modules, are used for
It is obtained by sample training from the parameter encoded, so that it is determined that self-encoding encoder.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910548261.XA CN110263343B (en) | 2019-06-24 | 2019-06-24 | Phrase vector-based keyword extraction method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910548261.XA CN110263343B (en) | 2019-06-24 | 2019-06-24 | Phrase vector-based keyword extraction method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110263343A true CN110263343A (en) | 2019-09-20 |
CN110263343B CN110263343B (en) | 2021-06-15 |
Family
ID=67920847
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910548261.XA Active CN110263343B (en) | 2019-06-24 | 2019-06-24 | Phrase vector-based keyword extraction method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110263343B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111222333A (en) * | 2020-04-22 | 2020-06-02 | 成都索贝数码科技股份有限公司 | Keyword extraction method based on fusion of network high-order structure and topic model |
CN111274428A (en) * | 2019-12-19 | 2020-06-12 | 北京创鑫旅程网络技术有限公司 | Keyword extraction method and device, electronic equipment and storage medium |
CN111785254A (en) * | 2020-07-24 | 2020-10-16 | 四川大学华西医院 | Self-service BLS training and checking system based on anthropomorphic dummy |
CN112818686A (en) * | 2021-03-23 | 2021-05-18 | 北京百度网讯科技有限公司 | Domain phrase mining method and device and electronic equipment |
CN113312532A (en) * | 2021-06-01 | 2021-08-27 | 哈尔滨工业大学 | Public opinion grade prediction method based on deep learning and oriented to public inspection field |
Citations (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20080017686A (en) * | 2006-08-22 | 2008-02-27 | 에스케이커뮤니케이션즈 주식회사 | Method for extracting subject and sorting document of searching engine, computer readable record medium on which program for executing method is recorded |
US8019708B2 (en) * | 2007-12-05 | 2011-09-13 | Yahoo! Inc. | Methods and apparatus for computing graph similarity via signature similarity |
CN103744835A (en) * | 2014-01-02 | 2014-04-23 | 上海大学 | Text keyword extracting method based on subject model |
KR101656245B1 (en) * | 2015-09-09 | 2016-09-09 | 주식회사 위버플 | Method and system for extracting sentences |
CN106372064A (en) * | 2016-11-18 | 2017-02-01 | 北京工业大学 | Characteristic word weight calculating method for text mining |
CN106970910A (en) * | 2017-03-31 | 2017-07-21 | 北京奇艺世纪科技有限公司 | A kind of keyword extracting method and device based on graph model |
CN106997382A (en) * | 2017-03-22 | 2017-08-01 | 山东大学 | Innovation intention label automatic marking method and system based on big data |
CN107122413A (en) * | 2017-03-31 | 2017-09-01 | 北京奇艺世纪科技有限公司 | A kind of keyword extracting method and device based on graph model |
CN107133213A (en) * | 2017-05-06 | 2017-09-05 | 广东药科大学 | A kind of text snippet extraction method and system based on algorithm |
CN107193803A (en) * | 2017-05-26 | 2017-09-22 | 北京东方科诺科技发展有限公司 | A kind of particular task text key word extracting method based on semanteme |
CN107247780A (en) * | 2017-06-12 | 2017-10-13 | 北京理工大学 | A kind of patent document method for measuring similarity of knowledge based body |
CN107832457A (en) * | 2017-11-24 | 2018-03-23 | 国网山东省电力公司电力科学研究院 | Power transmission and transforming equipment defect dictionary method for building up and system based on TextRank algorithm |
CN108460019A (en) * | 2018-02-28 | 2018-08-28 | 福州大学 | A kind of emerging much-talked-about topic detecting system based on attention mechanism |
CN108710611A (en) * | 2018-05-17 | 2018-10-26 | 南京大学 | A kind of short text topic model generation method of word-based network and term vector |
CN108984526A (en) * | 2018-07-10 | 2018-12-11 | 北京理工大学 | A kind of document subject matter vector abstracting method based on deep learning |
CN109614626A (en) * | 2018-12-21 | 2019-04-12 | 北京信息科技大学 | Keyword Automatic method based on gravitational model |
CN109726394A (en) * | 2018-12-18 | 2019-05-07 | 电子科技大学 | Short text Subject Clustering method based on fusion BTM model |
CN109918660A (en) * | 2019-03-04 | 2019-06-21 | 北京邮电大学 | A kind of keyword extracting method and device based on TextRank |
CN109918510A (en) * | 2019-03-26 | 2019-06-21 | 中国科学技术大学 | Cross-cutting keyword extracting method |
-
2019
- 2019-06-24 CN CN201910548261.XA patent/CN110263343B/en active Active
Patent Citations (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20080017686A (en) * | 2006-08-22 | 2008-02-27 | 에스케이커뮤니케이션즈 주식회사 | Method for extracting subject and sorting document of searching engine, computer readable record medium on which program for executing method is recorded |
US8019708B2 (en) * | 2007-12-05 | 2011-09-13 | Yahoo! Inc. | Methods and apparatus for computing graph similarity via signature similarity |
CN103744835A (en) * | 2014-01-02 | 2014-04-23 | 上海大学 | Text keyword extracting method based on subject model |
KR101656245B1 (en) * | 2015-09-09 | 2016-09-09 | 주식회사 위버플 | Method and system for extracting sentences |
CN106372064A (en) * | 2016-11-18 | 2017-02-01 | 北京工业大学 | Characteristic word weight calculating method for text mining |
CN106997382A (en) * | 2017-03-22 | 2017-08-01 | 山东大学 | Innovation intention label automatic marking method and system based on big data |
CN106970910A (en) * | 2017-03-31 | 2017-07-21 | 北京奇艺世纪科技有限公司 | A kind of keyword extracting method and device based on graph model |
CN107122413A (en) * | 2017-03-31 | 2017-09-01 | 北京奇艺世纪科技有限公司 | A kind of keyword extracting method and device based on graph model |
CN107133213A (en) * | 2017-05-06 | 2017-09-05 | 广东药科大学 | A kind of text snippet extraction method and system based on algorithm |
CN107193803A (en) * | 2017-05-26 | 2017-09-22 | 北京东方科诺科技发展有限公司 | A kind of particular task text key word extracting method based on semanteme |
CN107247780A (en) * | 2017-06-12 | 2017-10-13 | 北京理工大学 | A kind of patent document method for measuring similarity of knowledge based body |
CN107832457A (en) * | 2017-11-24 | 2018-03-23 | 国网山东省电力公司电力科学研究院 | Power transmission and transforming equipment defect dictionary method for building up and system based on TextRank algorithm |
CN108460019A (en) * | 2018-02-28 | 2018-08-28 | 福州大学 | A kind of emerging much-talked-about topic detecting system based on attention mechanism |
CN108710611A (en) * | 2018-05-17 | 2018-10-26 | 南京大学 | A kind of short text topic model generation method of word-based network and term vector |
CN108984526A (en) * | 2018-07-10 | 2018-12-11 | 北京理工大学 | A kind of document subject matter vector abstracting method based on deep learning |
CN109726394A (en) * | 2018-12-18 | 2019-05-07 | 电子科技大学 | Short text Subject Clustering method based on fusion BTM model |
CN109614626A (en) * | 2018-12-21 | 2019-04-12 | 北京信息科技大学 | Keyword Automatic method based on gravitational model |
CN109918660A (en) * | 2019-03-04 | 2019-06-21 | 北京邮电大学 | A kind of keyword extracting method and device based on TextRank |
CN109918510A (en) * | 2019-03-26 | 2019-06-21 | 中国科学技术大学 | Cross-cutting keyword extracting method |
Non-Patent Citations (5)
Title |
---|
BASALDELLA MARCO 等: "Bidirectional lstm recurrent neural network for keyphrase extraction", 《ITALIAN RESEARCH CONFERENCE ON DIGITAL LIBRARIES》 * |
张莉婧 等: "基于改进TextRank的关键词抽取算法", 《北京印刷学院学报》 * |
李航 等: "融合多特征的TextRank关键词抽取方法", 《情报杂志》 * |
洪冬梅: "基于LSTM的自动文本摘要技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
齐翌辰 等: "基于深度学习的中文抽取式摘要方法应用", 《科教导刊》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111274428A (en) * | 2019-12-19 | 2020-06-12 | 北京创鑫旅程网络技术有限公司 | Keyword extraction method and device, electronic equipment and storage medium |
CN111274428B (en) * | 2019-12-19 | 2023-06-30 | 北京创鑫旅程网络技术有限公司 | Keyword extraction method and device, electronic equipment and storage medium |
CN111222333A (en) * | 2020-04-22 | 2020-06-02 | 成都索贝数码科技股份有限公司 | Keyword extraction method based on fusion of network high-order structure and topic model |
CN111785254A (en) * | 2020-07-24 | 2020-10-16 | 四川大学华西医院 | Self-service BLS training and checking system based on anthropomorphic dummy |
CN111785254B (en) * | 2020-07-24 | 2023-04-07 | 四川大学华西医院 | Self-service BLS training and checking system based on anthropomorphic dummy |
CN112818686A (en) * | 2021-03-23 | 2021-05-18 | 北京百度网讯科技有限公司 | Domain phrase mining method and device and electronic equipment |
CN112818686B (en) * | 2021-03-23 | 2023-10-31 | 北京百度网讯科技有限公司 | Domain phrase mining method and device and electronic equipment |
CN113312532A (en) * | 2021-06-01 | 2021-08-27 | 哈尔滨工业大学 | Public opinion grade prediction method based on deep learning and oriented to public inspection field |
Also Published As
Publication number | Publication date |
---|---|
CN110263343B (en) | 2021-06-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110263343A (en) | The keyword abstraction method and system of phrase-based vector | |
CN111310471B (en) | Travel named entity identification method based on BBLC model | |
CN113010693A (en) | Intelligent knowledge graph question-answering method fusing pointer to generate network | |
CN106991085B (en) | Entity abbreviation generation method and device | |
CN111191002B (en) | Neural code searching method and device based on hierarchical embedding | |
CN112328900A (en) | Deep learning recommendation method integrating scoring matrix and comment text | |
Kaur | Incorporating sentimental analysis into development of a hybrid classification model: A comprehensive study | |
CN110955776A (en) | Construction method of government affair text classification model | |
CN110347796A (en) | Short text similarity calculating method under vector semantic tensor space | |
CN113239148B (en) | Scientific and technological resource retrieval method based on machine reading understanding | |
CN113239663B (en) | Multi-meaning word Chinese entity relation identification method based on Hopkinson | |
CN112989761A (en) | Text classification method and device | |
CN115357719A (en) | Power audit text classification method and device based on improved BERT model | |
CN113821635A (en) | Text abstract generation method and system for financial field | |
CN115688784A (en) | Chinese named entity recognition method fusing character and word characteristics | |
CN115169349A (en) | Chinese electronic resume named entity recognition method based on ALBERT | |
CN113806528A (en) | Topic detection method and device based on BERT model and storage medium | |
CN109325243A (en) | Mongolian word cutting method and its word cutting system of the character level based on series model | |
CN112434512A (en) | New word determining method and device in combination with context | |
CN111859955A (en) | Public opinion data analysis model based on deep learning | |
CN116432755A (en) | Weight network reasoning method based on dynamic entity prototype | |
CN115840815A (en) | Automatic abstract generation method based on pointer key information | |
Weijie et al. | Long text classification based on BERT | |
CN114741515A (en) | Social network user attribute prediction method and system based on graph generation | |
CN113935308A (en) | Method and system for automatically generating text abstract facing field of geoscience |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |