CN110263343A - The keyword abstraction method and system of phrase-based vector - Google Patents

The keyword abstraction method and system of phrase-based vector Download PDF

Info

Publication number
CN110263343A
CN110263343A CN201910548261.XA CN201910548261A CN110263343A CN 110263343 A CN110263343 A CN 110263343A CN 201910548261 A CN201910548261 A CN 201910548261A CN 110263343 A CN110263343 A CN 110263343A
Authority
CN
China
Prior art keywords
lexical item
candidate
candidate lexical
vector
item
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910548261.XA
Other languages
Chinese (zh)
Other versions
CN110263343B (en
Inventor
孙新
赵永妍
申长虹
杨凯歌
张颖捷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN201910548261.XA priority Critical patent/CN110263343B/en
Publication of CN110263343A publication Critical patent/CN110263343A/en
Application granted granted Critical
Publication of CN110263343B publication Critical patent/CN110263343B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to natural language processing and depth learning technology field, in particular to a kind of the keyword abstraction method and system of phrase-based vector.Main technical schemes of the invention include: urtext to be segmented and marked part of speech, retain n tuple according to part of speech, obtain candidate word item collection;The a large amount of phrases building vector for including in candidate key set of words is indicated;Calculate the topic weights of each candidate lexical item;Using candidate lexical item as the vertex in figure, using the co-occurrence information of candidate lexical item as side structural map, with the weight of semantic similarity and co-occurrence information calculating side between candidate lexical item, the score of each candidate lexical item and sequence are iterated to calculate.Keyword abstraction method provided by the invention and system had both introduced the subject information in document, introduced contextual information further through the semantic similarity between phrase, more can capture heavy duty word in full text, semantic precision is high, has a wide range of application.

Description

The keyword abstraction method and system of phrase-based vector
Technical field
The present invention relates to natural language processing and depth learning technology field, in particular to a kind of pass of phrase-based vector Keyword abstracting method and system.
Background technique
In recent years, mass data is while bringing great convenience, also the same analysis and lookup band to data Huge challenge is carried out.Under big data background, keynote message required for how rapidly obtaining from mass data becomes people Problem in the urgent need to address.Keyword abstraction refer to automatically extracted from document by algorithm it is important, have theme The word or phrase of property.In scientific and technical literature, keyword or phrase can help user to quickly understand papers contents.Meanwhile it is crucial Word or phrase are also used as the search entry in information retrieval, natural language processing and text mining.Appoint in keyword abstraction In business, the term vector comprising the semanteme of word, which has been obtained, to be applied and achieves good effect.However, many professional papers, Including containing a large amount of proper noun in enterprise's paper, and these nouns are not often single words but phrase, therefore only The needs of keyword abstraction task are insufficient for term vector, text, which needs to construct vector to phrase, to be indicated.
Currently existing scholar proposes to be combined based on term vector using self-encoding encoder to construct phrase vector.It is self-editing Code device (Auto Encoder) only has two parts of encoder and decoder in structure, is carried out with self-encoding encoder to word vector Combination can input the expression of each word in phrase in encoder section, then their boil down tos come when constructing phrase vector One intermediate hidden layers vector parses the phrase of input in decoder section, then in this again by hidden layer vector Between vector can be considered to contain the phrase vector of semantic information and indicate.However, directly being used in traditional self-encoding encoder The fully-connected network on basis is coded and decoded, wherein connecting entirely between layers, the node between every layer is without even It connects, this common autoencoder network can not handle the sequence information in structure as similar phrase.
The algorithm that in addition, there will be only calculates the semantic similarity of word by term vector, and has ignored the theme of text Information.TextRank is a kind of keyword abstraction algorithm based on figure, its basic thought is with the candidate lexical item structure in document Cheng Tu constructs side with the cooccurrence relation of candidate lexical item in a document, then by the mutual ballot between candidate lexical item come iteration Weight is calculated, finally candidate lexical item is ranked up according to score to determine the keyword finally extracted.Traditional In TextRank, the initial weight on each vertex is 1 (or 1/n, n are number of vertices) in figure, and the weight of each edge is also set as 1, that is to say, that the poll on each vertex, which can equably be thrown, gives its connected each vertex.Although the simple side of such method Just, but the thematic of document is not only had ignored, but also does not account for the semantic relation between vertex.
In Recognition with Recurrent Neural Network (Recurrent Neural Network, RNN), the node between hidden layer is no longer It is connectionless but have connection, and the input of hidden layer not only output comprising input layer also comprising last moment hidden layer Output.Therefore RNN is adapted to encode sequence data.However in the communication process of RNN, the forgetting of historical information and The accumulation of error is a major issue, and present people are usually using long Memory Neural Networks (Long Short-Term in short-term Memery, LSTM) Lai Gaijin.
LSTM is a kind of RNN specific type, it records information using cell state, and cell state is in sequence transmission process In only a small amount of linear reciprocal, can preferably retain historical information.Then LSTM is protected and is controlled using door control mechanism Cell state.Door control mechanism is an abstract concept, it is actually by a sigmoid function and point in specific implementation What multiplication was constituted, door control mechanism controls the transmitting of information by the value between output one 0 to 1, and output valve is closer to 0 table Show allow by information it is fewer, closer to 1 indicate allow by information it is more.
In a LSTM unit, to be processed first is information that previous step passes over, and LSTM is by forgeing door (forget gate) controls the forgetting and reservation of historical information.Forget door ftAccording to current information, decide whether to forget Information before, specific formula is as follows:
ft=σ (Wf·[ht-1,xt]+bf)
Wherein σ indicates sigmoid function, WfAnd bfRespectively indicate weight matrix and the biasing forgotten in door.
It is information currently entered that LSTM is to be treated later, and first passing through input gate control current input information will retain Part, later, with one cell state of tanh function creationThe information of the moment node is added in the cell state.
it=σ (Wi·[ht-1,xt]+bi)
By forgeing door and input gate, LSTM can determine which past information needs are left, and current which Information need to be to be stored, to calculate current cell state Ct
Last LSTM can pass through out gate according to historical information and current input information using sigmoid function (output gate) determines the information for current time needing to export, and similar with input state, output state can also use a tanh Function filtering.
ot=σ (Wo·[ht-1,xt]+bo)
ot=ot*tanh(Ct)
By cleverly door machine system, the information before Memory Neural Networks can be remembered in short-term is grown, while in turn avoiding " ladder Degree disappears " the problem of.
Summary of the invention
In order to solve, term vector is insufficient for the needs of keyword abstraction task and existing algorithm has ignored text Subject information these two aspects problem, the present invention provides the keyword abstraction method and system of a kind of phrase-based vector.
To achieve the above object, in a first aspect, the present invention provides a kind of keyword abstraction method of phrase-based vector, institute The method of stating includes:
S1, part of speech is segmented and marked to text, retain n tuple and obtain candidate word item collection;
It S2, is that candidate lexical item constructs phrase vector by self-encoding encoder;
S3, the theme for determining the text calculate the similarity of candidate lexical item and theme vector, using the similarity as The topic weights of candidate's lexical item;
S4, pass through TextRank algorithm, obtain keyword from the candidate word item collection.
Further, the self-encoding encoder in the step S2 includes encoder and decoder, and encoder is by LSTM layers two-way It is formed with full articulamentum, decoded portion is formed by unidirectional LSTM layers and softmax layers.
Further, the self-encoding encoder in the step S2 includes encoder and decoder, and training method includes following step It is rapid:
S21, training sample is chosen, obtains candidate lexical item;
S22, to candidate lexical item cj=(x1,x2,…,xT), in the encoder, using two-way LSTM from former and later two directions It is respectively calculated:
Wherein,WithRespectively t (t=1,2 ..., the T) moment is from left to right and from right to left in both direction Hiding layer state and cell state,WithRespectively the t-1 moment is from left to right and two from right to left Hiding layer state and cell state on direction, xtFor the word in the candidate lexical item of t moment input;T is indicated in candidate lexical item The quantity of word;
S23, in the encoder, is calculated ES by formulaT:
h′T=f (WhhT+bh)
C′T=f (WcCT+bc)
Wherein,For connector, Wh、bh、Wc、bcThe parameter matrix in fully-connected network and biasing are represented, f indicates full connection Activation primitive ReLU, ES in networkTIt is h 'TWith C 'TOne tuple of composition;
S24, in decoder section, with ESTIt is decoded for original state using unidirectional LSTM:
Wherein, ztIt is hiding layer state of the decoder in t moment, zt-1For the hiding layer state at t-1 moment, ESTFor coding Device state,The word in candidate lexical item exported for the t-1 moment;
S25, according to ztEstimate the probability of current word:
Wherein, Wszt+bsIt gives a mark to each possible output word, softmax is normalized function.
S26, when in training process loss function L constantly become smaller finally tend towards stability when, obtain the parameter W of encoderh、bh、 Wc、bcAnd the W in decoders、zt, so that it is determined that self-encoding encoder;Wherein, the calculation formula of loss function L are as follows:
Further, in the step S2, candidate's lexical item inputs self-encoding encoder, the ES of encoder outputTIn value For the phrase vector of the candidate lexical item.
Further, theme vector in the step S3Calculation formula are as follows:
Wherein,It is theme lexical item tiCorresponding vector expression,It is text diTheme vector indicate.
Further, in the TextRank algorithm of the step S4, if candidate lexical item cjAnd ckGo out in co-occurrence window Show, then cjAnd ckBetween there are a line, the calculation formula of the weight on side are as follows:
wjk=similarity (cj,ck)×occurcount(cj,ck)
Wherein,It is candidate lexical item c respectivelyjAnd ckVector indicate, occurcount(cj,ck) indicate cjAnd ckAltogether The number occurred jointly in existing window, similarity (cj,ck) it is cjAnd ckBetween similarity, wjkRepresent cjAnd ckBetween The weight on side.
It further, further include iterative calculation vertex weights in the TextRank algorithm of the step S4, including following Step:
The weight for iterating to calculate candidate lexical item, until reaching maximum number of iterations, weighted scoreCalculation formula are as follows:
Wherein,Indicate candidate lexical item cjScore, d is damped coefficient, it is preferred that d 0.85;It is candidate word Item cjTopic weights, wjkIt is candidate lexical item cjWith candidate lexical item ckBetween side weight, wkpIt is candidate lexical item ckWith candidate lexical item cpBetween side weight,It indicates and candidate lexical item cjThe set of connected candidate lexical item,It is element therein, similarly,It indicates and candidate lexical item ckThe set of connected candidate lexical item,It is element therein.
Second aspect, the present invention provides a kind of keyword abstraction systems of phrase-based vector, and the system comprises texts This preprocessing module retains n tuple according to part of speech, obtains candidate lexical item for part of speech to be segmented and marked to urtext Collection;
Phrase vector constructs module, for candidate lexical item cj=(x1,x2,…,xT), had by self-encoding encoder The phrase vector of semantic expressiveness;
Topic weights computing module, for calculating the topic weights of candidate lexical item;
Candidate word sorting module takes TopK candidate lexical item as keyword for calculating weighted score for candidate lexical item.
Further, the system also includes self-encoding encoder training modules, for obtaining what oneself encoded by sample training Parameter, so that it is determined that self-encoding encoder.
The keyword abstraction method and system of a kind of phrase-based vector provided by the invention, with existing keyword abstraction side Method and system are compared, and are had the following beneficial effects:
1, keyword abstraction method provided by the invention and system had both introduced the subject information in document, further through word Semantic similarity between language introduces contextual information, can more capture the heavy duty word in full text, make the keyword extracted more Add accurate.
2, keyword abstraction method provided by the invention and system obtain keyword using phrase vector, so that calculating Journey becomes succinct efficient.
3, phrase vector calculation provided by the invention innovatively introduces the self-encoding encoder based on LSTM to term vector It is compressed, can preferably indicate the semantic information of phrase, semantic precision is higher, and application range is wider.
4, present invention improves over TextRank algorithms, innovatively calculate theme to each candidate lexical item using phrase vector Weight, and between candidate lexical item semantic similarity and co-occurrence information calculate the weight on side jointly, can consider entire document Theme, and introduce the semantic information between vertex, keep the accuracy of sort algorithm higher.
Detailed description of the invention
In order to illustrate more clearly of the embodiment of the present disclosure or technical solution in the prior art, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Disclosed some embodiments for those of ordinary skill in the art without creative efforts, can be with Other attached drawings are obtained according to these figures.
Fig. 1 is the structural schematic diagram of the self-encoding encoder of one embodiment of the invention;
Fig. 2 is the flow chart of the keyword abstraction method of the phrase-based vector of one embodiment of the invention.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, the technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.
Present invention will be further explained below with reference to the attached drawings and specific embodiments.
In order to which technical solution in present application example and advantage is more clearly understood, below in conjunction with attached drawing to the application's Exemplary embodiment is described in more detail, it is clear that and described embodiment is only a part of the embodiment of the application, Rather than the exhaustion of all embodiments.It should be noted that in the absence of conflict, the example in the application can be tied mutually It closes.
The present invention provides a kind of keyword abstraction method of phrase-based vector, as shown in Fig. 2, this method includes following step It is rapid:
S1, to urtext diPart of speech is segmented and marked, n tuple is retained according to part of speech, obtains candidate word item collection
S2, to each candidate lexical item cj=(x1,x2,…,xT), the phrase vector of candidate lexical item is obtained by self-encoding encoder It indicates.Wherein, xiIt is candidate lexical item cjIn the term vector of i-th word indicate that T indicates the word quantity in candidate lexical item.
S3, each candidate lexical item c is calculatediWith theme vectorSimilarity as its topic weightsWherein, diIt indicates I-th document.Self-encoding encoder includes encoder and decoder, and encoder section is made of two-way LSTM layers and full articulamentum, is solved Code part is formed by unidirectional LSTM layers and softmax layers.
S4, by improved TextRank algorithm, obtain keyword from the candidate word item collection.
In step s 2, in the encoder, to each candidate lexical item c to be enteredj, using two-way LSTM from former and later two Direction is respectively calculated, the last one moment is taken to hide layer state hTWith cell state CTAs end-state, and spelled It connects, obtains the output ES of coding layer finally by a full articulamentumT
In a decoder, with ESTIt for initial input, is decoded using unidirectional LSTM structure, is obtained by softmax layers To the decoded probability distribution of each step, the probability for decoding the corresponding correct word of each step is maximized finally by loss function L.
Trained purpose is to optimize the parameter of self-encoding encoder, is allowed a decoder to the output of encoder as input, maximum The semantic information of the candidate lexical item of the reduction encoder input of degree.
Specific training method are as follows:
(1) training sample is chosen, then as S1, sample is carried out the operation such as to segment, obtains candidate word item collection.
Candidate lexical item cj=(x1,x2,…,xT) indicate, wherein xiIt is candidate lexical item cjIn i-th of word term vector It indicates, T indicates the word quantity in candidate lexical item.With candidate lexical item cjFor " Beijing Institute of Technology ", x1It is that " Beijing " is corresponding Term vector, x2It is " science and engineering " corresponding term vector, x3It is " university " corresponding term vector.
(2) model is trained using a large amount of candidate lexical items.By taking candidate lexical item " Beijing Institute of Technology " as an example, inputs and be " Beijing " " science and engineering " " university " corresponding term vector indicates that the encoded phrase vector for obtaining " Beijing Institute of Technology " indicates, and Decoding sequence is obtained by the phrase vector decoding and is followed successively by " Beijing " " science and engineering " " university " corresponding probability value, it is made by training It maximizes.
To each candidate lexical item cj=(x1,s2,…,xT), in encoder section, encoder uses two-way LSTM from front and back Both direction is respectively calculated:
Wherein,WithRespectively t (t=1,2 ..., the T) moment is from left to right and from right to left in both direction Hiding layer state and cell state,WithRespectively the t-1 moment is from left to right and two from right to left Hiding layer state and cell state on direction, xtFor the word in the candidate lexical item of t moment input.At each moment, when Preceding hiding layer state htWith cell state CtCalculating will rely on the hiding layer state h at a momentt-1, cell state Ct-1 With current input xt
The last one moment is taken to hide layer state hTWith cell state CTIt, directly will be in both direction as end-state State is attached.In addition to providing the input of a fixed size to decoding layer, it is also necessary to pass through a full articulamentum pair State after connection is handled.Calculate the input ES that following formula obtains a fixed size of decoderT:
h′T=f (WhhT+bh)
C′T=f (WcCT+bc)
Wherein,For connector, Wh、bh、Wc、bcThe parameter matrix in fully-connected network and biasing are represented, f indicates full connection Activation primitive ReLU, ES in networkTIt is h 'TWith C 'TComposition can finally be provided to a tuple of decoder.
In decoder section, with ESTIt is decoded for original state using unidirectional LSTM:
Wherein, ztIt is hiding layer state of the decoder in t moment, zt-1For the hiding layer state at t-1 moment, ESTFor coding Device state,The word in candidate lexical item exported for the t-1 moment.
According to ztEstimate the probability of current word:
Wherein, WsIt is parameter matrix, ztIt is hiding layer state of the decoder in t moment, Wszt+bsTo each possible output Word is given a mark, and is normalized to obtain each word with softmaxProbability
The training objective of self-encoding encoder is to make the maximum probability for exporting correct phrase: self-encoding encoder output is corresponding each The probability of word, training objective are that the maximum probability for making to export correct word passes through that is, being trained according to loss function L Parameter (the W including the parameter in LSTM, in encoder of training adjustment self-encoding encoderh、bh、Wc、bcAnd the W in decoders、 zt), when in training process loss function constantly become smaller finally tend towards stability when, can illustrate that intermediate vector can fine earth's surface Show phrase semanteme, intermediate vector table can be shown as phrase vector by we.The loss function L calculates as follows:
After self-encoding encoder training, loss function value tends towards stability.Self-encoding encoder training at this time is completed, will be candidate Lexical item inputs in the encoder of self-encoding encoder, ESTIn value be phrase vector.By the self-encoding encoder constructed above, utilize Information in candidate lexical item sequence compresses term vector, and the phrase vector for obtaining candidate lexical item indicates.
After the completion of self-encoding encoder training, when the phrase vector for needing to obtain candidate lexical item indicates, coding need to be only utilized Part calculates, and the phrase vector that can be obtained candidate lexical item indicates EST, gained ESTIt can be examined with the entirety of a candidate lexical item Consider the semantic information of candidate's lexical item.
In step s3, topic weights calculating process is as follows:
(1) determine descriptor item collection: the theme sentence or paragraph for having succinct generalization using text is representatives, such as paper Topic or abstract, therefrom determine text theme lexical item, the descriptor item collection of text is added:Wherein di Indicate i-th document, n is the theme the element number of lexical item concentration.For example, to " mining design industry development thinking under the new situation For instance analysis ", descriptor item collection can be " mining design ", " thinking of development ", " instance analysis ".
(2) it calculates theme vector: calculating descriptor item collectionIn all corresponding word or expression vectors of lexical item be averaged Value, the theme vector as documentFor indicating the theme of entire chapter document:
Wherein,It is theme lexical item tiCorresponding vector expression,It is document diTheme vector indicate.
(3) topic weights are calculated: to each candidate lexical item cj, calculate it and document diTheme vectorBetween cosine Distance, as its topic weights.
Wherein,It is document diCandidate lexical item cjTopic weights,It is candidate lexical item cjVector indicate, cos table Show COS distance.
By above (1)~(3) step, the topic weights between one 0 to 1 can be distributed for each candidate lexical item.It needs It is noted that topic weights are theme of the 1 expression candidate's lexical item closest to text, candidate's lexical item distance is indicated for 0 The theme of text is farther out.
In step s 4, with document diCandidate word item collectionNon-directed graph is constructed for vertex, calculates each candidate lexical item cj Weighted scoreTake a candidate lexical item of TopK (preceding K) as keyword.This is by improving TextRank algorithm come real Existing, specific process is as follows:
(1) non-directed graph is constructed: with document diCandidate word item collectionIn all elements be vertex construct a non-directed graph. Wherein, if candidate lexical item cjAnd ckOccur in the co-occurrence window that a length is n, then cjAnd ckBetween there are a lines.
(2) calculate while weight: while weight be improvements of the invention.Calculate same dependence self-encoding encoder construction Phrase vector.According to two candidate lexical item cjAnd ckVector expression between COS distance similarity (cj,ck) and co-occurrence Number occurcount(cj,ck) be figure in each edge distribute weight wjk:
wjk=similarity (cj,ck)×occurcount(cj,ck)
WhereinIt is candidate lexical item c respectivelyjAnd ckVector indicate, cos indicate vector COS distance, occurcount(cj,ck) indicate cjAnd ckThe number occurred jointly in co-occurrence window occurs mutually riding two words of the two simultaneously Number reinforce their semantic relation, wjkRepresent cjAnd ckBetween side weight.
(3) iterate to calculate vertex weights: vertex weights are also improvements of the invention.Iterate to calculate each vertex in figure Weight, until reaching maximum number of iterations, weighted scoreIt calculates as follows:
Wherein,Indicate document diCandidate lexical item cjWeight, d is damped coefficient, and effect is to make each vertex There is certain probability to vote to other vertex, vertex each in this way there can be the score being not zero, it is ensured that algorithm is multiple It can be restrained after iteration, usual value is 0.85.It is document diCandidate lexical item cjTopic weights, wjkIt is candidate lexical item cj With candidate lexical item ckBetween side weight, wkpIt is candidate lexical item ckWith candidate lexical item cpBetween side weight,It indicates and waits Select lexical item cjConnected candidate lexical item set,It is the element in the set, similarly,It indicates and candidate lexical item ckIt is connected Candidate lexical item set,It is the element in the set,Indicate document diCandidate lexical item ckWeight, on the right of equation Latter half indicate be and cjConnected vertex is to cjBallot.
(4) candidate lexical item sequence: after successive ignition, each vertex in figure can obtain a stable score, will Candidate word item collectionBy weighted scoreDescending sequence, keyword of the TopK candidate lexical item as document before retaining.
By tetra- steps of above-mentioned S1~S4, so that it may extract the keyword of document.
The present invention also provides a kind of keyword abstraction systems of phrase-based vector, comprising:
Text Pretreatment module retains n tuple according to part of speech, obtains for part of speech to be segmented and marked to urtext To candidate word item collection;
Phrase vector constructs module, for candidate lexical item cj=(x1,x2,…,xT), had by self-encoding encoder The phrase vector of semantic expressiveness;
Topic weights computing module, for calculating the topic weights of candidate lexical item;Specific calculation method is as described above.
Candidate word sorting module takes TopK candidate lexical item as keyword for calculating weighted score for candidate lexical item. Specific choosing method is as described above.
Further, the system also includes self-encoding encoder training modules, for handling the sequence information in phrase structure, The phrase vector for obtaining candidate lexical item indicates that training method is as described above.
Below with enterprise's paper data instance in enterprise's paper database, illustrate the key of specific phrase-based vector Word abstracting method.
Have environmental protection and enterprise's paper data of other multiple fields in enterprise's paper database, include in data " topic ", The fields such as " time ", " abstract ", " keyword ", " English keyword ", " classification number ".During keyword abstraction, with data " topic " and " abstract " in library is used as content of text, and " keyword " verifies extraction result as labeled data.
In training self-encoding encoder, take " keyword " field in database as training data, the portion in training process Divide parameter as shown in table 1.
The setting of 1 training parameter of table
Before carrying out keyword abstraction, labeled data is analyzed to determine the partial parameters in algorithm.Data set In share 59913 paper data, average every paper has 4.2 mark keywords.Firstly, the length of statistics mark keyword Degree, i.e., the number of words that each keyword includes, the results are shown in Table 2.It can be found that the average length of whole keywords from table 2 Degree is 1.98, and the length of most keywords, all between 1 to 3, keyword of the length between 1 to 3 is in whole 93.9% is occupied in 254376 keywords.Therefore 1 tuple when selecting candidate lexical item in reservation text, 2 tuples and 3 yuan Group.
Then, the part of speech of whole words in keyword is counted, statistical result is as shown in table 3.Part-of-speech tagging utilizes Jieba points Word tool is completed, and part of speech explanation in part is as shown in table 4.According to table 3, the part of speech of word is distributed no distribution of lengths collection in keyword In, but it is also mainly gathered in noun, verb and the verb with noun function, these three parts of speech occupy whole word parts of speech 73.1%.Therefore, take noun, verb and name verb in text and combinations thereof as candidate when carrying out candidate lexical item and selecting Lexical item.
The distribution of 2 length keywords of table
The distribution of 3 word part of speech of table
4 Jieba part of speech explanation of table
It is main using topic as full text when calculating topic weights due to only including the topic and abstract of paper in content of text The theme vector that candidate lexical item calculates text is extracted in the representative of topic from topic.In addition the co-occurrence window in candidate word sequence is big Small to be initially set to 3, the candidate word number finally retained takes 10, as shown in table 5.
5 keyword abstraction result (part) of table
Preferably, the present invention takes a paper data instance in enterprise's paper database, provides specific keyword and takes out Take process.
Data content is that " 10 years height of coal industry have been looked back in the instance analysis of mining design industry development thinking under the new situation Fast period of expansion and its profound influence that mining design market is generated.In current coal industry economy, rapidly downlink, coal are designed Under the background of Market competition, by taking the development of world tech design institute mining speciality as an example, the manpower money of mining speciality is analyzed Source and business Variation Features propose the thinking of development and implementing measure of mining speciality, are other design companies mining specialities Development provides reference ".
Wherein, " mining design industry development thinking instance analysis under the new situation " is the topic of paper, remaining content is opinion The abstract of text.
Candidate lexical item is chosen by n tuple lexical item and part-of-speech tagging, the candidate lexical item selected from the abstract of paper is made For the descriptor item collection of text, the candidate lexical item selected is as shown in table 6.
The candidate lexical item result of table 6
It is indicated using the corresponding phrase vector of lexical items all in self-encoding encoder acquisition descriptor item collection, calculates descriptor item collection In the corresponding phrase vector of all lexical items average value, as the theme vector of text, the theme vector that document is calculated is big Small is 400, and partial value is as shown in table 7.
7 topic weights result (part) of table
To each candidate lexical item, its COS distance between the theme vector of text is calculated, its topic weights, portion are obtained Score value is as shown in table 8.
8 topic weights result (part) of table
Using candidate lexical item as vertex, the co-occurrence information of candidate lexical item constructs non-directed graph as side, according to two candidate words The co-occurrence number of COS distance and the two between the vector expression of item is each edge distribution weight in figure, according to topic weights Vertex weights are calculated with the weight successive ignition on side.After successive ignition, each vertex in figure can obtain one surely Fixed score, part score are as shown in table 9.
9 weighted score result (part) of table
Obtained scoring event is ranked up, using Top10 of highest scoring candidate lexical item as final keyword, As shown in table 10.
10 keyword abstraction result (part) of table
It should be noted that " first " and " second " is used merely to distinguish the identical entity of title or operation herein, and Sequence or relationship between these entities or operation are not implied that.
Those of ordinary skill in the art will appreciate that: the above embodiments are only used to illustrate the technical solution of the present invention., and It is non-that it is limited;Although present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art It is understood that it is still possible to modify the technical solutions described in the foregoing embodiments, either to part of or All technical features are equivalently replaced;And these are modified or replaceed, it does not separate the essence of the corresponding technical solution this hair Bright claim limited range.

Claims (9)

1. a kind of keyword abstraction method of phrase-based vector, which is characterized in that the described method includes:
S1, part of speech is segmented and marked to text, retain n tuple and obtain candidate word item collection;
It S2, is that candidate lexical item constructs phrase vector by self-encoding encoder;
S3, the theme for determining the text calculate the similarity of candidate lexical item and theme vector, using the similarity as described in The topic weights of candidate lexical item;
S4, pass through TextRank algorithm, obtain keyword from the candidate word item collection.
2. the method according to claim 1, wherein the self-encoding encoder in the step S2 includes encoder reconciliation Code device, encoder are made of two-way LSTM layers and full articulamentum, and decoded portion is formed by unidirectional LSTM layers and softmax layers.
3. according to the method described in claim 2, it is characterized in that, the training method of the self-encoding encoder in the step S2 includes Following steps:
S21, training sample is chosen, obtains candidate lexical item;
S22, to candidate lexical item cj=(x1, x2..., xT), in the encoder, distinguished using two-way LSTM from former and later two directions It is calculated:
Wherein,WithRespectively t (t=1, the 2 ..., T) moment is from left to right and from right to left in both direction Layer state and cell state are hidden,WithRespectively t-1 moment from left to right and from right to left two sides Upward hiding layer state and cell state, xtFor the word in the candidate lexical item of t moment input, T indicates single in candidate lexical item The quantity of word;
S23, in the encoder, is calculated ES by formulaT:
h′T=f (WhhT+bh)
C′T=f (WcCT+bc)
Wherein,For connector, Wh、bh、Wc、bcThe parameter matrix in fully-connected network and biasing are represented, f indicates fully-connected network In activation primitive ReLU, ESTIt is h 'TWith C 'TOne tuple of composition;
S24, in decoder section, with ESTIt is decoded for original state using unidirectional LSTM:
Wherein, ztIt is hiding layer state of the decoder in t moment, zt-1For the hiding layer state at t-1 moment, ESTFor encoder shape State,The word in candidate lexical item exported for the t-1 moment;
S25, according to ztEstimate the probability of current word
Wherein, Wszt+bsIt gives a mark to each possible output word, softmax is normalized function;
S26, when in training process loss function L constantly become smaller finally tend towards stability when, obtain the parameter W of encoderh、bh、Wc、 bcAnd the parameter W of decoders、zt, so that it is determined that self-encoding encoder;Wherein, the calculation formula of loss function L are as follows:
4. according to the method described in claim 3, it is characterized in that, candidate's lexical item input encodes certainly in the step S2 Device, the ES of encoder outputTIn value be the candidate lexical item phrase vector.
5. the method according to claim 1, wherein theme vector in the step S3Calculation formula are as follows:
Wherein,It is theme lexical item tiCorresponding vector expression,It is text diTheme vector indicate.
6. the method according to claim 1, wherein in the TextRank algorithm of the step S4, candidate word Item cjAnd ckOccur in co-occurrence window, then cjAnd ckBetween there are a line, the calculation formula of the weight on side are as follows:
wjk=similarity (cj, ck)×occurcount(cj, ck)
Wherein,It is candidate lexical item c respectivelyjAnd ckVector indicate, occurcount(cj, ck) indicate cjAnd ckIn co-occurrence window The number occurred jointly in mouthful, similarity (cj, ck) it is cjAnd ckBetween similarity, wjkRepresent cjAnd ckBetween side Weight.
7. according to the method described in claim 6, it is characterized in that, further including changing in the TextRank algorithm of the step S4 In generation, calculates the weight of candidate lexical item, until reaching maximum number of iterations, weightCalculation formula are as follows:
Wherein,Indicate candidate lexical itemcjWeight, d is damped coefficient, it is preferred that d 0.85;It is candidate lexical itemcj's Topic weights, wjkIt is candidate lexical item cjWith candidate lexical item ckBetween side weight, wkpIt is candidate lexical item ckWith candidate lexical item cpBetween The weight on side,It indicates and candidate lexical item cjThe set of connected candidate lexical item,It isIn element,It indicates With candidate lexical item ckThe set of connected candidate lexical item,It isIn element,Indicate candidate lexical item ckWeight.
8. a kind of keyword abstraction system of phrase-based vector, which is characterized in that the system comprises:
Text Pretreatment module retains n tuple according to part of speech, is waited for part of speech to be segmented and marked to urtext Select lexical item collection;
Phrase vector constructs module, for candidate lexical item cj=(x1, x2..., xT), being obtained by self-encoding encoder has semanteme The phrase vector of expression;
Topic weights computing module, for calculating the topic weights of candidate lexical item;
Candidate word sorting module takes TopK candidate lexical item as keyword for calculating weighted score for candidate lexical item.
9. system according to claim 8, which is characterized in that the system also includes self-encoding encoder training modules, are used for It is obtained by sample training from the parameter encoded, so that it is determined that self-encoding encoder.
CN201910548261.XA 2019-06-24 2019-06-24 Phrase vector-based keyword extraction method and system Active CN110263343B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910548261.XA CN110263343B (en) 2019-06-24 2019-06-24 Phrase vector-based keyword extraction method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910548261.XA CN110263343B (en) 2019-06-24 2019-06-24 Phrase vector-based keyword extraction method and system

Publications (2)

Publication Number Publication Date
CN110263343A true CN110263343A (en) 2019-09-20
CN110263343B CN110263343B (en) 2021-06-15

Family

ID=67920847

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910548261.XA Active CN110263343B (en) 2019-06-24 2019-06-24 Phrase vector-based keyword extraction method and system

Country Status (1)

Country Link
CN (1) CN110263343B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111222333A (en) * 2020-04-22 2020-06-02 成都索贝数码科技股份有限公司 Keyword extraction method based on fusion of network high-order structure and topic model
CN111274428A (en) * 2019-12-19 2020-06-12 北京创鑫旅程网络技术有限公司 Keyword extraction method and device, electronic equipment and storage medium
CN111785254A (en) * 2020-07-24 2020-10-16 四川大学华西医院 Self-service BLS training and checking system based on anthropomorphic dummy
CN112818686A (en) * 2021-03-23 2021-05-18 北京百度网讯科技有限公司 Domain phrase mining method and device and electronic equipment
CN113312532A (en) * 2021-06-01 2021-08-27 哈尔滨工业大学 Public opinion grade prediction method based on deep learning and oriented to public inspection field

Citations (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20080017686A (en) * 2006-08-22 2008-02-27 에스케이커뮤니케이션즈 주식회사 Method for extracting subject and sorting document of searching engine, computer readable record medium on which program for executing method is recorded
US8019708B2 (en) * 2007-12-05 2011-09-13 Yahoo! Inc. Methods and apparatus for computing graph similarity via signature similarity
CN103744835A (en) * 2014-01-02 2014-04-23 上海大学 Text keyword extracting method based on subject model
KR101656245B1 (en) * 2015-09-09 2016-09-09 주식회사 위버플 Method and system for extracting sentences
CN106372064A (en) * 2016-11-18 2017-02-01 北京工业大学 Characteristic word weight calculating method for text mining
CN106970910A (en) * 2017-03-31 2017-07-21 北京奇艺世纪科技有限公司 A kind of keyword extracting method and device based on graph model
CN106997382A (en) * 2017-03-22 2017-08-01 山东大学 Innovation intention label automatic marking method and system based on big data
CN107122413A (en) * 2017-03-31 2017-09-01 北京奇艺世纪科技有限公司 A kind of keyword extracting method and device based on graph model
CN107133213A (en) * 2017-05-06 2017-09-05 广东药科大学 A kind of text snippet extraction method and system based on algorithm
CN107193803A (en) * 2017-05-26 2017-09-22 北京东方科诺科技发展有限公司 A kind of particular task text key word extracting method based on semanteme
CN107247780A (en) * 2017-06-12 2017-10-13 北京理工大学 A kind of patent document method for measuring similarity of knowledge based body
CN107832457A (en) * 2017-11-24 2018-03-23 国网山东省电力公司电力科学研究院 Power transmission and transforming equipment defect dictionary method for building up and system based on TextRank algorithm
CN108460019A (en) * 2018-02-28 2018-08-28 福州大学 A kind of emerging much-talked-about topic detecting system based on attention mechanism
CN108710611A (en) * 2018-05-17 2018-10-26 南京大学 A kind of short text topic model generation method of word-based network and term vector
CN108984526A (en) * 2018-07-10 2018-12-11 北京理工大学 A kind of document subject matter vector abstracting method based on deep learning
CN109614626A (en) * 2018-12-21 2019-04-12 北京信息科技大学 Keyword Automatic method based on gravitational model
CN109726394A (en) * 2018-12-18 2019-05-07 电子科技大学 Short text Subject Clustering method based on fusion BTM model
CN109918660A (en) * 2019-03-04 2019-06-21 北京邮电大学 A kind of keyword extracting method and device based on TextRank
CN109918510A (en) * 2019-03-26 2019-06-21 中国科学技术大学 Cross-cutting keyword extracting method

Patent Citations (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20080017686A (en) * 2006-08-22 2008-02-27 에스케이커뮤니케이션즈 주식회사 Method for extracting subject and sorting document of searching engine, computer readable record medium on which program for executing method is recorded
US8019708B2 (en) * 2007-12-05 2011-09-13 Yahoo! Inc. Methods and apparatus for computing graph similarity via signature similarity
CN103744835A (en) * 2014-01-02 2014-04-23 上海大学 Text keyword extracting method based on subject model
KR101656245B1 (en) * 2015-09-09 2016-09-09 주식회사 위버플 Method and system for extracting sentences
CN106372064A (en) * 2016-11-18 2017-02-01 北京工业大学 Characteristic word weight calculating method for text mining
CN106997382A (en) * 2017-03-22 2017-08-01 山东大学 Innovation intention label automatic marking method and system based on big data
CN106970910A (en) * 2017-03-31 2017-07-21 北京奇艺世纪科技有限公司 A kind of keyword extracting method and device based on graph model
CN107122413A (en) * 2017-03-31 2017-09-01 北京奇艺世纪科技有限公司 A kind of keyword extracting method and device based on graph model
CN107133213A (en) * 2017-05-06 2017-09-05 广东药科大学 A kind of text snippet extraction method and system based on algorithm
CN107193803A (en) * 2017-05-26 2017-09-22 北京东方科诺科技发展有限公司 A kind of particular task text key word extracting method based on semanteme
CN107247780A (en) * 2017-06-12 2017-10-13 北京理工大学 A kind of patent document method for measuring similarity of knowledge based body
CN107832457A (en) * 2017-11-24 2018-03-23 国网山东省电力公司电力科学研究院 Power transmission and transforming equipment defect dictionary method for building up and system based on TextRank algorithm
CN108460019A (en) * 2018-02-28 2018-08-28 福州大学 A kind of emerging much-talked-about topic detecting system based on attention mechanism
CN108710611A (en) * 2018-05-17 2018-10-26 南京大学 A kind of short text topic model generation method of word-based network and term vector
CN108984526A (en) * 2018-07-10 2018-12-11 北京理工大学 A kind of document subject matter vector abstracting method based on deep learning
CN109726394A (en) * 2018-12-18 2019-05-07 电子科技大学 Short text Subject Clustering method based on fusion BTM model
CN109614626A (en) * 2018-12-21 2019-04-12 北京信息科技大学 Keyword Automatic method based on gravitational model
CN109918660A (en) * 2019-03-04 2019-06-21 北京邮电大学 A kind of keyword extracting method and device based on TextRank
CN109918510A (en) * 2019-03-26 2019-06-21 中国科学技术大学 Cross-cutting keyword extracting method

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
BASALDELLA MARCO 等: "Bidirectional lstm recurrent neural network for keyphrase extraction", 《ITALIAN RESEARCH CONFERENCE ON DIGITAL LIBRARIES》 *
张莉婧 等: "基于改进TextRank的关键词抽取算法", 《北京印刷学院学报》 *
李航 等: "融合多特征的TextRank关键词抽取方法", 《情报杂志》 *
洪冬梅: "基于LSTM的自动文本摘要技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
齐翌辰 等: "基于深度学习的中文抽取式摘要方法应用", 《科教导刊》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111274428A (en) * 2019-12-19 2020-06-12 北京创鑫旅程网络技术有限公司 Keyword extraction method and device, electronic equipment and storage medium
CN111274428B (en) * 2019-12-19 2023-06-30 北京创鑫旅程网络技术有限公司 Keyword extraction method and device, electronic equipment and storage medium
CN111222333A (en) * 2020-04-22 2020-06-02 成都索贝数码科技股份有限公司 Keyword extraction method based on fusion of network high-order structure and topic model
CN111785254A (en) * 2020-07-24 2020-10-16 四川大学华西医院 Self-service BLS training and checking system based on anthropomorphic dummy
CN111785254B (en) * 2020-07-24 2023-04-07 四川大学华西医院 Self-service BLS training and checking system based on anthropomorphic dummy
CN112818686A (en) * 2021-03-23 2021-05-18 北京百度网讯科技有限公司 Domain phrase mining method and device and electronic equipment
CN112818686B (en) * 2021-03-23 2023-10-31 北京百度网讯科技有限公司 Domain phrase mining method and device and electronic equipment
CN113312532A (en) * 2021-06-01 2021-08-27 哈尔滨工业大学 Public opinion grade prediction method based on deep learning and oriented to public inspection field

Also Published As

Publication number Publication date
CN110263343B (en) 2021-06-15

Similar Documents

Publication Publication Date Title
CN110263343A (en) The keyword abstraction method and system of phrase-based vector
CN111310471B (en) Travel named entity identification method based on BBLC model
CN113010693A (en) Intelligent knowledge graph question-answering method fusing pointer to generate network
CN106991085B (en) Entity abbreviation generation method and device
CN111191002B (en) Neural code searching method and device based on hierarchical embedding
CN112328900A (en) Deep learning recommendation method integrating scoring matrix and comment text
Kaur Incorporating sentimental analysis into development of a hybrid classification model: A comprehensive study
CN110955776A (en) Construction method of government affair text classification model
CN110347796A (en) Short text similarity calculating method under vector semantic tensor space
CN113239148B (en) Scientific and technological resource retrieval method based on machine reading understanding
CN113239663B (en) Multi-meaning word Chinese entity relation identification method based on Hopkinson
CN112989761A (en) Text classification method and device
CN115357719A (en) Power audit text classification method and device based on improved BERT model
CN113821635A (en) Text abstract generation method and system for financial field
CN115688784A (en) Chinese named entity recognition method fusing character and word characteristics
CN115169349A (en) Chinese electronic resume named entity recognition method based on ALBERT
CN113806528A (en) Topic detection method and device based on BERT model and storage medium
CN109325243A (en) Mongolian word cutting method and its word cutting system of the character level based on series model
CN112434512A (en) New word determining method and device in combination with context
CN111859955A (en) Public opinion data analysis model based on deep learning
CN116432755A (en) Weight network reasoning method based on dynamic entity prototype
CN115840815A (en) Automatic abstract generation method based on pointer key information
Weijie et al. Long text classification based on BERT
CN114741515A (en) Social network user attribute prediction method and system based on graph generation
CN113935308A (en) Method and system for automatically generating text abstract facing field of geoscience

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant