CN110263343A

CN110263343A - The keyword abstraction method and system of phrase-based vector

Info

Publication number: CN110263343A
Application number: CN201910548261.XA
Authority: CN
Inventors: 孙新; 赵永妍; 申长虹; 杨凯歌; 张颖捷
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2019-06-24
Filing date: 2019-06-24
Publication date: 2019-09-20
Anticipated expiration: 2039-06-24
Also published as: CN110263343B

Abstract

The present invention relates to natural language processing and depth learning technology field, in particular to a kind of the keyword abstraction method and system of phrase-based vector.Main technical schemes of the invention include: urtext to be segmented and marked part of speech, retain n tuple according to part of speech, obtain candidate word item collection；The a large amount of phrases building vector for including in candidate key set of words is indicated；Calculate the topic weights of each candidate lexical item；Using candidate lexical item as the vertex in figure, using the co-occurrence information of candidate lexical item as side structural map, with the weight of semantic similarity and co-occurrence information calculating side between candidate lexical item, the score of each candidate lexical item and sequence are iterated to calculate.Keyword abstraction method provided by the invention and system had both introduced the subject information in document, introduced contextual information further through the semantic similarity between phrase, more can capture heavy duty word in full text, semantic precision is high, has a wide range of application.

Description

The keyword abstraction method and system of phrase-based vector

Technical field

The present invention relates to natural language processing and depth learning technology field, in particular to a kind of pass of phrase-based vector Keyword abstracting method and system.

Background technique

In recent years, mass data is while bringing great convenience, also the same analysis and lookup band to data Huge challenge is carried out.Under big data background, keynote message required for how rapidly obtaining from mass data becomes people Problem in the urgent need to address.Keyword abstraction refer to automatically extracted from document by algorithm it is important, have theme The word or phrase of property.In scientific and technical literature, keyword or phrase can help user to quickly understand papers contents.Meanwhile it is crucial Word or phrase are also used as the search entry in information retrieval, natural language processing and text mining.Appoint in keyword abstraction In business, the term vector comprising the semanteme of word, which has been obtained, to be applied and achieves good effect.However, many professional papers, Including containing a large amount of proper noun in enterprise's paper, and these nouns are not often single words but phrase, therefore only The needs of keyword abstraction task are insufficient for term vector, text, which needs to construct vector to phrase, to be indicated.

Currently existing scholar proposes to be combined based on term vector using self-encoding encoder to construct phrase vector.It is self-editing Code device (Auto Encoder) only has two parts of encoder and decoder in structure, is carried out with self-encoding encoder to word vector Combination can input the expression of each word in phrase in encoder section, then their boil down tos come when constructing phrase vector One intermediate hidden layers vector parses the phrase of input in decoder section, then in this again by hidden layer vector Between vector can be considered to contain the phrase vector of semantic information and indicate.However, directly being used in traditional self-encoding encoder The fully-connected network on basis is coded and decoded, wherein connecting entirely between layers, the node between every layer is without even It connects, this common autoencoder network can not handle the sequence information in structure as similar phrase.

The algorithm that in addition, there will be only calculates the semantic similarity of word by term vector, and has ignored the theme of text Information.TextRank is a kind of keyword abstraction algorithm based on figure, its basic thought is with the candidate lexical item structure in document Cheng Tu constructs side with the cooccurrence relation of candidate lexical item in a document, then by the mutual ballot between candidate lexical item come iteration Weight is calculated, finally candidate lexical item is ranked up according to score to determine the keyword finally extracted.Traditional In TextRank, the initial weight on each vertex is 1 (or 1/n, n are number of vertices) in figure, and the weight of each edge is also set as 1, that is to say, that the poll on each vertex, which can equably be thrown, gives its connected each vertex.Although the simple side of such method Just, but the thematic of document is not only had ignored, but also does not account for the semantic relation between vertex.

In Recognition with Recurrent Neural Network (Recurrent Neural Network, RNN), the node between hidden layer is no longer It is connectionless but have connection, and the input of hidden layer not only output comprising input layer also comprising last moment hidden layer Output.Therefore RNN is adapted to encode sequence data.However in the communication process of RNN, the forgetting of historical information and The accumulation of error is a major issue, and present people are usually using long Memory Neural Networks (Long Short-Term in short-term Memery, LSTM) Lai Gaijin.

LSTM is a kind of RNN specific type, it records information using cell state, and cell state is in sequence transmission process In only a small amount of linear reciprocal, can preferably retain historical information.Then LSTM is protected and is controlled using door control mechanism Cell state.Door control mechanism is an abstract concept, it is actually by a sigmoid function and point in specific implementation What multiplication was constituted, door control mechanism controls the transmitting of information by the value between output one 0 to 1, and output valve is closer to 0 table Show allow by information it is fewer, closer to 1 indicate allow by information it is more.

In a LSTM unit, to be processed first is information that previous step passes over, and LSTM is by forgeing door (forget gate) controls the forgetting and reservation of historical information.Forget door f_tAccording to current information, decide whether to forget Information before, specific formula is as follows:

f_t=σ (W_f·[h_t-1,x_t]+b_f)

Wherein σ indicates sigmoid function, W_fAnd b_fRespectively indicate weight matrix and the biasing forgotten in door.

It is information currently entered that LSTM is to be treated later, and first passing through input gate control current input information will retain Part, later, with one cell state of tanh function creationThe information of the moment node is added in the cell state.

i_t=σ (W_i·[h_t-1,x_t]+b_i)

By forgeing door and input gate, LSTM can determine which past information needs are left, and current which Information need to be to be stored, to calculate current cell state C_t。

Last LSTM can pass through out gate according to historical information and current input information using sigmoid function (output gate) determines the information for current time needing to export, and similar with input state, output state can also use a tanh Function filtering.

o_t=σ (W_o·[h_t-1,x_t]+b_o)

o_t=o_t*tanh(C_t)

By cleverly door machine system, the information before Memory Neural Networks can be remembered in short-term is grown, while in turn avoiding " ladder Degree disappears " the problem of.

Summary of the invention

In order to solve, term vector is insufficient for the needs of keyword abstraction task and existing algorithm has ignored text Subject information these two aspects problem, the present invention provides the keyword abstraction method and system of a kind of phrase-based vector.

To achieve the above object, in a first aspect, the present invention provides a kind of keyword abstraction method of phrase-based vector, institute The method of stating includes:

S1, part of speech is segmented and marked to text, retain n tuple and obtain candidate word item collection；

It S2, is that candidate lexical item constructs phrase vector by self-encoding encoder；

S3, the theme for determining the text calculate the similarity of candidate lexical item and theme vector, using the similarity as The topic weights of candidate's lexical item；

S4, pass through TextRank algorithm, obtain keyword from the candidate word item collection.

Further, the self-encoding encoder in the step S2 includes encoder and decoder, and encoder is by LSTM layers two-way It is formed with full articulamentum, decoded portion is formed by unidirectional LSTM layers and softmax layers.

Further, the self-encoding encoder in the step S2 includes encoder and decoder, and training method includes following step It is rapid:

S21, training sample is chosen, obtains candidate lexical item；

S22, to candidate lexical item c_j=(x₁,x₂,…,x_T), in the encoder, using two-way LSTM from former and later two directions It is respectively calculated:

Wherein,WithRespectively t (t=1,2 ..., the T) moment is from left to right and from right to left in both direction Hiding layer state and cell state,WithRespectively the t-1 moment is from left to right and two from right to left Hiding layer state and cell state on direction, x_tFor the word in the candidate lexical item of t moment input；T is indicated in candidate lexical item The quantity of word；

S23, in the encoder, is calculated ES by formula_T:

h′_T=f (W_hh_T+b_h)

C′_T=f (W_cC_T+b_c)

Wherein,For connector, W_h、b_h、W_c、b_cThe parameter matrix in fully-connected network and biasing are represented, f indicates full connection Activation primitive ReLU, ES in network_TIt is h '_TWith C '_TOne tuple of composition；

S24, in decoder section, with ES_TIt is decoded for original state using unidirectional LSTM:

Wherein, z_tIt is hiding layer state of the decoder in t moment, z_t-1For the hiding layer state at t-1 moment, ES_TFor coding Device state,The word in candidate lexical item exported for the t-1 moment；

S25, according to z_tEstimate the probability of current word:

Wherein, W_sz_t+b_sIt gives a mark to each possible output word, softmax is normalized function.

S26, when in training process loss function L constantly become smaller finally tend towards stability when, obtain the parameter W of encoder_h、b_h、 W_c、b_cAnd the W in decoder_s、z_t, so that it is determined that self-encoding encoder；Wherein, the calculation formula of loss function L are as follows:

Further, in the step S2, candidate's lexical item inputs self-encoding encoder, the ES of encoder output_TIn value For the phrase vector of the candidate lexical item.

Further, theme vector in the step S3Calculation formula are as follows:

Wherein,It is theme lexical item t_iCorresponding vector expression,It is text d_iTheme vector indicate.

Further, in the TextRank algorithm of the step S4, if candidate lexical item c_jAnd c_kGo out in co-occurrence window Show, then c_jAnd c_kBetween there are a line, the calculation formula of the weight on side are as follows:

w_jk=similarity (c_j,c_k)×occur_count(c_j,c_k)

Wherein,It is candidate lexical item c respectively_jAnd c_kVector indicate, occur_count(c_j,c_k) indicate c_jAnd c_kAltogether The number occurred jointly in existing window, similarity (c_j,c_k) it is c_jAnd c_kBetween similarity, w_jkRepresent c_jAnd c_kBetween The weight on side.

It further, further include iterative calculation vertex weights in the TextRank algorithm of the step S4, including following Step:

The weight for iterating to calculate candidate lexical item, until reaching maximum number of iterations, weighted scoreCalculation formula are as follows:

Wherein,Indicate candidate lexical item c_jScore, d is damped coefficient, it is preferred that d 0.85；It is candidate word Item c_jTopic weights, w_jkIt is candidate lexical item c_jWith candidate lexical item c_kBetween side weight, w_kpIt is candidate lexical item c_kWith candidate lexical item c_pBetween side weight,It indicates and candidate lexical item c_jThe set of connected candidate lexical item,It is element therein, similarly,It indicates and candidate lexical item c_kThe set of connected candidate lexical item,It is element therein.

Second aspect, the present invention provides a kind of keyword abstraction systems of phrase-based vector, and the system comprises texts This preprocessing module retains n tuple according to part of speech, obtains candidate lexical item for part of speech to be segmented and marked to urtext Collection；

Phrase vector constructs module, for candidate lexical item c_j=(x₁,x₂,…,x_T), had by self-encoding encoder The phrase vector of semantic expressiveness；

Topic weights computing module, for calculating the topic weights of candidate lexical item；

Candidate word sorting module takes TopK candidate lexical item as keyword for calculating weighted score for candidate lexical item.

Further, the system also includes self-encoding encoder training modules, for obtaining what oneself encoded by sample training Parameter, so that it is determined that self-encoding encoder.

The keyword abstraction method and system of a kind of phrase-based vector provided by the invention, with existing keyword abstraction side Method and system are compared, and are had the following beneficial effects:

1, keyword abstraction method provided by the invention and system had both introduced the subject information in document, further through word Semantic similarity between language introduces contextual information, can more capture the heavy duty word in full text, make the keyword extracted more Add accurate.

2, keyword abstraction method provided by the invention and system obtain keyword using phrase vector, so that calculating Journey becomes succinct efficient.

3, phrase vector calculation provided by the invention innovatively introduces the self-encoding encoder based on LSTM to term vector It is compressed, can preferably indicate the semantic information of phrase, semantic precision is higher, and application range is wider.

4, present invention improves over TextRank algorithms, innovatively calculate theme to each candidate lexical item using phrase vector Weight, and between candidate lexical item semantic similarity and co-occurrence information calculate the weight on side jointly, can consider entire document Theme, and introduce the semantic information between vertex, keep the accuracy of sort algorithm higher.

Detailed description of the invention

In order to illustrate more clearly of the embodiment of the present disclosure or technical solution in the prior art, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Disclosed some embodiments for those of ordinary skill in the art without creative efforts, can be with Other attached drawings are obtained according to these figures.

Fig. 1 is the structural schematic diagram of the self-encoding encoder of one embodiment of the invention；

Fig. 2 is the flow chart of the keyword abstraction method of the phrase-based vector of one embodiment of the invention.

Specific embodiment

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, the technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.

Present invention will be further explained below with reference to the attached drawings and specific embodiments.

In order to which technical solution in present application example and advantage is more clearly understood, below in conjunction with attached drawing to the application's Exemplary embodiment is described in more detail, it is clear that and described embodiment is only a part of the embodiment of the application, Rather than the exhaustion of all embodiments.It should be noted that in the absence of conflict, the example in the application can be tied mutually It closes.

The present invention provides a kind of keyword abstraction method of phrase-based vector, as shown in Fig. 2, this method includes following step It is rapid:

S1, to urtext d_iPart of speech is segmented and marked, n tuple is retained according to part of speech, obtains candidate word item collection

S2, to each candidate lexical item c_j=(x₁,x₂,…,x_T), the phrase vector of candidate lexical item is obtained by self-encoding encoder It indicates.Wherein, x_iIt is candidate lexical item c_jIn the term vector of i-th word indicate that T indicates the word quantity in candidate lexical item.

S3, each candidate lexical item c is calculated_iWith theme vectorSimilarity as its topic weightsWherein, d_iIt indicates I-th document.Self-encoding encoder includes encoder and decoder, and encoder section is made of two-way LSTM layers and full articulamentum, is solved Code part is formed by unidirectional LSTM layers and softmax layers.

S4, by improved TextRank algorithm, obtain keyword from the candidate word item collection.

In step s 2, in the encoder, to each candidate lexical item c to be entered_j, using two-way LSTM from former and later two Direction is respectively calculated, the last one moment is taken to hide layer state h_TWith cell state C_TAs end-state, and spelled It connects, obtains the output ES of coding layer finally by a full articulamentum_T。

In a decoder, with ES_TIt for initial input, is decoded using unidirectional LSTM structure, is obtained by softmax layers To the decoded probability distribution of each step, the probability for decoding the corresponding correct word of each step is maximized finally by loss function L.

Trained purpose is to optimize the parameter of self-encoding encoder, is allowed a decoder to the output of encoder as input, maximum The semantic information of the candidate lexical item of the reduction encoder input of degree.

Specific training method are as follows:

(1) training sample is chosen, then as S1, sample is carried out the operation such as to segment, obtains candidate word item collection.

Candidate lexical item c_j=(x₁,x₂,…,x_T) indicate, wherein x_iIt is candidate lexical item c_jIn i-th of word term vector It indicates, T indicates the word quantity in candidate lexical item.With candidate lexical item c_jFor " Beijing Institute of Technology ", x₁It is that " Beijing " is corresponding Term vector, x₂It is " science and engineering " corresponding term vector, x₃It is " university " corresponding term vector.

(2) model is trained using a large amount of candidate lexical items.By taking candidate lexical item " Beijing Institute of Technology " as an example, inputs and be " Beijing " " science and engineering " " university " corresponding term vector indicates that the encoded phrase vector for obtaining " Beijing Institute of Technology " indicates, and Decoding sequence is obtained by the phrase vector decoding and is followed successively by " Beijing " " science and engineering " " university " corresponding probability value, it is made by training It maximizes.

To each candidate lexical item c_j=(x₁,s₂,…,x_T), in encoder section, encoder uses two-way LSTM from front and back Both direction is respectively calculated:

Wherein,WithRespectively t (t=1,2 ..., the T) moment is from left to right and from right to left in both direction Hiding layer state and cell state,WithRespectively the t-1 moment is from left to right and two from right to left Hiding layer state and cell state on direction, x_tFor the word in the candidate lexical item of t moment input.At each moment, when Preceding hiding layer state h_tWith cell state C_tCalculating will rely on the hiding layer state h at a moment_t-1, cell state C_t-1 With current input x_t。

The last one moment is taken to hide layer state h_TWith cell state C_TIt, directly will be in both direction as end-state State is attached.In addition to providing the input of a fixed size to decoding layer, it is also necessary to pass through a full articulamentum pair State after connection is handled.Calculate the input ES that following formula obtains a fixed size of decoder_T:

h′_T=f (W_hh_T+b_h)

C′_T=f (W_cC_T+b_c)

Wherein,For connector, W_h、b_h、W_c、b_cThe parameter matrix in fully-connected network and biasing are represented, f indicates full connection Activation primitive ReLU, ES in network_TIt is h '_TWith C '_TComposition can finally be provided to a tuple of decoder.

In decoder section, with ES_TIt is decoded for original state using unidirectional LSTM:

Wherein, z_tIt is hiding layer state of the decoder in t moment, z_t-1For the hiding layer state at t-1 moment, ES_TFor coding Device state,The word in candidate lexical item exported for the t-1 moment.

According to z_tEstimate the probability of current word:

Wherein, W_sIt is parameter matrix, z_tIt is hiding layer state of the decoder in t moment, W_sz_t+b_sTo each possible output Word is given a mark, and is normalized to obtain each word with softmaxProbability

The training objective of self-encoding encoder is to make the maximum probability for exporting correct phrase: self-encoding encoder output is corresponding each The probability of word, training objective are that the maximum probability for making to export correct word passes through that is, being trained according to loss function L Parameter (the W including the parameter in LSTM, in encoder of training adjustment self-encoding encoder_h、b_h、W_c、b_cAnd the W in decoder_s、 z_t), when in training process loss function constantly become smaller finally tend towards stability when, can illustrate that intermediate vector can fine earth's surface Show phrase semanteme, intermediate vector table can be shown as phrase vector by we.The loss function L calculates as follows:

After self-encoding encoder training, loss function value tends towards stability.Self-encoding encoder training at this time is completed, will be candidate Lexical item inputs in the encoder of self-encoding encoder, ES_TIn value be phrase vector.By the self-encoding encoder constructed above, utilize Information in candidate lexical item sequence compresses term vector, and the phrase vector for obtaining candidate lexical item indicates.

After the completion of self-encoding encoder training, when the phrase vector for needing to obtain candidate lexical item indicates, coding need to be only utilized Part calculates, and the phrase vector that can be obtained candidate lexical item indicates ES_T, gained ES_TIt can be examined with the entirety of a candidate lexical item Consider the semantic information of candidate's lexical item.

In step s3, topic weights calculating process is as follows:

(1) determine descriptor item collection: the theme sentence or paragraph for having succinct generalization using text is representatives, such as paper Topic or abstract, therefrom determine text theme lexical item, the descriptor item collection of text is added:Wherein d_i Indicate i-th document, n is the theme the element number of lexical item concentration.For example, to " mining design industry development thinking under the new situation For instance analysis ", descriptor item collection can be " mining design ", " thinking of development ", " instance analysis ".

(2) it calculates theme vector: calculating descriptor item collectionIn all corresponding word or expression vectors of lexical item be averaged Value, the theme vector as documentFor indicating the theme of entire chapter document:

Wherein,It is theme lexical item t_iCorresponding vector expression,It is document d_iTheme vector indicate.

(3) topic weights are calculated: to each candidate lexical item c_j, calculate it and document d_iTheme vectorBetween cosine Distance, as its topic weights.

Wherein,It is document d_iCandidate lexical item c_jTopic weights,It is candidate lexical item c_jVector indicate, cos table Show COS distance.

By above (1)~(3) step, the topic weights between one 0 to 1 can be distributed for each candidate lexical item.It needs It is noted that topic weights are theme of the 1 expression candidate's lexical item closest to text, candidate's lexical item distance is indicated for 0 The theme of text is farther out.

In step s 4, with document d_iCandidate word item collectionNon-directed graph is constructed for vertex, calculates each candidate lexical item c_j Weighted scoreTake a candidate lexical item of TopK (preceding K) as keyword.This is by improving TextRank algorithm come real Existing, specific process is as follows:

(1) non-directed graph is constructed: with document d_iCandidate word item collectionIn all elements be vertex construct a non-directed graph. Wherein, if candidate lexical item c_jAnd c_kOccur in the co-occurrence window that a length is n, then c_jAnd c_kBetween there are a lines.

(2) calculate while weight: while weight be improvements of the invention.Calculate same dependence self-encoding encoder construction Phrase vector.According to two candidate lexical item c_jAnd c_kVector expression between COS distance similarity (c_j,c_k) and co-occurrence Number occur_count(c_j,c_k) be figure in each edge distribute weight w_jk:

w_jk=similarity (c_j,c_k)×occur_count(c_j,c_k)

WhereinIt is candidate lexical item c respectively_jAnd c_kVector indicate, cos indicate vector COS distance, occur_count(c_j,c_k) indicate c_jAnd c_kThe number occurred jointly in co-occurrence window occurs mutually riding two words of the two simultaneously Number reinforce their semantic relation, w_jkRepresent c_jAnd c_kBetween side weight.

(3) iterate to calculate vertex weights: vertex weights are also improvements of the invention.Iterate to calculate each vertex in figure Weight, until reaching maximum number of iterations, weighted scoreIt calculates as follows:

Wherein,Indicate document d_iCandidate lexical item c_jWeight, d is damped coefficient, and effect is to make each vertex There is certain probability to vote to other vertex, vertex each in this way there can be the score being not zero, it is ensured that algorithm is multiple It can be restrained after iteration, usual value is 0.85.It is document d_iCandidate lexical item c_jTopic weights, w_jkIt is candidate lexical item c_j With candidate lexical item c_kBetween side weight, w_kpIt is candidate lexical item c_kWith candidate lexical item c_pBetween side weight,It indicates and waits Select lexical item c_jConnected candidate lexical item set,It is the element in the set, similarly,It indicates and candidate lexical item c_kIt is connected Candidate lexical item set,It is the element in the set,Indicate document d_iCandidate lexical item c_kWeight, on the right of equation Latter half indicate be and c_jConnected vertex is to c_jBallot.

(4) candidate lexical item sequence: after successive ignition, each vertex in figure can obtain a stable score, will Candidate word item collectionBy weighted scoreDescending sequence, keyword of the TopK candidate lexical item as document before retaining.

By tetra- steps of above-mentioned S1~S4, so that it may extract the keyword of document.

The present invention also provides a kind of keyword abstraction systems of phrase-based vector, comprising:

Text Pretreatment module retains n tuple according to part of speech, obtains for part of speech to be segmented and marked to urtext To candidate word item collection；

Topic weights computing module, for calculating the topic weights of candidate lexical item；Specific calculation method is as described above.

Candidate word sorting module takes TopK candidate lexical item as keyword for calculating weighted score for candidate lexical item. Specific choosing method is as described above.

Further, the system also includes self-encoding encoder training modules, for handling the sequence information in phrase structure, The phrase vector for obtaining candidate lexical item indicates that training method is as described above.

Below with enterprise's paper data instance in enterprise's paper database, illustrate the key of specific phrase-based vector Word abstracting method.

Have environmental protection and enterprise's paper data of other multiple fields in enterprise's paper database, include in data " topic ", The fields such as " time ", " abstract ", " keyword ", " English keyword ", " classification number ".During keyword abstraction, with data " topic " and " abstract " in library is used as content of text, and " keyword " verifies extraction result as labeled data.

In training self-encoding encoder, take " keyword " field in database as training data, the portion in training process Divide parameter as shown in table 1.

The setting of 1 training parameter of table

Before carrying out keyword abstraction, labeled data is analyzed to determine the partial parameters in algorithm.Data set In share 59913 paper data, average every paper has 4.2 mark keywords.Firstly, the length of statistics mark keyword Degree, i.e., the number of words that each keyword includes, the results are shown in Table 2.It can be found that the average length of whole keywords from table 2 Degree is 1.98, and the length of most keywords, all between 1 to 3, keyword of the length between 1 to 3 is in whole 93.9% is occupied in 254376 keywords.Therefore 1 tuple when selecting candidate lexical item in reservation text, 2 tuples and 3 yuan Group.

Then, the part of speech of whole words in keyword is counted, statistical result is as shown in table 3.Part-of-speech tagging utilizes Jieba points Word tool is completed, and part of speech explanation in part is as shown in table 4.According to table 3, the part of speech of word is distributed no distribution of lengths collection in keyword In, but it is also mainly gathered in noun, verb and the verb with noun function, these three parts of speech occupy whole word parts of speech 73.1%.Therefore, take noun, verb and name verb in text and combinations thereof as candidate when carrying out candidate lexical item and selecting Lexical item.

The distribution of 2 length keywords of table

The distribution of 3 word part of speech of table

4 Jieba part of speech explanation of table

It is main using topic as full text when calculating topic weights due to only including the topic and abstract of paper in content of text The theme vector that candidate lexical item calculates text is extracted in the representative of topic from topic.In addition the co-occurrence window in candidate word sequence is big Small to be initially set to 3, the candidate word number finally retained takes 10, as shown in table 5.

5 keyword abstraction result (part) of table

Preferably, the present invention takes a paper data instance in enterprise's paper database, provides specific keyword and takes out Take process.

Data content is that " 10 years height of coal industry have been looked back in the instance analysis of mining design industry development thinking under the new situation Fast period of expansion and its profound influence that mining design market is generated.In current coal industry economy, rapidly downlink, coal are designed Under the background of Market competition, by taking the development of world tech design institute mining speciality as an example, the manpower money of mining speciality is analyzed Source and business Variation Features propose the thinking of development and implementing measure of mining speciality, are other design companies mining specialities Development provides reference ".

Wherein, " mining design industry development thinking instance analysis under the new situation " is the topic of paper, remaining content is opinion The abstract of text.

Candidate lexical item is chosen by n tuple lexical item and part-of-speech tagging, the candidate lexical item selected from the abstract of paper is made For the descriptor item collection of text, the candidate lexical item selected is as shown in table 6.

The candidate lexical item result of table 6

It is indicated using the corresponding phrase vector of lexical items all in self-encoding encoder acquisition descriptor item collection, calculates descriptor item collection In the corresponding phrase vector of all lexical items average value, as the theme vector of text, the theme vector that document is calculated is big Small is 400, and partial value is as shown in table 7.

7 topic weights result (part) of table

To each candidate lexical item, its COS distance between the theme vector of text is calculated, its topic weights, portion are obtained Score value is as shown in table 8.

8 topic weights result (part) of table

Using candidate lexical item as vertex, the co-occurrence information of candidate lexical item constructs non-directed graph as side, according to two candidate words The co-occurrence number of COS distance and the two between the vector expression of item is each edge distribution weight in figure, according to topic weights Vertex weights are calculated with the weight successive ignition on side.After successive ignition, each vertex in figure can obtain one surely Fixed score, part score are as shown in table 9.

9 weighted score result (part) of table

Obtained scoring event is ranked up, using Top10 of highest scoring candidate lexical item as final keyword, As shown in table 10.

10 keyword abstraction result (part) of table

It should be noted that " first " and " second " is used merely to distinguish the identical entity of title or operation herein, and Sequence or relationship between these entities or operation are not implied that.

Those of ordinary skill in the art will appreciate that: the above embodiments are only used to illustrate the technical solution of the present invention., and It is non-that it is limited；Although present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art It is understood that it is still possible to modify the technical solutions described in the foregoing embodiments, either to part of or All technical features are equivalently replaced；And these are modified or replaceed, it does not separate the essence of the corresponding technical solution this hair Bright claim limited range.

Claims

1. a kind of keyword abstraction method of phrase-based vector, which is characterized in that the described method includes:

S3, the theme for determining the text calculate the similarity of candidate lexical item and theme vector, using the similarity as described in The topic weights of candidate lexical item；

2. the method according to claim 1, wherein the self-encoding encoder in the step S2 includes encoder reconciliation Code device, encoder are made of two-way LSTM layers and full articulamentum, and decoded portion is formed by unidirectional LSTM layers and softmax layers.

3. according to the method described in claim 2, it is characterized in that, the training method of the self-encoding encoder in the step S2 includes Following steps:

S21, training sample is chosen, obtains candidate lexical item；

S22, to candidate lexical item c_j=(x₁, x₂..., x_T), in the encoder, distinguished using two-way LSTM from former and later two directions It is calculated:

Wherein,WithRespectively t (t=1, the 2 ..., T) moment is from left to right and from right to left in both direction Layer state and cell state are hidden,WithRespectively t-1 moment from left to right and from right to left two sides Upward hiding layer state and cell state, x_tFor the word in the candidate lexical item of t moment input, T indicates single in candidate lexical item The quantity of word；

S23, in the encoder, is calculated ES by formula_T:

h′_T=f (W_hh_T+b_h)

C′_T=f (W_cC_T+b_c)

Wherein,For connector, W_h、b_h、W_c、b_cThe parameter matrix in fully-connected network and biasing are represented, f indicates fully-connected network In activation primitive ReLU, ES_TIt is h '_TWith C '_TOne tuple of composition；

Wherein, z_tIt is hiding layer state of the decoder in t moment, z_t-1For the hiding layer state at t-1 moment, ES_TFor encoder shape State,The word in candidate lexical item exported for the t-1 moment；

S25, according to z_tEstimate the probability of current word

Wherein, W_sz_t+b_sIt gives a mark to each possible output word, softmax is normalized function；

S26, when in training process loss function L constantly become smaller finally tend towards stability when, obtain the parameter W of encoder_h、b_h、W_c、 b_cAnd the parameter W of decoder_s、z_t, so that it is determined that self-encoding encoder；Wherein, the calculation formula of loss function L are as follows:

4. according to the method described in claim 3, it is characterized in that, candidate's lexical item input encodes certainly in the step S2 Device, the ES of encoder output_TIn value be the candidate lexical item phrase vector.

5. the method according to claim 1, wherein theme vector in the step S3Calculation formula are as follows:

6. the method according to claim 1, wherein in the TextRank algorithm of the step S4, candidate word Item c_jAnd c_kOccur in co-occurrence window, then c_jAnd c_kBetween there are a line, the calculation formula of the weight on side are as follows:

w_jk=similarity (c_j, c_k)×occur_count(c_j, c_k)

Wherein,It is candidate lexical item c respectively_jAnd c_kVector indicate, occur_count(c_j, c_k) indicate c_jAnd c_kIn co-occurrence window The number occurred jointly in mouthful, similarity (c_j, c_k) it is c_jAnd c_kBetween similarity, w_jkRepresent c_jAnd c_kBetween side Weight.

7. according to the method described in claim 6, it is characterized in that, further including changing in the TextRank algorithm of the step S4 In generation, calculates the weight of candidate lexical item, until reaching maximum number of iterations, weightCalculation formula are as follows:

Wherein,Indicate candidate lexical item_cjWeight, d is damped coefficient, it is preferred that d 0.85；It is candidate lexical item_cj's Topic weights, w_jkIt is candidate lexical item c_jWith candidate lexical item c_kBetween side weight, w_kpIt is candidate lexical item c_kWith candidate lexical item c_pBetween The weight on side,It indicates and candidate lexical item c_jThe set of connected candidate lexical item,It isIn element,It indicates With candidate lexical item c_kThe set of connected candidate lexical item,It isIn element,Indicate candidate lexical item c_kWeight.

8. a kind of keyword abstraction system of phrase-based vector, which is characterized in that the system comprises:

Text Pretreatment module retains n tuple according to part of speech, is waited for part of speech to be segmented and marked to urtext Select lexical item collection；

Phrase vector constructs module, for candidate lexical item c_j=(x₁, x₂..., x_T), being obtained by self-encoding encoder has semanteme The phrase vector of expression；

9. system according to claim 8, which is characterized in that the system also includes self-encoding encoder training modules, are used for It is obtained by sample training from the parameter encoded, so that it is determined that self-encoding encoder.