CN109255118A

CN109255118A - A kind of keyword extracting method and device

Info

Publication number: CN109255118A
Application number: CN201710604469.XA
Authority: CN
Inventors: 张春荣
Original assignee: Putian Information Technology Co Ltd
Current assignee: Putian Information Technology Co Ltd
Priority date: 2017-07-11
Filing date: 2017-07-11
Publication date: 2019-01-22
Anticipated expiration: 2037-07-11
Also published as: CN109255118B

Abstract

The embodiment of the present invention provides a kind of keyword extracting method and device.The described method includes: obtaining web page text information, the web page text information is pre-processed, the sequence of candidate keywords is obtained；According to candidate keywords figure described in the sequence construct of the candidate keywords, the similarity value in the sequence of the candidate keywords between each candidate keywords and other candidate keywords is obtained according to the candidate keywords figure operation, and uses the similarity value as the initial weight value of each candidate keywords；According to the initial weight value of each candidate keywords, operation obtains the corresponding convergence weighted value of each candidate keywords, convergence weighted value corresponding to each candidate keywords carries out sizes values sequence, according to the sizes values sequence of the convergence weighted value of each candidate keywords, the target keyword of web page text information described in each candidate keywords is extracted.The embodiment of the present invention improves the initial weight algorithm of TextRank algorithm, realizes the more efficient keyword extracted in web page text information.

Description

A kind of keyword extracting method and device

Technical field

The present invention relates to field of computer technology, and in particular to a kind of keyword extracting method and device.

Background technique

The purpose that text key word extracts is the theme of the condensed text of height, the core content of quick obtaining text.It closes Keyword is extracted in news, the autoabstract of academic paper, socialized label mark, and the fields such as text subject extracts have important Effect.

The angle whether keyword extraction is labeled from corpus, which can be divided into, supervision and unsupervised two kinds.Wherein there is supervision Keyword extraction Typical Representative keyword extraction can be regarded as two classification problems, in any one text Vocabulary, carry out two-value judgement, that is, belong to keyword or the classification of non-key word two-value, the requirement of this method is to document sets language Material carries out keyword handmarking in advance, carries out disaggregated model training, and then realize keyword extraction, needs a large amount of artificial Intervene, cost is higher.

Unsupervised method does not have to handmarking, and because being not necessarily to training process, application is more convenient.Compare mainstream at present There are mainly three types of unsupervised keyword extracting methods: TF-IDF model keyword extraction based on word frequency statistics is based on theme mould The keyword extraction of type and keyword extraction based on vocabulary graph model.In the unsupervised keyword extraction research of three kinds of mainstreams On, and have a lot of other relevant optimization algorithms.Based on vocabulary graph model keyword extraction do not need additional document sets into Row training, keyword extraction can be carried out by relying only on itself text word structure information, simple and effective, so obtaining extensively Application, wherein again using TextRank algorithm as Typical Representative.

Attention (Attention-based) mechanism in neural network is substantially based on the note found in human vision Meaning mechanism is initially applied in image domains.Its basic thought is: people are not one when carrying out observation image in fact It is secondary that just each position pixel of entire image has been seen, it is the particular portion for focusing onto image according to demand mostly Point.And the mankind can will observe the position that image attention power should be concentrated according to the image study observed before to future.Pass through Attention goes to study piece image part to be processed, and each current state all can learn to obtain according to preceding state The position l to be paid close attention to and image currently entered, go processing attention partial pixel, rather than whole pixels of image.This The benefit of sample is exactly that less pixel needs to handle, and reduces the complexity of task.It can be seen that being applied in image The attention mechanism of attention and the mankind are much like.The attention used in natural language processing NLP is It is extended in Recognition with Recurrent Neural Network RNN, there are two types of attention mechanism, a kind of one is global (global) mechanism It is part (local) mechanism.The attention of mechanism global first, is handled all words original language.And part Mechanism is to reduce consuming when attention is calculated, and is not to consider original language end when calculating attention All words, but according to an anticipation function, the position for the original language end to be aligned, then passes through when first prediction currently decodes Contextual window only considers the word in window.

The task of keyword extraction is exactly that several significant words or word are automatically extracted out from one section of given text Group.The core concept of TextRank algorithm is originated from famous page rank algorithm PageRank, in the formula of PageRank On the basis of, the concept of the weight on side is introduced, the similarity of two sentences is represented.TextRank formula is as follows:

Wherein, the point V given for one_i, In (V_i) it is the point set for being directed toward the point, Out (V_i) it is point V_iIt is directed toward Point set.D is damped coefficient, and value range is 0 to 1, represents the probability that a certain specified point from figure is directed toward any other point, General value is 0.85.Appoint two o'clock V in figure_i、V_jBetween side weight be w_ji。

TextRank algorithm be using cooccurrence relation (co-occurrence) between local vocabulary to subsequent key word into Row sequence, is directly extracted from text itself.Text is split into minimum constituent unit, i.e. vocabulary by TextRank algorithm, as net Network node forms WordNet graph model.TextRank when iterating to calculate term weight as PageRank, theoretically It needs to calculate side right, but is calculated to simplify, it will usually default identical initial weight, and distribute adjacent word Divided equally when remittance weight.Therefore there is no the influences for considering relationship between vocabulary, that is, external document to close between vocabulary The influence of system.

In addition, another factor for influencing the weight distribution of lexical node is the importance of lexical node itself, text is represented The influence power of shelves internal structure, is usually adjusted by adjacent node.But classical TextRank algorithm is independent of it His training corpus does not consider word structural relation inside text, establishes graph model and carries out keyword extraction, therefore cannot be fine Reflection close on the influence power of vocabulary.

Therefore, the initial weight algorithm of TextRank algorithm how is improved, more efficient extraction document keyword becomes one A urgent problem to be solved.

Summary of the invention

For the defects in the prior art, the embodiment of the present invention provides a kind of keyword extracting method and device.

In a first aspect, the embodiment of the invention provides a kind of keyword extracting methods, which comprises

Web page text information is obtained, the web page text information is pre-processed, the sequence of candidate keywords is obtained；

According to candidate keywords figure described in the sequence construct of the candidate keywords, transported according to the candidate keywords figure The similarity value in the sequence for obtaining the candidate keywords between each candidate keywords and other candidate keywords is calculated, and Use the similarity value as the initial weight value of each candidate keywords；

According to the initial weight value of each candidate keywords, operation obtains the corresponding convergence weighted value of each candidate keywords, right The corresponding convergence weighted value of each candidate keywords carries out sizes values sequence, according to the big of the convergence weighted value of each candidate keywords Small value sequence, extracts the target keyword of web page text information described in each candidate keywords.

Optionally, described that web page text information pre-processing is specifically included:

Divide the web page text information according to complete words, participle and part-of-speech tagging, mistake are carried out to the complete words Stop words and part of speech are filtered, the candidate keywords are retained.

Optionally, the candidate keywords figure according to the sequence construct of the candidate keywords, according to the candidate Keyword figure operation obtains the phase in the sequence of the candidate keywords between each candidate keywords and other candidate keywords Like angle value, and the initial weight value for using the similarity value as each candidate keywords specifically includes:

K is carried out to each candidate keywords according to the continuous bag of words CBOW of term vector word2VEC algorithm and ties up term vector Characterization calculates each candidate keywords and other candidate keywords in the sequence of the candidate keywords by the term vector Between similarity value, i.e. the cosine angle initial weight value that obtains each candidate keywords；

Wherein, the value of k is the element of transfer matrix R in the candidate keywords figure.

Optionally, the initial weight value according to each candidate keywords, operation obtain the corresponding receipts of each candidate keywords Weighted value is held back to specifically include:

The convergence weighted value of each candidate keywords is obtained using the iterative calculation of following formula according to attention mechanism；

The calculation formula of the convergence weighted value are as follows:

Wherein,

V_iFor i-th of candidate keywords；

V_jFor j-th candidates keyword；

WS(V_i) be i-th of candidate keywords convergence weighted value；

D is damped coefficient, and value range is 0 to 1, represents a certain particular candidate keyword in candidate key word sequence and refers to To the probability of other candidate keywords, general value is 0.85；

In(V_i) it is the set for being directed toward the candidate keywords of i-th of candidate keywords；

Out(V_i) be i-th of candidate keywords be directed toward candidate keywords set；

ω_jiFor the similarity value Sim of i-th candidate keywords and j-th candidates keyword in the sequence of candidate keywords (e_i,f_j), and use the similarity value as the initial weight value of i-th candidate keywords and j-th candidates keyword；

e_iIt is indicated for the term vector of i-th of candidate keywords；

f_jIt is indicated for the term vector of j-th candidates keyword；

ω_jkFor the similarity value Sim of i-th candidate keywords and k-th of candidate keywords in the sequence of candidate keywords (e_k,f_j), and use the similarity value as the initial weight value of i-th candidate keywords and k-th of candidate keywords；

k_w,iValue be the candidate keywords figure in transfer matrix R^|V|x2bElement；

2b is the window that length is 2b, and 2b indicates window size, i.e., at most conllinear 2b candidate keywords；

| V | it is the numerical value of candidate keywords；

α_jiFor attention between i-th of candidate keywords and j-th candidates keyword in the sequence of candidate keywords Value, α_ji=α_ij；

The attention α_ijCalculation formula are as follows:

Wherein,

k_w,iFor transfer matrix R^|V|x2bIn w row i-th arrange element value；

expk_w,iFor the exponential function using constant e the bottom of as, the value of constant e is about 2.718282；

S_iFor amount of bias, obtained automatically after window is fixed；

The initial weight value ω_ji=Sim (e_i,f_j) calculation formula are as follows:

Second aspect, the embodiment of the present invention provide a kind of keyword extracting device, and described device includes:

Candidate keywords obtain module, for obtaining web page text information, pre-process to the web page text information, Obtain the sequence of candidate keywords；

Initial weight value obtains module, for the candidate keywords figure according to the sequence construct of the candidate keywords, Each candidate keywords and other candidates in the sequence of the candidate keywords are obtained according to the candidate keywords figure operation to close Similarity value between keyword, and use the similarity value as the initial weight value of each candidate keywords；

Target keyword obtains module, and for the initial weight value according to each candidate keywords, operation obtains each candidate pass The corresponding convergence weighted value of keyword, convergence weighted value corresponding to each candidate keywords carries out sizes values sequence, according to each candidate The sizes values sequence of the convergence weighted value of keyword, extracts the target critical of web page text information described in each candidate keywords Word.

Optionally, the candidate keywords obtain module and are specifically used for:

Optionally, the initial weight value obtains module and is specifically used for:

Optionally, the target keyword obtains module and is specifically used for:

The calculation formula of the convergence weighted value are as follows:

Wherein,

V_iFor i-th of candidate keywords；

V_jFor j-th candidates keyword；

WS(V_i) be i-th of candidate keywords convergence weighted value；

e_iIt is indicated for the term vector of i-th of candidate keywords；

f_jIt is indicated for the term vector of j-th candidates keyword；

| V | it is the numerical value of candidate keywords；

The attention α_ijCalculation formula are as follows:

Wherein,

k_w,iFor transfer matrix R^|V|x2bIn w row i-th arrange element value；

S_iFor amount of bias, obtained automatically after window is fixed；

The third aspect, the embodiment of the invention provides a kind of electronic equipment, the electronic equipment includes:

At least one processor；And

At least one processor being connect with the processor communication, in which:

The memory is stored with the program instruction that can be executed by the processor, and the processor calls described program Instruction is able to carry out above-mentioned corresponding either method.

Fourth aspect, the embodiment of the invention provides a kind of non-transient computer readable storage medium, the non-transient meter Calculation machine readable storage medium storing program for executing stores computer program, and it is above-mentioned corresponding any that the computer program executes the computer Method.

Keyword extracting method and device provided in an embodiment of the present invention, are pre-processed by text information, and sufficiently research is waited The relationship between keyword and candidate keywords is selected, by external information provided by text itself and text set, building is waited Keyword figure is selected, then is based on attention mechanism, calculates the influence power of text internal structure, iterates to calculate weight, carries out target pass The extraction of keyword, the embodiment of the present invention realize the improvement to the initial weight algorithm of TextRank algorithm, realize more efficient Extract web page text information in keyword.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will to embodiment or Attached drawing needed to be used in the description of the prior art is briefly described, it should be apparent that, the accompanying drawings in the following description is this hair Bright some embodiments for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.

Fig. 1 is the flow diagram of keyword extracting method in the embodiment of the present invention；

Fig. 2 is the TextRank keyword extracting method flow diagram based on attention mechanism in the embodiment of the present invention；

Fig. 3 is to update schematic diagram based on attention mechanism weight in the embodiment of the present invention；

Fig. 4 is the structural schematic diagram of keyword extracting device in the embodiment of the present invention；

Fig. 5 is the logic diagram of electronic equipment provided by one embodiment of the present invention.

Specific embodiment

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical solution in the embodiment of the present invention is explicitly described, it is clear that described embodiment is this hair Bright a part of the embodiment, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art exist Every other embodiment obtained under the premise of creative work is not made, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a kind of keyword extracting method, Fig. 1 is keyword extraction side in the embodiment of the present invention The flow diagram of method, as described in Figure 1, which comprises

Step S101, web page text information is obtained, the web page text information is pre-processed, obtains candidate keywords Sequence；

Wherein, the acquisition web page text information specifically refers to: being parsed using resolver to webpage, by HTML net Page is parsed into text, only retains useful text information.

Pair web page text information pre-processing refers to: given text T being split according to complete words, i.e., In each sentence, participle and part-of-speech tagging processing are carried out, and filters out stop words, and carry out part of speech filtering, only retains specified word Property word such as noun, verb, adjective be the candidate keywords after retaining.

The sequence of the candidate keywords refers to what web page text obtained after web page text information pre-processing Multiple candidate keywords.

Step S102, the candidate keywords figure according to the sequence construct of the candidate keywords, according to the candidate pass Keyword figure operation obtains similar between each candidate keywords and other candidate keywords in the sequence of the candidate keywords Angle value, and use the similarity value as the initial weight value of each candidate keywords；

Wherein, the candidate keywords figure refers to any two candidate keywords word lexical nodes in a window, adopts With cooccurrence relation, the side between wantonly two word is constructed, there are side be only 2b in length when their corresponding vocabulary between two nodes Window in co-occurrence, 2b indicate window size, i.e., most 2b words of co-occurrence.

The initial weight value, is indicated with ω, refers to the similarity value Sim (e between two candidate keywords_k,f_j), institute State similarity value Sim (e_k,f_j) calculation formula are as follows:

Wherein, e_iIt is indicated for the term vector of i-th of candidate keywords, f_jIt is indicated for the term vector of j-th candidates keyword.

Step S103, according to the initial weight value of each candidate keywords, operation obtains the corresponding convergence of each candidate keywords Weighted value, convergence weighted value corresponding to each candidate keywords carry out sizes values sequence, are weighed according to the convergence of each candidate keywords The sizes values of weight values sort, and extract the target keyword of web page text information described in each candidate keywords.

Wherein, the convergence weighted value, is indicated with WS, refers to the influence power in view of text internal structure, candidate key The importance of word lexical node itself is adjusted by adjacent lexical node, uses following formula based on attention mechanism Transferring weights matrix, and the weight of each lexical node of iterative diffusion are calculated, until the value obtained after convergence；

The calculation formula of the convergence weighted value are as follows:

Wherein,

V_iFor i-th of candidate keywords；

V_jFor j-th candidates keyword；

WS(V_i) be i-th of candidate keywords convergence weighted value；

e_iIt is indicated for the term vector of i-th of candidate keywords；

f_jIt is indicated for the term vector of j-th candidates keyword；

| V | it is the numerical value of candidate keywords；

The attention α_ijCalculation formula are as follows:

Wherein,

k_w,iFor transfer matrix R^|V|x2bIn w row i-th arrange element value；

S_iFor amount of bias, obtained automatically after window is fixed；

Keyword extracting method provided in an embodiment of the present invention, by carrying out pre-processing acquisition to web page text information Candidate keywords sequence construct candidate keywords figure, calculate in the sequence of the candidate keywords each candidate keywords with Similarity value between other candidate keywords, and use the similarity value as the initial weight value of each candidate keywords. According to the initial weight value of each candidate keywords, the convergence weighted value of each candidate keywords is obtained, according to convergence weighted value Sizes values sequence, extracts the target keyword in web page text information, and the embodiment of the present invention is realized to TextRank algorithm The improvement of initial weight algorithm realizes the more efficient keyword extracted in web page text information.

On the basis of the above embodiments, described that web page text information pre-processing is specifically included:

On the basis of the above embodiments, the candidate keywords according to the sequence construct of the candidate keywords Figure obtains each candidate keywords and other times in the sequence of the candidate keywords according to the candidate keywords figure operation The similarity value between keyword is selected, and the initial weight value for using the similarity value as each candidate keywords is specifically wrapped It includes:

Wherein, transfer matrix R refers to matrix R in the candidate keywords figure^|V|x2b, 2b is the window that length is 2b, 2b Indicate window size, artificial to be arranged, i.e., at most conllinear 2b candidate keywords, | V | it is the numerical value of candidate keywords.

On the basis of the above embodiments, the initial weight value according to each candidate keywords, operation obtain each candidate The corresponding convergence weighted value of keyword specifically includes:

The calculation formula of the convergence weighted value are as follows:

Wherein,

V_iFor i-th of candidate keywords；

V_jFor j-th candidates keyword；

WS(V_i) be i-th of candidate keywords convergence weighted value；

e_iIt is indicated for the term vector of i-th of candidate keywords；

f_jIt is indicated for the term vector of j-th candidates keyword；

| V | it is the numerical value of candidate keywords；

The attention α_ijCalculation formula are as follows:

Wherein,

k_w,iFor transfer matrix R^|V|x2bIn w row i-th arrange element value；

S_iFor amount of bias, obtained automatically after window is fixed；

The specific embodiment of the embodiment of the present invention are as follows:

Fig. 2 is the TextRank keyword extracting method flow diagram based on attention mechanism in the embodiment of the present invention, As shown in Figure 2, which comprises

Step S201, obtain Web page text text: webpage is parsed using resolver, by HTML web analysis at Text only retains useful text information；

Pair step S202, to Web page text Text Pretreatment: given text T being split according to complete words, i.e., In each sentence, participle and part-of-speech tagging processing are carried out, and filters out stop words, and carry out part of speech filtering, only retains specified word Property word such as noun, verb, adjective be the candidate keywords after retaining.Such as in text " in 20 years following, manually The influence that intelligence generates the whole society is by big to being difficult to imagine." by participle: " in 20 years following, artificial intelligence is to the whole society The influence of generation is by big to being difficult to imagine.", after part-of-speech tagging " in future/nt 20/m/q/and nd ,/wp artificial intelligence/n / a society/n generation/v/u influence/v general/d complete on/p greatly/a to/v is difficult to the/d imagination/v./ wp ", and after carrying out part of speech filtering It obtains " the following artificial intelligence whole society have an impact big to be difficult to imagine " and is used as candidate keywords initiation sequence；

Step S203, the transfer matrix R of candidate keywords figure is constructed according to word2VEC algorithm^|V|x2b: wherein V is candidate Then keyword lexical node collection appoints the side between two o'clock using cooccurrence relation construction, there are sides only when it between two nodes Corresponding vocabulary length be 2b window in co-occurrence, 2b indicate window size, i.e., most 2b words of co-occurrence.Assuming that one A sentence is successively made of following word: w1, w2, w3, w4, w5 ..., wn.So [w1, w2 ..., w2b], [w2, W3 ..., w2b+1], [w3, w4 ..., w2b+2] etc. be all a window.Any two vocabulary in a window are corresponding There are a undirected sides had no right between node.The embodiment of the present invention is by external provided by text itself and text set Information carries out the training of sample text collection by the CBOW model of word2VEC, to each of dictionary D word carry out K dimension word to Scale sign obtains the similarity in dictionary D between each vocabulary and other vocabulary, and use phase then by calculating cosine angle Initial weight value like degree as lexical node.Similarity reflects the degree of association between vocabulary, such as " artificial intelligence " and " machine The similitude of device study " is just higher than " artificial intelligence " and the similarity of " pineapple ".I-th of word and j-th of word in document sets Between similarity be Sim (e_i,f_j), and e_i,f_jIt is indicated for term vector, similarity formula is as follows:

Step S204, the weight based on attention mechanism is calculated, the weight of each node of iterative diffusion: in view of in text The importance of the influence power of portion's structure, lexical node itself is adjusted by adjacent node.Such as previously described " future The artificial intelligence whole society has an impact big to being difficult to imagine "." artificial intelligence " is clearly prior influence power, needs to give Bigger weight.The present invention is based on attention mechanism to calculate transferring weights matrix, and each node of iterative diffusion using following formula Weight, until convergence:

Wherein,

V_iFor i-th of candidate keywords；

V_jFor j-th candidates keyword；

WS(V_i) be i-th of candidate keywords convergence weighted value；

e_iIt is indicated for the term vector of i-th of candidate keywords；

f_jIt is indicated for the term vector of j-th candidates keyword；

| V | it is the numerical value of candidate keywords；

The attention α_ijCalculation formula are as follows:

Wherein,

k_w,iFor transfer matrix R^|V|x2bIn w row i-th arrange element value, indicate the importance of the relative position of each word；

S_iFor amount of bias, obtained automatically after window is fixed；

Therefore, WS (V_i) update not only by ω_jiInfluence also by α_ijInfluence, the influence power of text internal structure, The importance of lexical node itself is adjusted by adjacent node.In candidate keywords figure with candidate keywords V_iIt is relevant Node and side schematic diagram as shown in figure 3,

Wherein solid line with the arrow indicates node V_iTo pointed node V_jTransition probability, the thicknesses of lines indicates that weight turns Move the size of probability；Dotted line is then indicated by V_jNode jumps to node V_iTransition probability.

Step S205, it obtains target critical word sequence: Bit-reversed being carried out to node weights, to obtain most important T A vocabulary, as candidate keywords.Most important T vocabulary will be obtained, will be marked in urtext, if being formed adjacent Phrase is then combined into word keyword.For example, there is sentence in text above, " the following artificial intelligence whole society has an impact big to hardly possible With the imagination ".If " being difficult to " and " imagination " belongs to candidate keywords, it is combined into " being difficult to imagine " and target keyword is added Sequence.

Keyword extracting method provided in an embodiment of the present invention, is pre-processed by text information, sufficiently research candidate key Relationship between word and candidate keywords constructs candidate key by external information provided by text itself and text set Word figure, then it is based on attention mechanism, the influence power of text internal structure is calculated, weight is iterated to calculate, carries out target keyword It extracts, the embodiment of the present invention realizes the improvement to the initial weight algorithm of TextRank algorithm, realizes more efficient extraction Keyword in web page text information.

The embodiment of the present invention provides a kind of keyword extracting device, and Fig. 4 is keyword extracting device in the embodiment of the present invention Structural schematic diagram, as shown in figure 4, described device include: candidate keywords obtain module 401, initial weight value obtain module 402 and target keyword obtain module 403；

Candidate keywords obtain module 401 for obtaining web page text information, locate in advance to the web page text information Reason, obtains the sequence of candidate keywords；Initial weight value obtains module 402 for the sequence structure according to the candidate keywords The candidate keywords figure is built, each time in the sequence of the candidate keywords is obtained according to the candidate keywords figure operation The similarity value between keyword and other candidate keywords is selected, and uses the similarity value as each candidate keywords Initial weight value；Target keyword obtains module 403 for the initial weight value according to each candidate keywords, and operation obtains each The corresponding convergence weighted value of candidate keywords, convergence weighted value corresponding to each candidate keywords carry out sizes values sequence, according to The sizes values sequence of the convergence weighted value of each candidate keywords, extracts the mesh of web page text information described in each candidate keywords Mark keyword.

On the basis of the above embodiments, the candidate keywords obtain module and are specifically used for:

On the basis of the above embodiments, the initial weight value obtains module and is specifically used for:

On the basis of the above embodiments, the target keyword obtains module and is specifically used for:

The calculation formula of the convergence weighted value are as follows:

Wherein,

V_iFor i-th of candidate keywords；

V_jFor j-th candidates keyword；

WS(V_i) be i-th of candidate keywords convergence weighted value；

e_iIt is indicated for the term vector of i-th of candidate keywords；

f_jIt is indicated for the term vector of j-th candidates keyword；

| V | it is the numerical value of candidate keywords；

The attention α_ijCalculation formula are as follows:

Wherein,

k_w,iFor transfer matrix R^|V|x2bIn w row i-th arrange element value；

S_iFor amount of bias, obtained automatically after window is fixed；

Keyword extracting device provided in an embodiment of the present invention is mentioned for realizing keyword provided in an embodiment of the present invention Method is taken, specific embodiment specifically states that details are not described herein in above method embodiment.

Keyword extracting device provided in an embodiment of the present invention, is pre-processed by text information, sufficiently research candidate key Relationship between word and candidate keywords constructs candidate key by external information provided by text itself and text set Word figure, then it is based on attention mechanism, the influence power of text internal structure is calculated, weight is iterated to calculate, carries out target keyword It extracts, the embodiment of the present invention realizes the improvement to the initial weight algorithm of TextRank algorithm, realizes more efficient extraction Keyword in web page text information.

Fig. 5 is the logic diagram of electronic equipment provided by one embodiment of the present invention, as shown in figure 5, the electronics is set It is standby, comprising: processor (processor) 501, memory (memory) 502 and bus 503；

Wherein, the processor 501 and memory 502 complete mutual communication by the bus 503；The place Reason device 501 is used to call the program instruction in the memory 502, to execute method provided by above-mentioned each method embodiment.

The present embodiment discloses a kind of computer program product, and the computer program product includes being stored in non-transient meter Computer program on calculation machine readable storage medium storing program for executing, the computer program include program instruction, when described program instruction is counted When calculation machine executes, computer is able to carry out method provided by above-mentioned each method embodiment.

The present embodiment provides a kind of non-transient computer readable storage medium, the non-transient computer readable storage medium Matter stores computer instruction, and the computer instruction makes the computer execute method provided by above-mentioned each method embodiment.

Finally, it should be noted that the above various embodiments is only to illustrate the technical solution of the embodiment of the present invention, rather than it is right It is limited；Although the embodiment of the present invention is described in detail referring to foregoing embodiments, the ordinary skill of this field Personnel are it is understood that it is still possible to modify the technical solutions described in the foregoing embodiments, or to part Or all technical features are equivalently replaced；And these are modified or replaceed, it does not separate the essence of the corresponding technical solution The range of each embodiment technical solution of the embodiment of the present invention.

Claims

1. a kind of keyword extracting method, which is characterized in that the described method includes:

According to candidate keywords figure described in the sequence construct of the candidate keywords, obtained according to the candidate keywords figure operation Similarity value in the sequence of the candidate keywords between each candidate keywords and other candidate keywords, and with the phase Initial weight value like angle value as each candidate keywords；

According to the initial weight value of each candidate keywords, operation obtains the corresponding convergence weighted value of each candidate keywords, to each time It selects the corresponding convergence weighted value of keyword to carry out sizes values sequence, is arranged according to the sizes values of the convergence weighted value of each candidate keywords Sequence extracts the target keyword of web page text information described in each candidate keywords.

2. the method according to claim 1, wherein described pre-process specific packet to the web page text information It includes:

Divide the web page text information according to complete words, participle is carried out to the complete words and part-of-speech tagging, filtering stop Word and part of speech retain the candidate keywords.

3. the method according to claim 1, wherein described according to the sequence construct of the candidate keywords Candidate keywords figure obtains each candidate keywords in the sequence of the candidate keywords according to the candidate keywords figure operation With the similarity value between other candidate keywords, and use the similarity value as the initial weight value of each candidate keywords It specifically includes:

K dimension term vector characterization is carried out to each candidate keywords according to the continuous bag of words CBOW of term vector word2VEC algorithm, It is calculated in the sequence of the candidate keywords between each candidate keywords and other candidate keywords by the term vector Similarity value, i.e. cosine angle obtain the initial weight value of each candidate keywords；

4. the method according to claim 1, wherein the initial weight value according to each candidate keywords, fortune The corresponding convergence weighted value of each candidate keywords of acquisition is calculated to specifically include:

The calculation formula of the convergence weighted value are as follows:

Wherein,

V_iFor i-th of candidate keywords；

V_jFor j-th candidates keyword；

WS(V_i) be i-th of candidate keywords convergence weighted value；

D is damped coefficient, and value range is 0 to 1, represents a certain particular candidate keyword in candidate key word sequence and is directed toward other The probability of candidate keywords, general value are 0.85；

ω_jiFor the similarity value Sim (e of i-th candidate keywords and j-th candidates keyword in the sequence of candidate keywords_i, f_j), and use the similarity value as the initial weight value of i-th candidate keywords and j-th candidates keyword；

e_iIt is indicated for the term vector of i-th of candidate keywords；

f_jIt is indicated for the term vector of j-th candidates keyword；

ω_jkFor the similarity value Sim (e of i-th candidate keywords and k-th of candidate keywords in the sequence of candidate keywords_k, f_j), and use the similarity value as the initial weight value of i-th candidate keywords and k-th of candidate keywords；

| V | it is the numerical value of candidate keywords；

α_jiFor in the sequence of candidate keywords between i-th of candidate keywords and j-th candidates keyword attention value, α_ji= α_ij；

The attention α_ijCalculation formula are as follows:

Wherein,

k_w,iFor transfer matrix R^|V|x2bIn w row i-th arrange element value；

S_iFor amount of bias, obtained automatically after window is fixed；

5. a kind of keyword extracting device, which is characterized in that described device includes:

Candidate keywords obtain module, for obtaining web page text information, pre-process, are waited to the web page text information Select the sequence of keyword；

Initial weight value obtains module, for the candidate keywords figure according to the sequence construct of the candidate keywords, according to The candidate keywords figure operation obtains each candidate keywords and other candidate keywords in the sequence of the candidate keywords Between similarity value, and use the similarity value as the initial weight value of each candidate keywords；

Target keyword obtains module, and for the initial weight value according to each candidate keywords, operation obtains each candidate keywords Corresponding convergence weighted value, convergence weighted value corresponding to each candidate keywords carries out sizes values sequence, according to each candidate key The sizes values sequence of the convergence weighted value of word, extracts the target keyword of web page text information described in each candidate keywords.

6. device according to claim 5, which is characterized in that the candidate keywords obtain module and are specifically used for:

7. device according to claim 5, which is characterized in that the initial weight value obtains module and is specifically used for:

8. device according to claim 5, which is characterized in that the target keyword obtains module and is specifically used for:

The calculation formula of the convergence weighted value are as follows:

Wherein,

V_iFor i-th of candidate keywords；

V_jFor j-th candidates keyword；

WS(V_i) be i-th of candidate keywords convergence weighted value；

e_iIt is indicated for the term vector of i-th of candidate keywords；

f_jIt is indicated for the term vector of j-th candidates keyword；

| V | it is the numerical value of candidate keywords；

The attention α_ijCalculation formula are as follows:

Wherein,

k_w,iFor transfer matrix R^|V|x2bIn w row i-th arrange element value；

S_iFor amount of bias, obtained automatically after window is fixed；

9. a kind of electronic equipment characterized by comprising

At least one processor；And

The memory is stored with the program instruction that can be executed by the processor, and the processor calls described program to instruct energy Enough methods executed as described in Claims 1-4 is any.

10. a kind of non-transient computer readable storage medium, which is characterized in that the non-transient computer readable storage medium is deposited Computer program is stored up, the computer program makes the computer execute the method as described in Claims 1-4 is any.