CN109255118A - A kind of keyword extracting method and device - Google Patents

A kind of keyword extracting method and device Download PDF

Info

Publication number
CN109255118A
CN109255118A CN201710604469.XA CN201710604469A CN109255118A CN 109255118 A CN109255118 A CN 109255118A CN 201710604469 A CN201710604469 A CN 201710604469A CN 109255118 A CN109255118 A CN 109255118A
Authority
CN
China
Prior art keywords
candidate keywords
value
candidate
keyword
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710604469.XA
Other languages
Chinese (zh)
Other versions
CN109255118B (en
Inventor
张春荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Putian Information Technology Co Ltd
Original Assignee
Putian Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Putian Information Technology Co Ltd filed Critical Putian Information Technology Co Ltd
Priority to CN201710604469.XA priority Critical patent/CN109255118B/en
Publication of CN109255118A publication Critical patent/CN109255118A/en
Application granted granted Critical
Publication of CN109255118B publication Critical patent/CN109255118B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the present invention provides a kind of keyword extracting method and device.The described method includes: obtaining web page text information, the web page text information is pre-processed, the sequence of candidate keywords is obtained;According to candidate keywords figure described in the sequence construct of the candidate keywords, the similarity value in the sequence of the candidate keywords between each candidate keywords and other candidate keywords is obtained according to the candidate keywords figure operation, and uses the similarity value as the initial weight value of each candidate keywords;According to the initial weight value of each candidate keywords, operation obtains the corresponding convergence weighted value of each candidate keywords, convergence weighted value corresponding to each candidate keywords carries out sizes values sequence, according to the sizes values sequence of the convergence weighted value of each candidate keywords, the target keyword of web page text information described in each candidate keywords is extracted.The embodiment of the present invention improves the initial weight algorithm of TextRank algorithm, realizes the more efficient keyword extracted in web page text information.

Description

A kind of keyword extracting method and device
Technical field
The present invention relates to field of computer technology, and in particular to a kind of keyword extracting method and device.
Background technique
The purpose that text key word extracts is the theme of the condensed text of height, the core content of quick obtaining text.It closes Keyword is extracted in news, the autoabstract of academic paper, socialized label mark, and the fields such as text subject extracts have important Effect.
The angle whether keyword extraction is labeled from corpus, which can be divided into, supervision and unsupervised two kinds.Wherein there is supervision Keyword extraction Typical Representative keyword extraction can be regarded as two classification problems, in any one text Vocabulary, carry out two-value judgement, that is, belong to keyword or the classification of non-key word two-value, the requirement of this method is to document sets language Material carries out keyword handmarking in advance, carries out disaggregated model training, and then realize keyword extraction, needs a large amount of artificial Intervene, cost is higher.
Unsupervised method does not have to handmarking, and because being not necessarily to training process, application is more convenient.Compare mainstream at present There are mainly three types of unsupervised keyword extracting methods: TF-IDF model keyword extraction based on word frequency statistics is based on theme mould The keyword extraction of type and keyword extraction based on vocabulary graph model.In the unsupervised keyword extraction research of three kinds of mainstreams On, and have a lot of other relevant optimization algorithms.Based on vocabulary graph model keyword extraction do not need additional document sets into Row training, keyword extraction can be carried out by relying only on itself text word structure information, simple and effective, so obtaining extensively Application, wherein again using TextRank algorithm as Typical Representative.
Attention (Attention-based) mechanism in neural network is substantially based on the note found in human vision Meaning mechanism is initially applied in image domains.Its basic thought is: people are not one when carrying out observation image in fact It is secondary that just each position pixel of entire image has been seen, it is the particular portion for focusing onto image according to demand mostly Point.And the mankind can will observe the position that image attention power should be concentrated according to the image study observed before to future.Pass through Attention goes to study piece image part to be processed, and each current state all can learn to obtain according to preceding state The position l to be paid close attention to and image currently entered, go processing attention partial pixel, rather than whole pixels of image.This The benefit of sample is exactly that less pixel needs to handle, and reduces the complexity of task.It can be seen that being applied in image The attention mechanism of attention and the mankind are much like.The attention used in natural language processing NLP is It is extended in Recognition with Recurrent Neural Network RNN, there are two types of attention mechanism, a kind of one is global (global) mechanism It is part (local) mechanism.The attention of mechanism global first, is handled all words original language.And part Mechanism is to reduce consuming when attention is calculated, and is not to consider original language end when calculating attention All words, but according to an anticipation function, the position for the original language end to be aligned, then passes through when first prediction currently decodes Contextual window only considers the word in window.
The task of keyword extraction is exactly that several significant words or word are automatically extracted out from one section of given text Group.The core concept of TextRank algorithm is originated from famous page rank algorithm PageRank, in the formula of PageRank On the basis of, the concept of the weight on side is introduced, the similarity of two sentences is represented.TextRank formula is as follows:
Wherein, the point V given for onei, In (Vi) it is the point set for being directed toward the point, Out (Vi) it is point ViIt is directed toward Point set.D is damped coefficient, and value range is 0 to 1, represents the probability that a certain specified point from figure is directed toward any other point, General value is 0.85.Appoint two o'clock V in figurei、VjBetween side weight be wji
TextRank algorithm be using cooccurrence relation (co-occurrence) between local vocabulary to subsequent key word into Row sequence, is directly extracted from text itself.Text is split into minimum constituent unit, i.e. vocabulary by TextRank algorithm, as net Network node forms WordNet graph model.TextRank when iterating to calculate term weight as PageRank, theoretically It needs to calculate side right, but is calculated to simplify, it will usually default identical initial weight, and distribute adjacent word Divided equally when remittance weight.Therefore there is no the influences for considering relationship between vocabulary, that is, external document to close between vocabulary The influence of system.
In addition, another factor for influencing the weight distribution of lexical node is the importance of lexical node itself, text is represented The influence power of shelves internal structure, is usually adjusted by adjacent node.But classical TextRank algorithm is independent of it His training corpus does not consider word structural relation inside text, establishes graph model and carries out keyword extraction, therefore cannot be fine Reflection close on the influence power of vocabulary.
Therefore, the initial weight algorithm of TextRank algorithm how is improved, more efficient extraction document keyword becomes one A urgent problem to be solved.
Summary of the invention
For the defects in the prior art, the embodiment of the present invention provides a kind of keyword extracting method and device.
In a first aspect, the embodiment of the invention provides a kind of keyword extracting methods, which comprises
Web page text information is obtained, the web page text information is pre-processed, the sequence of candidate keywords is obtained;
According to candidate keywords figure described in the sequence construct of the candidate keywords, transported according to the candidate keywords figure The similarity value in the sequence for obtaining the candidate keywords between each candidate keywords and other candidate keywords is calculated, and Use the similarity value as the initial weight value of each candidate keywords;
According to the initial weight value of each candidate keywords, operation obtains the corresponding convergence weighted value of each candidate keywords, right The corresponding convergence weighted value of each candidate keywords carries out sizes values sequence, according to the big of the convergence weighted value of each candidate keywords Small value sequence, extracts the target keyword of web page text information described in each candidate keywords.
Optionally, described that web page text information pre-processing is specifically included:
Divide the web page text information according to complete words, participle and part-of-speech tagging, mistake are carried out to the complete words Stop words and part of speech are filtered, the candidate keywords are retained.
Optionally, the candidate keywords figure according to the sequence construct of the candidate keywords, according to the candidate Keyword figure operation obtains the phase in the sequence of the candidate keywords between each candidate keywords and other candidate keywords Like angle value, and the initial weight value for using the similarity value as each candidate keywords specifically includes:
K is carried out to each candidate keywords according to the continuous bag of words CBOW of term vector word2VEC algorithm and ties up term vector Characterization calculates each candidate keywords and other candidate keywords in the sequence of the candidate keywords by the term vector Between similarity value, i.e. the cosine angle initial weight value that obtains each candidate keywords;
Wherein, the value of k is the element of transfer matrix R in the candidate keywords figure.
Optionally, the initial weight value according to each candidate keywords, operation obtain the corresponding receipts of each candidate keywords Weighted value is held back to specifically include:
The convergence weighted value of each candidate keywords is obtained using the iterative calculation of following formula according to attention mechanism;
The calculation formula of the convergence weighted value are as follows:
Wherein,
ViFor i-th of candidate keywords;
VjFor j-th candidates keyword;
WS(Vi) be i-th of candidate keywords convergence weighted value;
D is damped coefficient, and value range is 0 to 1, represents a certain particular candidate keyword in candidate key word sequence and refers to To the probability of other candidate keywords, general value is 0.85;
In(Vi) it is the set for being directed toward the candidate keywords of i-th of candidate keywords;
Out(Vi) be i-th of candidate keywords be directed toward candidate keywords set;
ωjiFor the similarity value Sim of i-th candidate keywords and j-th candidates keyword in the sequence of candidate keywords (ei,fj), and use the similarity value as the initial weight value of i-th candidate keywords and j-th candidates keyword;
eiIt is indicated for the term vector of i-th of candidate keywords;
fjIt is indicated for the term vector of j-th candidates keyword;
ωjkFor the similarity value Sim of i-th candidate keywords and k-th of candidate keywords in the sequence of candidate keywords (ek,fj), and use the similarity value as the initial weight value of i-th candidate keywords and k-th of candidate keywords;
kw,iValue be the candidate keywords figure in transfer matrix R|V|x2bElement;
2b is the window that length is 2b, and 2b indicates window size, i.e., at most conllinear 2b candidate keywords;
| V | it is the numerical value of candidate keywords;
αjiFor attention between i-th of candidate keywords and j-th candidates keyword in the sequence of candidate keywords Value, αjiij
The attention αijCalculation formula are as follows:
Wherein,
kw,iFor transfer matrix R|V|x2bIn w row i-th arrange element value;
expkw,iFor the exponential function using constant e the bottom of as, the value of constant e is about 2.718282;
SiFor amount of bias, obtained automatically after window is fixed;
The initial weight value ωji=Sim (ei,fj) calculation formula are as follows:
Second aspect, the embodiment of the present invention provide a kind of keyword extracting device, and described device includes:
Candidate keywords obtain module, for obtaining web page text information, pre-process to the web page text information, Obtain the sequence of candidate keywords;
Initial weight value obtains module, for the candidate keywords figure according to the sequence construct of the candidate keywords, Each candidate keywords and other candidates in the sequence of the candidate keywords are obtained according to the candidate keywords figure operation to close Similarity value between keyword, and use the similarity value as the initial weight value of each candidate keywords;
Target keyword obtains module, and for the initial weight value according to each candidate keywords, operation obtains each candidate pass The corresponding convergence weighted value of keyword, convergence weighted value corresponding to each candidate keywords carries out sizes values sequence, according to each candidate The sizes values sequence of the convergence weighted value of keyword, extracts the target critical of web page text information described in each candidate keywords Word.
Optionally, the candidate keywords obtain module and are specifically used for:
Divide the web page text information according to complete words, participle and part-of-speech tagging, mistake are carried out to the complete words Stop words and part of speech are filtered, the candidate keywords are retained.
Optionally, the initial weight value obtains module and is specifically used for:
K is carried out to each candidate keywords according to the continuous bag of words CBOW of term vector word2VEC algorithm and ties up term vector Characterization calculates each candidate keywords and other candidate keywords in the sequence of the candidate keywords by the term vector Between similarity value, i.e. the cosine angle initial weight value that obtains each candidate keywords;
Wherein, the value of k is the element of transfer matrix R in the candidate keywords figure.
Optionally, the target keyword obtains module and is specifically used for:
The convergence weighted value of each candidate keywords is obtained using the iterative calculation of following formula according to attention mechanism;
The calculation formula of the convergence weighted value are as follows:
Wherein,
ViFor i-th of candidate keywords;
VjFor j-th candidates keyword;
WS(Vi) be i-th of candidate keywords convergence weighted value;
D is damped coefficient, and value range is 0 to 1, represents a certain particular candidate keyword in candidate key word sequence and refers to To the probability of other candidate keywords, general value is 0.85;
In(Vi) it is the set for being directed toward the candidate keywords of i-th of candidate keywords;
Out(Vi) be i-th of candidate keywords be directed toward candidate keywords set;
ωjiFor the similarity value Sim of i-th candidate keywords and j-th candidates keyword in the sequence of candidate keywords (ei,fj), and use the similarity value as the initial weight value of i-th candidate keywords and j-th candidates keyword;
eiIt is indicated for the term vector of i-th of candidate keywords;
fjIt is indicated for the term vector of j-th candidates keyword;
ωjkFor the similarity value Sim of i-th candidate keywords and k-th of candidate keywords in the sequence of candidate keywords (ek,fj), and use the similarity value as the initial weight value of i-th candidate keywords and k-th of candidate keywords;
kw,iValue be the candidate keywords figure in transfer matrix R|V|x2bElement;
2b is the window that length is 2b, and 2b indicates window size, i.e., at most conllinear 2b candidate keywords;
| V | it is the numerical value of candidate keywords;
αjiFor attention between i-th of candidate keywords and j-th candidates keyword in the sequence of candidate keywords Value, αjiij
The attention αijCalculation formula are as follows:
Wherein,
kw,iFor transfer matrix R|V|x2bIn w row i-th arrange element value;
expkw,iFor the exponential function using constant e the bottom of as, the value of constant e is about 2.718282;
SiFor amount of bias, obtained automatically after window is fixed;
The initial weight value ωji=Sim (ei,fj) calculation formula are as follows:
The third aspect, the embodiment of the invention provides a kind of electronic equipment, the electronic equipment includes:
At least one processor;And
At least one processor being connect with the processor communication, in which:
The memory is stored with the program instruction that can be executed by the processor, and the processor calls described program Instruction is able to carry out above-mentioned corresponding either method.
Fourth aspect, the embodiment of the invention provides a kind of non-transient computer readable storage medium, the non-transient meter Calculation machine readable storage medium storing program for executing stores computer program, and it is above-mentioned corresponding any that the computer program executes the computer Method.
Keyword extracting method and device provided in an embodiment of the present invention, are pre-processed by text information, and sufficiently research is waited The relationship between keyword and candidate keywords is selected, by external information provided by text itself and text set, building is waited Keyword figure is selected, then is based on attention mechanism, calculates the influence power of text internal structure, iterates to calculate weight, carries out target pass The extraction of keyword, the embodiment of the present invention realize the improvement to the initial weight algorithm of TextRank algorithm, realize more efficient Extract web page text information in keyword.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will to embodiment or Attached drawing needed to be used in the description of the prior art is briefly described, it should be apparent that, the accompanying drawings in the following description is this hair Bright some embodiments for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.
Fig. 1 is the flow diagram of keyword extracting method in the embodiment of the present invention;
Fig. 2 is the TextRank keyword extracting method flow diagram based on attention mechanism in the embodiment of the present invention;
Fig. 3 is to update schematic diagram based on attention mechanism weight in the embodiment of the present invention;
Fig. 4 is the structural schematic diagram of keyword extracting device in the embodiment of the present invention;
Fig. 5 is the logic diagram of electronic equipment provided by one embodiment of the present invention.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical solution in the embodiment of the present invention is explicitly described, it is clear that described embodiment is this hair Bright a part of the embodiment, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art exist Every other embodiment obtained under the premise of creative work is not made, shall fall within the protection scope of the present invention.
The embodiment of the invention provides a kind of keyword extracting method, Fig. 1 is keyword extraction side in the embodiment of the present invention The flow diagram of method, as described in Figure 1, which comprises
Step S101, web page text information is obtained, the web page text information is pre-processed, obtains candidate keywords Sequence;
Wherein, the acquisition web page text information specifically refers to: being parsed using resolver to webpage, by HTML net Page is parsed into text, only retains useful text information.
Pair web page text information pre-processing refers to: given text T being split according to complete words, i.e., In each sentence, participle and part-of-speech tagging processing are carried out, and filters out stop words, and carry out part of speech filtering, only retains specified word Property word such as noun, verb, adjective be the candidate keywords after retaining.
The sequence of the candidate keywords refers to what web page text obtained after web page text information pre-processing Multiple candidate keywords.
Step S102, the candidate keywords figure according to the sequence construct of the candidate keywords, according to the candidate pass Keyword figure operation obtains similar between each candidate keywords and other candidate keywords in the sequence of the candidate keywords Angle value, and use the similarity value as the initial weight value of each candidate keywords;
Wherein, the candidate keywords figure refers to any two candidate keywords word lexical nodes in a window, adopts With cooccurrence relation, the side between wantonly two word is constructed, there are side be only 2b in length when their corresponding vocabulary between two nodes Window in co-occurrence, 2b indicate window size, i.e., most 2b words of co-occurrence.
The initial weight value, is indicated with ω, refers to the similarity value Sim (e between two candidate keywordsk,fj), institute State similarity value Sim (ek,fj) calculation formula are as follows:
Wherein, eiIt is indicated for the term vector of i-th of candidate keywords, fjIt is indicated for the term vector of j-th candidates keyword.
Step S103, according to the initial weight value of each candidate keywords, operation obtains the corresponding convergence of each candidate keywords Weighted value, convergence weighted value corresponding to each candidate keywords carry out sizes values sequence, are weighed according to the convergence of each candidate keywords The sizes values of weight values sort, and extract the target keyword of web page text information described in each candidate keywords.
Wherein, the convergence weighted value, is indicated with WS, refers to the influence power in view of text internal structure, candidate key The importance of word lexical node itself is adjusted by adjacent lexical node, uses following formula based on attention mechanism Transferring weights matrix, and the weight of each lexical node of iterative diffusion are calculated, until the value obtained after convergence;
The calculation formula of the convergence weighted value are as follows:
Wherein,
ViFor i-th of candidate keywords;
VjFor j-th candidates keyword;
WS(Vi) be i-th of candidate keywords convergence weighted value;
D is damped coefficient, and value range is 0 to 1, represents a certain particular candidate keyword in candidate key word sequence and refers to To the probability of other candidate keywords, general value is 0.85;
In(Vi) it is the set for being directed toward the candidate keywords of i-th of candidate keywords;
Out(Vi) be i-th of candidate keywords be directed toward candidate keywords set;
ωjiFor the similarity value Sim of i-th candidate keywords and j-th candidates keyword in the sequence of candidate keywords (ei,fj), and use the similarity value as the initial weight value of i-th candidate keywords and j-th candidates keyword;
eiIt is indicated for the term vector of i-th of candidate keywords;
fjIt is indicated for the term vector of j-th candidates keyword;
ωjkFor the similarity value Sim of i-th candidate keywords and k-th of candidate keywords in the sequence of candidate keywords (ek,fj), and use the similarity value as the initial weight value of i-th candidate keywords and k-th of candidate keywords;
kw,iValue be the candidate keywords figure in transfer matrix R|V|x2bElement;
2b is the window that length is 2b, and 2b indicates window size, i.e., at most conllinear 2b candidate keywords;
| V | it is the numerical value of candidate keywords;
αjiFor attention between i-th of candidate keywords and j-th candidates keyword in the sequence of candidate keywords Value, αjiij
The attention αijCalculation formula are as follows:
Wherein,
kw,iFor transfer matrix R|V|x2bIn w row i-th arrange element value;
expkw,iFor the exponential function using constant e the bottom of as, the value of constant e is about 2.718282;
SiFor amount of bias, obtained automatically after window is fixed;
The initial weight value ωji=Sim (ei,fj) calculation formula are as follows:
Keyword extracting method provided in an embodiment of the present invention, by carrying out pre-processing acquisition to web page text information Candidate keywords sequence construct candidate keywords figure, calculate in the sequence of the candidate keywords each candidate keywords with Similarity value between other candidate keywords, and use the similarity value as the initial weight value of each candidate keywords. According to the initial weight value of each candidate keywords, the convergence weighted value of each candidate keywords is obtained, according to convergence weighted value Sizes values sequence, extracts the target keyword in web page text information, and the embodiment of the present invention is realized to TextRank algorithm The improvement of initial weight algorithm realizes the more efficient keyword extracted in web page text information.
On the basis of the above embodiments, described that web page text information pre-processing is specifically included:
Divide the web page text information according to complete words, participle and part-of-speech tagging, mistake are carried out to the complete words Stop words and part of speech are filtered, the candidate keywords are retained.
On the basis of the above embodiments, the candidate keywords according to the sequence construct of the candidate keywords Figure obtains each candidate keywords and other times in the sequence of the candidate keywords according to the candidate keywords figure operation The similarity value between keyword is selected, and the initial weight value for using the similarity value as each candidate keywords is specifically wrapped It includes:
K is carried out to each candidate keywords according to the continuous bag of words CBOW of term vector word2VEC algorithm and ties up term vector Characterization calculates each candidate keywords and other candidate keywords in the sequence of the candidate keywords by the term vector Between similarity value, i.e. the cosine angle initial weight value that obtains each candidate keywords;
Wherein, the value of k is the element of transfer matrix R in the candidate keywords figure.
Wherein, transfer matrix R refers to matrix R in the candidate keywords figure|V|x2b, 2b is the window that length is 2b, 2b Indicate window size, artificial to be arranged, i.e., at most conllinear 2b candidate keywords, | V | it is the numerical value of candidate keywords.
On the basis of the above embodiments, the initial weight value according to each candidate keywords, operation obtain each candidate The corresponding convergence weighted value of keyword specifically includes:
The convergence weighted value of each candidate keywords is obtained using the iterative calculation of following formula according to attention mechanism;
The calculation formula of the convergence weighted value are as follows:
Wherein,
ViFor i-th of candidate keywords;
VjFor j-th candidates keyword;
WS(Vi) be i-th of candidate keywords convergence weighted value;
D is damped coefficient, and value range is 0 to 1, represents a certain particular candidate keyword in candidate key word sequence and refers to To the probability of other candidate keywords, general value is 0.85;
In(Vi) it is the set for being directed toward the candidate keywords of i-th of candidate keywords;
Out(Vi) be i-th of candidate keywords be directed toward candidate keywords set;
ωjiFor the similarity value Sim of i-th candidate keywords and j-th candidates keyword in the sequence of candidate keywords (ei,fj), and use the similarity value as the initial weight value of i-th candidate keywords and j-th candidates keyword;
eiIt is indicated for the term vector of i-th of candidate keywords;
fjIt is indicated for the term vector of j-th candidates keyword;
ωjkFor the similarity value Sim of i-th candidate keywords and k-th of candidate keywords in the sequence of candidate keywords (ek,fj), and use the similarity value as the initial weight value of i-th candidate keywords and k-th of candidate keywords;
kw,iValue be the candidate keywords figure in transfer matrix R|V|x2bElement;
2b is the window that length is 2b, and 2b indicates window size, i.e., at most conllinear 2b candidate keywords;
| V | it is the numerical value of candidate keywords;
αjiFor attention between i-th of candidate keywords and j-th candidates keyword in the sequence of candidate keywords Value, αjiij
The attention αijCalculation formula are as follows:
Wherein,
kw,iFor transfer matrix R|V|x2bIn w row i-th arrange element value;
expkw,iFor the exponential function using constant e the bottom of as, the value of constant e is about 2.718282;
SiFor amount of bias, obtained automatically after window is fixed;
The initial weight value ωji=Sim (ei,fj) calculation formula are as follows:
The specific embodiment of the embodiment of the present invention are as follows:
Fig. 2 is the TextRank keyword extracting method flow diagram based on attention mechanism in the embodiment of the present invention, As shown in Figure 2, which comprises
Step S201, obtain Web page text text: webpage is parsed using resolver, by HTML web analysis at Text only retains useful text information;
Pair step S202, to Web page text Text Pretreatment: given text T being split according to complete words, i.e., In each sentence, participle and part-of-speech tagging processing are carried out, and filters out stop words, and carry out part of speech filtering, only retains specified word Property word such as noun, verb, adjective be the candidate keywords after retaining.Such as in text " in 20 years following, manually The influence that intelligence generates the whole society is by big to being difficult to imagine." by participle: " in 20 years following, artificial intelligence is to the whole society The influence of generation is by big to being difficult to imagine.", after part-of-speech tagging " in future/nt 20/m/q/and nd ,/wp artificial intelligence/n / a society/n generation/v/u influence/v general/d complete on/p greatly/a to/v is difficult to the/d imagination/v./ wp ", and after carrying out part of speech filtering It obtains " the following artificial intelligence whole society have an impact big to be difficult to imagine " and is used as candidate keywords initiation sequence;
Step S203, the transfer matrix R of candidate keywords figure is constructed according to word2VEC algorithm|V|x2b: wherein V is candidate Then keyword lexical node collection appoints the side between two o'clock using cooccurrence relation construction, there are sides only when it between two nodes Corresponding vocabulary length be 2b window in co-occurrence, 2b indicate window size, i.e., most 2b words of co-occurrence.Assuming that one A sentence is successively made of following word: w1, w2, w3, w4, w5 ..., wn.So [w1, w2 ..., w2b], [w2, W3 ..., w2b+1], [w3, w4 ..., w2b+2] etc. be all a window.Any two vocabulary in a window are corresponding There are a undirected sides had no right between node.The embodiment of the present invention is by external provided by text itself and text set Information carries out the training of sample text collection by the CBOW model of word2VEC, to each of dictionary D word carry out K dimension word to Scale sign obtains the similarity in dictionary D between each vocabulary and other vocabulary, and use phase then by calculating cosine angle Initial weight value like degree as lexical node.Similarity reflects the degree of association between vocabulary, such as " artificial intelligence " and " machine The similitude of device study " is just higher than " artificial intelligence " and the similarity of " pineapple ".I-th of word and j-th of word in document sets Between similarity be Sim (ei,fj), and ei,fjIt is indicated for term vector, similarity formula is as follows:
Step S204, the weight based on attention mechanism is calculated, the weight of each node of iterative diffusion: in view of in text The importance of the influence power of portion's structure, lexical node itself is adjusted by adjacent node.Such as previously described " future The artificial intelligence whole society has an impact big to being difficult to imagine "." artificial intelligence " is clearly prior influence power, needs to give Bigger weight.The present invention is based on attention mechanism to calculate transferring weights matrix, and each node of iterative diffusion using following formula Weight, until convergence:
Wherein,
ViFor i-th of candidate keywords;
VjFor j-th candidates keyword;
WS(Vi) be i-th of candidate keywords convergence weighted value;
D is damped coefficient, and value range is 0 to 1, represents a certain particular candidate keyword in candidate key word sequence and refers to To the probability of other candidate keywords, general value is 0.85;
In(Vi) it is the set for being directed toward the candidate keywords of i-th of candidate keywords;
Out(Vi) be i-th of candidate keywords be directed toward candidate keywords set;
ωjiFor the similarity value Sim of i-th candidate keywords and j-th candidates keyword in the sequence of candidate keywords (ei,fj), and use the similarity value as the initial weight value of i-th candidate keywords and j-th candidates keyword;
eiIt is indicated for the term vector of i-th of candidate keywords;
fjIt is indicated for the term vector of j-th candidates keyword;
ωjkFor the similarity value Sim of i-th candidate keywords and k-th of candidate keywords in the sequence of candidate keywords (ek,fj), and use the similarity value as the initial weight value of i-th candidate keywords and k-th of candidate keywords;
kw,iValue be the candidate keywords figure in transfer matrix R|V|x2bElement;
2b is the window that length is 2b, and 2b indicates window size, i.e., at most conllinear 2b candidate keywords;
| V | it is the numerical value of candidate keywords;
αjiFor attention between i-th of candidate keywords and j-th candidates keyword in the sequence of candidate keywords Value, αjiij
The attention αijCalculation formula are as follows:
Wherein,
kw,iFor transfer matrix R|V|x2bIn w row i-th arrange element value, indicate the importance of the relative position of each word;
expkw,iFor the exponential function using constant e the bottom of as, the value of constant e is about 2.718282;
SiFor amount of bias, obtained automatically after window is fixed;
The initial weight value ωji=Sim (ei,fj) calculation formula are as follows:
Therefore, WS (Vi) update not only by ωjiInfluence also by αijInfluence, the influence power of text internal structure, The importance of lexical node itself is adjusted by adjacent node.In candidate keywords figure with candidate keywords ViIt is relevant Node and side schematic diagram as shown in figure 3,
Wherein solid line with the arrow indicates node ViTo pointed node VjTransition probability, the thicknesses of lines indicates that weight turns Move the size of probability;Dotted line is then indicated by VjNode jumps to node ViTransition probability.
Step S205, it obtains target critical word sequence: Bit-reversed being carried out to node weights, to obtain most important T A vocabulary, as candidate keywords.Most important T vocabulary will be obtained, will be marked in urtext, if being formed adjacent Phrase is then combined into word keyword.For example, there is sentence in text above, " the following artificial intelligence whole society has an impact big to hardly possible With the imagination ".If " being difficult to " and " imagination " belongs to candidate keywords, it is combined into " being difficult to imagine " and target keyword is added Sequence.
Keyword extracting method provided in an embodiment of the present invention, is pre-processed by text information, sufficiently research candidate key Relationship between word and candidate keywords constructs candidate key by external information provided by text itself and text set Word figure, then it is based on attention mechanism, the influence power of text internal structure is calculated, weight is iterated to calculate, carries out target keyword It extracts, the embodiment of the present invention realizes the improvement to the initial weight algorithm of TextRank algorithm, realizes more efficient extraction Keyword in web page text information.
The embodiment of the present invention provides a kind of keyword extracting device, and Fig. 4 is keyword extracting device in the embodiment of the present invention Structural schematic diagram, as shown in figure 4, described device include: candidate keywords obtain module 401, initial weight value obtain module 402 and target keyword obtain module 403;
Candidate keywords obtain module 401 for obtaining web page text information, locate in advance to the web page text information Reason, obtains the sequence of candidate keywords;Initial weight value obtains module 402 for the sequence structure according to the candidate keywords The candidate keywords figure is built, each time in the sequence of the candidate keywords is obtained according to the candidate keywords figure operation The similarity value between keyword and other candidate keywords is selected, and uses the similarity value as each candidate keywords Initial weight value;Target keyword obtains module 403 for the initial weight value according to each candidate keywords, and operation obtains each The corresponding convergence weighted value of candidate keywords, convergence weighted value corresponding to each candidate keywords carry out sizes values sequence, according to The sizes values sequence of the convergence weighted value of each candidate keywords, extracts the mesh of web page text information described in each candidate keywords Mark keyword.
On the basis of the above embodiments, the candidate keywords obtain module and are specifically used for:
Divide the web page text information according to complete words, participle and part-of-speech tagging, mistake are carried out to the complete words Stop words and part of speech are filtered, the candidate keywords are retained.
On the basis of the above embodiments, the initial weight value obtains module and is specifically used for:
K is carried out to each candidate keywords according to the continuous bag of words CBOW of term vector word2VEC algorithm and ties up term vector Characterization calculates each candidate keywords and other candidate keywords in the sequence of the candidate keywords by the term vector Between similarity value, i.e. the cosine angle initial weight value that obtains each candidate keywords;
Wherein, the value of k is the element of transfer matrix R in the candidate keywords figure.
On the basis of the above embodiments, the target keyword obtains module and is specifically used for:
The convergence weighted value of each candidate keywords is obtained using the iterative calculation of following formula according to attention mechanism;
The calculation formula of the convergence weighted value are as follows:
Wherein,
ViFor i-th of candidate keywords;
VjFor j-th candidates keyword;
WS(Vi) be i-th of candidate keywords convergence weighted value;
D is damped coefficient, and value range is 0 to 1, represents a certain particular candidate keyword in candidate key word sequence and refers to To the probability of other candidate keywords, general value is 0.85;
In(Vi) it is the set for being directed toward the candidate keywords of i-th of candidate keywords;
Out(Vi) be i-th of candidate keywords be directed toward candidate keywords set;
ωjiFor the similarity value Sim of i-th candidate keywords and j-th candidates keyword in the sequence of candidate keywords (ei,fj), and use the similarity value as the initial weight value of i-th candidate keywords and j-th candidates keyword;
eiIt is indicated for the term vector of i-th of candidate keywords;
fjIt is indicated for the term vector of j-th candidates keyword;
ωjkFor the similarity value Sim of i-th candidate keywords and k-th of candidate keywords in the sequence of candidate keywords (ek,fj), and use the similarity value as the initial weight value of i-th candidate keywords and k-th of candidate keywords;
kw,iValue be the candidate keywords figure in transfer matrix R|V|x2bElement;
2b is the window that length is 2b, and 2b indicates window size, i.e., at most conllinear 2b candidate keywords;
| V | it is the numerical value of candidate keywords;
αjiFor attention between i-th of candidate keywords and j-th candidates keyword in the sequence of candidate keywords Value, αjiij
The attention αijCalculation formula are as follows:
Wherein,
kw,iFor transfer matrix R|V|x2bIn w row i-th arrange element value;
expkw,iFor the exponential function using constant e the bottom of as, the value of constant e is about 2.718282;
SiFor amount of bias, obtained automatically after window is fixed;
The initial weight value ωji=Sim (ei,fj) calculation formula are as follows:
Keyword extracting device provided in an embodiment of the present invention is mentioned for realizing keyword provided in an embodiment of the present invention Method is taken, specific embodiment specifically states that details are not described herein in above method embodiment.
Keyword extracting device provided in an embodiment of the present invention, is pre-processed by text information, sufficiently research candidate key Relationship between word and candidate keywords constructs candidate key by external information provided by text itself and text set Word figure, then it is based on attention mechanism, the influence power of text internal structure is calculated, weight is iterated to calculate, carries out target keyword It extracts, the embodiment of the present invention realizes the improvement to the initial weight algorithm of TextRank algorithm, realizes more efficient extraction Keyword in web page text information.
Fig. 5 is the logic diagram of electronic equipment provided by one embodiment of the present invention, as shown in figure 5, the electronics is set It is standby, comprising: processor (processor) 501, memory (memory) 502 and bus 503;
Wherein, the processor 501 and memory 502 complete mutual communication by the bus 503;The place Reason device 501 is used to call the program instruction in the memory 502, to execute method provided by above-mentioned each method embodiment.
The present embodiment discloses a kind of computer program product, and the computer program product includes being stored in non-transient meter Computer program on calculation machine readable storage medium storing program for executing, the computer program include program instruction, when described program instruction is counted When calculation machine executes, computer is able to carry out method provided by above-mentioned each method embodiment.
The present embodiment provides a kind of non-transient computer readable storage medium, the non-transient computer readable storage medium Matter stores computer instruction, and the computer instruction makes the computer execute method provided by above-mentioned each method embodiment.
Finally, it should be noted that the above various embodiments is only to illustrate the technical solution of the embodiment of the present invention, rather than it is right It is limited;Although the embodiment of the present invention is described in detail referring to foregoing embodiments, the ordinary skill of this field Personnel are it is understood that it is still possible to modify the technical solutions described in the foregoing embodiments, or to part Or all technical features are equivalently replaced;And these are modified or replaceed, it does not separate the essence of the corresponding technical solution The range of each embodiment technical solution of the embodiment of the present invention.

Claims (10)

1. a kind of keyword extracting method, which is characterized in that the described method includes:
Web page text information is obtained, the web page text information is pre-processed, the sequence of candidate keywords is obtained;
According to candidate keywords figure described in the sequence construct of the candidate keywords, obtained according to the candidate keywords figure operation Similarity value in the sequence of the candidate keywords between each candidate keywords and other candidate keywords, and with the phase Initial weight value like angle value as each candidate keywords;
According to the initial weight value of each candidate keywords, operation obtains the corresponding convergence weighted value of each candidate keywords, to each time It selects the corresponding convergence weighted value of keyword to carry out sizes values sequence, is arranged according to the sizes values of the convergence weighted value of each candidate keywords Sequence extracts the target keyword of web page text information described in each candidate keywords.
2. the method according to claim 1, wherein described pre-process specific packet to the web page text information It includes:
Divide the web page text information according to complete words, participle is carried out to the complete words and part-of-speech tagging, filtering stop Word and part of speech retain the candidate keywords.
3. the method according to claim 1, wherein described according to the sequence construct of the candidate keywords Candidate keywords figure obtains each candidate keywords in the sequence of the candidate keywords according to the candidate keywords figure operation With the similarity value between other candidate keywords, and use the similarity value as the initial weight value of each candidate keywords It specifically includes:
K dimension term vector characterization is carried out to each candidate keywords according to the continuous bag of words CBOW of term vector word2VEC algorithm, It is calculated in the sequence of the candidate keywords between each candidate keywords and other candidate keywords by the term vector Similarity value, i.e. cosine angle obtain the initial weight value of each candidate keywords;
Wherein, the value of k is the element of transfer matrix R in the candidate keywords figure.
4. the method according to claim 1, wherein the initial weight value according to each candidate keywords, fortune The corresponding convergence weighted value of each candidate keywords of acquisition is calculated to specifically include:
The convergence weighted value of each candidate keywords is obtained using the iterative calculation of following formula according to attention mechanism;
The calculation formula of the convergence weighted value are as follows:
Wherein,
ViFor i-th of candidate keywords;
VjFor j-th candidates keyword;
WS(Vi) be i-th of candidate keywords convergence weighted value;
D is damped coefficient, and value range is 0 to 1, represents a certain particular candidate keyword in candidate key word sequence and is directed toward other The probability of candidate keywords, general value are 0.85;
In(Vi) it is the set for being directed toward the candidate keywords of i-th of candidate keywords;
Out(Vi) be i-th of candidate keywords be directed toward candidate keywords set;
ωjiFor the similarity value Sim (e of i-th candidate keywords and j-th candidates keyword in the sequence of candidate keywordsi, fj), and use the similarity value as the initial weight value of i-th candidate keywords and j-th candidates keyword;
eiIt is indicated for the term vector of i-th of candidate keywords;
fjIt is indicated for the term vector of j-th candidates keyword;
ωjkFor the similarity value Sim (e of i-th candidate keywords and k-th of candidate keywords in the sequence of candidate keywordsk, fj), and use the similarity value as the initial weight value of i-th candidate keywords and k-th of candidate keywords;
kw,iValue be the candidate keywords figure in transfer matrix R|V|x2bElement;
2b is the window that length is 2b, and 2b indicates window size, i.e., at most conllinear 2b candidate keywords;
| V | it is the numerical value of candidate keywords;
αjiFor in the sequence of candidate keywords between i-th of candidate keywords and j-th candidates keyword attention value, αji= αij
The attention αijCalculation formula are as follows:
Wherein,
kw,iFor transfer matrix R|V|x2bIn w row i-th arrange element value;
expkw,iFor the exponential function using constant e the bottom of as, the value of constant e is about 2.718282;
SiFor amount of bias, obtained automatically after window is fixed;
The initial weight value ωji=Sim (ei,fj) calculation formula are as follows:
5. a kind of keyword extracting device, which is characterized in that described device includes:
Candidate keywords obtain module, for obtaining web page text information, pre-process, are waited to the web page text information Select the sequence of keyword;
Initial weight value obtains module, for the candidate keywords figure according to the sequence construct of the candidate keywords, according to The candidate keywords figure operation obtains each candidate keywords and other candidate keywords in the sequence of the candidate keywords Between similarity value, and use the similarity value as the initial weight value of each candidate keywords;
Target keyword obtains module, and for the initial weight value according to each candidate keywords, operation obtains each candidate keywords Corresponding convergence weighted value, convergence weighted value corresponding to each candidate keywords carries out sizes values sequence, according to each candidate key The sizes values sequence of the convergence weighted value of word, extracts the target keyword of web page text information described in each candidate keywords.
6. device according to claim 5, which is characterized in that the candidate keywords obtain module and are specifically used for:
Divide the web page text information according to complete words, participle is carried out to the complete words and part-of-speech tagging, filtering stop Word and part of speech retain the candidate keywords.
7. device according to claim 5, which is characterized in that the initial weight value obtains module and is specifically used for:
K dimension term vector characterization is carried out to each candidate keywords according to the continuous bag of words CBOW of term vector word2VEC algorithm, It is calculated in the sequence of the candidate keywords between each candidate keywords and other candidate keywords by the term vector Similarity value, i.e. cosine angle obtain the initial weight value of each candidate keywords;
Wherein, the value of k is the element of transfer matrix R in the candidate keywords figure.
8. device according to claim 5, which is characterized in that the target keyword obtains module and is specifically used for:
The convergence weighted value of each candidate keywords is obtained using the iterative calculation of following formula according to attention mechanism;
The calculation formula of the convergence weighted value are as follows:
Wherein,
ViFor i-th of candidate keywords;
VjFor j-th candidates keyword;
WS(Vi) be i-th of candidate keywords convergence weighted value;
D is damped coefficient, and value range is 0 to 1, represents a certain particular candidate keyword in candidate key word sequence and is directed toward other The probability of candidate keywords, general value are 0.85;
In(Vi) it is the set for being directed toward the candidate keywords of i-th of candidate keywords;
Out(Vi) be i-th of candidate keywords be directed toward candidate keywords set;
ωjiFor the similarity value Sim (e of i-th candidate keywords and j-th candidates keyword in the sequence of candidate keywordsi, fj), and use the similarity value as the initial weight value of i-th candidate keywords and j-th candidates keyword;
eiIt is indicated for the term vector of i-th of candidate keywords;
fjIt is indicated for the term vector of j-th candidates keyword;
ωjkFor the similarity value Sim (e of i-th candidate keywords and k-th of candidate keywords in the sequence of candidate keywordsk, fj), and use the similarity value as the initial weight value of i-th candidate keywords and k-th of candidate keywords;
kw,iValue be the candidate keywords figure in transfer matrix R|V|x2bElement;
2b is the window that length is 2b, and 2b indicates window size, i.e., at most conllinear 2b candidate keywords;
| V | it is the numerical value of candidate keywords;
αjiFor in the sequence of candidate keywords between i-th of candidate keywords and j-th candidates keyword attention value, αji= αij
The attention αijCalculation formula are as follows:
Wherein,
kw,iFor transfer matrix R|V|x2bIn w row i-th arrange element value;
expkw,iFor the exponential function using constant e the bottom of as, the value of constant e is about 2.718282;
SiFor amount of bias, obtained automatically after window is fixed;
The initial weight value ωji=Sim (ei,fj) calculation formula are as follows:
9. a kind of electronic equipment characterized by comprising
At least one processor;And
At least one processor being connect with the processor communication, in which:
The memory is stored with the program instruction that can be executed by the processor, and the processor calls described program to instruct energy Enough methods executed as described in Claims 1-4 is any.
10. a kind of non-transient computer readable storage medium, which is characterized in that the non-transient computer readable storage medium is deposited Computer program is stored up, the computer program makes the computer execute the method as described in Claims 1-4 is any.
CN201710604469.XA 2017-07-11 2017-07-11 Keyword extraction method and device Active CN109255118B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710604469.XA CN109255118B (en) 2017-07-11 2017-07-11 Keyword extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710604469.XA CN109255118B (en) 2017-07-11 2017-07-11 Keyword extraction method and device

Publications (2)

Publication Number Publication Date
CN109255118A true CN109255118A (en) 2019-01-22
CN109255118B CN109255118B (en) 2023-08-08

Family

ID=65051893

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710604469.XA Active CN109255118B (en) 2017-07-11 2017-07-11 Keyword extraction method and device

Country Status (1)

Country Link
CN (1) CN109255118B (en)

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110147425A (en) * 2019-05-22 2019-08-20 华泰期货有限公司 A kind of keyword extracting method, device, computer equipment and storage medium
CN110263122A (en) * 2019-05-08 2019-09-20 北京奇艺世纪科技有限公司 A kind of keyword acquisition methods, device and computer readable storage medium
CN110377725A (en) * 2019-07-12 2019-10-25 深圳新度博望科技有限公司 Data creation method, device, computer equipment and storage medium
CN110489758A (en) * 2019-09-10 2019-11-22 深圳市和讯华谷信息技术有限公司 The values calculation method and device of application program
CN110598209A (en) * 2019-08-21 2019-12-20 合肥工业大学 Method, system and storage medium for extracting keywords
CN110728136A (en) * 2019-10-14 2020-01-24 延安大学 Multi-factor fused textrank keyword extraction algorithm
CN110795553A (en) * 2019-09-09 2020-02-14 腾讯科技(深圳)有限公司 Abstract generation method and device
CN110837601A (en) * 2019-10-25 2020-02-25 杭州叙简科技股份有限公司 Automatic classification and prediction method for alarm condition
CN111291165A (en) * 2020-05-09 2020-06-16 支付宝(杭州)信息技术有限公司 Method and device for embedding training word vector into model
CN111368038A (en) * 2020-03-09 2020-07-03 广州市百果园信息技术有限公司 Keyword extraction method and device, computer equipment and storage medium
WO2020155769A1 (en) * 2019-01-30 2020-08-06 平安科技(深圳)有限公司 Method and device for establishing keyword generation model
CN111553156A (en) * 2020-05-25 2020-08-18 支付宝(杭州)信息技术有限公司 Keyword extraction method, device and equipment
CN111680128A (en) * 2020-06-16 2020-09-18 杭州安恒信息技术股份有限公司 Method and system for detecting web page sensitive words and related devices
CN111859940A (en) * 2019-04-23 2020-10-30 北京嘀嘀无限科技发展有限公司 Keyword extraction method and device, electronic equipment and storage medium
CN112347150A (en) * 2020-11-23 2021-02-09 北京智源人工智能研究院 Method and device for labeling academic label of student and electronic equipment
CN112434158A (en) * 2020-11-13 2021-03-02 北京创业光荣信息科技有限责任公司 Enterprise label acquisition method and device, storage medium and computer equipment
CN112580352A (en) * 2021-03-01 2021-03-30 腾讯科技(深圳)有限公司 Keyword extraction method, device and equipment and computer storage medium
CN113111897A (en) * 2020-02-13 2021-07-13 北京明亿科技有限公司 Alarm receiving and warning condition type determining method and device based on support vector machine
CN113114986A (en) * 2021-03-30 2021-07-13 深圳市冠标科技发展有限公司 Early warning method based on picture and sound synchronization and related equipment
CN113139956A (en) * 2021-05-12 2021-07-20 深圳大学 Generation method and identification method of section identification model based on language knowledge guidance
CN113836307A (en) * 2021-10-15 2021-12-24 国网北京市电力公司 Power supply service work order hotspot discovery method, system and device and storage medium
CN114239553A (en) * 2021-12-23 2022-03-25 佳源科技股份有限公司 Log auditing method, device, equipment and medium based on artificial intelligence
CN114328826A (en) * 2021-12-20 2022-04-12 青岛檬豆网络科技有限公司 Method for extracting key words and abstracts of technical achievements and technical requirements
CN114912446A (en) * 2022-04-29 2022-08-16 中证信用增进股份有限公司 Keyword extraction method and device and storage medium
CN116936135A (en) * 2023-09-19 2023-10-24 北京珺安惠尔健康科技有限公司 Medical big health data acquisition and analysis method based on NLP technology
CN116975246A (en) * 2023-08-03 2023-10-31 深圳市博锐高科科技有限公司 Data acquisition method, device, chip and terminal

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050278325A1 (en) * 2004-06-14 2005-12-15 Rada Mihalcea Graph-based ranking algorithms for text processing
CN102314448A (en) * 2010-07-06 2012-01-11 株式会社理光 Equipment for acquiring one or more key elements from document and method
CN105893410A (en) * 2015-11-18 2016-08-24 乐视网信息技术(北京)股份有限公司 Keyword extraction method and apparatus
CN106202042A (en) * 2016-07-06 2016-12-07 中央民族大学 A kind of keyword abstraction method based on figure

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050278325A1 (en) * 2004-06-14 2005-12-15 Rada Mihalcea Graph-based ranking algorithms for text processing
CN102314448A (en) * 2010-07-06 2012-01-11 株式会社理光 Equipment for acquiring one or more key elements from document and method
CN105893410A (en) * 2015-11-18 2016-08-24 乐视网信息技术(北京)股份有限公司 Keyword extraction method and apparatus
CN106202042A (en) * 2016-07-06 2016-12-07 中央民族大学 A kind of keyword abstraction method based on figure

Cited By (41)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020155769A1 (en) * 2019-01-30 2020-08-06 平安科技(深圳)有限公司 Method and device for establishing keyword generation model
CN111859940B (en) * 2019-04-23 2024-05-14 北京嘀嘀无限科技发展有限公司 Keyword extraction method and device, electronic equipment and storage medium
CN111859940A (en) * 2019-04-23 2020-10-30 北京嘀嘀无限科技发展有限公司 Keyword extraction method and device, electronic equipment and storage medium
CN110263122A (en) * 2019-05-08 2019-09-20 北京奇艺世纪科技有限公司 A kind of keyword acquisition methods, device and computer readable storage medium
CN110263122B (en) * 2019-05-08 2022-05-17 北京奇艺世纪科技有限公司 Keyword acquisition method and device and computer readable storage medium
CN110147425A (en) * 2019-05-22 2019-08-20 华泰期货有限公司 A kind of keyword extracting method, device, computer equipment and storage medium
CN110377725A (en) * 2019-07-12 2019-10-25 深圳新度博望科技有限公司 Data creation method, device, computer equipment and storage medium
CN110377725B (en) * 2019-07-12 2021-09-24 深圳新度博望科技有限公司 Data generation method and device, computer equipment and storage medium
CN110598209A (en) * 2019-08-21 2019-12-20 合肥工业大学 Method, system and storage medium for extracting keywords
CN110795553A (en) * 2019-09-09 2020-02-14 腾讯科技(深圳)有限公司 Abstract generation method and device
CN110795553B (en) * 2019-09-09 2024-04-23 腾讯科技(深圳)有限公司 Digest generation method and device
CN110489758B (en) * 2019-09-10 2023-04-18 深圳市和讯华谷信息技术有限公司 Value view calculation method and device for application program
CN110489758A (en) * 2019-09-10 2019-11-22 深圳市和讯华谷信息技术有限公司 The values calculation method and device of application program
CN110728136A (en) * 2019-10-14 2020-01-24 延安大学 Multi-factor fused textrank keyword extraction algorithm
CN110837601A (en) * 2019-10-25 2020-02-25 杭州叙简科技股份有限公司 Automatic classification and prediction method for alarm condition
CN113111897A (en) * 2020-02-13 2021-07-13 北京明亿科技有限公司 Alarm receiving and warning condition type determining method and device based on support vector machine
CN111368038B (en) * 2020-03-09 2023-04-11 广州市百果园信息技术有限公司 Keyword extraction method and device, computer equipment and storage medium
CN111368038A (en) * 2020-03-09 2020-07-03 广州市百果园信息技术有限公司 Keyword extraction method and device, computer equipment and storage medium
CN111291165B (en) * 2020-05-09 2020-08-14 支付宝(杭州)信息技术有限公司 Method and device for embedding training word vector into model
CN111291165A (en) * 2020-05-09 2020-06-16 支付宝(杭州)信息技术有限公司 Method and device for embedding training word vector into model
CN111553156B (en) * 2020-05-25 2023-08-04 支付宝(杭州)信息技术有限公司 Keyword extraction method, device and equipment
CN111553156A (en) * 2020-05-25 2020-08-18 支付宝(杭州)信息技术有限公司 Keyword extraction method, device and equipment
CN111680128A (en) * 2020-06-16 2020-09-18 杭州安恒信息技术股份有限公司 Method and system for detecting web page sensitive words and related devices
CN112434158B (en) * 2020-11-13 2024-05-28 海创汇科技创业发展股份有限公司 Enterprise tag acquisition method, enterprise tag acquisition device, storage medium and computer equipment
CN112434158A (en) * 2020-11-13 2021-03-02 北京创业光荣信息科技有限责任公司 Enterprise label acquisition method and device, storage medium and computer equipment
CN112347150B (en) * 2020-11-23 2021-08-31 北京智谱华章科技有限公司 Method and device for labeling academic label of student and electronic equipment
CN112347150A (en) * 2020-11-23 2021-02-09 北京智源人工智能研究院 Method and device for labeling academic label of student and electronic equipment
CN112580352A (en) * 2021-03-01 2021-03-30 腾讯科技(深圳)有限公司 Keyword extraction method, device and equipment and computer storage medium
CN113114986B (en) * 2021-03-30 2023-04-28 深圳市冠标科技发展有限公司 Early warning method based on picture and sound synchronization and related equipment
CN113114986A (en) * 2021-03-30 2021-07-13 深圳市冠标科技发展有限公司 Early warning method based on picture and sound synchronization and related equipment
CN113139956A (en) * 2021-05-12 2021-07-20 深圳大学 Generation method and identification method of section identification model based on language knowledge guidance
CN113836307B (en) * 2021-10-15 2024-02-20 国网北京市电力公司 Power supply service work order hot spot discovery method, system, device and storage medium
CN113836307A (en) * 2021-10-15 2021-12-24 国网北京市电力公司 Power supply service work order hotspot discovery method, system and device and storage medium
CN114328826A (en) * 2021-12-20 2022-04-12 青岛檬豆网络科技有限公司 Method for extracting key words and abstracts of technical achievements and technical requirements
CN114328826B (en) * 2021-12-20 2024-06-11 青岛檬豆网络科技有限公司 Method for extracting keywords and abstracts of technical achievements and technical demands
CN114239553A (en) * 2021-12-23 2022-03-25 佳源科技股份有限公司 Log auditing method, device, equipment and medium based on artificial intelligence
CN114912446A (en) * 2022-04-29 2022-08-16 中证信用增进股份有限公司 Keyword extraction method and device and storage medium
CN116975246A (en) * 2023-08-03 2023-10-31 深圳市博锐高科科技有限公司 Data acquisition method, device, chip and terminal
CN116975246B (en) * 2023-08-03 2024-04-26 深圳市博锐高科科技有限公司 Data acquisition method, device, chip and terminal
CN116936135A (en) * 2023-09-19 2023-10-24 北京珺安惠尔健康科技有限公司 Medical big health data acquisition and analysis method based on NLP technology
CN116936135B (en) * 2023-09-19 2023-11-24 北京珺安惠尔健康科技有限公司 Medical big health data acquisition and analysis method based on NLP technology

Also Published As

Publication number Publication date
CN109255118B (en) 2023-08-08

Similar Documents

Publication Publication Date Title
CN109255118A (en) A kind of keyword extracting method and device
Khan et al. A survey on the state-of-the-art machine learning models in the context of NLP
Alwehaibi et al. Comparison of pre-trained word vectors for arabic text classification using deep learning approach
Stojanovski et al. Twitter sentiment analysis using deep convolutional neural network
CN111143576A (en) Event-oriented dynamic knowledge graph construction method and device
CN109325231A (en) A kind of method that multi task model generates term vector
Santur Sentiment analysis based on gated recurrent unit
CN108874896B (en) Humor identification method based on neural network and humor characteristics
Ju et al. An efficient method for document categorization based on word2vec and latent semantic analysis
CN106126619A (en) A kind of video retrieval method based on video content and system
CN109214006A (en) The natural language inference method that the hierarchical semantic of image enhancement indicates
CN113515632B (en) Text classification method based on graph path knowledge extraction
Gridach et al. Empirical evaluation of word representations on Arabic sentiment analysis
CN107908698A (en) A kind of theme network crawler method, electronic equipment, storage medium, system
CN109684449A (en) A kind of natural language characterizing semantics method based on attention mechanism
CN110008463A (en) Method, apparatus and computer-readable medium for event extraction
CN108875065A (en) A kind of Indonesia's news web page recommended method based on content
CN110717042A (en) Method for constructing document-keyword heterogeneous network model
CN109408632A (en) A kind of information security recognition methods
CN110298041A (en) Rubbish text filter method, device, electronic equipment and storage medium
Al Omari et al. Hybrid CNNs-LSTM deep analyzer for arabic opinion mining
Anandika et al. A study on machine learning approaches for named entity recognition
Fu et al. Improving distributed word representation and topic model by word-topic mixture model
Kang et al. A short texts matching method using shallow features and deep features
Kaur et al. Automatic Punjabi poetry classification using machine learning algorithms with reduced feature set

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant