CN105843795A - Topic model based document keyword extraction method and system - Google Patents

Topic model based document keyword extraction method and system Download PDF

Info

Publication number
CN105843795A
CN105843795A CN201610162410.5A CN201610162410A CN105843795A CN 105843795 A CN105843795 A CN 105843795A CN 201610162410 A CN201610162410 A CN 201610162410A CN 105843795 A CN105843795 A CN 105843795A
Authority
CN
China
Prior art keywords
word
document
theme
pagerank
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610162410.5A
Other languages
Chinese (zh)
Other versions
CN105843795B (en
Inventor
蔡毅
杨楷
闵华清
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN201610162410.5A priority Critical patent/CN105843795B/en
Publication of CN105843795A publication Critical patent/CN105843795A/en
Application granted granted Critical
Publication of CN105843795B publication Critical patent/CN105843795B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a topic model based document keyword extraction method and system. The document keyword extraction method comprises the following steps of document information preprocessing, document structure graph construction, document topic distribution extraction, word weight extraction and keyword generation. The document keyword extraction system comprises the following modules: a document information preprocessing module, a document structure graph construction module, a document topic distribution extraction module, a word weight extraction module and a keyword generation module. According to the method and system, extracted keywords are more reasonable and related to a topic of a document more closely; and partial deficiencies in the keyword extraction field at present are overcome, a better document summarization effect is achieved, and a user can conveniently and quickly know an abstract of the document.

Description

Document keyword abstraction method based on topic model and system thereof
Technical field
The present invention relates to a kind of data mining technology, particularly to a kind of document keyword abstraction side based on topic model Method and system thereof.
Background technology
Key word is the summary of document main contents, and it is the important method of a kind of fast understanding document subject matter.At each Local it can be seen that the figure of key word, such as we on news website it can be seen that the label of every news, Wo Men It can be seen that the key word discussed of this paper when browsing technical paper.That reduce people in magnanimity information, search information Difficulty.Current key word has been used in every field.In information retrieval field, the application of key word widely, Baidu, paddy The search engine companies such as song are all based on the key word of web page text and retrieve, past based on document keyword searching result out Past is that user intentionally gets.And in field of social network, current many functions and correlational study are all based on user and are marked The label (tags) of note realizes.User tag facilitates user that mark object is managed, collects and is retrieved, it is also possible to utilize This customizes recommendation of personalized information for user.By providing user's mark to objects (such as picture, article, video etc.) Pouring functions, utilizes user group's wisdom (Collective Intelligence), and we can obtain large quantities of literary composition marked Shelves, provide data supporting to our research work.Key word is widely used in every field, and general generation method has Three below: 1) by the spontaneous generation of user, the content 2 that user oneself mark is interested) expert document please be carried out manual mark Key word 3) use document automation keyword extraction techniques to extract key word.It is suitable for by the way of user's spontaneous mark label Scene is less, and specific object interested is only labeled by user, and a kind of effective mode excitation the most also cannot be provided to use Family mark other guide.And because current information technology high-speed develops, the growth of internet information amount also explosion type, new content All the time producing, please expert carry out relevant documentation marking high cost by hand, document of its mark also can only make a search With, and be difficult to carry out commercial use.So the demand currently for document automation keyword extraction techniques is urgent, relevant The research of problem is also current focus.
Summary of the invention
The primary and foremost purpose of the present invention is that the shortcoming overcoming prior art is with not enough, it is provided that a kind of literary composition based on topic model Shelves keyword abstraction method, the method so that the key word extracted more rationally, more representative.
Another object of the present invention is to the shortcoming overcoming prior art with not enough, it is provided that a kind of literary composition based on topic model Shelves keyword abstraction method system, system solves the problem the weak point of current keyword extraction part, makes key word preferably Summarize document, facilitate user to understand rapidly document according to the key word extracted.
The primary and foremost purpose of the present invention can be achieved through the following technical solutions: a kind of document key word based on topic model Abstracting method, the method includes:
S1, document information pretreatment: the document of input carries out word part of speech division, function word/stop words is rejected, stem Extract, it is established that semi-structured data.
S2, file structure figure build: file structure figure describes each word positional information in the document in document. One word of each node on behalf of this figure, the limit of two nodes of link represent the word of the two node on behalf in a document away from Close to from.The present invention proposes a kind of file structure figure construction method.
The distribution of S3, document subject matter is extracted: each document has the theme laying particular stress on description, and this method passes through topic model technology The theme distribution of each word in theme distribution in abstracting document and document.This method also proposes a kind of master based on background word Topic model, promotes the effect of topic model.The things that the document that theme is close describes is more close.For each theme, permissible The set of letters relevant to this theme is extracted from collection of document.
S4, word weight extraction: what the weight of each word represented is this word significance level in a document.In document The most important word weight is the highest, otherwise, the word of low weight significance level in a document is low.The present invention proposes a kind of weight and carries Access method.
S5, key word generate: according to above-mentioned step, and this method will ask key word problem to be converted in nomography extraction pass The problem of key node, uses PageRank algorithm according to file structure figure this method, and combines topic model and word weight, permissible Each word is calculated a score, and score is the biggest represents its key word being more probably the document.The present invention proposes a kind of crucial The method that word generates.
Described document information pretreatment comprises the following steps:
S1a, for Chinese text, part of speech divides and participle instrument can be used to divide text, for English text, According to space document carried out participle, and use that word is stemmed obtains word prototype;
The part of speech of each word in the word that S1b, employing part-of-speech tagging instrument annotation step (S1a) have been cut.
The part-of-speech tagging result produced in S1c, employing step (S1bb), deletes the function word in document, stop words.
Described file structure figure builds and comprises the following steps:
S2a, a selected sliding window length, be designated as W.
S2b, the sliding window of one a length of W of structure, for occurring in the word in window, build a figure joint for it Point, if two words simultaneously appear in sliding window, then two nodes representated by the two word increase a limit.
S2c, mobile sliding window, start to move toward document afterbody from document, constantly increase in figure in moving process Node and limit.
The distribution extraction of described document subject matter comprises the following steps:
S3a, document background word mark: document background word refers to the word not having certain semantic implication in a document, this Topic model is often caused confusion by word.Because the limited amount of this word, we use the mode of semi-automation to mark Note.We are by calculating the TF-IDF value of each word, and certain threshold values selected, and the word that will be less than this threshold values is regarded as letter The word that breath amount is less.The most manually browse these words, select background word.
This method uses TF-IDF (Term Frequency Inverse Document Frequency) word frequency technology meter Calculate weighted value.TF-IDF is for weighing the word significance level for a document sets.Its main thought is: if certain The frequency the highest (i.e. TF is the highest) that individual word occurs in a document, and probability of occurrence seldom (IDF value in other documents Ratio is relatively low), the separating capacity of this word is stronger.The computing formula of TF-IDF is as follows:
TF i = n i Σ k n k
IDF i = l o g ( D D w )
TF-IDFi=TFi*IDFi,
In above formula, i is i-th key word.niIt is word tiThe number of times occurred in a document;TFiIt is key word tiAll Word frequency in document;∑knkIt is that all words occur total degree in a document.IDFiIt is key word tiReverse probability;D is to be Unite the number of all documents, DwIt is that word t occursiDocument number;
S3b, utilize the mutation bLDA of LDA (implicit Di Li Cray distribution, Latent Dirichlet Allocation) (implicit Di Li Cray based on background distribution) obtains the distribution of the implicit theme of document.In bLDA, all in collection of document All of implicit theme shared the most according to a certain percentage by document, and implicit theme collection is then made up of a series of correlated characteristic words. And in bLDA, first topic is arranged to background theme, is all gathered in this topic by the word unrelated with theme.Because Gibbs sampler can extract theme, effectively so this method employs gibbs sampler and solves from extensive document sets bLDA.By bLDA, the probability of each topic corresponding of each word in document can be obtained.
Wz(wi)=p (z | w)
Under conditions of given word w, this word is assigned to the probability of theme z, and this also reflects word w much degree On belong to theme z.
Described word weight extraction comprises the following steps:
S4a, can be obtained the theme distribution of each word in document by step (S3b), we ask according to the theme distribution of word Its comentropy, its theme distribution of word that comentropy is the biggest is the most uniform, i.e. degree of aliasing is the highest, and the least its theme that represents of comentropy divides Cloth is more concentrated, and degree of aliasing is the lowest.Word w in documentiComentropy computing formula for its theme distribution is as follows:
H ( w i ) = - Σ t = 1 K p ( k = t | w i ) log p ( k = t | w i )
The comentropy that S4b, basis are tried to achieve calculates the weight of each word, and formula is as follows:
w e i g h t ( w i ) = δ H ( w i )
Work as wiWhen implicit theme is distributed the most uniform (when this word representativeness is not strong), H (wi) will be very big, thusWill become the least, i.e. the weight of this word will become the least.
Described key word generates and comprises the steps of:
S5a, the PageRank value of the first each word of random initializtion.
S5b, for every document, for each theme, this method uses PageRank method to carry out the weight to each word The property wanted is given a mark.PageRank algorithm is a kind of algorithm being proposed by Google at first and using, and goes out chain according to obtaining chain in webpage Connect one webpage figure of structure, open up complement further according to the webpage formed and calculate the power of influence of each webpage.It is generally employed Evaluate field to webpage, evaluate a webpage the most important.PageRank algorithm applies following two hypothesis: quantum hypothesis and Quality is assumed.If quantum hypothesis refer to a webpage to enter chain quantity the most, this webpage is the most important.Matter Amount assumes to refer to chain, and to enter the webpage of certain page the most important, and this webpage is the most important.This method To key word problem be sought to be converted in nomography extract the problem of key node, and use PageRank to calculate according to file structure figure Method.
For theme each in every document, this method calculates the PageRank value of each word, and computing formula is as follows:
TR t ( w i ) = λ Σ j : w j → w i e ( w j , w i ) O ( w j ) TR t ( w j ) + ( 1 - λ ) W t ( w i )
In above-mentioned formula, λ is damped coefficient (damping factor), and span is [0,1].One node is not necessarily Having chain, it is 0 that some node goes out chain.This node is isolated to other webpages, in order to make this webpage to be accessed to, it is necessary to PageRank is modified by damped coefficient, and (1-λ) is the probit that each node jumps to another one node.For page wiTotal chain goes out number, and e (wi,wj) it is page wiTo wjChain go out number.These public affairs The main thought of formula is: the PageRank value of each page distributes to the summation of its value equal to numerous pages that its chain enters.
Wherein, Wt(wi)=p (z | w), under conditions of i.e. giving word w, this word is assigned to the probability of theme z, and this is also Reflect word w to have in much degree and belong to theme z.This probability is proposed by (3b).For every document, we can be according to K Theme draws the word graph of K different weights, then applies above-mentioned formula, is calculated each word in document and comprises K PageRank value.
S5c, the method are alternative manners, calculate current iteration and the weighted value difference of each word in last iteration, Terminate the method if less than 0.001, enter next step.If iterations arrives Termination Threshold 300, also terminate the party Method, enters next step.Otherwise, enter (5b) and carry out next round iteration.Constantly repeat the above steps, each page PageRank value can trend towards normal and stable, because this algorithm will eventually be restrained, and this convergency value is PageRank Value.
S5d, for each word, its K PageRank value under K different themes is processed as one by this method PageRank value.Concrete processing method is as follows: p (z | d) is the probability that document d belongs to theme z, for K of each word PageRank value, adds the impact of entropy when calculating the weight of each word, we can use equation below they to be merged:
T R ( w i ) = δ H ( w i ) Σ z = 1 K TR z ( w i ) × p ( z | d )
δ is a smoothing factor, and can control the value of comentropy affects amplitude to final PageRank value.
S5e, according to the final PageRank value of word each in document, it is ranked up by the method according to its value, it is intended that Maximum top n word is as the key word of the document.
Another object of the present invention is achieved through the following technical solutions: the document keyword abstraction method of a kind of topic model System, this system includes:
Document information pretreatment module: the document of input carries out word part of speech division, function word/stop words is rejected, word Do extraction, set up semi-structured data.
File structure figure builds module: build file structure figure.
Document subject matter distribution extraction module: utilize topic model to extract the theme distribution of document.
Word weight extraction module: give weight for each word in document.
Keyword generation module: generate document key word.
Described keyword generation module specifically for:
First, PageRank algorithm is a kind of algorithm being proposed by Google at first and using, and goes out chain according to obtaining chain in webpage Connect one webpage figure of structure, open up complement further according to the webpage formed and calculate the power of influence of each webpage.It is generally employed Evaluate field to webpage, evaluate a webpage the most important.PageRank algorithm applies following two hypothesis: quantum hypothesis and Quality is assumed.If quantum hypothesis refer to a webpage to enter chain quantity the most, this webpage is the most important.Matter Amount assumes to refer to chain, and to enter the webpage of certain page the most important, and this webpage is the most important.This method To key word problem be sought to be converted in nomography extract the problem of key node, and use PageRank to calculate according to file structure figure Method.PageRank algorithm is widely used in keyword extraction field.
In the present invention, for theme each in every document, this method calculates the PageRank value of each word, computing formula As follows:
TR t ( w i ) = λ Σ j : w j → w i e ( w j , w i ) O ( w j ) TR t ( w j ) + ( 1 - λ ) W t ( w i )
In above-mentioned formula, λ is damped coefficient (damping factor), and span is [0,1].One node is not necessarily Having chain, it is 0 that some node goes out chain.This node is isolated to other webpages, in order to make this webpage to be accessed to, it is necessary to PageRank is modified by damped coefficient, and (1-λ) is the probit that each node jumps to another one node.For page wiTotal chain goes out number, and e (wi,wj) it is page wiTo wjChain go out number.These public affairs The main thought of formula is: the PageRank value of each page distributes to the summation of its value equal to numerous pages that its chain enters.Its In, Wt(wi)=p (z | w), under conditions of i.e. giving word w, this word is assigned to the probability of theme z, and this also reflects word W has in much degree and belongs to theme z.This probability is proposed by (3b).For every document, we can draw K according to K theme The word graph of different weights, then applies above-mentioned formula, is calculated each word in document and comprises K PageRank value.
The PageRank value of the present invention first each word of random initializtion, then calculates each word wiTheme t's PageRank value TRt(wi), continuous iteration, until the difference of iterations twice iteration more than 300 or front and back is less than 0.001 Time stop.
Then, its K PageRank value under K different themes is processed as by the present invention for each word, this method One PageRank value.Formula is as follows:
T R ( w i ) = δ H ( w i ) Σ z = 1 K TR z ( w i ) × p ( z | d )
P (z | d) is the probability that document d belongs to theme z, for K PageRank value of each word, calculates each word Plus the impact of entropy during weight.δ is a smoothing factor, and can control the value of comentropy affects width to final PageRank value Degree.
According to the PageRank value that word each in document is final, it is ranked up by the method according to its value, it is intended that maximum Top n word as the key word of the document.Then according to keyword abstraction result, user is presented to.
Described document subject matter distribution extraction module specifically for:
First, LDA (implicit Di Li Cray distribution, Latent Dirichlet Allocation) is a kind of implicit theme Model.It has a lot of mutation, the present invention propose its another kind of mutation: bLDA (implicit Di Li Cray based on background distribution) with Obtain the distribution of the implicit theme of document.In bLDA, in collection of document, all documents are shared all of the most according to a certain percentage Implicit theme, implicit theme collection is then made up of a series of correlated characteristic words.And in bLDA, first topic is set It is set to background theme, the word unrelated with theme is all gathered in this topic.Because gibbs sampler can be effectively from a large scale Document sets extracts theme, so this method employs gibbs sampler and solves bLDA.By bLDA, can obtain in document every The probability of individual word each topic corresponding.
Wz(wi)=p (z | w)
Under conditions of given word w, this word is assigned to the probability of theme z, and this also reflects word w much degree On belong to theme z.Document subject matter distribution extraction module is according to the theme distribution of each word of document structure tree.
Described word weight extraction module specifically for:
The present invention seeks its comentropy according to the theme distribution of word, word w in documentiComentropy for its theme distribution calculates Formula is as follows:
H ( w i ) = - Σ t = 1 K p ( k = t | w i ) log p ( k = t | w i )
Then, calculating the weight of each word according to the comentropy tried to achieve, formula is as follows:
w e i g h t ( w i ) = δ H ( w i )
weight(wi) the biggest, represent a word and be distributed in the fewest theme, more there is importance.Otherwise, weight (wi) the least to represent this word degree of aliasing the highest, its representativeness is strong, and the probability becoming document key word is the lowest.
Accompanying drawing explanation
Fig. 1 is the overview flow chart of the document keyword abstraction method based on topic model disclosed in the present invention.
Detailed description of the invention
Below in conjunction with embodiment and accompanying drawing, the present invention is described in further detail, but embodiments of the present invention do not limit In this.
Embodiment
As it is shown in figure 1, the overview flow chart of document keyword abstraction method based on topic model, the literary composition of this topic model Shelves keyword abstraction method comprises the following steps:
Document information pretreatment: the document of input carries out word part of speech division, function word/stop words is rejected, and stem carries Take, it is established that semi-structured data.
File structure figure builds: file structure figure describes each word positional information in the document in document.Should One word of each node on behalf of figure, the limit of two nodes of link represents the word distance in a document of the two node on behalf Nearer.The present invention proposes a kind of file structure figure construction method.
Document subject matter distribution is extracted: each document has the theme laying particular stress on description, and this method is taken out by topic model technology Take the theme distribution of each word in the theme distribution in document and document.This method also proposes a kind of theme based on background word Model, promotes the effect of topic model.The things that the document that theme is close describes is more close.For each theme, Ke Yicong Collection of document extracts the set of letters relevant to this theme.
Word weight extraction: what the weight of each word represented is this word significance level in a document.In document more Important word weight is the highest, otherwise, the word of low weight significance level in a document is low.The present invention proposes a kind of weight extraction Method.
Key word generates: according to above-mentioned step, and this method will ask key word problem to be converted in nomography extraction key The problem of node, uses PageRank algorithm according to file structure figure this method, and combines topic model and word weight, can be right Each word calculates a score, and score is the biggest represents its key word being more probably the document.The present invention proposes a kind of key word The method generated.
The document keyword abstraction method based on topic model provided the present invention below elaborates:
Described document information pretreatment module.For Chinese text, part of speech divides and participle instrument can be used text Divide;For English text, according to space document carried out participle, and use that word is stemmed obtains word prototype.Connect The part of speech using part-of-speech tagging instrument to obtain each word.Finally, the function word in document, stop words are deleted, leaves behind name Word, adjective, verb, reduce the noise when that descriptor building.It is to remove sentence Chinese and English word morpheme that stem extracts Impact, such as ' dog ' and ' dogs ' should can be regarded as with a word.
Described file structure figure builds module.In order to text-processing problem is converted into figure problem, need to turn document It is changed to figure.For the information of stet shelves as far as possible, the present invention builds the sliding window of an a length of W, for occurring in Word in window, builds a node of graph for it, if two words simultaneously appear in sliding window, then for the two word institute's generation Two nodes of table increase a limit.Start, toward document afterbody moving window, constantly to increase in figure in moving process from document Add node and limit.
Described document subject matter distribution extraction module.First carrying out, document background word marks.The present invention uses semi-automation Mode mark.We are by calculating the TF-IDF value of each word, and certain threshold values selected, will be less than the word of this threshold values It is regarded as the word that quantity of information is less.The most manually browse these words, select background word.Then, the present invention utilizes LDA Mutation bLDA (the implicit Di Like based on background of (implicit Di Li Cray distribution, Latent Dirichlet Allocation) Thunder is distributed) obtain the distribution of implicit theme of document.On the basis of LDA, present invention adds background theme, will be unrelated with theme Word all gather in this topic.Then, we use gibbs sampler method to extract theme from document sets.
Described word weight extraction module.The present invention seeks its comentropy according to the theme distribution of word, the word that comentropy is the biggest Its theme distribution is the most uniform, i.e. degree of aliasing is the highest, and the least its theme distribution that represents of comentropy is more concentrated, and degree of aliasing is the lowest.Literary composition Word w in DangiComentropy computing formula for its theme distribution is as follows:
H ( w i ) = - Σ t = 1 K p ( k = t | w i ) log p ( k = t | w i )
Calculate the weight of each word according to the comentropy tried to achieve, formula is as follows:
w e i g h t ( w i ) = δ H ( w i )
Wherein, w is worked asiWhen implicit theme is distributed the most uniform (when this word representativeness is not strong), H (wi) will be very big, because of AndWill become the least, i.e. the weight of this word will become the least.
Described keyword generation module.The PageRank value of each word of the present invention first random initializtion.Then, for often Piece document, for each theme, this method uses PageRank method to give a mark the importance of each word.The most again For each word, we, by the marking weighted cumulative under its different themes, obtain the marking of this word final.For every literary composition Each theme in Dang, this method calculates the PageRank value of each word, and computing formula is as follows:
TR t ( w i ) = λ Σ j : w j → w i e ( w j , w i ) O ( w j ) TR t ( w j ) + ( 1 - λ ) W t ( w i )
In above-mentioned formula, λ is damped coefficient (damping factor), and span is [0,1].One node is not necessarily Having chain, it is 0 that some node goes out chain.This node is isolated to other webpages, in order to make this webpage to be accessed to, it is necessary to PageRank is modified by damped coefficient, and (1-λ) is the probit that each node jumps to another one node.For page wiTotal chain goes out number, and e (wi,wj) it is page wiTo wjChain go out number.Should The main thought of formula is: the PageRank value of each page distributes to the summation of its value equal to numerous pages that its chain enters.
Wherein, Wt(wi)=p (z | w), under conditions of i.e. giving word w, this word is assigned to the probability of theme z, and this is also Reflect word w to have in much degree and belong to theme z.This probability is proposed by (3b).For every document, we can be according to K Theme draws the word graph of K different weights, then applies above-mentioned formula, is calculated each word in document and comprises K PageRank value.
We calculate the PageRank value of each word iteratively, until sending end condition.The end condition of the present invention has 1) calculate current iteration and the weighted value difference of each word in last iteration, terminate the method if less than 0.001.2) if Iterations arrives Termination Threshold 300, also terminates the method.In continuous iteration, the PageRank value meeting of each page Trending towards normal and stable, because this algorithm will eventually be restrained, and this convergency value is PageRank value.
For each word, its K PageRank value under K different themes is processed as a PageRank by this method Value.Concrete processing method is as follows: p (z | d) is the probability that document d belongs to theme z, for K PageRank value of each word, Adding the impact of entropy when calculating the weight of each word, we can use equation below they to be merged:
T R ( w i ) = δ H ( w i ) Σ z = 1 K TR z ( w i ) × p ( z | d )
Wherein, δ is a smoothing factor, and can control the value of comentropy affects amplitude to final PageRank value.
According to the PageRank value that word each in document is final, it is ranked up by the method according to its value, it is intended that maximum Top n word as the key word of the document.
The present embodiment also discloses a kind of document keyword abstraction system based on topic model, the document of this topic model Keyword abstraction system includes with lower module:
Document information pretreatment module, for the document of input is carried out word part of speech division, function word/stop words picks Removing, stem extracts, it is established that semi-structured data;
File structure figure builds module, is used for building file structure figure, and wherein, described file structure figure describes in document Each word positional information in the document, one word of each node on behalf of the document structure chart, link two nodes Limit represent that the word of the two node on behalf is the most close together;
Document subject matter distribution extraction module, for by the theme in topic model technology abstracting document based on background word The theme distribution of each word in distribution and document;
Word weight extraction module, for extracting the weight of each word in document, wherein, the weight of each word represents This word significance level in a document;
Keyword generation module, for asking key word problem to be converted in nomography to extract the problem of key node, root Use PageRank algorithm according to file structure figure, and combine topic model and word weight, each word is calculated a score, score The maximum key word as the document.
(1) operation principle of keyword generation module is as follows:
The PageRank value of the first each word of random initializtion;
For every document, for each theme, PageRank method is used to carry out the importance of each word beating Point, wherein, calculating the PageRank value of each word, computing formula is as follows:
TR t ( w i ) = λ Σ j : w j → w i e ( w j , w i ) O ( w j ) TR t ( w j ) + ( 1 - λ ) W t ( W i )
In above-mentioned formula, λ is damped coefficient (damping factor), and span is [0,1], and it is right to be come by damped coefficient PageRank is modified, and (1-λ) is the probit that each node jumps to another one node, For page wiTotal chain goes out number, and e (wi,wj) it is page wiTo wjChain go out number, wherein, Wt(wi)=p (z | w), i.e. give word w Under conditions of, this word is assigned to the probability of theme z;
It is iterated, calculates current iteration and the weighted value difference of each word in last iteration, if less than 0.001 end Only the method, enters next step;If iterations arrives Termination Threshold 300, also terminate the method, enter next step; Otherwise, enter S5b and carry out next round iteration;Constantly repeat the above steps, makees the convergency value of the PageRank value of each page For final PageRank value;
For each word, its K PageRank value under K different themes is consolidated into a PageRank Value, merges formula as follows:
T R ( w i ) = δ H ( w i ) Σ z = 1 K TR z ( w i ) × p ( z | d )
Wherein, p (z | d) is the probability that document d belongs to theme z, and δ is a smoothing factor, controls the value of comentropy to Whole PageRank value affect amplitude;
The PageRank value final to word each in document is ranked up, it is intended that maximum top n word is as the document Key word.
(2) operation principle of word weight extraction module is as follows:
Theme distribution according to word seeks its comentropy, word w in documentiFor its theme distribution comentropy computing formula such as Under:
H ( w i ) = - Σ t = 1 K p ( k = t | w i ) log p ( k = t | w i )
Calculate the weight of each word according to the comentropy tried to achieve, formula is as follows:
w e i g h t ( w i ) = δ H ( w i ) .
(3) operation principle of document subject matter distribution extraction module is as follows:
Document background word marks, by calculating the TF-IDF value of each word, and certain threshold values selected, will be less than this threshold values Word regard as the word that quantity of information is less, the most manually browse and select background word;
The computing formula of described TF-IDF value is as follows:
TF i = n i Σ k n k
IDF i = l o g ( D D w )
TF-IDFi=TFi*IDFi,
In above formula, i is i-th key word, niIt is word tiThe number of times occurred in a document, TFiIt is key word tiAll Word frequency in document, ∑knkIt is that all words occur total degree, IDF in a documentiIt is key word tiReverse probability, D is to be Unite the number of all documents, DwIt is that word t occursiDocument number;
Utilize the distribution of the implicit theme of bLDA implicit Di Li Cray distributed acquisition document based on background, by first Topic is arranged to background theme, is all gathered by the word unrelated with theme in this topic, uses gibbs sampler to solve bLDA, The probability obtaining each topic corresponding of each word in document is as follows
Wz(wi)=p (z | w)
Under conditions of given word w, this word is assigned to the probability of theme z;
(4) operation principle of file structure figure structure module is as follows:
A selected sliding window length, is designated as W;
Build the sliding window of an a length of W, for occurring in the word in window, build a node of graph for it, If two words simultaneously appear in sliding window, then two nodes representated by the two word increase a limit;
Mobile sliding window, starts to move toward document afterbody from document, constantly increases node in moving process in figure With limit;
(5) operation principle of document information pretreatment module is as follows:
For Chinese text, part of speech divides and uses participle instrument to divide text, for English text, according to space Document is carried out participle, and uses that word is stemmed obtains word prototype;
The part of speech of each word in the word that the part-of-speech tagging instrument mark of employing has cut;
According to part-of-speech tagging result, the function word in document, stop words are deleted.
In each method embodiment of the present invention, the priority that the sequence number of described each step can not be used for limiting each step is suitable Sequence, for those of ordinary skill in the art, on the premise of not paying creative work, changes also the priority of each step Within protection scope of the present invention.
The module of foregoing description or the specific works process of unit, the corresponding mistake being referred in preceding method embodiment Journey, the most specifically repeats at this.
Above-described embodiment is the present invention preferably embodiment, but embodiments of the present invention are not by above-described embodiment Limit, the change made under other any spirit without departing from the present invention and principle, modify, substitute, combine, simplify, All should be the substitute mode of equivalence, within being included in protection scope of the present invention.

Claims (10)

1. a document keyword abstraction method based on topic model, it is characterised in that described document keyword abstraction method Specifically include the following step:
S1, document information pretreatment, carry out word part of speech division by the document of input, and function word/stop words is rejected, and stem carries Take, it is established that semi-structured data;
S2, file structure figure build, and build file structure figure, and wherein, described file structure figure describes each word in document Positional information in the document, one word of each node on behalf of the document structure chart, the limit of two nodes of link represents The word of the two node on behalf is the most close together;
The distribution of S3, document subject matter is extracted, by the theme distribution in topic model technology abstracting document based on background word and The theme distribution of each word in document;
S4, word weight extraction, extract the weight of each word in document, and wherein, the weight of each word represents this word and exists Significance level in document;
S5, key word generate, and key word problem will be sought to be converted in nomography extract the problem of key node, according to file structure Figure uses PageRank algorithm, and combines topic model and word weight, and each word is calculated a score, and score maximum is as this The key word of document.
Document keyword abstraction method based on topic model the most according to claim 1, it is characterised in that described step S5, key word generate specific as follows:
S5a, the PageRank value of the first each word of random initializtion;
S5b, for every document, for each theme, use PageRank method to carry out the importance of each word beating Point, wherein, calculating the PageRank value of each word, computing formula is as follows:
TR t ( w i ) = λ Σ j : w j → w i e ( w j , w i ) O ( w j ) TR t ( w j ) + ( 1 - λ ) W t ( w i )
In above-mentioned formula, λ is damped coefficient (damping factor), and span is [0,1], and it is right to be come by damped coefficient PageRank is modified, and (1-λ) is the probit that each node jumps to another one node,For page wiTotal chain goes out number, and e (wi,wj) it is page wiTo wjChain go out number, its In, Wt(wi)=p (z | w), under conditions of i.e. giving word w, this word is assigned to the probability of theme z;
S5c, it is iterated, calculates current iteration and the weighted value difference of each word in last iteration, if less than 0.001 end Only the method, enters next step;If iterations arrives Termination Threshold 300, also terminate the method, enter next step; Otherwise, enter S5b and carry out next round iteration;Constantly repeat the above steps, makees the convergency value of the PageRank value of each page For final PageRank value;
S5d, for each word, its K PageRank value under K different themes is consolidated into a PageRank Value, merges formula as follows:
T R ( w i ) = δ H ( w i ) Σ z = 1 K TR z ( w i ) × p ( z | d )
Wherein, p (z | d) is the probability that document d belongs to theme z, and δ is a smoothing factor, controls the value of comentropy to finally PageRank value affect amplitude;
S5e, the PageRank value that word each in document is final is ranked up, it is intended that maximum top n word is as the document Key word.
Document keyword abstraction method based on topic model the most according to claim 1, it is characterised in that described step S4, word weight extraction are specific as follows:
S4a, theme distribution according to word seek its comentropy, word w in documentiFor its theme distribution comentropy computing formula such as Under:
H ( w i ) = - Σ t = 1 K p ( k = t | w i ) log p ( k = t | w i )
The comentropy that S4b, basis are tried to achieve calculates the weight of each word, and formula is as follows:
w e i g h t ( w i ) = δ H ( w i ) .
Document keyword abstraction method based on topic model the most according to claim 1, it is characterised in that described step The distribution of S3, document subject matter is extracted specific as follows:
S3a, document background word mark, by calculating the TF-IDF value of each word, and certain threshold values selected, will be less than this threshold values Word regard as the word that quantity of information is less, the most manually browse and select background word;The computing formula of described TF-IDF value is such as Under:
TF i = n i Σ k n k
IDF i = l o g ( D D w )
TF-IDFi=TFi*IDFi,
In above formula, i is i-th key word, niIt is word tiThe number of times occurred in a document, TFiIt is key word tiIn all documents Word frequency, ∑knkIt is that all words occur total degree, IDF in a documentiIt is key word tiReverse probability, D be system own The number of document, DwIt is that word t occursiDocument number;
S3b, utilize the distribution of the implicit theme of bLDA implicit Di Li Cray distributed acquisition document based on background, by first Topic is arranged to background theme, is all gathered by the word unrelated with theme in this topic, uses gibbs sampler to solve bLDA, The probability obtaining each topic corresponding of each word in document is as follows
Wz(wi)=p (z | w)
Under conditions of given word w, this word is assigned to the probability of theme z.
Document keyword abstraction method based on topic model the most according to claim 1, it is characterised in that described step S2, file structure figure build specific as follows:
S2a, a selected sliding window length, be designated as W;
S2b, the sliding window of one a length of W of structure, for occurring in the word in window, build a node of graph for it, If two words simultaneously appear in sliding window, then two nodes representated by the two word increase a limit;
S2c, mobile sliding window, start to move toward document afterbody from document, constantly increases node in moving process in figure With limit.
Document keyword abstraction method based on topic model the most according to claim 1, it is characterised in that described step S1, document information pretreatment are specific as follows:
S1a, for Chinese text, part of speech divides and uses participle instrument to divide text, for English text, according to space Document is carried out participle, and uses that word is stemmed obtains word prototype;
The part of speech of each word in the word that S1b, the part-of-speech tagging instrument mark of employing have cut;
S1c, according to part-of-speech tagging result, the function word in document, stop words are deleted.
7. a document keyword abstraction method system based on topic model, it is characterised in that described document keyword abstraction Method system includes following modules:
Document information pretreatment module, for the document of input carries out word part of speech division, function word/stop words is rejected, word Dry extraction, it is established that semi-structured data;
File structure figure builds module, is used for building file structure figure, and wherein, described file structure figure describes in document each Word positional information in the document, one word of each node on behalf of the document structure chart, the limit of two nodes of link Represent that the word of the two node on behalf is the most close together;
Document subject matter distribution extraction module, for by the theme distribution in topic model technology abstracting document based on background word And the theme distribution of each word in document;
Word weight extraction module, for extracting the weight of each word in document, wherein, the weight of each word represents this list Word significance level in a document;
Keyword generation module, for seeking key word problem to be converted in nomography to extract the problem of key node, according to literary composition Mark structure figure uses PageRank algorithm, and combines topic model and word weight, and each word is calculated a score, and score is maximum Key word as the document.
Document keyword abstraction system based on topic model the most according to claim 7, it is characterised in that described key The operation principle of word generation module is as follows:
The PageRank value of the first each word of random initializtion;
For every document, for each theme, use PageRank method that the importance of each word is given a mark, its In, calculating the PageRank value of each word, computing formula is as follows:
TR t ( w i ) = λ Σ j : w j → w i e ( w j , w i ) O ( w j ) TR t ( w j ) + ( 1 - λ ) W t ( w i )
In above-mentioned formula, λ is damped coefficient (damping factor), and span is [0,1], is come PageRank by damped coefficient Being modified, (1-λ) is the probit that each node jumps to another one node,For page Face wiTotal chain goes out number, and e (wi,wj) it is page wiTo wjChain go out number, wherein, Wt(wi)=p (z | w), i.e. give the bar of word w Under part, this word is assigned to the probability of theme z;
Being iterated, calculate current iteration and the weighted value difference of each word in last iteration, terminating if less than 0.001 should Method, enters next step;If iterations arrives Termination Threshold 300, also terminate the method, enter next step;Otherwise, Enter S5b and carry out next round iteration;Constantly repeat the above steps, using the convergency value of the PageRank value of each page as finally PageRank value;
For each word, its K PageRank value under K different themes is consolidated into a PageRank value, closes And formula is as follows:
T R ( w i ) = δ H ( w i ) Σ z = 1 K TR z ( w i ) × p ( z | d )
Wherein, p (z | d) is the probability that document d belongs to theme z, and δ is a smoothing factor, controls the value of comentropy to finally PageRank value affect amplitude;
The PageRank value final to word each in document is ranked up, it is intended that maximum top n word is as the key of the document Word.
Document keyword abstraction system based on topic model the most according to claim 7, it is characterised in that described word The operation principle of weight extraction module is as follows:
Theme distribution according to word seeks its comentropy, word w in documentiComentropy computing formula for its theme distribution is as follows:
H ( w i ) = - Σ t = 1 K p ( k = t | w i ) log p ( k = t | w i )
Calculate the weight of each word according to the comentropy tried to achieve, formula is as follows:
w e i g h t ( w i ) = δ H ( w i ) .
Document keyword abstraction system based on topic model the most according to claim 7, it is characterised in that described literary composition The operation principle of shelves theme distribution extraction module is as follows:
Document background word marks, by calculating the TF-IDF value of each word, and certain threshold values selected, will be less than the list of this threshold values The word that quantity of information is less regarded as in word, the most manually browses and selects background word;
The computing formula of described TF-IDF value is as follows:
TF i = n i Σ k n k
IDF i = l o g ( D D w )
TF-IDFi=TFi*IDFi,
In above formula, i is i-th key word, niIt is word tiThe number of times occurred in a document, TFiIt is key word tiIn all documents Word frequency, ∑knkIt is that all words occur total degree, IDF in a documentiIt is key word tiReverse probability, D be system own The number of document, DwIt is that word t occursiDocument number;
Utilize the distribution of the implicit theme of bLDA implicit Di Li Cray distributed acquisition document based on background, by first topic It is arranged to background theme, the word unrelated with theme is all gathered in this topic, use gibbs sampler to solve bLDA, obtain In document, the probability of each word each topic corresponding is as follows
Wz(wi)=p (z | w)
Under conditions of given word w, this word is assigned to the probability of theme z;
The operation principle that described file structure figure builds module is as follows:
A selected sliding window length, is designated as W;
Build the sliding window of an a length of W, for occurring in the word in window, build a node of graph for it, if two Individual word simultaneously appears in sliding window, then two nodes representated by the two word increase a limit;
Mobile sliding window, starts to move toward document afterbody from document, constantly increases node and limit in moving process in figure;
The operation principle of described document information pretreatment module is as follows:
For Chinese text, part of speech divides and uses participle instrument to divide text, for English text, according to space to literary composition Shelves carry out participle, and use that word is stemmed obtains word prototype;
The part of speech of each word in the word that the part-of-speech tagging instrument mark of employing has cut;
According to part-of-speech tagging result, the function word in document, stop words are deleted.
CN201610162410.5A 2016-03-21 2016-03-21 Document keyword abstraction method and its system based on topic model Active CN105843795B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610162410.5A CN105843795B (en) 2016-03-21 2016-03-21 Document keyword abstraction method and its system based on topic model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610162410.5A CN105843795B (en) 2016-03-21 2016-03-21 Document keyword abstraction method and its system based on topic model

Publications (2)

Publication Number Publication Date
CN105843795A true CN105843795A (en) 2016-08-10
CN105843795B CN105843795B (en) 2019-05-14

Family

ID=56587704

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610162410.5A Active CN105843795B (en) 2016-03-21 2016-03-21 Document keyword abstraction method and its system based on topic model

Country Status (1)

Country Link
CN (1) CN105843795B (en)

Cited By (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106407316A (en) * 2016-08-30 2017-02-15 北京航空航天大学 Topic model-based software question and answer recommendation method and device
CN106484783A (en) * 2016-09-19 2017-03-08 济南浪潮高新科技投资发展有限公司 A kind of graphical representation method of report data
CN106599076A (en) * 2016-11-16 2017-04-26 深圳市异度信息产业有限公司 Forum induced graph generation method and apparatus
CN106844416A (en) * 2016-11-17 2017-06-13 中国科学院计算技术研究所 A kind of sub-topic method for digging
CN107092595A (en) * 2017-04-23 2017-08-25 四川用联信息技术有限公司 New keyword extraction techniques
CN107102985A (en) * 2017-04-23 2017-08-29 四川用联信息技术有限公司 Multi-threaded keyword extraction techniques in improved document
CN107102986A (en) * 2017-04-23 2017-08-29 四川用联信息技术有限公司 Multi-threaded keyword extraction techniques in document
CN107193803A (en) * 2017-05-26 2017-09-22 北京东方科诺科技发展有限公司 A kind of particular task text key word extracting method based on semanteme
CN107193892A (en) * 2017-05-02 2017-09-22 东软集团股份有限公司 A kind of document subject matter determines method and device
CN107391613A (en) * 2017-07-04 2017-11-24 北京航空航天大学 A kind of automatic disambiguation method of more documents of industry security theme and device
CN107665189A (en) * 2017-06-16 2018-02-06 平安科技(深圳)有限公司 A kind of method, terminal and equipment for extracting centre word
CN107797990A (en) * 2017-10-18 2018-03-13 渡鸦科技(北京)有限责任公司 Method and apparatus for determining text core sentence
CN108197117A (en) * 2018-01-31 2018-06-22 厦门大学 A kind of Chinese text keyword extracting method based on document subject matter structure with semanteme
CN108415900A (en) * 2018-02-05 2018-08-17 中国科学院信息工程研究所 A kind of visualText INFORMATION DISCOVERY method and system based on multistage cooccurrence relation word figure
CN108763390A (en) * 2018-05-18 2018-11-06 浙江新能量科技股份有限公司 Fine granularity subject distillation method based on sliding window technique
CN108763213A (en) * 2018-05-25 2018-11-06 西南电子技术研究所(中国电子科技集团公司第十研究所) Theme feature text key word extracting method
CN108776653A (en) * 2018-05-25 2018-11-09 南京大学 A kind of text segmenting method of the judgement document based on PageRank and comentropy
CN108920456A (en) * 2018-06-13 2018-11-30 北京信息科技大学 A kind of keyword Automatic method
CN109635081A (en) * 2018-11-23 2019-04-16 上海大学 A kind of text key word weighing computation method based on word frequency power-law distribution characteristic
CN109918660A (en) * 2019-03-04 2019-06-21 北京邮电大学 A kind of keyword extracting method and device based on TextRank
CN109960724A (en) * 2019-03-13 2019-07-02 北京工业大学 A kind of text snippet method based on TF-IDF
CN110019639A (en) * 2017-07-18 2019-07-16 腾讯科技(北京)有限公司 Data processing method, device and storage medium
CN110162592A (en) * 2019-05-24 2019-08-23 东北大学 A kind of news keyword extracting method based on the improved TextRank of gravitation
CN110472005A (en) * 2019-06-27 2019-11-19 中山大学 A kind of unsupervised keyword extracting method
CN110493019A (en) * 2019-07-05 2019-11-22 深圳壹账通智能科技有限公司 Automatic generation method, device, equipment and the storage medium of meeting summary
CN110728136A (en) * 2019-10-14 2020-01-24 延安大学 Multi-factor fused textrank keyword extraction algorithm
CN112883171A (en) * 2021-02-02 2021-06-01 中国科学院计算技术研究所 Document keyword extraction method and device based on BERT model
CN113094573A (en) * 2020-01-09 2021-07-09 中移(上海)信息通信科技有限公司 Multi-keyword sequencing searchable encryption method, device, equipment and storage medium
CN114020901A (en) * 2021-09-27 2022-02-08 南京云创大数据科技股份有限公司 Financial public opinion analysis method combining topic mining and emotion analysis
CN114332872A (en) * 2022-03-14 2022-04-12 四川国路安数据技术有限公司 Contract document fault-tolerant information extraction method based on graph attention network
CN114328826A (en) * 2021-12-20 2022-04-12 青岛檬豆网络科技有限公司 Method for extracting key words and abstracts of technical achievements and technical requirements
CN114510565A (en) * 2020-11-16 2022-05-17 威联通科技股份有限公司 Method for automatically extracting, classifying and keyword-searching short texts and device adopting same
CN116431930A (en) * 2023-06-13 2023-07-14 天津联创科技发展有限公司 Technological achievement conversion data query method, system, terminal and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090292685A1 (en) * 2008-05-22 2009-11-26 Microsoft Corporation Video search re-ranking via multi-graph propagation
CN103955489A (en) * 2014-04-15 2014-07-30 华南理工大学 Distributed mass short text KNN (K Nearest Neighbor) classification algorithm and distributed mass short text KNN classification system based on information entropy feature weight quantification
CN104915446A (en) * 2015-06-29 2015-09-16 华南理工大学 Automatic extracting method and system of event evolving relationship based on news

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090292685A1 (en) * 2008-05-22 2009-11-26 Microsoft Corporation Video search re-ranking via multi-graph propagation
CN103955489A (en) * 2014-04-15 2014-07-30 华南理工大学 Distributed mass short text KNN (K Nearest Neighbor) classification algorithm and distributed mass short text KNN classification system based on information entropy feature weight quantification
CN104915446A (en) * 2015-06-29 2015-09-16 华南理工大学 Automatic extracting method and system of event evolving relationship based on news

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
XIN JIN ET AL.: "LDA based Related Word Detection in Advertising", 《2010 SEVENTH WEB INFORMATION SYSTEMS AND APPLICATIONS CONFERENCE》 *
ZHIYUAN LIU ET AL.: "Automatic Keyphrase Extraction via Topic Decomposition", 《PROCEEDINGS OF THE 2010 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING》 *
刁倩 等: "VSM中词权重的信息熵算法", 《情报学报》 *
江雨燕 等: "基于共享背景主题的Labeled LDA模型", 《电子学报》 *

Cited By (44)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106407316A (en) * 2016-08-30 2017-02-15 北京航空航天大学 Topic model-based software question and answer recommendation method and device
CN106407316B (en) * 2016-08-30 2020-05-15 北京航空航天大学 Software question and answer recommendation method and device based on topic model
CN106484783A (en) * 2016-09-19 2017-03-08 济南浪潮高新科技投资发展有限公司 A kind of graphical representation method of report data
CN106599076A (en) * 2016-11-16 2017-04-26 深圳市异度信息产业有限公司 Forum induced graph generation method and apparatus
CN106844416A (en) * 2016-11-17 2017-06-13 中国科学院计算技术研究所 A kind of sub-topic method for digging
CN107102986A (en) * 2017-04-23 2017-08-29 四川用联信息技术有限公司 Multi-threaded keyword extraction techniques in document
CN107102985A (en) * 2017-04-23 2017-08-29 四川用联信息技术有限公司 Multi-threaded keyword extraction techniques in improved document
CN107092595A (en) * 2017-04-23 2017-08-25 四川用联信息技术有限公司 New keyword extraction techniques
CN107193892A (en) * 2017-05-02 2017-09-22 东软集团股份有限公司 A kind of document subject matter determines method and device
CN107193803A (en) * 2017-05-26 2017-09-22 北京东方科诺科技发展有限公司 A kind of particular task text key word extracting method based on semanteme
CN107193803B (en) * 2017-05-26 2020-07-10 北京东方科诺科技发展有限公司 Semantic-based specific task text keyword extraction method
CN107665189A (en) * 2017-06-16 2018-02-06 平安科技(深圳)有限公司 A kind of method, terminal and equipment for extracting centre word
CN107665189B (en) * 2017-06-16 2019-12-13 平安科技(深圳)有限公司 method, terminal and equipment for extracting central word
CN107391613A (en) * 2017-07-04 2017-11-24 北京航空航天大学 A kind of automatic disambiguation method of more documents of industry security theme and device
CN110019639A (en) * 2017-07-18 2019-07-16 腾讯科技(北京)有限公司 Data processing method, device and storage medium
CN110019639B (en) * 2017-07-18 2023-04-18 腾讯科技(北京)有限公司 Data processing method, device and storage medium
CN107797990A (en) * 2017-10-18 2018-03-13 渡鸦科技(北京)有限责任公司 Method and apparatus for determining text core sentence
CN108197117B (en) * 2018-01-31 2020-05-26 厦门大学 Chinese text keyword extraction method based on document theme structure and semantics
CN108197117A (en) * 2018-01-31 2018-06-22 厦门大学 A kind of Chinese text keyword extracting method based on document subject matter structure with semanteme
CN108415900A (en) * 2018-02-05 2018-08-17 中国科学院信息工程研究所 A kind of visualText INFORMATION DISCOVERY method and system based on multistage cooccurrence relation word figure
CN108763390A (en) * 2018-05-18 2018-11-06 浙江新能量科技股份有限公司 Fine granularity subject distillation method based on sliding window technique
CN108776653A (en) * 2018-05-25 2018-11-09 南京大学 A kind of text segmenting method of the judgement document based on PageRank and comentropy
CN108763213A (en) * 2018-05-25 2018-11-06 西南电子技术研究所(中国电子科技集团公司第十研究所) Theme feature text key word extracting method
CN108920456A (en) * 2018-06-13 2018-11-30 北京信息科技大学 A kind of keyword Automatic method
CN108920456B (en) * 2018-06-13 2022-08-30 北京信息科技大学 Automatic keyword extraction method
CN109635081A (en) * 2018-11-23 2019-04-16 上海大学 A kind of text key word weighing computation method based on word frequency power-law distribution characteristic
CN109635081B (en) * 2018-11-23 2023-06-13 上海大学 Text keyword weight calculation method based on word frequency power law distribution characteristics
CN109918660B (en) * 2019-03-04 2021-03-02 北京邮电大学 Keyword extraction method and device based on TextRank
CN109918660A (en) * 2019-03-04 2019-06-21 北京邮电大学 A kind of keyword extracting method and device based on TextRank
CN109960724A (en) * 2019-03-13 2019-07-02 北京工业大学 A kind of text snippet method based on TF-IDF
CN110162592A (en) * 2019-05-24 2019-08-23 东北大学 A kind of news keyword extracting method based on the improved TextRank of gravitation
CN110472005B (en) * 2019-06-27 2023-09-15 中山大学 Unsupervised keyword extraction method
CN110472005A (en) * 2019-06-27 2019-11-19 中山大学 A kind of unsupervised keyword extracting method
CN110493019A (en) * 2019-07-05 2019-11-22 深圳壹账通智能科技有限公司 Automatic generation method, device, equipment and the storage medium of meeting summary
CN110728136A (en) * 2019-10-14 2020-01-24 延安大学 Multi-factor fused textrank keyword extraction algorithm
CN113094573A (en) * 2020-01-09 2021-07-09 中移(上海)信息通信科技有限公司 Multi-keyword sequencing searchable encryption method, device, equipment and storage medium
CN114510565A (en) * 2020-11-16 2022-05-17 威联通科技股份有限公司 Method for automatically extracting, classifying and keyword-searching short texts and device adopting same
CN112883171A (en) * 2021-02-02 2021-06-01 中国科学院计算技术研究所 Document keyword extraction method and device based on BERT model
CN112883171B (en) * 2021-02-02 2023-02-03 中国科学院计算技术研究所 Document keyword extraction method and device based on BERT model
CN114020901A (en) * 2021-09-27 2022-02-08 南京云创大数据科技股份有限公司 Financial public opinion analysis method combining topic mining and emotion analysis
CN114328826A (en) * 2021-12-20 2022-04-12 青岛檬豆网络科技有限公司 Method for extracting key words and abstracts of technical achievements and technical requirements
CN114328826B (en) * 2021-12-20 2024-06-11 青岛檬豆网络科技有限公司 Method for extracting keywords and abstracts of technical achievements and technical demands
CN114332872A (en) * 2022-03-14 2022-04-12 四川国路安数据技术有限公司 Contract document fault-tolerant information extraction method based on graph attention network
CN116431930A (en) * 2023-06-13 2023-07-14 天津联创科技发展有限公司 Technological achievement conversion data query method, system, terminal and storage medium

Also Published As

Publication number Publication date
CN105843795B (en) 2019-05-14

Similar Documents

Publication Publication Date Title
CN105843795A (en) Topic model based document keyword extraction method and system
Thakkar et al. Graph-based algorithms for text summarization
CN103324665B (en) Hot spot information extraction method and device based on micro-blog
US9183281B2 (en) Context-based document unit recommendation for sensemaking tasks
CN103678412B (en) A kind of method and device of file retrieval
CN107239512B (en) A kind of microblogging comment spam recognition methods of combination comment relational network figure
CN102411638A (en) Method for generating multimedia summary of news search result
Lahiri et al. Keyword extraction from emails
CN104298732B (en) The personalized text sequence of network-oriented user a kind of and recommendation method
CN104182504A (en) Algorithm for dynamically tracking and summarizing news events
Li et al. Eos: expertise oriented search using social networks
Nicoletti et al. Mining interests for user profiling in electronic conversations
CN106874419B (en) A kind of real-time hot spot polymerization of more granularities
Chatterjee et al. RENT: Regular expression and NLP-based term extraction scheme for agricultural domain
Marujo et al. Hourly traffic prediction of news stories
CN107066585A (en) A kind of probability topic calculates the public sentiment monitoring method and system with matching
Lin et al. Combining a segmentation-like approach and a density-based approach in content extraction
Konagala et al. Fake news detection using deep learning: supervised fake news detection analysis in social media with semantic similarity method
Kanakaraj et al. NLP based intelligent news search engine using information extraction from e-newspapers
You Automatic summarization and keyword extraction from web page or text file
EP3040932A1 (en) A method for tracking discussion in social media
Kannan et al. Text document clustering using statistical integrated graph based sentence sensitivity ranking algorithm
Yang et al. An Opinion-aware Approach to Contextual Suggestion.
Lim et al. Generalized and lightweight algorithms for automated web forum content extraction
Raj et al. Malayalam text summarization: Minimum spanning tree based graph reduction approach

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant