CN105843795A - Topic model based document keyword extraction method and system - Google Patents
Topic model based document keyword extraction method and system Download PDFInfo
- Publication number
- CN105843795A CN105843795A CN201610162410.5A CN201610162410A CN105843795A CN 105843795 A CN105843795 A CN 105843795A CN 201610162410 A CN201610162410 A CN 201610162410A CN 105843795 A CN105843795 A CN 105843795A
- Authority
- CN
- China
- Prior art keywords
- word
- document
- theme
- pagerank
- value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a topic model based document keyword extraction method and system. The document keyword extraction method comprises the following steps of document information preprocessing, document structure graph construction, document topic distribution extraction, word weight extraction and keyword generation. The document keyword extraction system comprises the following modules: a document information preprocessing module, a document structure graph construction module, a document topic distribution extraction module, a word weight extraction module and a keyword generation module. According to the method and system, extracted keywords are more reasonable and related to a topic of a document more closely; and partial deficiencies in the keyword extraction field at present are overcome, a better document summarization effect is achieved, and a user can conveniently and quickly know an abstract of the document.
Description
Technical field
The present invention relates to a kind of data mining technology, particularly to a kind of document keyword abstraction side based on topic model
Method and system thereof.
Background technology
Key word is the summary of document main contents, and it is the important method of a kind of fast understanding document subject matter.At each
Local it can be seen that the figure of key word, such as we on news website it can be seen that the label of every news, Wo Men
It can be seen that the key word discussed of this paper when browsing technical paper.That reduce people in magnanimity information, search information
Difficulty.Current key word has been used in every field.In information retrieval field, the application of key word widely, Baidu, paddy
The search engine companies such as song are all based on the key word of web page text and retrieve, past based on document keyword searching result out
Past is that user intentionally gets.And in field of social network, current many functions and correlational study are all based on user and are marked
The label (tags) of note realizes.User tag facilitates user that mark object is managed, collects and is retrieved, it is also possible to utilize
This customizes recommendation of personalized information for user.By providing user's mark to objects (such as picture, article, video etc.)
Pouring functions, utilizes user group's wisdom (Collective Intelligence), and we can obtain large quantities of literary composition marked
Shelves, provide data supporting to our research work.Key word is widely used in every field, and general generation method has
Three below: 1) by the spontaneous generation of user, the content 2 that user oneself mark is interested) expert document please be carried out manual mark
Key word 3) use document automation keyword extraction techniques to extract key word.It is suitable for by the way of user's spontaneous mark label
Scene is less, and specific object interested is only labeled by user, and a kind of effective mode excitation the most also cannot be provided to use
Family mark other guide.And because current information technology high-speed develops, the growth of internet information amount also explosion type, new content
All the time producing, please expert carry out relevant documentation marking high cost by hand, document of its mark also can only make a search
With, and be difficult to carry out commercial use.So the demand currently for document automation keyword extraction techniques is urgent, relevant
The research of problem is also current focus.
Summary of the invention
The primary and foremost purpose of the present invention is that the shortcoming overcoming prior art is with not enough, it is provided that a kind of literary composition based on topic model
Shelves keyword abstraction method, the method so that the key word extracted more rationally, more representative.
Another object of the present invention is to the shortcoming overcoming prior art with not enough, it is provided that a kind of literary composition based on topic model
Shelves keyword abstraction method system, system solves the problem the weak point of current keyword extraction part, makes key word preferably
Summarize document, facilitate user to understand rapidly document according to the key word extracted.
The primary and foremost purpose of the present invention can be achieved through the following technical solutions: a kind of document key word based on topic model
Abstracting method, the method includes:
S1, document information pretreatment: the document of input carries out word part of speech division, function word/stop words is rejected, stem
Extract, it is established that semi-structured data.
S2, file structure figure build: file structure figure describes each word positional information in the document in document.
One word of each node on behalf of this figure, the limit of two nodes of link represent the word of the two node on behalf in a document away from
Close to from.The present invention proposes a kind of file structure figure construction method.
The distribution of S3, document subject matter is extracted: each document has the theme laying particular stress on description, and this method passes through topic model technology
The theme distribution of each word in theme distribution in abstracting document and document.This method also proposes a kind of master based on background word
Topic model, promotes the effect of topic model.The things that the document that theme is close describes is more close.For each theme, permissible
The set of letters relevant to this theme is extracted from collection of document.
S4, word weight extraction: what the weight of each word represented is this word significance level in a document.In document
The most important word weight is the highest, otherwise, the word of low weight significance level in a document is low.The present invention proposes a kind of weight and carries
Access method.
S5, key word generate: according to above-mentioned step, and this method will ask key word problem to be converted in nomography extraction pass
The problem of key node, uses PageRank algorithm according to file structure figure this method, and combines topic model and word weight, permissible
Each word is calculated a score, and score is the biggest represents its key word being more probably the document.The present invention proposes a kind of crucial
The method that word generates.
Described document information pretreatment comprises the following steps:
S1a, for Chinese text, part of speech divides and participle instrument can be used to divide text, for English text,
According to space document carried out participle, and use that word is stemmed obtains word prototype;
The part of speech of each word in the word that S1b, employing part-of-speech tagging instrument annotation step (S1a) have been cut.
The part-of-speech tagging result produced in S1c, employing step (S1bb), deletes the function word in document, stop words.
Described file structure figure builds and comprises the following steps:
S2a, a selected sliding window length, be designated as W.
S2b, the sliding window of one a length of W of structure, for occurring in the word in window, build a figure joint for it
Point, if two words simultaneously appear in sliding window, then two nodes representated by the two word increase a limit.
S2c, mobile sliding window, start to move toward document afterbody from document, constantly increase in figure in moving process
Node and limit.
The distribution extraction of described document subject matter comprises the following steps:
S3a, document background word mark: document background word refers to the word not having certain semantic implication in a document, this
Topic model is often caused confusion by word.Because the limited amount of this word, we use the mode of semi-automation to mark
Note.We are by calculating the TF-IDF value of each word, and certain threshold values selected, and the word that will be less than this threshold values is regarded as letter
The word that breath amount is less.The most manually browse these words, select background word.
This method uses TF-IDF (Term Frequency Inverse Document Frequency) word frequency technology meter
Calculate weighted value.TF-IDF is for weighing the word significance level for a document sets.Its main thought is: if certain
The frequency the highest (i.e. TF is the highest) that individual word occurs in a document, and probability of occurrence seldom (IDF value in other documents
Ratio is relatively low), the separating capacity of this word is stronger.The computing formula of TF-IDF is as follows:
TF-IDFi=TFi*IDFi,
In above formula, i is i-th key word.niIt is word tiThe number of times occurred in a document;TFiIt is key word tiAll
Word frequency in document;∑knkIt is that all words occur total degree in a document.IDFiIt is key word tiReverse probability;D is to be
Unite the number of all documents, DwIt is that word t occursiDocument number;
S3b, utilize the mutation bLDA of LDA (implicit Di Li Cray distribution, Latent Dirichlet Allocation)
(implicit Di Li Cray based on background distribution) obtains the distribution of the implicit theme of document.In bLDA, all in collection of document
All of implicit theme shared the most according to a certain percentage by document, and implicit theme collection is then made up of a series of correlated characteristic words.
And in bLDA, first topic is arranged to background theme, is all gathered in this topic by the word unrelated with theme.Because
Gibbs sampler can extract theme, effectively so this method employs gibbs sampler and solves from extensive document sets
bLDA.By bLDA, the probability of each topic corresponding of each word in document can be obtained.
Wz(wi)=p (z | w)
Under conditions of given word w, this word is assigned to the probability of theme z, and this also reflects word w much degree
On belong to theme z.
Described word weight extraction comprises the following steps:
S4a, can be obtained the theme distribution of each word in document by step (S3b), we ask according to the theme distribution of word
Its comentropy, its theme distribution of word that comentropy is the biggest is the most uniform, i.e. degree of aliasing is the highest, and the least its theme that represents of comentropy divides
Cloth is more concentrated, and degree of aliasing is the lowest.Word w in documentiComentropy computing formula for its theme distribution is as follows:
The comentropy that S4b, basis are tried to achieve calculates the weight of each word, and formula is as follows:
Work as wiWhen implicit theme is distributed the most uniform (when this word representativeness is not strong), H (wi) will be very big, thusWill become the least, i.e. the weight of this word will become the least.
Described key word generates and comprises the steps of:
S5a, the PageRank value of the first each word of random initializtion.
S5b, for every document, for each theme, this method uses PageRank method to carry out the weight to each word
The property wanted is given a mark.PageRank algorithm is a kind of algorithm being proposed by Google at first and using, and goes out chain according to obtaining chain in webpage
Connect one webpage figure of structure, open up complement further according to the webpage formed and calculate the power of influence of each webpage.It is generally employed
Evaluate field to webpage, evaluate a webpage the most important.PageRank algorithm applies following two hypothesis: quantum hypothesis and
Quality is assumed.If quantum hypothesis refer to a webpage to enter chain quantity the most, this webpage is the most important.Matter
Amount assumes to refer to chain, and to enter the webpage of certain page the most important, and this webpage is the most important.This method
To key word problem be sought to be converted in nomography extract the problem of key node, and use PageRank to calculate according to file structure figure
Method.
For theme each in every document, this method calculates the PageRank value of each word, and computing formula is as follows:
In above-mentioned formula, λ is damped coefficient (damping factor), and span is [0,1].One node is not necessarily
Having chain, it is 0 that some node goes out chain.This node is isolated to other webpages, in order to make this webpage to be accessed to, it is necessary to
PageRank is modified by damped coefficient, and (1-λ) is the probit that each node jumps to another one node.For page wiTotal chain goes out number, and e (wi,wj) it is page wiTo wjChain go out number.These public affairs
The main thought of formula is: the PageRank value of each page distributes to the summation of its value equal to numerous pages that its chain enters.
Wherein, Wt(wi)=p (z | w), under conditions of i.e. giving word w, this word is assigned to the probability of theme z, and this is also
Reflect word w to have in much degree and belong to theme z.This probability is proposed by (3b).For every document, we can be according to K
Theme draws the word graph of K different weights, then applies above-mentioned formula, is calculated each word in document and comprises K
PageRank value.
S5c, the method are alternative manners, calculate current iteration and the weighted value difference of each word in last iteration,
Terminate the method if less than 0.001, enter next step.If iterations arrives Termination Threshold 300, also terminate the party
Method, enters next step.Otherwise, enter (5b) and carry out next round iteration.Constantly repeat the above steps, each page
PageRank value can trend towards normal and stable, because this algorithm will eventually be restrained, and this convergency value is PageRank
Value.
S5d, for each word, its K PageRank value under K different themes is processed as one by this method
PageRank value.Concrete processing method is as follows: p (z | d) is the probability that document d belongs to theme z, for K of each word
PageRank value, adds the impact of entropy when calculating the weight of each word, we can use equation below they to be merged:
δ is a smoothing factor, and can control the value of comentropy affects amplitude to final PageRank value.
S5e, according to the final PageRank value of word each in document, it is ranked up by the method according to its value, it is intended that
Maximum top n word is as the key word of the document.
Another object of the present invention is achieved through the following technical solutions: the document keyword abstraction method of a kind of topic model
System, this system includes:
Document information pretreatment module: the document of input carries out word part of speech division, function word/stop words is rejected, word
Do extraction, set up semi-structured data.
File structure figure builds module: build file structure figure.
Document subject matter distribution extraction module: utilize topic model to extract the theme distribution of document.
Word weight extraction module: give weight for each word in document.
Keyword generation module: generate document key word.
Described keyword generation module specifically for:
First, PageRank algorithm is a kind of algorithm being proposed by Google at first and using, and goes out chain according to obtaining chain in webpage
Connect one webpage figure of structure, open up complement further according to the webpage formed and calculate the power of influence of each webpage.It is generally employed
Evaluate field to webpage, evaluate a webpage the most important.PageRank algorithm applies following two hypothesis: quantum hypothesis and
Quality is assumed.If quantum hypothesis refer to a webpage to enter chain quantity the most, this webpage is the most important.Matter
Amount assumes to refer to chain, and to enter the webpage of certain page the most important, and this webpage is the most important.This method
To key word problem be sought to be converted in nomography extract the problem of key node, and use PageRank to calculate according to file structure figure
Method.PageRank algorithm is widely used in keyword extraction field.
In the present invention, for theme each in every document, this method calculates the PageRank value of each word, computing formula
As follows:
In above-mentioned formula, λ is damped coefficient (damping factor), and span is [0,1].One node is not necessarily
Having chain, it is 0 that some node goes out chain.This node is isolated to other webpages, in order to make this webpage to be accessed to, it is necessary to
PageRank is modified by damped coefficient, and (1-λ) is the probit that each node jumps to another one node.For page wiTotal chain goes out number, and e (wi,wj) it is page wiTo wjChain go out number.These public affairs
The main thought of formula is: the PageRank value of each page distributes to the summation of its value equal to numerous pages that its chain enters.Its
In, Wt(wi)=p (z | w), under conditions of i.e. giving word w, this word is assigned to the probability of theme z, and this also reflects word
W has in much degree and belongs to theme z.This probability is proposed by (3b).For every document, we can draw K according to K theme
The word graph of different weights, then applies above-mentioned formula, is calculated each word in document and comprises K PageRank value.
The PageRank value of the present invention first each word of random initializtion, then calculates each word wiTheme t's
PageRank value TRt(wi), continuous iteration, until the difference of iterations twice iteration more than 300 or front and back is less than 0.001
Time stop.
Then, its K PageRank value under K different themes is processed as by the present invention for each word, this method
One PageRank value.Formula is as follows:
P (z | d) is the probability that document d belongs to theme z, for K PageRank value of each word, calculates each word
Plus the impact of entropy during weight.δ is a smoothing factor, and can control the value of comentropy affects width to final PageRank value
Degree.
According to the PageRank value that word each in document is final, it is ranked up by the method according to its value, it is intended that maximum
Top n word as the key word of the document.Then according to keyword abstraction result, user is presented to.
Described document subject matter distribution extraction module specifically for:
First, LDA (implicit Di Li Cray distribution, Latent Dirichlet Allocation) is a kind of implicit theme
Model.It has a lot of mutation, the present invention propose its another kind of mutation: bLDA (implicit Di Li Cray based on background distribution) with
Obtain the distribution of the implicit theme of document.In bLDA, in collection of document, all documents are shared all of the most according to a certain percentage
Implicit theme, implicit theme collection is then made up of a series of correlated characteristic words.And in bLDA, first topic is set
It is set to background theme, the word unrelated with theme is all gathered in this topic.Because gibbs sampler can be effectively from a large scale
Document sets extracts theme, so this method employs gibbs sampler and solves bLDA.By bLDA, can obtain in document every
The probability of individual word each topic corresponding.
Wz(wi)=p (z | w)
Under conditions of given word w, this word is assigned to the probability of theme z, and this also reflects word w much degree
On belong to theme z.Document subject matter distribution extraction module is according to the theme distribution of each word of document structure tree.
Described word weight extraction module specifically for:
The present invention seeks its comentropy according to the theme distribution of word, word w in documentiComentropy for its theme distribution calculates
Formula is as follows:
Then, calculating the weight of each word according to the comentropy tried to achieve, formula is as follows:
weight(wi) the biggest, represent a word and be distributed in the fewest theme, more there is importance.Otherwise, weight
(wi) the least to represent this word degree of aliasing the highest, its representativeness is strong, and the probability becoming document key word is the lowest.
Accompanying drawing explanation
Fig. 1 is the overview flow chart of the document keyword abstraction method based on topic model disclosed in the present invention.
Detailed description of the invention
Below in conjunction with embodiment and accompanying drawing, the present invention is described in further detail, but embodiments of the present invention do not limit
In this.
Embodiment
As it is shown in figure 1, the overview flow chart of document keyword abstraction method based on topic model, the literary composition of this topic model
Shelves keyword abstraction method comprises the following steps:
Document information pretreatment: the document of input carries out word part of speech division, function word/stop words is rejected, and stem carries
Take, it is established that semi-structured data.
File structure figure builds: file structure figure describes each word positional information in the document in document.Should
One word of each node on behalf of figure, the limit of two nodes of link represents the word distance in a document of the two node on behalf
Nearer.The present invention proposes a kind of file structure figure construction method.
Document subject matter distribution is extracted: each document has the theme laying particular stress on description, and this method is taken out by topic model technology
Take the theme distribution of each word in the theme distribution in document and document.This method also proposes a kind of theme based on background word
Model, promotes the effect of topic model.The things that the document that theme is close describes is more close.For each theme, Ke Yicong
Collection of document extracts the set of letters relevant to this theme.
Word weight extraction: what the weight of each word represented is this word significance level in a document.In document more
Important word weight is the highest, otherwise, the word of low weight significance level in a document is low.The present invention proposes a kind of weight extraction
Method.
Key word generates: according to above-mentioned step, and this method will ask key word problem to be converted in nomography extraction key
The problem of node, uses PageRank algorithm according to file structure figure this method, and combines topic model and word weight, can be right
Each word calculates a score, and score is the biggest represents its key word being more probably the document.The present invention proposes a kind of key word
The method generated.
The document keyword abstraction method based on topic model provided the present invention below elaborates:
Described document information pretreatment module.For Chinese text, part of speech divides and participle instrument can be used text
Divide;For English text, according to space document carried out participle, and use that word is stemmed obtains word prototype.Connect
The part of speech using part-of-speech tagging instrument to obtain each word.Finally, the function word in document, stop words are deleted, leaves behind name
Word, adjective, verb, reduce the noise when that descriptor building.It is to remove sentence Chinese and English word morpheme that stem extracts
Impact, such as ' dog ' and ' dogs ' should can be regarded as with a word.
Described file structure figure builds module.In order to text-processing problem is converted into figure problem, need to turn document
It is changed to figure.For the information of stet shelves as far as possible, the present invention builds the sliding window of an a length of W, for occurring in
Word in window, builds a node of graph for it, if two words simultaneously appear in sliding window, then for the two word institute's generation
Two nodes of table increase a limit.Start, toward document afterbody moving window, constantly to increase in figure in moving process from document
Add node and limit.
Described document subject matter distribution extraction module.First carrying out, document background word marks.The present invention uses semi-automation
Mode mark.We are by calculating the TF-IDF value of each word, and certain threshold values selected, will be less than the word of this threshold values
It is regarded as the word that quantity of information is less.The most manually browse these words, select background word.Then, the present invention utilizes LDA
Mutation bLDA (the implicit Di Like based on background of (implicit Di Li Cray distribution, Latent Dirichlet Allocation)
Thunder is distributed) obtain the distribution of implicit theme of document.On the basis of LDA, present invention adds background theme, will be unrelated with theme
Word all gather in this topic.Then, we use gibbs sampler method to extract theme from document sets.
Described word weight extraction module.The present invention seeks its comentropy according to the theme distribution of word, the word that comentropy is the biggest
Its theme distribution is the most uniform, i.e. degree of aliasing is the highest, and the least its theme distribution that represents of comentropy is more concentrated, and degree of aliasing is the lowest.Literary composition
Word w in DangiComentropy computing formula for its theme distribution is as follows:
Calculate the weight of each word according to the comentropy tried to achieve, formula is as follows:
Wherein, w is worked asiWhen implicit theme is distributed the most uniform (when this word representativeness is not strong), H (wi) will be very big, because of
AndWill become the least, i.e. the weight of this word will become the least.
Described keyword generation module.The PageRank value of each word of the present invention first random initializtion.Then, for often
Piece document, for each theme, this method uses PageRank method to give a mark the importance of each word.The most again
For each word, we, by the marking weighted cumulative under its different themes, obtain the marking of this word final.For every literary composition
Each theme in Dang, this method calculates the PageRank value of each word, and computing formula is as follows:
In above-mentioned formula, λ is damped coefficient (damping factor), and span is [0,1].One node is not necessarily
Having chain, it is 0 that some node goes out chain.This node is isolated to other webpages, in order to make this webpage to be accessed to, it is necessary to
PageRank is modified by damped coefficient, and (1-λ) is the probit that each node jumps to another one node.For page wiTotal chain goes out number, and e (wi,wj) it is page wiTo wjChain go out number.Should
The main thought of formula is: the PageRank value of each page distributes to the summation of its value equal to numerous pages that its chain enters.
Wherein, Wt(wi)=p (z | w), under conditions of i.e. giving word w, this word is assigned to the probability of theme z, and this is also
Reflect word w to have in much degree and belong to theme z.This probability is proposed by (3b).For every document, we can be according to K
Theme draws the word graph of K different weights, then applies above-mentioned formula, is calculated each word in document and comprises K
PageRank value.
We calculate the PageRank value of each word iteratively, until sending end condition.The end condition of the present invention has
1) calculate current iteration and the weighted value difference of each word in last iteration, terminate the method if less than 0.001.2) if
Iterations arrives Termination Threshold 300, also terminates the method.In continuous iteration, the PageRank value meeting of each page
Trending towards normal and stable, because this algorithm will eventually be restrained, and this convergency value is PageRank value.
For each word, its K PageRank value under K different themes is processed as a PageRank by this method
Value.Concrete processing method is as follows: p (z | d) is the probability that document d belongs to theme z, for K PageRank value of each word,
Adding the impact of entropy when calculating the weight of each word, we can use equation below they to be merged:
Wherein, δ is a smoothing factor, and can control the value of comentropy affects amplitude to final PageRank value.
According to the PageRank value that word each in document is final, it is ranked up by the method according to its value, it is intended that maximum
Top n word as the key word of the document.
The present embodiment also discloses a kind of document keyword abstraction system based on topic model, the document of this topic model
Keyword abstraction system includes with lower module:
Document information pretreatment module, for the document of input is carried out word part of speech division, function word/stop words picks
Removing, stem extracts, it is established that semi-structured data;
File structure figure builds module, is used for building file structure figure, and wherein, described file structure figure describes in document
Each word positional information in the document, one word of each node on behalf of the document structure chart, link two nodes
Limit represent that the word of the two node on behalf is the most close together;
Document subject matter distribution extraction module, for by the theme in topic model technology abstracting document based on background word
The theme distribution of each word in distribution and document;
Word weight extraction module, for extracting the weight of each word in document, wherein, the weight of each word represents
This word significance level in a document;
Keyword generation module, for asking key word problem to be converted in nomography to extract the problem of key node, root
Use PageRank algorithm according to file structure figure, and combine topic model and word weight, each word is calculated a score, score
The maximum key word as the document.
(1) operation principle of keyword generation module is as follows:
The PageRank value of the first each word of random initializtion;
For every document, for each theme, PageRank method is used to carry out the importance of each word beating
Point, wherein, calculating the PageRank value of each word, computing formula is as follows:
In above-mentioned formula, λ is damped coefficient (damping factor), and span is [0,1], and it is right to be come by damped coefficient
PageRank is modified, and (1-λ) is the probit that each node jumps to another one node,
For page wiTotal chain goes out number, and e (wi,wj) it is page wiTo wjChain go out number, wherein, Wt(wi)=p (z | w), i.e. give word w
Under conditions of, this word is assigned to the probability of theme z;
It is iterated, calculates current iteration and the weighted value difference of each word in last iteration, if less than 0.001 end
Only the method, enters next step;If iterations arrives Termination Threshold 300, also terminate the method, enter next step;
Otherwise, enter S5b and carry out next round iteration;Constantly repeat the above steps, makees the convergency value of the PageRank value of each page
For final PageRank value;
For each word, its K PageRank value under K different themes is consolidated into a PageRank
Value, merges formula as follows:
Wherein, p (z | d) is the probability that document d belongs to theme z, and δ is a smoothing factor, controls the value of comentropy to
Whole PageRank value affect amplitude;
The PageRank value final to word each in document is ranked up, it is intended that maximum top n word is as the document
Key word.
(2) operation principle of word weight extraction module is as follows:
Theme distribution according to word seeks its comentropy, word w in documentiFor its theme distribution comentropy computing formula such as
Under:
Calculate the weight of each word according to the comentropy tried to achieve, formula is as follows:
(3) operation principle of document subject matter distribution extraction module is as follows:
Document background word marks, by calculating the TF-IDF value of each word, and certain threshold values selected, will be less than this threshold values
Word regard as the word that quantity of information is less, the most manually browse and select background word;
The computing formula of described TF-IDF value is as follows:
TF-IDFi=TFi*IDFi,
In above formula, i is i-th key word, niIt is word tiThe number of times occurred in a document, TFiIt is key word tiAll
Word frequency in document, ∑knkIt is that all words occur total degree, IDF in a documentiIt is key word tiReverse probability, D is to be
Unite the number of all documents, DwIt is that word t occursiDocument number;
Utilize the distribution of the implicit theme of bLDA implicit Di Li Cray distributed acquisition document based on background, by first
Topic is arranged to background theme, is all gathered by the word unrelated with theme in this topic, uses gibbs sampler to solve bLDA,
The probability obtaining each topic corresponding of each word in document is as follows
Wz(wi)=p (z | w)
Under conditions of given word w, this word is assigned to the probability of theme z;
(4) operation principle of file structure figure structure module is as follows:
A selected sliding window length, is designated as W;
Build the sliding window of an a length of W, for occurring in the word in window, build a node of graph for it,
If two words simultaneously appear in sliding window, then two nodes representated by the two word increase a limit;
Mobile sliding window, starts to move toward document afterbody from document, constantly increases node in moving process in figure
With limit;
(5) operation principle of document information pretreatment module is as follows:
For Chinese text, part of speech divides and uses participle instrument to divide text, for English text, according to space
Document is carried out participle, and uses that word is stemmed obtains word prototype;
The part of speech of each word in the word that the part-of-speech tagging instrument mark of employing has cut;
According to part-of-speech tagging result, the function word in document, stop words are deleted.
In each method embodiment of the present invention, the priority that the sequence number of described each step can not be used for limiting each step is suitable
Sequence, for those of ordinary skill in the art, on the premise of not paying creative work, changes also the priority of each step
Within protection scope of the present invention.
The module of foregoing description or the specific works process of unit, the corresponding mistake being referred in preceding method embodiment
Journey, the most specifically repeats at this.
Above-described embodiment is the present invention preferably embodiment, but embodiments of the present invention are not by above-described embodiment
Limit, the change made under other any spirit without departing from the present invention and principle, modify, substitute, combine, simplify,
All should be the substitute mode of equivalence, within being included in protection scope of the present invention.
Claims (10)
1. a document keyword abstraction method based on topic model, it is characterised in that described document keyword abstraction method
Specifically include the following step:
S1, document information pretreatment, carry out word part of speech division by the document of input, and function word/stop words is rejected, and stem carries
Take, it is established that semi-structured data;
S2, file structure figure build, and build file structure figure, and wherein, described file structure figure describes each word in document
Positional information in the document, one word of each node on behalf of the document structure chart, the limit of two nodes of link represents
The word of the two node on behalf is the most close together;
The distribution of S3, document subject matter is extracted, by the theme distribution in topic model technology abstracting document based on background word and
The theme distribution of each word in document;
S4, word weight extraction, extract the weight of each word in document, and wherein, the weight of each word represents this word and exists
Significance level in document;
S5, key word generate, and key word problem will be sought to be converted in nomography extract the problem of key node, according to file structure
Figure uses PageRank algorithm, and combines topic model and word weight, and each word is calculated a score, and score maximum is as this
The key word of document.
Document keyword abstraction method based on topic model the most according to claim 1, it is characterised in that described step
S5, key word generate specific as follows:
S5a, the PageRank value of the first each word of random initializtion;
S5b, for every document, for each theme, use PageRank method to carry out the importance of each word beating
Point, wherein, calculating the PageRank value of each word, computing formula is as follows:
In above-mentioned formula, λ is damped coefficient (damping factor), and span is [0,1], and it is right to be come by damped coefficient
PageRank is modified, and (1-λ) is the probit that each node jumps to another one node,For page wiTotal chain goes out number, and e (wi,wj) it is page wiTo wjChain go out number, its
In, Wt(wi)=p (z | w), under conditions of i.e. giving word w, this word is assigned to the probability of theme z;
S5c, it is iterated, calculates current iteration and the weighted value difference of each word in last iteration, if less than 0.001 end
Only the method, enters next step;If iterations arrives Termination Threshold 300, also terminate the method, enter next step;
Otherwise, enter S5b and carry out next round iteration;Constantly repeat the above steps, makees the convergency value of the PageRank value of each page
For final PageRank value;
S5d, for each word, its K PageRank value under K different themes is consolidated into a PageRank
Value, merges formula as follows:
Wherein, p (z | d) is the probability that document d belongs to theme z, and δ is a smoothing factor, controls the value of comentropy to finally
PageRank value affect amplitude;
S5e, the PageRank value that word each in document is final is ranked up, it is intended that maximum top n word is as the document
Key word.
Document keyword abstraction method based on topic model the most according to claim 1, it is characterised in that described step
S4, word weight extraction are specific as follows:
S4a, theme distribution according to word seek its comentropy, word w in documentiFor its theme distribution comentropy computing formula such as
Under:
The comentropy that S4b, basis are tried to achieve calculates the weight of each word, and formula is as follows:
Document keyword abstraction method based on topic model the most according to claim 1, it is characterised in that described step
The distribution of S3, document subject matter is extracted specific as follows:
S3a, document background word mark, by calculating the TF-IDF value of each word, and certain threshold values selected, will be less than this threshold values
Word regard as the word that quantity of information is less, the most manually browse and select background word;The computing formula of described TF-IDF value is such as
Under:
TF-IDFi=TFi*IDFi,
In above formula, i is i-th key word, niIt is word tiThe number of times occurred in a document, TFiIt is key word tiIn all documents
Word frequency, ∑knkIt is that all words occur total degree, IDF in a documentiIt is key word tiReverse probability, D be system own
The number of document, DwIt is that word t occursiDocument number;
S3b, utilize the distribution of the implicit theme of bLDA implicit Di Li Cray distributed acquisition document based on background, by first
Topic is arranged to background theme, is all gathered by the word unrelated with theme in this topic, uses gibbs sampler to solve bLDA,
The probability obtaining each topic corresponding of each word in document is as follows
Wz(wi)=p (z | w)
Under conditions of given word w, this word is assigned to the probability of theme z.
Document keyword abstraction method based on topic model the most according to claim 1, it is characterised in that described step
S2, file structure figure build specific as follows:
S2a, a selected sliding window length, be designated as W;
S2b, the sliding window of one a length of W of structure, for occurring in the word in window, build a node of graph for it,
If two words simultaneously appear in sliding window, then two nodes representated by the two word increase a limit;
S2c, mobile sliding window, start to move toward document afterbody from document, constantly increases node in moving process in figure
With limit.
Document keyword abstraction method based on topic model the most according to claim 1, it is characterised in that described step
S1, document information pretreatment are specific as follows:
S1a, for Chinese text, part of speech divides and uses participle instrument to divide text, for English text, according to space
Document is carried out participle, and uses that word is stemmed obtains word prototype;
The part of speech of each word in the word that S1b, the part-of-speech tagging instrument mark of employing have cut;
S1c, according to part-of-speech tagging result, the function word in document, stop words are deleted.
7. a document keyword abstraction method system based on topic model, it is characterised in that described document keyword abstraction
Method system includes following modules:
Document information pretreatment module, for the document of input carries out word part of speech division, function word/stop words is rejected, word
Dry extraction, it is established that semi-structured data;
File structure figure builds module, is used for building file structure figure, and wherein, described file structure figure describes in document each
Word positional information in the document, one word of each node on behalf of the document structure chart, the limit of two nodes of link
Represent that the word of the two node on behalf is the most close together;
Document subject matter distribution extraction module, for by the theme distribution in topic model technology abstracting document based on background word
And the theme distribution of each word in document;
Word weight extraction module, for extracting the weight of each word in document, wherein, the weight of each word represents this list
Word significance level in a document;
Keyword generation module, for seeking key word problem to be converted in nomography to extract the problem of key node, according to literary composition
Mark structure figure uses PageRank algorithm, and combines topic model and word weight, and each word is calculated a score, and score is maximum
Key word as the document.
Document keyword abstraction system based on topic model the most according to claim 7, it is characterised in that described key
The operation principle of word generation module is as follows:
The PageRank value of the first each word of random initializtion;
For every document, for each theme, use PageRank method that the importance of each word is given a mark, its
In, calculating the PageRank value of each word, computing formula is as follows:
In above-mentioned formula, λ is damped coefficient (damping factor), and span is [0,1], is come PageRank by damped coefficient
Being modified, (1-λ) is the probit that each node jumps to another one node,For page
Face wiTotal chain goes out number, and e (wi,wj) it is page wiTo wjChain go out number, wherein, Wt(wi)=p (z | w), i.e. give the bar of word w
Under part, this word is assigned to the probability of theme z;
Being iterated, calculate current iteration and the weighted value difference of each word in last iteration, terminating if less than 0.001 should
Method, enters next step;If iterations arrives Termination Threshold 300, also terminate the method, enter next step;Otherwise,
Enter S5b and carry out next round iteration;Constantly repeat the above steps, using the convergency value of the PageRank value of each page as finally
PageRank value;
For each word, its K PageRank value under K different themes is consolidated into a PageRank value, closes
And formula is as follows:
Wherein, p (z | d) is the probability that document d belongs to theme z, and δ is a smoothing factor, controls the value of comentropy to finally
PageRank value affect amplitude;
The PageRank value final to word each in document is ranked up, it is intended that maximum top n word is as the key of the document
Word.
Document keyword abstraction system based on topic model the most according to claim 7, it is characterised in that described word
The operation principle of weight extraction module is as follows:
Theme distribution according to word seeks its comentropy, word w in documentiComentropy computing formula for its theme distribution is as follows:
Calculate the weight of each word according to the comentropy tried to achieve, formula is as follows:
Document keyword abstraction system based on topic model the most according to claim 7, it is characterised in that described literary composition
The operation principle of shelves theme distribution extraction module is as follows:
Document background word marks, by calculating the TF-IDF value of each word, and certain threshold values selected, will be less than the list of this threshold values
The word that quantity of information is less regarded as in word, the most manually browses and selects background word;
The computing formula of described TF-IDF value is as follows:
TF-IDFi=TFi*IDFi,
In above formula, i is i-th key word, niIt is word tiThe number of times occurred in a document, TFiIt is key word tiIn all documents
Word frequency, ∑knkIt is that all words occur total degree, IDF in a documentiIt is key word tiReverse probability, D be system own
The number of document, DwIt is that word t occursiDocument number;
Utilize the distribution of the implicit theme of bLDA implicit Di Li Cray distributed acquisition document based on background, by first topic
It is arranged to background theme, the word unrelated with theme is all gathered in this topic, use gibbs sampler to solve bLDA, obtain
In document, the probability of each word each topic corresponding is as follows
Wz(wi)=p (z | w)
Under conditions of given word w, this word is assigned to the probability of theme z;
The operation principle that described file structure figure builds module is as follows:
A selected sliding window length, is designated as W;
Build the sliding window of an a length of W, for occurring in the word in window, build a node of graph for it, if two
Individual word simultaneously appears in sliding window, then two nodes representated by the two word increase a limit;
Mobile sliding window, starts to move toward document afterbody from document, constantly increases node and limit in moving process in figure;
The operation principle of described document information pretreatment module is as follows:
For Chinese text, part of speech divides and uses participle instrument to divide text, for English text, according to space to literary composition
Shelves carry out participle, and use that word is stemmed obtains word prototype;
The part of speech of each word in the word that the part-of-speech tagging instrument mark of employing has cut;
According to part-of-speech tagging result, the function word in document, stop words are deleted.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610162410.5A CN105843795B (en) | 2016-03-21 | 2016-03-21 | Document keyword abstraction method and its system based on topic model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610162410.5A CN105843795B (en) | 2016-03-21 | 2016-03-21 | Document keyword abstraction method and its system based on topic model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105843795A true CN105843795A (en) | 2016-08-10 |
CN105843795B CN105843795B (en) | 2019-05-14 |
Family
ID=56587704
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610162410.5A Active CN105843795B (en) | 2016-03-21 | 2016-03-21 | Document keyword abstraction method and its system based on topic model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105843795B (en) |
Cited By (33)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106407316A (en) * | 2016-08-30 | 2017-02-15 | 北京航空航天大学 | Topic model-based software question and answer recommendation method and device |
CN106484783A (en) * | 2016-09-19 | 2017-03-08 | 济南浪潮高新科技投资发展有限公司 | A kind of graphical representation method of report data |
CN106599076A (en) * | 2016-11-16 | 2017-04-26 | 深圳市异度信息产业有限公司 | Forum induced graph generation method and apparatus |
CN106844416A (en) * | 2016-11-17 | 2017-06-13 | 中国科学院计算技术研究所 | A kind of sub-topic method for digging |
CN107092595A (en) * | 2017-04-23 | 2017-08-25 | 四川用联信息技术有限公司 | New keyword extraction techniques |
CN107102985A (en) * | 2017-04-23 | 2017-08-29 | 四川用联信息技术有限公司 | Multi-threaded keyword extraction techniques in improved document |
CN107102986A (en) * | 2017-04-23 | 2017-08-29 | 四川用联信息技术有限公司 | Multi-threaded keyword extraction techniques in document |
CN107193803A (en) * | 2017-05-26 | 2017-09-22 | 北京东方科诺科技发展有限公司 | A kind of particular task text key word extracting method based on semanteme |
CN107193892A (en) * | 2017-05-02 | 2017-09-22 | 东软集团股份有限公司 | A kind of document subject matter determines method and device |
CN107391613A (en) * | 2017-07-04 | 2017-11-24 | 北京航空航天大学 | A kind of automatic disambiguation method of more documents of industry security theme and device |
CN107665189A (en) * | 2017-06-16 | 2018-02-06 | 平安科技(深圳)有限公司 | A kind of method, terminal and equipment for extracting centre word |
CN107797990A (en) * | 2017-10-18 | 2018-03-13 | 渡鸦科技(北京)有限责任公司 | Method and apparatus for determining text core sentence |
CN108197117A (en) * | 2018-01-31 | 2018-06-22 | 厦门大学 | A kind of Chinese text keyword extracting method based on document subject matter structure with semanteme |
CN108415900A (en) * | 2018-02-05 | 2018-08-17 | 中国科学院信息工程研究所 | A kind of visualText INFORMATION DISCOVERY method and system based on multistage cooccurrence relation word figure |
CN108763390A (en) * | 2018-05-18 | 2018-11-06 | 浙江新能量科技股份有限公司 | Fine granularity subject distillation method based on sliding window technique |
CN108763213A (en) * | 2018-05-25 | 2018-11-06 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Theme feature text key word extracting method |
CN108776653A (en) * | 2018-05-25 | 2018-11-09 | 南京大学 | A kind of text segmenting method of the judgement document based on PageRank and comentropy |
CN108920456A (en) * | 2018-06-13 | 2018-11-30 | 北京信息科技大学 | A kind of keyword Automatic method |
CN109635081A (en) * | 2018-11-23 | 2019-04-16 | 上海大学 | A kind of text key word weighing computation method based on word frequency power-law distribution characteristic |
CN109918660A (en) * | 2019-03-04 | 2019-06-21 | 北京邮电大学 | A kind of keyword extracting method and device based on TextRank |
CN109960724A (en) * | 2019-03-13 | 2019-07-02 | 北京工业大学 | A kind of text snippet method based on TF-IDF |
CN110019639A (en) * | 2017-07-18 | 2019-07-16 | 腾讯科技(北京)有限公司 | Data processing method, device and storage medium |
CN110162592A (en) * | 2019-05-24 | 2019-08-23 | 东北大学 | A kind of news keyword extracting method based on the improved TextRank of gravitation |
CN110472005A (en) * | 2019-06-27 | 2019-11-19 | 中山大学 | A kind of unsupervised keyword extracting method |
CN110493019A (en) * | 2019-07-05 | 2019-11-22 | 深圳壹账通智能科技有限公司 | Automatic generation method, device, equipment and the storage medium of meeting summary |
CN110728136A (en) * | 2019-10-14 | 2020-01-24 | 延安大学 | Multi-factor fused textrank keyword extraction algorithm |
CN112883171A (en) * | 2021-02-02 | 2021-06-01 | 中国科学院计算技术研究所 | Document keyword extraction method and device based on BERT model |
CN113094573A (en) * | 2020-01-09 | 2021-07-09 | 中移(上海)信息通信科技有限公司 | Multi-keyword sequencing searchable encryption method, device, equipment and storage medium |
CN114020901A (en) * | 2021-09-27 | 2022-02-08 | 南京云创大数据科技股份有限公司 | Financial public opinion analysis method combining topic mining and emotion analysis |
CN114332872A (en) * | 2022-03-14 | 2022-04-12 | 四川国路安数据技术有限公司 | Contract document fault-tolerant information extraction method based on graph attention network |
CN114328826A (en) * | 2021-12-20 | 2022-04-12 | 青岛檬豆网络科技有限公司 | Method for extracting key words and abstracts of technical achievements and technical requirements |
CN114510565A (en) * | 2020-11-16 | 2022-05-17 | 威联通科技股份有限公司 | Method for automatically extracting, classifying and keyword-searching short texts and device adopting same |
CN116431930A (en) * | 2023-06-13 | 2023-07-14 | 天津联创科技发展有限公司 | Technological achievement conversion data query method, system, terminal and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090292685A1 (en) * | 2008-05-22 | 2009-11-26 | Microsoft Corporation | Video search re-ranking via multi-graph propagation |
CN103955489A (en) * | 2014-04-15 | 2014-07-30 | 华南理工大学 | Distributed mass short text KNN (K Nearest Neighbor) classification algorithm and distributed mass short text KNN classification system based on information entropy feature weight quantification |
CN104915446A (en) * | 2015-06-29 | 2015-09-16 | 华南理工大学 | Automatic extracting method and system of event evolving relationship based on news |
-
2016
- 2016-03-21 CN CN201610162410.5A patent/CN105843795B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090292685A1 (en) * | 2008-05-22 | 2009-11-26 | Microsoft Corporation | Video search re-ranking via multi-graph propagation |
CN103955489A (en) * | 2014-04-15 | 2014-07-30 | 华南理工大学 | Distributed mass short text KNN (K Nearest Neighbor) classification algorithm and distributed mass short text KNN classification system based on information entropy feature weight quantification |
CN104915446A (en) * | 2015-06-29 | 2015-09-16 | 华南理工大学 | Automatic extracting method and system of event evolving relationship based on news |
Non-Patent Citations (4)
Title |
---|
XIN JIN ET AL.: "LDA based Related Word Detection in Advertising", 《2010 SEVENTH WEB INFORMATION SYSTEMS AND APPLICATIONS CONFERENCE》 * |
ZHIYUAN LIU ET AL.: "Automatic Keyphrase Extraction via Topic Decomposition", 《PROCEEDINGS OF THE 2010 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING》 * |
刁倩 等: "VSM中词权重的信息熵算法", 《情报学报》 * |
江雨燕 等: "基于共享背景主题的Labeled LDA模型", 《电子学报》 * |
Cited By (44)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106407316A (en) * | 2016-08-30 | 2017-02-15 | 北京航空航天大学 | Topic model-based software question and answer recommendation method and device |
CN106407316B (en) * | 2016-08-30 | 2020-05-15 | 北京航空航天大学 | Software question and answer recommendation method and device based on topic model |
CN106484783A (en) * | 2016-09-19 | 2017-03-08 | 济南浪潮高新科技投资发展有限公司 | A kind of graphical representation method of report data |
CN106599076A (en) * | 2016-11-16 | 2017-04-26 | 深圳市异度信息产业有限公司 | Forum induced graph generation method and apparatus |
CN106844416A (en) * | 2016-11-17 | 2017-06-13 | 中国科学院计算技术研究所 | A kind of sub-topic method for digging |
CN107102986A (en) * | 2017-04-23 | 2017-08-29 | 四川用联信息技术有限公司 | Multi-threaded keyword extraction techniques in document |
CN107102985A (en) * | 2017-04-23 | 2017-08-29 | 四川用联信息技术有限公司 | Multi-threaded keyword extraction techniques in improved document |
CN107092595A (en) * | 2017-04-23 | 2017-08-25 | 四川用联信息技术有限公司 | New keyword extraction techniques |
CN107193892A (en) * | 2017-05-02 | 2017-09-22 | 东软集团股份有限公司 | A kind of document subject matter determines method and device |
CN107193803A (en) * | 2017-05-26 | 2017-09-22 | 北京东方科诺科技发展有限公司 | A kind of particular task text key word extracting method based on semanteme |
CN107193803B (en) * | 2017-05-26 | 2020-07-10 | 北京东方科诺科技发展有限公司 | Semantic-based specific task text keyword extraction method |
CN107665189A (en) * | 2017-06-16 | 2018-02-06 | 平安科技(深圳)有限公司 | A kind of method, terminal and equipment for extracting centre word |
CN107665189B (en) * | 2017-06-16 | 2019-12-13 | 平安科技(深圳)有限公司 | method, terminal and equipment for extracting central word |
CN107391613A (en) * | 2017-07-04 | 2017-11-24 | 北京航空航天大学 | A kind of automatic disambiguation method of more documents of industry security theme and device |
CN110019639A (en) * | 2017-07-18 | 2019-07-16 | 腾讯科技(北京)有限公司 | Data processing method, device and storage medium |
CN110019639B (en) * | 2017-07-18 | 2023-04-18 | 腾讯科技(北京)有限公司 | Data processing method, device and storage medium |
CN107797990A (en) * | 2017-10-18 | 2018-03-13 | 渡鸦科技(北京)有限责任公司 | Method and apparatus for determining text core sentence |
CN108197117B (en) * | 2018-01-31 | 2020-05-26 | 厦门大学 | Chinese text keyword extraction method based on document theme structure and semantics |
CN108197117A (en) * | 2018-01-31 | 2018-06-22 | 厦门大学 | A kind of Chinese text keyword extracting method based on document subject matter structure with semanteme |
CN108415900A (en) * | 2018-02-05 | 2018-08-17 | 中国科学院信息工程研究所 | A kind of visualText INFORMATION DISCOVERY method and system based on multistage cooccurrence relation word figure |
CN108763390A (en) * | 2018-05-18 | 2018-11-06 | 浙江新能量科技股份有限公司 | Fine granularity subject distillation method based on sliding window technique |
CN108776653A (en) * | 2018-05-25 | 2018-11-09 | 南京大学 | A kind of text segmenting method of the judgement document based on PageRank and comentropy |
CN108763213A (en) * | 2018-05-25 | 2018-11-06 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Theme feature text key word extracting method |
CN108920456A (en) * | 2018-06-13 | 2018-11-30 | 北京信息科技大学 | A kind of keyword Automatic method |
CN108920456B (en) * | 2018-06-13 | 2022-08-30 | 北京信息科技大学 | Automatic keyword extraction method |
CN109635081A (en) * | 2018-11-23 | 2019-04-16 | 上海大学 | A kind of text key word weighing computation method based on word frequency power-law distribution characteristic |
CN109635081B (en) * | 2018-11-23 | 2023-06-13 | 上海大学 | Text keyword weight calculation method based on word frequency power law distribution characteristics |
CN109918660B (en) * | 2019-03-04 | 2021-03-02 | 北京邮电大学 | Keyword extraction method and device based on TextRank |
CN109918660A (en) * | 2019-03-04 | 2019-06-21 | 北京邮电大学 | A kind of keyword extracting method and device based on TextRank |
CN109960724A (en) * | 2019-03-13 | 2019-07-02 | 北京工业大学 | A kind of text snippet method based on TF-IDF |
CN110162592A (en) * | 2019-05-24 | 2019-08-23 | 东北大学 | A kind of news keyword extracting method based on the improved TextRank of gravitation |
CN110472005B (en) * | 2019-06-27 | 2023-09-15 | 中山大学 | Unsupervised keyword extraction method |
CN110472005A (en) * | 2019-06-27 | 2019-11-19 | 中山大学 | A kind of unsupervised keyword extracting method |
CN110493019A (en) * | 2019-07-05 | 2019-11-22 | 深圳壹账通智能科技有限公司 | Automatic generation method, device, equipment and the storage medium of meeting summary |
CN110728136A (en) * | 2019-10-14 | 2020-01-24 | 延安大学 | Multi-factor fused textrank keyword extraction algorithm |
CN113094573A (en) * | 2020-01-09 | 2021-07-09 | 中移(上海)信息通信科技有限公司 | Multi-keyword sequencing searchable encryption method, device, equipment and storage medium |
CN114510565A (en) * | 2020-11-16 | 2022-05-17 | 威联通科技股份有限公司 | Method for automatically extracting, classifying and keyword-searching short texts and device adopting same |
CN112883171A (en) * | 2021-02-02 | 2021-06-01 | 中国科学院计算技术研究所 | Document keyword extraction method and device based on BERT model |
CN112883171B (en) * | 2021-02-02 | 2023-02-03 | 中国科学院计算技术研究所 | Document keyword extraction method and device based on BERT model |
CN114020901A (en) * | 2021-09-27 | 2022-02-08 | 南京云创大数据科技股份有限公司 | Financial public opinion analysis method combining topic mining and emotion analysis |
CN114328826A (en) * | 2021-12-20 | 2022-04-12 | 青岛檬豆网络科技有限公司 | Method for extracting key words and abstracts of technical achievements and technical requirements |
CN114328826B (en) * | 2021-12-20 | 2024-06-11 | 青岛檬豆网络科技有限公司 | Method for extracting keywords and abstracts of technical achievements and technical demands |
CN114332872A (en) * | 2022-03-14 | 2022-04-12 | 四川国路安数据技术有限公司 | Contract document fault-tolerant information extraction method based on graph attention network |
CN116431930A (en) * | 2023-06-13 | 2023-07-14 | 天津联创科技发展有限公司 | Technological achievement conversion data query method, system, terminal and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN105843795B (en) | 2019-05-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105843795A (en) | Topic model based document keyword extraction method and system | |
Thakkar et al. | Graph-based algorithms for text summarization | |
CN103324665B (en) | Hot spot information extraction method and device based on micro-blog | |
US9183281B2 (en) | Context-based document unit recommendation for sensemaking tasks | |
CN103678412B (en) | A kind of method and device of file retrieval | |
CN107239512B (en) | A kind of microblogging comment spam recognition methods of combination comment relational network figure | |
CN102411638A (en) | Method for generating multimedia summary of news search result | |
Lahiri et al. | Keyword extraction from emails | |
CN104298732B (en) | The personalized text sequence of network-oriented user a kind of and recommendation method | |
CN104182504A (en) | Algorithm for dynamically tracking and summarizing news events | |
Li et al. | Eos: expertise oriented search using social networks | |
Nicoletti et al. | Mining interests for user profiling in electronic conversations | |
CN106874419B (en) | A kind of real-time hot spot polymerization of more granularities | |
Chatterjee et al. | RENT: Regular expression and NLP-based term extraction scheme for agricultural domain | |
Marujo et al. | Hourly traffic prediction of news stories | |
CN107066585A (en) | A kind of probability topic calculates the public sentiment monitoring method and system with matching | |
Lin et al. | Combining a segmentation-like approach and a density-based approach in content extraction | |
Konagala et al. | Fake news detection using deep learning: supervised fake news detection analysis in social media with semantic similarity method | |
Kanakaraj et al. | NLP based intelligent news search engine using information extraction from e-newspapers | |
You | Automatic summarization and keyword extraction from web page or text file | |
EP3040932A1 (en) | A method for tracking discussion in social media | |
Kannan et al. | Text document clustering using statistical integrated graph based sentence sensitivity ranking algorithm | |
Yang et al. | An Opinion-aware Approach to Contextual Suggestion. | |
Lim et al. | Generalized and lightweight algorithms for automated web forum content extraction | |
Raj et al. | Malayalam text summarization: Minimum spanning tree based graph reduction approach |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |