CN106844416A - A kind of sub-topic method for digging - Google Patents

A kind of sub-topic method for digging Download PDF

Info

Publication number
CN106844416A
CN106844416A CN201611024146.5A CN201611024146A CN106844416A CN 106844416 A CN106844416 A CN 106844416A CN 201611024146 A CN201611024146 A CN 201611024146A CN 106844416 A CN106844416 A CN 106844416A
Authority
CN
China
Prior art keywords
word
topic
theme
sub
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611024146.5A
Other languages
Chinese (zh)
Other versions
CN106844416B (en
Inventor
李静远
丘志杰
刘悦
程学旗
王凤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN201611024146.5A priority Critical patent/CN106844416B/en
Publication of CN106844416A publication Critical patent/CN106844416A/en
Application granted granted Critical
Publication of CN106844416B publication Critical patent/CN106844416B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The present invention provides a kind of sub-topic method for digging, including:1) the theme value to every each word of document in corpus is initialized;2) the theme value of each word based on current each piece document, for each word in every article, probability of the word from each sub-topic is calculated respectively and calculates probability of the word from background module, calculated probability is then based on, is again each word distribution theme value in every article using gibbs sampler algorithm;Wherein, word distribution vector of probability of the word from background module in the background module of advance statistics is calculated, and the word distribution vector in the background module is constant all the time in an iterative process;3) LDA sub-topics are drawn according to current theme value information if the condition for stopping iteration being met, if it is not, then returning to step 2).The present invention being capable of significantly topic mining effect of the lift pins to monograph set.

Description

A kind of sub-topic method for digging
Technical field
The present invention relates to natural language processing technique field, specifically, the present invention relates to a kind of Topics Crawling method.
Background technology
At present, the excavation of topic and an important research direction for analyzing always natural language processing field, in public sentiment The fields such as analysis have a wide range of applications.Exploded by the fast-developing network information for triggering of online social networks so that common User seems at a loss as to what to do in face of the voluminous amount of information for quickly generating, therefore currently represents general to the information on online social networks All over occurring in that classificationization and become more meticulous trend.Under this trend, information distribution is more careful compact, in microblogging The label schemes such as HashTag, and wechat public's account similar public number monograph set mechanism etc..It is special for this part The application demand that the article cluster that topic information is more become more meticulous is arranged constantly increases, to the further sub-topic of monograph Excavation becomes the hot issue that current industrial quarters and academia are concerned about.
Using strategies such as text cluster and topic models, these strategies have universality to traditional topic analysis method, but It is not fully up to expectations for the topic mining effect of sorted more careful compact monograph.Most common phenomenon It is that common topic method for digging is not high for the identification of the article set with same background, the topic fruiting area of generation Indexing is limited.The main stream approach of current sub-topic analysis focuses on the subject information for finding out thematic internal diversity, this part work The main contents of work are exactly the otherness between finding out article in the article that some possess big common background, each difference Theme forms a sub-topic, and finds out the representative keyword of each sub-topic.Just because of there is a public affairs between these articles Common background, so sub-topic analysis deposits substantial difference with the work of topic analysis, for example, uses LDA (Latent Dirichlet Allocation) topic model carries out sub-topic analysis, because all there is a similar back of the body in all of article Scape, so information thoroughly can not be carried out into careful segmentation using LDA topic models, the article of somewhat different sub-topic It is possible to because there is common background knowledge, and to cause that their otherness is submerged, the subject information and descriptor of capture It is included under same theme because similarity is too high.
Recent domestic all expands the research of correlation for topic mining algorithm, and makes some progress.Its The mining algorithm of middle much-talked-about topic can be summarized as two classes, and the first kind is to carry out much-talked-about topic digging using the algorithm of classification and cluster Pick.It has been proposed, for example, that Group average clustering (Group Average Clustering) algorithm is to improve hierarchical clustering algorithm, Carry out the discovery of review formula topic.Someone have studied how to detect popular topic, final experiment using DBSCAN algorithms Effect is not able to reach expected degree.Somebody proposes the clustering algorithm of Single-pass, and this algorithm is especially suitable for In online topic detection, acceptable topic Result can be provided on relatively low Algorithms T-cbmplexity.
Equations of The Second Kind method is traditional topic model, and topic model is set up directly against Twitter message using LDA models, from And extract related topic information.It has been proposed, for example, that semi-supervised learning model L-LDA, can be used to study and uses The interest distribution at family.Someone proposes Cray process in Di of improved LDA and stratification on the basis of distributed algorithm (Hierarchical Dirichlet Process, or HDP), it is possible to use it carries out topic analysis.It is proposed that one Individual new topic model, this model is related subject model (Correlated Topic Model, or CTM), and it is by just Correlation between state distribution modeling topic.Somebody designs and realizes a topic digging system towards news, claims It is TwitterStand, Twitter topic news popular at present can be caught with it.Someone is studied by analyzing in microblogging Appearance automatically generates the content summary about microblogging, and this is also a kind of research method that topic is excavated.For example a kind of scheme is to use Single sentence summarizes microblog topic, helps the user quickly to understand hot issue.On this basis, somebody is proposed using many Individual sentence represents a method for topic, primarily to overcoming the single sentence defect inadequate to topic information representation.
On the other hand, someone also proposed the Chinese hot microblog topic digging system based on specific area (BTopicMiner), they think to solve the sparse sex chromosome mosaicism of micro-blog information inherent data, can first using text cluster The related Twitter message of content is merged into microblogging document by method;They think that the follow-up relation between microblogging has contained words simultaneously The relevance of topic, and on this basis, be extended to model the pass of the follow-up between microblogging on traditional LDA topic models System;Mutual information (MI) is finally utilized to write inscription remittance for much-talked-about topic recommendation if calculating the topic being extracted.
But above achievement in research does not focus on the further analysis work of the sub-topic inside classified special topic. As it was noted above, when the sub-topic inside to special topic is excavated, because all there is a similar background in all of article, some The article of different sub-topics is possible to because there is common background knowledge, and to cause that their otherness is submerged, captures Subject information and descriptor be included under same theme because similarity is too high.Above-mentioned prior art cannot be effectively Overcome this problem.
Therefore, currently in the urgent need to a kind of solution of the sub-topic that differentiation effectively can be further excavated inside special topic Certainly scheme.
The content of the invention
Task of the invention is to provide a kind of sub-topic that differentiation effectively can be further excavated inside special topic Solution.
According to an aspect of the invention, there is provided a kind of LDA sub-topic method for digging for suppressing ambient noise, including under Row step:
1) the theme value to every each word of document in corpus is initialized, wherein, the value model of theme value Enclose be K+1 kinds value composition set, wherein a kind of value corresponds to background module, remaining K kind value corresponds respectively to want thin The K sub-topic divided;
2) the theme value of each word based on current each piece document, for each word in every article, difference Calculate probability of the word from each sub-topic and calculate probability of the word from background module, be then based on being calculated The probability for going out, is each word distribution theme value in every article using gibbs sampler algorithm again;Wherein, word comes from Word distribution vector of the probability of background module in the background module of advance statistics is calculated, the word in the background module Distribution vector is constant all the time in an iterative process;
3) judge whether to meet the condition for stopping iteration, each word if meeting according to current each piece document Theme is worth LDA sub-topics, if it is not, then returning to step 2) proceed iteration to update each word of each piece document Theme value.
Wherein, the step 2) including substep:
21) document is chosen;
22) word is chosen from current document;
23) for current term, based on the current topic for removing other words after this word subject information of itself Value assignment information, calculates probability of the word from each theme;The theme includes each sub-topics of K and 1 background module;
24) gibbs sampler is carried out according to the probability for being calculated, master is redistributed to current term according to sampled result Topic value, is then back to step 22), next word is processed, until current document is disposed;
If 25) current document is disposed, return to step 21), start to process next chapter document, until all documents It is disposed.
Wherein, the step 23) in, the probable value that current term belongs to each sub-topic is calculated according to the following equation:
The probable value that current term belongs to background module is calculated according to the following equation:
Wherein, ziThe theme value of current term is represented, k represents the theme value of sub-topic, and b represents the theme of background module Value, subscript i represents the numbering of current term,The theme value vector of other words composition in addition to the word that numbering is i is represented,The vector of word composition is represented,Represent in the set of other words composition in document m in addition to i, theme is k's The total number of word,Represent in the set of other words composition in corpus in addition to i, theme is the word t of k Total number, α represents theme hyper parameter, and β represents the hyper parameter of Topic word distribution, φtRepresent background module word be distributed to The corresponding probability of word t in amount, λ represents the weight regulatory factor of background module and topic module.
Wherein, the step 3) in, the condition for stopping iteration is that iterations reaches default numerical value.
Wherein, the step 2) and step 3) between also perform step:
30) according to the theme value that each word in corpus is current, after the word for belonging to background module is excluded, calculate Theme-word distribution matrix;
The step 3) in, the condition of the stopping iteration is:Contrast current topic-word distribution matrix and it is last round of repeatedly Theme-word distribution matrix that generation calculates, if variable quantity is less than default threshold value, then it is assumed that meet the bar for stopping iteration Part, otherwise it is assumed that being unsatisfactory for stopping the condition of iteration.
Wherein, the step 3) in, when the condition for stopping iteration being met, current theme-word distribution matrix is exported, K sub-topic and its corresponding keyword are obtained according to this matrix.
Wherein, the step 1) in, the initialization is:For every each word of document, corresponding to K son words In topic and 1 K+1 option of background module, theme value is distributed as it in the way of shaking the elbows.
Wherein, the step 2) in, the word distribution vector in the background module is by counting each of global corpus The occurrence number of individual word draws.
Wherein, the step 30) in, for any word, when the word belongs to the probability of background module more than default During threshold value, assert that the word belongs to background module, the word is excluded when theme-word distribution matrix is calculated.
According to another aspect of the present invention, additionally provide it is another suppress ambient noise LDA sub-topics mining algorithm (under Sometimes BLDA algorithms are referred to as in text), including:
Step a) incorporates gibbs (Gibbs) sampling process of ambient noise;
Step b) adds the BLDA algorithms of background module in LDA algorithm model;
Wherein, step a) is specifically comprised the steps of:
Step a-1) consider from language material composition structure and incorporate the thought of removal background noise, extend LDA algorithm;
Wherein, step b) is specifically comprised the steps of:
Step b-1) statistics background module in word distribution vector;
Step b-2) producing method of keyword is reseted during iteration;
Wherein, step 2-2) specifically comprise the steps of:
Step b-2-1) set background module proportion;
Step b-2-2) calculate generating probability of this word from background language material respectively and this word carrys out each autodyne Generating probability among the sub-topic of alienation;
Step b-2-3) set whether rule judgment word comes from background language material;
Step b-2-4) each word is sampled using new sampling algorithm;
Step b-2-5) after iteration terminates, it is considered to all theme values are the word of non-background module, statistic document master Topic two vectors of matrix and subject word matrix.
Compared with prior art, the present invention has following technique effect:
1st, the present invention proposes a sub-topic mining algorithm BLDA based on LDA algorithm extension, is treatment magnanimity warp point The sub-topic analysis work of the thematic information after class provides a new resolving ideas;
2nd, serial experiment has been carried out to BLDA in wechat public's account data, it was demonstrated that it is directed to if monograph set Topic mining effect has a distinct increment in theme recall rate and multinomial clustering target than LDA algorithm.
Brief description of the drawings
Hereinafter, embodiments of the invention are described in detail with reference to accompanying drawing, wherein:
Fig. 1 shows the flow chart of the LDA sub-topic method for digging of the suppression ambient noise of one embodiment of the invention;
Fig. 2 shows the generation path profile of the document-theme-word of original LDA models;
Fig. 3 shows the word generating process schematic diagram of the LDA models for removing ambient noise of one embodiment of the invention;
Fig. 4 shows that LDA contrasts statistical chart with BLDA themes recall rate;
Fig. 5 shows LDA and BLDA average title similarity comparison figures;
Fig. 6 shows the Purity comparing result analysis charts of LDA and BLDA algorithms;
Fig. 7 shows the NMI comparative result figures of LDA and BLDA algorithms;
Fig. 8 shows the F value index comparative result figures of LDA and BLDA algorithms.
Specific embodiment
According to one embodiment of present invention, there is provided a kind of LDA sub-topic method for digging for suppressing ambient noise, with original The LDA algorithm of beginning is compared, and the sub-topic method for digging of the present embodiment increased a kind of special theme, i.e. background language material theme. In the present embodiment, it is believed that a word is likely to come from the topic model of differentiation possible from background language material, background Language material word distribution situation will not change in the iterative process of topic model, and the word distribution situation of background language material can be with Calculated by the statistics to overall language material in advance, and the probability distribution of the word of the topic model of differentiation then needs Calculated during renewal iteration below.
Fig. 1 shows the flow chart of the LDA sub-topic method for digging of the suppression ambient noise of the present embodiment, including following step Suddenly:
Step 1:Based on calculating probability of each word from background language material for global language material, and then obtain as background The word distribution vector φ of knowledge.In one example, the number that all words occur in global language material is directly counted, is safeguarded One word frequency vector lists of V dimensions, is subsequently adding smooth item, finally normalizes.In the present embodiment, the word point in background module Cloth vector φ only need to statistics once, behind would not change again.
In the present embodiment, background knowledge is an advance statistic, and it is by each word in the global language material of calculating Generating probability obtained from.Background knowledge is illustrated with a simply example below.Assuming that there is a corpus, there are two Document is as follows:
1. I comes from Beijing, and I is Pekinese.
2. first time come certain so-and-so square I be very exciting.
The information of the statistics above, a total of 15 word amounts, " I " occurs in that 3 times, then " I " comes from background language material Probability is exactly 3/15, and " certain so-and-so " is occurred in that 1 time altogether, so the probability that " certain so-and-so " comes from background language material is exactly 1/ 15。
Step 2:Setting theme number.Theme refers to the sub-topic number that subdivision is needed in special topic, i.e. differentiation master herein The number of topic.Hereinafter, this number is represented with K.
Step 3:Theme value to each word of each article in corpus is initialized.Wherein, theme value difference K different theme and background module are represented, background module can be understood as a special theme.Use in one example 1 to K represents the K theme value of theme respectively, with -1 theme value for representing background module.
Initialization is exactly to assign initial theme value respectively to each word of each article in corpus.In an implementation In example, a theme value (this theme value is probably the theme value of background module) randomly is distributed to each word, for example, Can be to each each word of article, in K+1 option (K theme adds background module) in the way of shaking the elbows Distribution theme value.
Step 4:Based on current theme value, for each word in every article, calculate the word and led from each The probability of topic, is then based on calculated probability, and theme is redistributed using gibbs sampler algorithm.
This step is circulated comprising two-layer, including substep:
Step 41:Choose a document.
Step 42:Then a word is chosen from current document.
Step 43:For current term, based on the current of other words removed after this word subject information of itself Theme assignment information, calculates probability of the word from each theme.
In this step, probability of the word from each theme is calculated based on the LDA models of addition background module.With it is original LDA algorithm is different, and background module is added in the present embodiment, and it changes the process of word generation, it is necessary to consider that word can Background can be come from, it is also possible in the topic model from differentiation.
Table 1 gives the implication of the parameters in the LDA models (may be simply referred to as BLDA models) for adding background module, ginseng Number λ is the proportion for setting background module, and span is 0 to 1 real number.When λ values are 0, BLDA models Original LDA models are deteriorated to, i.e., the two is in the absence of difference.With the increase of the value of λ, the proportion of background module is also incited somebody to action To lifting.
Table 1
In BLDA models, weight regulatory factor λ can be with empirically determined, it is also possible to is tested with the test set of a small amount of language material It is determined that.When probability of the word from each theme is calculated, calculate respectively generating probability of this word from background language material and This word comes each from the generating probability of the sub-topic of differentiation.
Wherein, for any word, if generating probability of this word from background language material exceedes default threshold value, that This word is directly thought from background language material, and this word is by the theme probability and the word probability of theme for calculating document During be ignored.Otherwise, if background language material generates the probability of this word no more than certain threshold value, just by Background generation path Also sample path is added.Referring to figs. 2 and 3 Fig. 2 shows the generation path profile of the document-theme-word of original LDA models;Figure The word generating process schematic diagram of the 3 LDA models (i.e. BLDA models) for removing ambient noise for showing the present embodiment.Relative to original Beginning LDA model, BLDA models increased background module, and differentiation topic module constitutes non-background module.In BLDA models, During one word of generation, the word may come from background module, it is also possible to come from non-background module, when it comes from background mould During block, the corresponding theme probable value of the word draws according to the background knowledge that weight regulatory factor λ and step 1 draw, when it comes from When non-background module, the corresponding theme probable value of the word can be according to weight regulatory factor λ, and based on original LDA models Data processing method draws.
In one example, the probable value of each theme of word and background module probable value computing formula such as formula (1) and Shown in formula (2), shown in normalization computing formula such as formula (3) and formula (4).
In formula, k represents the theme value of sub-topic, and b represents the theme value of background module, and subscript i represents current term Numbering,The theme value vector of other words composition in addition to the word that numbering is i is represented,Represent the vector of word composition (can be described as term vector).Represent in the set of other words composition in document m in addition to i, theme is the word of k Total number,Represent in corpus except when preceding numbering is as in the set of other words composition beyond the word of i, theme It is the total number of the word t (word t can be regarded as the lexical item corresponding to current term herein) of k, only statistics theme value is 1 here To the word of K.Gibbs sampler is carried out to each word, all of word can all have a theme value, when this word is adopted Sample result is background module, and it is -1 at this moment to set its theme value, otherwise for 1 between K, while each statistical variable is safeguarded, side Just the probability that other words belong to each theme value is calculated.
After having calculated probability of the current term from each theme, into step 44.
Step 44:Gibbs sampler is carried out according to the probability for being calculated, is divided again to current term according to sampled result With theme (assigning new theme value).Step 42 is then back to, next word is processed, until current document is disposed.
Step 45:If current document is disposed, return to step 41, start to process next chapter document, until all Document process is finished.
According to the two-layer circulation process of above-mentioned steps 41~45, you can the theme value of all words of corpus is updated into one Time.
Step 5:According to the current theme value of each word in corpus, after the word for belonging to background module is excluded, Calculate theme-word distribution matrix.
Step 6:Contrast current topic-word distribution matrix and the last round of theme-word distribution matrix for iterating to calculate out, If variable quantity is less than default threshold value, then it is assumed that convergence, execution step 7, otherwise return to step 4 start execution next round and change Generation.
In this step, variable quantity is actually the condition for stopping iteration less than default threshold value, in other embodiments, Whether certain number of times can be reached using the condition of other stopping iteration, such as iterations.
Step 7:Current theme-word the distribution matrix of output, K theme and its correspondence can be obtained according to this matrix Keyword.
Based on above-mentioned flow, the degree of accuracy that the sub-topic of monograph collection is excavated can be significantly increased.Inventor from Lower two aspects are tested:The Topics Crawling ability of BLDA algorithms and the Subject Clustering ability of BLDA algorithms.Experiment number According to the article that wechat public's account is delivered is taken from, all of from 25 days 2 months to 28 days 2 months is comprising specific relative words Article is manually labeled as 20 classifications as training corpus, by experimental data, altogether comprising 3,487 documents.
BLDA Topics Crawlings capability result is analyzed:
For above-mentioned language material, run using common LDA algorithm and BLDA algorithms respectively, observe both algorithms above The Topics Crawling result that labeled data collection closes, and objective appraisal index is provided, compare two kinds of algorithms for monograph Topics Crawling ability.
The result of the subject key words that BLDA algorithms are extracted as shown in table 2, because length is limited, only gives preceding 12 here Individual subject key words list.By the subject key words phrase that directly observation BLDA algorithms are extracted, between the different themes of discovery Keyword relevance very little, the result of the subject key words extracted compared to common LDA algorithm (i.e. original LDA algorithm) comes See, effect is obviously improved.More detailed-oriented quantitative analysis result will be given below.
Table 2
Subject key words of the result of this experiment to being extracted by way of artificial judge are analyzed, the standard of judge It is theme recall rate:Model different during experiment all can respectively give 20 keywords of each theme every time, be considered as the theme Represent, judge that can this 20 keywords represent a subject information by manual analysis, then go judge can recall how many Original subject information.
Here the keyword phrase that BLDA and LDA algorithm are extracted is given respectively, then calculates both models in iteration time Number is respectively 400 times, 500 times, and theme recall rate at 600 times.BLDA and LDA by shared identical argument section, Parameter lambda in BLDA is set as 0.4.
Fig. 4 shows that BLDA contrasts statistical chart with LDA themes recall rate.It can be seen from figure 4 that having identical for this batch The language material of background, the theme recall rate of the theme recall rate than LDA of BLDA averagely improves a lot, and with iterations Increase, the average title recall rate of BLDA can also be correspondingly improved, until convergence convergence.Can also be analyzed from experimental result Go out, background module is for the motif discovery effect of being very helpful, the theme theme of the average recall rate compared to LDA of algorithm Recall rate improves 170%.
The judgment criteria of second subject key words is, the differentiation information content between different themes keyword phrase, this Individual is also to use the main purpose for going background thought.The bad test of this Indexes Comparison, present invention setting uses average phase Like degree as index, the computational methods of keyword phrase similarity information are the ratios by calculating the co-occurrence word between them, The statistics of this similarity is exactly Jaccard similarities.Fig. 5 shows LDA and BLDA average title similarity comparison figures.Its ginseng Number setting is consistent with previous group experiment.As can be seen from Figure 5 the average similarity of the keyword phrase that BLDA is extracted is obvious Less than the result that LDA is extracted, average similarity averagely reduces 495%.
BLDA assembility interpretations of result:
The most direct evaluation index that cluster result is cluster is judged from the result of cluster, purity methods are extremely letters A kind of single clustering evaluation method, the number of files that need to only calculate correct cluster accounts for the ratio of total number of files.
Evaluation result of two models when identical parameter setting (in addition to λ) is calculated respectively, provides iteration time Number 300 times, 400 times, 500 times, and two results at 600 times comparative analysis figure, as shown in Figure 6.As can be seen that BLDA Algorithm is averagely carried in the subject information purity that each stage subject information purity of iterations is all higher than far away LDA algorithm It is high by 143%, therefore the validity of BLDA algorithms is can be seen that from the result of cluster.
On the other hand, it is NMI that one kind can maintain index in a balanced way between clustering result quality and number of clusters mesh, and Fig. 7 gives in detail Thin LDA and the NMI comparative result figures of BLDA algorithms, the cluster result index NMI of BLDA averagely improve 160.2%.
NMI refers to that the effect of target value explanation clustering algorithm higher is better, and the NMI indexs of BLDA algorithms are better than LDA far away The NMI indexs of algorithm.
Another evaluation index different from evaluation index above can regard cluster as a series of decision-making Journey, i.e., to all of N (N-1)/2 document in document sets to carrying out decision-making.When and if only if two documents are similar, they are returned Enter in same cluster, two similar documents are included into a cluster by TP decision-makings, and the document of two dissmilarities is included into two by TN decision-makings Different clusters, can make the mistake of two classes in this decision process:The document of two dissmilarities can be included into same cluster by FP decision-makings, And two similar documents can be included into different clusters by FN decision-makings.What RI was calculated is the ratio of correct decisions.
In RI most basic computational methods, FN and FP possesses identical weight, sometimes will by similar document splitting ratio It is more serious that dissimilar document returns into a class.So the result of cluster can be measured using F values, and by setting different tune The factor is saved to adjust the punishment dynamics to FN.
Take and determine regulatory factor in the case of 1, to calculate BLDA and LDA respectively after 300,400,500 and 600 iteration F values (a kind of mutation of RI systems), Fig. 8 shows the F value index comparative result figures of LDA algorithm and BLDA algorithms.Result shows Show, the F values of BLDA models are significantly better than that LDA models.
In sum, inventor is respectively from AC accuracys rate, and these three indexs of NMI and RI coefficients verify BLDA algorithms in reality The cluster result on border is more reasonable than the cluster result of original LDA algorithm.Experimental result also quantitatively gives BLDA algorithms Validity prove and explain.
It should be noted last that, the above embodiments are merely illustrative of the technical solutions of the present invention and it is unrestricted.Although ginseng The present invention has been described in detail according to embodiment, it will be understood by those within the art that, to technical side of the invention Case is modified or equivalent, and without departure from the spirit and scope of technical solution of the present invention, it all should cover in the present invention Right in the middle of.

Claims (9)

1. a kind of sub-topic method for digging, comprises the following steps:
1) the theme value to every each word of document in corpus is initialized, wherein, the span of theme value is K + a kind of set of value composition, wherein a kind of value corresponds to background module, remaining K kind value corresponds respectively to the K to be segmented Individual sub-topic;
2) the theme value of each word based on current each piece document, for each word in every article, calculates respectively Probability of the word from each sub-topic and probability of the word from background module is calculated, be then based on what is calculated Probability, is each word distribution theme value in every article using gibbs sampler algorithm again;Wherein, word comes from background Word distribution vector of the probability of module in the background module of advance statistics is calculated, the word distribution in the background module Vector is constant all the time in an iterative process;
3) step 2 is repeated), until meet the condition for stopping iteration, according to each of current each piece document after stopping iteration The theme of individual word is worth sub-topic.
2. sub-topic method for digging according to claim 1, it is characterised in that the step 2) including substep:
21) document is chosen;
22) word is chosen from current document;
23) for current term, assigned based on the current topic value for removing other words after this word subject information of itself Value information, calculates probability of the word from each theme;The theme includes each sub-topics of K and 1 background module;
24) gibbs sampler is carried out according to the probability for being calculated, theme value is redistributed to current term according to sampled result, It is then back to step 22), next word is processed, until current document is disposed;
If 25) current document is disposed, return to step 21), start to process next chapter document, until all document process Finish.
3. sub-topic method for digging according to claim 2, it is characterised in that the step 23) in, current term belongs to The probable value of each sub-topic is calculated according to the following equation:
The probable value that current term belongs to background module is calculated according to the following equation:
Wherein, ziThe theme value of current term is represented, k represents the theme value of sub-topic, and b represents the theme value of background module, subscript I represents the numbering of current term,The theme value vector of other words composition in addition to the word that numbering is i is represented,Represent The vector of word composition,Represent in the set of other words composition in document m in addition to i, theme is the word of k Total number,Represent in the set of other words composition in corpus in addition to i, theme is the total number of the word t of k, α represents theme hyper parameter, and β represents the hyper parameter of Topic word distribution, φtRepresent word in the word distribution vector of background module The corresponding probability of t, λ represents the weight regulatory factor of background module and topic module.
4. sub-topic method for digging according to claim 1, it is characterised in that the step 3) in, the stopping iteration Condition be that iterations reaches default numerical value.
5. sub-topic method for digging according to claim 1, it is characterised in that the step 2) and step 3) between also hold Row step:
30) according to the theme value that each word in corpus is current, after the word for belonging to background module is excluded, master is calculated Topic-word distribution matrix;
The step 3) in, the condition of the stopping iteration is:Contrast current topic-word distribution matrix and last round of iteration meter The theme for calculating-word distribution matrix, if variable quantity is less than default threshold value, then it is assumed that meet the condition for stopping iteration, it is no Then think the condition for being unsatisfactory for stopping iteration.
6. sub-topic method for digging according to claim 5, it is characterised in that the step 3) in, changed when stopping is met During the condition in generation, current theme-word distribution matrix is exported, K sub-topic and its corresponding pass are obtained according to this matrix Keyword.
7. sub-topic method for digging according to claim 1, it is characterised in that the step 1) in, the initialization is: For every each word of document, in corresponding to K sub-topic and 1 K+1 option of background module, with what is shaked the elbows Mode is its distribution theme value.
8. sub-topic method for digging according to claim 1, it is characterised in that the step 2) in, the background module In word distribution vector drawn by counting the occurrence number of each word of global corpus.
9. sub-topic method for digging according to claim 5, it is characterised in that the step 30) in, for any word Language, when the probability that the word belongs to background module exceedes default threshold value, assert that the word belongs to background module, is led calculating The word is excluded during topic-word distribution matrix.
CN201611024146.5A 2016-11-17 2016-11-17 A kind of sub-topic method for digging Active CN106844416B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611024146.5A CN106844416B (en) 2016-11-17 2016-11-17 A kind of sub-topic method for digging

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611024146.5A CN106844416B (en) 2016-11-17 2016-11-17 A kind of sub-topic method for digging

Publications (2)

Publication Number Publication Date
CN106844416A true CN106844416A (en) 2017-06-13
CN106844416B CN106844416B (en) 2019-11-29

Family

ID=59145811

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611024146.5A Active CN106844416B (en) 2016-11-17 2016-11-17 A kind of sub-topic method for digging

Country Status (1)

Country Link
CN (1) CN106844416B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107391660A (en) * 2017-07-18 2017-11-24 太原理工大学 A kind of induction division methods for sub-topic division
CN108399228A (en) * 2018-02-12 2018-08-14 平安科技(深圳)有限公司 Article sorting technique, device, computer equipment and storage medium
CN109214454A (en) * 2018-08-31 2019-01-15 东北大学 A kind of emotion community classification method towards microblogging
CN109597875A (en) * 2018-11-02 2019-04-09 广东工业大学 A kind of Optimization Solution mode of the Gauss LDA of word-based insertion
CN111026835A (en) * 2019-12-26 2020-04-17 厦门市美亚柏科信息股份有限公司 Chat subject detection method, device and storage medium
CN111222319A (en) * 2019-11-14 2020-06-02 电子科技大学 Document information extraction method based on novel HDP model
CN111460079A (en) * 2020-03-06 2020-07-28 华南理工大学 Topic generation method based on concept information and word weight
CN112015904A (en) * 2019-05-30 2020-12-01 百度(美国)有限责任公司 Method, system, and computer-readable medium for determining latent topics for a corpus of documents
CN112069394A (en) * 2020-08-14 2020-12-11 上海风秩科技有限公司 Text information mining method and device
CN112836507A (en) * 2021-01-13 2021-05-25 哈尔滨工程大学 Method for extracting domain text theme
CN113344107A (en) * 2021-06-25 2021-09-03 清华大学深圳国际研究生院 Theme analysis method and system based on kernel principal component analysis and LDA (latent Dirichlet Allocation analysis)
CN117474703A (en) * 2023-12-26 2024-01-30 武汉荟友网络科技有限公司 Topic intelligent recommendation method based on social network

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103365978A (en) * 2013-07-01 2013-10-23 浙江大学 Traditional Chinese medicine data mining method based on LDA (Latent Dirichlet Allocation) topic model
CN103870447A (en) * 2014-03-11 2014-06-18 北京优捷信达信息科技有限公司 Keyword extracting method based on implied Dirichlet model
CN104504087A (en) * 2014-12-25 2015-04-08 中国科学院电子学研究所 Low-rank decomposition based delicate topic mining method
CN105335375A (en) * 2014-06-20 2016-02-17 华为技术有限公司 Topic mining method and apparatus
CN105354244A (en) * 2015-10-13 2016-02-24 广西师范学院 Time-space LDA model for social network community mining
CN105843795A (en) * 2016-03-21 2016-08-10 华南理工大学 Topic model based document keyword extraction method and system
CN106055604A (en) * 2016-05-25 2016-10-26 南京大学 Short text topic model mining method based on word network to extend characteristics

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103365978A (en) * 2013-07-01 2013-10-23 浙江大学 Traditional Chinese medicine data mining method based on LDA (Latent Dirichlet Allocation) topic model
CN103870447A (en) * 2014-03-11 2014-06-18 北京优捷信达信息科技有限公司 Keyword extracting method based on implied Dirichlet model
CN105335375A (en) * 2014-06-20 2016-02-17 华为技术有限公司 Topic mining method and apparatus
CN104504087A (en) * 2014-12-25 2015-04-08 中国科学院电子学研究所 Low-rank decomposition based delicate topic mining method
CN105354244A (en) * 2015-10-13 2016-02-24 广西师范学院 Time-space LDA model for social network community mining
CN105843795A (en) * 2016-03-21 2016-08-10 华南理工大学 Topic model based document keyword extraction method and system
CN106055604A (en) * 2016-05-25 2016-10-26 南京大学 Short text topic model mining method based on word network to extend characteristics

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
JEYHAN LAU 等: ""On-line Trend Analysis with Topic Models: #twitter trends detection topic model online"", 《PROCEEDINGS OF COLING 2012:TECHNICAL PAPERS》 *
张培晶 等: ""基于 LDA 的微博文本主题建模方法研究述评"", 《图书情报工作》 *
李继云 等: ""CGRMB-LDA: 面向隐式微博的主题挖掘"", 《计算机应用》 *

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107391660B (en) * 2017-07-18 2021-05-11 太原理工大学 Induced division method for subtopic division
CN107391660A (en) * 2017-07-18 2017-11-24 太原理工大学 A kind of induction division methods for sub-topic division
CN108399228A (en) * 2018-02-12 2018-08-14 平安科技(深圳)有限公司 Article sorting technique, device, computer equipment and storage medium
CN108399228B (en) * 2018-02-12 2020-11-13 平安科技(深圳)有限公司 Article classification method and device, computer equipment and storage medium
CN109214454A (en) * 2018-08-31 2019-01-15 东北大学 A kind of emotion community classification method towards microblogging
CN109214454B (en) * 2018-08-31 2021-07-06 东北大学 Microblog-oriented emotion community classification method
CN109597875A (en) * 2018-11-02 2019-04-09 广东工业大学 A kind of Optimization Solution mode of the Gauss LDA of word-based insertion
CN109597875B (en) * 2018-11-02 2022-08-23 广东工业大学 Word embedding-based Gaussian LDA optimization solution mode
CN112015904A (en) * 2019-05-30 2020-12-01 百度(美国)有限责任公司 Method, system, and computer-readable medium for determining latent topics for a corpus of documents
CN111222319A (en) * 2019-11-14 2020-06-02 电子科技大学 Document information extraction method based on novel HDP model
CN111222319B (en) * 2019-11-14 2021-09-14 电子科技大学 Document information extraction method based on HDP model
CN111026835B (en) * 2019-12-26 2022-06-10 厦门市美亚柏科信息股份有限公司 Chat subject detection method, device and storage medium
CN111026835A (en) * 2019-12-26 2020-04-17 厦门市美亚柏科信息股份有限公司 Chat subject detection method, device and storage medium
CN111460079A (en) * 2020-03-06 2020-07-28 华南理工大学 Topic generation method based on concept information and word weight
CN111460079B (en) * 2020-03-06 2023-03-28 华南理工大学 Topic generation method based on concept information and word weight
CN112069394B (en) * 2020-08-14 2023-09-29 上海风秩科技有限公司 Text information mining method and device
CN112069394A (en) * 2020-08-14 2020-12-11 上海风秩科技有限公司 Text information mining method and device
CN112836507A (en) * 2021-01-13 2021-05-25 哈尔滨工程大学 Method for extracting domain text theme
CN112836507B (en) * 2021-01-13 2022-12-09 哈尔滨工程大学 Method for extracting domain text theme
CN113344107A (en) * 2021-06-25 2021-09-03 清华大学深圳国际研究生院 Theme analysis method and system based on kernel principal component analysis and LDA (latent Dirichlet Allocation analysis)
CN113344107B (en) * 2021-06-25 2023-07-11 清华大学深圳国际研究生院 Topic analysis method and system based on kernel principal component analysis and LDA
CN117474703A (en) * 2023-12-26 2024-01-30 武汉荟友网络科技有限公司 Topic intelligent recommendation method based on social network
CN117474703B (en) * 2023-12-26 2024-03-26 武汉荟友网络科技有限公司 Topic intelligent recommendation method based on social network

Also Published As

Publication number Publication date
CN106844416B (en) 2019-11-29

Similar Documents

Publication Publication Date Title
CN106844416A (en) A kind of sub-topic method for digging
CN107633044B (en) Public opinion knowledge graph construction method based on hot events
Ma et al. An ontology-based text-mining method to cluster proposals for research project selection
Saad Opinion mining on US Airline Twitter data using machine learning techniques
CN108197144B (en) Hot topic discovery method based on BTM and Single-pass
CN103678670A (en) Micro-blog hot word and hot topic mining system and method
Pang et al. A generalized cluster centroid based classifier for text categorization
CN108268470A (en) A kind of comment text classification extracting method based on the cluster that develops
Setyaningsih et al. Categorization of exam questions based on bloom taxonomy using naïve bayes and laplace smoothing
Xiao et al. Patent text classification based on naive Bayesian method
Wotaifi et al. An effective hybrid deep neural network for arabic fake news detection
Gupta et al. Fake news detection using machine learning
Amati et al. Topic modeling by community detection algorithms
CN114461879A (en) Semantic social network multi-view community discovery method based on text feature integration
Waldherr et al. Mining big data with computational methods
Idrus et al. Sentiment analysis of state officials news on online media based on public opinion using naive bayes classifier algorithm and particle swarm optimization
CN111259117B (en) Short text batch matching method and device
Jamal et al. Sentimental analysis based on hybrid approach of latent dirichlet allocation and machine learning for large-scale of imbalanced twitter data
Minab et al. A new sentiment classification method based on hybrid classification in Twitter
CN105447013A (en) News recommendation system
Wotaifi et al. Developed Models Based on Transfer Learning for Improving Fake News Predictions.
Kadhim et al. Combined chi-square with k-means for document clustering
Akyol Clustering hotels and analyzing the importance of their features by machine learning techniques
Errecalde et al. Silhouette+ attraction: A simple and effective method for text clustering
Duan et al. Unbalanced data sentiment classification method based on ensemble learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant