CN105631018A

CN105631018A - Article feature extraction method based on topic model

Info

Publication number: CN105631018A
Application number: CN201511016955.7A
Authority: CN
Inventors: 沈嘉明; 宋振宇; 李世韬; 毛宇宁; 谈兆炜; 朱鸿儒; 王乐群; 郭运奇; 王彪; 傅洛伊; 王新兵
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2015-12-29
Filing date: 2015-12-29
Publication date: 2016-06-01
Anticipated expiration: 2035-12-29
Also published as: CN105631018B

Abstract

The invention provides an article feature extraction method based on a topic model. The method comprises the steps that an article reference relationship network is built on the basis of a raw corpus; a generative model of the topic model and a parameter combining expression formula are built; the inference process of the topic model is built according to the generative model, article sampling is conducted on a new corpus, and article parameters are extracted according to the sampling result of article sampling. According to the article feature extraction method based on the topic model, the article reference network is used for extending the traditional topic model, and therefore more accurate article features are extracted.

Description

Based on the article feature extraction method of topic model

Technical field

The present invention relates to article feature extraction technique field, specifically, it relates to based on the article feature extraction method of topic model, especially a kind of integrated Citations networks arranges, based on the method for the feature extraction of topic model.

Background technology

Scientific research activity is the strategic support improving social productive forces and overall national strength. The input for scientific research activity is all paid much attention in countries in the world. Science and technology research and development are put the core position in the national development overall situation by China, and the expenditure of scientific research is increased by state revenue and expenditure steadily. 2012, the research and development of China dropped into funds (comprising industry member and academia) more than trillion, is 10298.4 hundred million yuan, reaches the level of medium-developed country.

One of the most direct output result of scientific research activity is academic article. According to statistics, from 2004 to 2014, China scientific research personnel delivers scientific and technical article 136.98 ten thousand sections in the world altogether, occupies the second in the world; Article is cited 1037.01 ten thousand times altogether, occupies the world the 4th. Research practice shows, academic article is that scientific research personnel carries out scientific research activity or proceeds the extremely important information resources of further investigation. But, in the face of the documents and materials that the information age is vast as the open sea, the academic resources required for how retrieving oneself quickly and accurately, for scientific research personnel, is an extremely important and challenging job really.

In the face of the demand that academics search is recommended, company of Google was proposed the academic search engine of beta version in 2004, for the scientific research personnel in the whole world provides free academic documents Information services; 2006, Microsoft was proposed academic search engine MicrosoftAcademicSearch. Although these comprehensive academic search engine have relied on the search technique of business search company belonging to it, in fact, they Search Results and unsatisfactory. These academic search engine existing input for the inquiry of user, still return Query Result with the form of article list. Their accuracy paying attention to result for retrieval, the keyword inquired about with user by article search results mates accurately, and does not pay attention to the position that article is residing in respective field, and the development trend of article topic more. But, for scientific research personnel, than accurately coupling title is prior, obtain the forward position achievement in affiliated subject theme and important contribution article often. Such as, the search subscriber just relating to a certain research field is when searching for, they and the indefinite document oneself needing what type, the keyword of its search is usually rough theme or topic, if adopting above-mentioned comprehensive academic search engine, user often cannot fast and effeciently understand the forward position achievement in related discipline theme and important contribution article, and the result obtained can not be satisfactory.

Visible, build the academics search commending system of a set of highly effective, for scientific research personnel obtain required resource, in time grasp discipline development dynamically, improve self capacity of scientific research, and then strengthen the research strength of country, all there is considerable meaning. Just because of this, academics search commending system obtains the attention of people gradually in recent years. From 2000, about the article quantity of article search and commending system in the trend risen year by year. According to incompletely statistics, only the related article quantity of 2013 just reaches more than 30 sections. But, the research of academics search commending system is still in the starting stage.

In the building process of academics search system, an important content concentrates from large-scale article data collection and citation network relation data, extracts the feature of article. Such as the power of adduction relationship between the Academic Contribution degree of the theme of every section of article, article, article and feature word corresponding to theme.

Up to the present, the main direction of studying paying close attention to article feature extraction both at home and abroad comprises: is analyzed by the semanteme of article, thus obtains the recommendation results of other articles similar to this article theme; To article citation network modeling analysis, draw the importance of article.

At present, the article feature extraction method based on theme analysis comprises: uses topic model (such as LDA algorithm) to analyze article theme, and introduces theme similarity in the collaborative filtering of commending system; Similar topic article is found in conjunction with topic model and language model; Based on LDA algorithm, to word group theme modeling etc. Article feature extraction method based on article citation network comprises: use HITS algorithm, and the two points of figure built based on article and term calculate the authority value of article; Utilize article citation network, calculate the authority value of article author and recommend; Utilize PageRank algorithm, in conjunction with quality and the citation network of periodical, calculate the PageRank value etc. of article.

But, these achievements in research or do not consider that model is to the operability of large sample amount article database, only pay close attention to the information of citation network and have ignored the extraction of article text information, or only consider article database text information but have ignored the information of Citations networks. The use value of therefore final result is not high.

Summary of the invention

For defect of the prior art, it is an object of the invention to provide a kind of article feature extraction method based on topic model.

According to a kind of article feature extraction method based on topic model provided by the invention, comprising:

Steps A: based on the Citations networks of original building of corpus article, sets initial article set and obtains new corpus according to Citations networks;

Step B: for new corpus, builds generation model and the parametric joint expression formula of topic model;

Step C: the deduction process building topic model according to described generation model;

Step D: according to the deduction process of topic model, article that new corpus is sampled;

Step e: extract article parameter according to the sampled result of sampling article.

Preferably, described steps A comprises:

Steps A 1, is set to empty set by summit collection V, limit collection E is set to empty set, figure G is set to V, the set of E;

Steps A 2, for each section of article in original corpus, is added to current article node u in the collection V of summit, adds in the collection E of limit by all adduction relationships of current article node u;

Steps A 3: using the figure G that obtained by steps A 2 as described Citations networks;

Steps A 4, is set to initial known point set V by summit collection V₀, limit collection E is set to initial known limit collection E₀, figure G is set to V, the set of E;

Steps A 5, constantly searches in original corpus the not some v in the collection V of summit, if the point existed in such some v and some v and summit collection V exists adduction relationship, then a v is added in the collection V of summit, and the adduction relationship of a v is added in E; Until till V, E no longer change;

Steps A 6: derive obtaining corpus corresponding to figure G by steps A 5 as described new corpus.

Preferably, described step B comprises:

Step B1: to each theme of new corpus, performs following steps:

The polynomial parameters of kth theme to the distribution of word is generated based on Di Likelei hyper parameter ��Wherein, �� isThe parameter of the Dirichlet distribute obeyed; K is positive integer;

Step B2: to each section article of new corpus, performs following steps:

The polynomial parameters �� of m section article to the distribution of theme is generated based on Di Likelei hyper parameter ��_m; Wherein, �� is ��_mThe parameter of the Dirichlet distribute obeyed; M is positive integer;

The polynomial parameters �� quoting intensity distribution of m section article is generated based on Di Likelei hyper parameter ��_m; Wherein, �� is ��_mThe parameter of the Dirichlet distribute obeyed;

Based on the hyper parameter group of Bei Ta distributionGenerate the Bernoulli parameter �� of the original index of m section article_m; Wherein,It is ��_mThe parameter of the Bei Ta distribution obeyed;

Step B3: each word in each section article is performed following steps:

Generating and obeying Bernoulli parameter is ��_mThe original index s of the n-th word of m section article of Bernoulli Jacob's distribution_m,_n; N is positive integer;

If-s_m,_nBe 1, then generating and obeying parameter is ��_mMultinomial distribution quote article c_m,n, generating and obeying parameter is ��_cm,nThe theme z of multinomial distribution_m,n, generating obedience parameter isThe word w of multinomial distribution_m,n;

If-s_m,nBe 0, then generating and obeying parameter is ��_mThe theme z of multinomial distribution_m,n, generating obedience parameter isThe word w of multinomial distribution_m,n;

Wherein,Represent the corresponding c of matrix ��_m,nRow vector,Represent matrixCorresponding z_m,nRow vector; �� represents the distribution matrix of article to theme,Represent the distribution matrix of theme to word, w_m,nRepresent the n-th word in m section article, z_m,nRepresent the theme of the n-th word in m section article, c_m,nRepresent the n-th word in m section article and article quoted in this n-th word original word of right and wrong;

Step B4: the joint probability distribution building topic model is as follows:

\begin{matrix} p (\overset{&RightArrow;}{w}, \overset{&RightArrow;}{z}, \overset{&RightArrow;}{c}, \overset{&RightArrow;}{s} | \overset{&RightArrow;}{α}, \overset{&RightArrow;}{β}, \overset{&RightArrow;}{η}, α_{λ_{n}}, α_{λ_{c}}) \\ = p (\overset{&RightArrow;}{w} | \overset{&RightArrow;}{z}, \overset{&RightArrow;}{c}, \overset{&RightArrow;}{s}, \overset{&RightArrow;}{β}) \cdot p (\overset{&RightArrow;}{z} | \overset{&RightArrow;}{c}, \overset{&RightArrow;}{s}, \overset{&RightArrow;}{α}) \cdot p (\overset{&RightArrow;}{c} | \overset{&RightArrow;}{s}, \overset{&RightArrow;}{η}) \cdot p (\overset{&RightArrow;}{s} | α_{λ_{n}}, α_{λ_{c}}) \\ = Π_{k = 1}^{K} \frac{Δ (\overset{&RightArrow;}{n_{k}} + \overset{&RightArrow;}{β})}{Δ (\overset{&RightArrow;}{β})} \cdot Π_{m = 1}^{M} \frac{Δ (\overset{&RightArrow;}{n_{m}} + \overset{&RightArrow;}{α})}{Δ (\overset{&RightArrow;}{α})} \cdot Π_{m = 1}^{M} \frac{Δ (\overset{&RightArrow;}{R_{m}} + \overset{&RightArrow;}{η})}{Δ (\overset{&RightArrow;}{η})} \cdot Π_{m = 1}^{M} \frac{B (N_{m}^{(1)} + α_{λ_{c}}, N_{m}^{(0)} + α_{λ_{n}})}{B (α_{λ_{c}}, α_{λ_{n}})} \end{matrix}

Wherein, the probability of A when p (A | B) represents B, symbol �� expression vector;Be the theme the distribution of word,For article is to the distribution of theme,For the distribution quoted of article,For the distribution of word original in article,For the word frequency under kth theme, K represents theme quantity,Being the frequency of theme under m section article, M is article quantity,It is the frequency quoted of m section article,It is the frequency of non-original word in m section article,It it is the frequency of original word in m section article; B (p, q) represents that parameter is the Beta distribution of p and q;

�� () is defined as:

Wherein,For vectorDimension, �� is Gamma function, A_kRepresent vectorKth component.

Preferably, described step C comprises:

Step C1: adopt following gibbs sampler formula to carry out parameter estirmation:

Wherein,Represent vectorRemove z_m,nCorresponding component; Symbol �� represent direct ratio in;Represent theme z_m,nUnder, word w_m,nThe frequency occurred;Represent vectorMiddle w_m,nCorresponding component; V represents total word number;Represent z_m,nIn the frequency that occurs of the t word; ��_tRepresent vectorThe t component;Represent c_m,nMiddle theme is z_m,nAnd s_m,nThe frequency of the word of=0;Represent c_m,nMiddle theme is z_m,nAnd s_m,nThe frequency of the word of=1;Represent vectorZ_m,nCorresponding component;Represent c_m,nMiddle theme is kth theme and s_m,nThe frequency of the word of=0;Represent c_m,nMiddle theme is kth theme and s_m,nThe frequency of the word of=1; ��_kRepresent vectorKth component;Represent vectorRemove c_m,nCorresponding component;Represent in m section article from c_m,nWord number,Represent vectorC_m,nCorresponding component; L_mRepresent that the number of article quoted altogether in m section article;Represent the word number of the article being cited in m section article from a r section; ��_rRepresent vectorThe r component;Represent vectorRemove s_m,nCorresponding component;Represent Represent Represent the frequency of all non-original words in m section article;Represent the frequency of all original words;Represent that in m section article, theme is z_m,nAnd s_m,nThe frequency of the word of=0;Represent that in m section article, theme is z_m,nAnd s_m,nThe frequency of the word of=1;Represent Represent that in m section article, theme is kth theme and s_m,nThe frequency of the word of=0;Represent Represent that in m section article, theme is kth theme and s_m,nThe frequency of the word of=0;Represent the frequency of all non-original words in m section article,Represent the frequency of all original words in m section article.

Preferably, described step D comprises:

Step D1: initialize; To each word w in every section of article in new corpus_m,nBased on the original index s of binominal distribution stochastic sampling_m,n; If to s_m,nSampling obtain s_m,n=1, then quote article c from the extraction one section article of quoting of the article instantly sampled at random based on multinomial distribution_m,n; For the word w instantly sampled_m,nTheme z is given at random based on multinomial distribution_m,n;

Step D2: rescan new corpus; For each word w_m,n, according to the described gibbs sampler original index s of formula resampling_m,n; If newly obtain to s_m,nSampling s_m,n=1, then sample w again_m,nCorresponding quotes article c_m,n, otherwise, then directly omit quoting article c_m,nSampling; Sampling w_m,nTheme z_m,n, upgrade in new corpus;

Wherein, step D2 is repeatedly executed, until gibbs sampler convergence, enters step D3 and continues to perform;

Step D3: according to s corresponding in every section of article in the new corpus counted_m,nThe frequency that in the frequency of occurrences that the proportion of the word of=1, every section of article are quoted, every section of article, under the frequency of occurrences of theme and each theme, word occurs, obtains the original index �� of every section of article respectively, quotes intensity distribution ��, the word distribution phi of the theme distribution �� of every section of article and each theme.

Preferably, described step D also comprises:

One section is joined to the new article d in new corpus_new, add up this section of article d_newTheme distribution ��_new, quote intensity distribution ��_new, original index ��_new, specifically comprise step:

Step D401: initialize, to current article d_newIn each word w_m,nOriginal index s is given at random based on binominal distribution_m,nIf, to s_m,nSampling obtain s_m,n=1, then based on multinomial distribution at random from this article d_newQuote in article extract one section quote article c_m,n; For this word w_m,nTheme z is given at random based on multinomial distribution_m,n;

Step D402: rescan current article d_new, for each word w_m,nAccording to the described gibbs sampler original index s of formula resampling_m,n; If newly obtain to s_m,nSampling s_m,n=1, then sample w again_m,nCorresponding quotes article c_m,n, otherwise, then directly omit quoting article c_m,nSampling; Sampling w_m,nTheme z_m,n, upgrade in new corpus;

Wherein, step D402 is repeatedly executed, until gibbs sampler convergence, enters step D403 and continues to perform;

Step D403: add up current article d_newTheme distribution ��_new, statistics article d_newMiddle corresponding s_m,nThe proportion �� of the word of=1_new, the appearance distribution �� that statistics article is quoted_new��

Preferably, described step e comprises:

Formula below is used to obtain the parameter being correlated with:

θ_{m, k} = \frac{n_{m}^{(k)} + α}{n_{m}^{(\cdot) (0)} + n_{m}^{(\cdot) (1)} + K α}

λ_{m} = \frac{N_{m}^{(1)} + α_{λ_{c}}}{N_{m}^{(0)} + α_{λ_{n}} + N_{m}^{(1)} + α_{λ_{c}}}

δ_{m, c} = \frac{R_{m}^{(c)} + η}{R_{m}^{(\cdot)} + L_{m} η}

Wherein, ��_{M, k}It is the distribution probability of m section article about kth theme,It is the distribution probability of kth theme about t word, ��_mIt is the original index of m section article, ��_{M, c}It is m section article and the power of c section article adduction relationship;Represent that in m section article, theme is the frequency of the word of kth theme;Represent the frequency that in kth theme, the t word occurs,RepresentV represents the quantity of word in kth theme;Represent the frequency of the word of all referenced c section article of m section article,Represent

In preferred technical scheme: extract the effective keyword in corpus, and effective keyword is treated as abstract object; Theme number that article extracts, the intensity of theme distribution, article quote the intensity of distribution, can determine by customer need or by systemic presupposition. Assuming that the theme source of each word is random in every section of article, the theme distribution of certain section of article producing by the theme distribution of article itself or quoting by this article produces;

The probability model of text generation comprise it is assumed hereinafter that:

(1) in every section of article, multinomial distribution obeyed in the theme of each word, and Dirichlet distribution obeyed by its prior distribution.

(2) multinomial distribution obeyed in the different words under each theme, and Dirichlet distribution obeyed by its prior distribution.

(3) in every section of article, multinomial distribution is obeyed in the source of quoting of each word, and Dirichlet distribution obeyed by its prior distribution.

(4) in every section of article, the original of each word obeys binominal distribution, and Beta distribution obeyed by its prior distribution;

Wherein, assuming about probability model, the parameter of prior distribution is on average quoted article quantity determine by article mean length, theme number, article.

Compared with prior art, the present invention has following useful effect:

1, the present invention is based on above-mentioned problems of the prior art, thinks deeply article feature extraction method from a new visual angle, it is possible to improves the accuracy of article feature extraction and can extract the information that traditional characteristic extraction system is not considered from article.

2, the topic model that the present invention uses the Information expansion of citation network traditional, model is made can comprehensively to extract article feature by the information of two aspects, not only the situation that database data volume is bigger is suitable for, and the database of dynamic expansion can be suitable for, additionally it is possible to extract the information such as article adduction relationship intensity that conventional topic model can not extract, the original index of article.

3, the present invention utilizes the openness of article theme distribution, and in theme, words distribution is openness, and the openness of distribution quoted in article, reduces sampling complexity.

Accompanying drawing explanation

By reading with reference to the detailed description that non-limiting example is done by the following drawings, the other features, objects and advantages of the present invention will become more obvious:

Fig. 1 is original article data sample.

Fig. 2 is the generative process of novel topic model.

Fig. 3 is the method flow diagram of the present invention.

Embodiment

Below in conjunction with specific embodiment, the present invention is described in detail. The technician contributing to this area is understood the present invention by following examples further, but does not limit the present invention in any form. It should be appreciated that to those skilled in the art, without departing from the inventive concept of the premise, it is also possible to make some changes and improvements. These all belong to protection scope of the present invention.

The present invention utilizes original method to extract article feature. Present invention uses article citation network to expand traditional topic model so that topic model can utilize topic model and citation network to extract article feature simultaneously, thus extract more accurate article feature. The key step of the present invention comprises:

Article feature extraction method designed by the present invention relate to arrange Citations networks automatization scientific procedure, derive in conjunction with the novel topic model generation model of citation network and federal expression, novel topic model infers process and sampling algorithmic derivation, novel topic model these five core components of parameter estirmation. Method provided by the invention comprises the steps:

About steps A, based on the original corpus of large sample amount, automatically generate the Citations networks of article (such as paper), and output to file; Corpus comprises two portions information, and part information is the information about article itself, comprises title of article, author, summary etc., and another part information is the adduction relationship between article, and such as article A quotes article B, and article A quotes article C.

Academic Data on internet is vast as the open sea, and increases with the quantity of 1,000,000 grades every year. Therefore original corpus according to existing XML and JSON form in the present invention, based on each section of article in original corpus, extract article title, article abstract and article reference, then initial article set is set, adduction relationship according to academic article, obtain maximum component, and derive as new corpus.

Existing original article corpus form is as shown in table 1 and Fig. 1.

The original article data of table 1. stores format specification

In described steps A, the step of the described Citations networks based on original building of corpus article, comprising:

Steps A 3: using the figure G that obtained by steps A 2 as described Citations networks.

In described steps A, described setting initial article set also obtains the step of new corpus according to Citations networks, comprising: according to Citations networks, automatically obtains maximum component, obtains new corpus; Specifically comprise:

About step B, traditional topic model utilizes the theme feature of word frequency characteristic as article of every section of article, and the topic model adopted in the present invention can contain the relation between article, i.e. article Citations networks. Described topic model comprises two cores, is respectively generation model (describing in detail in stepb), infers process (describing in detail in step C). Generation model is equivalent to when known parameters, we assume that the model that article generative process is obeyed, the corresponding diagram model of the generation model of article can FIGS 2.

Described step B comprises:

Step B1: to each theme of new corpus, performs following steps:

Step B2: to each section article of new corpus, performs following steps:

Based on the hyper parameter group of Bei Ta distributionGenerate the Bernoulli parameter �� of the original index of m section article_m; Wherein,It is ��_mThe parameter of the Bei Ta distribution obeyed; It is understood by those skilled in the art that Bei Ta distribution itself needs two hyper parameter, these two hyper parameter can exchange.

Step B3: each word in each section article is performed following steps:

Generating and obeying Bernoulli parameter is ��_mThe original index s of the n-th word of m section article of Bernoulli Jacob's distribution_{M, n}; N is positive integer;

If-s_{M, n}Be 1, then generating and obeying parameter is ��_mMultinomial distribution quote article c_m,n, generating obedience parameter isThe theme z of multinomial distribution_m,n, generating obedience parameter isThe word w of multinomial distribution_m,n;

Wherein,Represent the corresponding c of matrix ��_m,nRow vector,Represent matrixCorresponding z_m,nRow vector; �� represents the distribution matrix of article to theme,Represent the distribution matrix of theme to word; w_m,nRepresent the n-th word in m section article, z_m,nRepresent the theme of the n-th word in m section article, c_m,nRepresent the n-th word in m section article and article quoted in this n-th word original word of right and wrong;

Step B4: the joint probability distribution building topic model is as follows:

\begin{matrix} p (\overset{&RightArrow;}{w}, \overset{&RightArrow;}{z}, \overset{&RightArrow;}{c}, \overset{&RightArrow;}{s} | \overset{&RightArrow;}{α}, \overset{&RightArrow;}{β}, \overset{&RightArrow;}{η}, α_{λ_{n}}, α_{λ_{c}}) \\ = p (\overset{&RightArrow;}{w} | \overset{&RightArrow;}{z}, \overset{&RightArrow;}{c}, \overset{&RightArrow;}{s}, \overset{&RightArrow;}{β}) \cdot p (\overset{&RightArrow;}{z} | \overset{&RightArrow;}{c}, \overset{&RightArrow;}{s}, \overset{&RightArrow;}{α}) \cdot p (\overset{&RightArrow;}{c} | \overset{&RightArrow;}{s}, \overset{&RightArrow;}{η}) \cdot p (\overset{&RightArrow;}{s} | α_{λ_{n}}, α_{λ_{c}}) \\ = Π_{k = 1}^{K} \frac{Δ (\overset{&RightArrow;}{n_{k}} + \overset{&RightArrow;}{β})}{Δ (\overset{&RightArrow;}{β})} \cdot Π_{m = 1}^{M} \frac{Δ (\overset{&RightArrow;}{n_{m}} + \overset{&RightArrow;}{α})}{Δ (\overset{&RightArrow;}{α})} \cdot Π_{m = 1}^{M} \frac{Δ (\overset{&RightArrow;}{R_{m}} + \overset{&RightArrow;}{η})}{Δ (\overset{&RightArrow;}{η})} \cdot Π_{m = 1}^{M} \frac{B (N_{m}^{(1)} + α_{λ_{c}}, N_{m}^{(0)} + α_{λ_{n}})}{B (α_{λ_{c}}, α_{λ_{n}})} \end{matrix}

�� () is defined as:

About step C, the parameter of deduction process for estimating in generation model. In practical situation, we are the words in known article, it is desirable to instead derive implicit parameter in the past, just need the method by statistical inference to complete here. For the novel topic model that we propose, conventional optimization method cannot solve the problem of maximum likelihood estimation, then we adopt and a kind of are called that the mode of gibbs sampler is to carry out parameter estirmation.

Described step C comprises:

Wherein,Represent vectorRemove z_m,nCorresponding component; Symbol �� represent direct ratio in;Represent theme z_m,nUnder, word w_m,nThe frequency occurred;Represent vectorMiddle w_m,nCorresponding component; V represents total word number;Represent z_m,nIn the frequency that occurs of the t word; ��_tRepresent vectorThe t component;Represent c_m,nMiddle theme is z_m,nAnd s_m,nThe frequency of the word of=0;Represent c_m,nMiddle theme is z_m,nAnd s_m,nThe frequency of the word of=1;Represent vectorZ_m,nCorresponding component;Represent c_m,nMiddle theme is kth theme and s_m,nThe frequency of the word of=0;Represent c_m,nMiddle theme is kth theme and s_m,nThe frequency of the word of=1; ��_kRepresent vectorKth component;Represent vectorRemove c_m,nCorresponding component;Represent in m section article from c_m,nWord number,Represent vectorC_m,nCorresponding component; L_mRepresent that the number of article quoted altogether in m section article;Represent the word number of the article being cited in m section article from a r section; �� r represents vectorThe r component;Represent vectorRemove s_m,nCorresponding component;Represent Represent Represent the frequency of all non-original words in m section article;Represent the frequency of all original words;Represent that in m section article, theme is z_m,nAnd s_m,nThe frequency of the word of=0;Represent that in m section article, theme is z_m,nAnd s_m,nThe frequency of the word of=1;Represent Represent that in m section article, theme is kth theme and s_m,nThe frequency of the word of=0;Represent Represent that in m section article, theme is kth theme and s_m,nThe frequency of the word of=0;Represent the frequency of all non-original words in m section article,Represent the frequency of all original words in m section article.

Wherein,In subscript represent the component of corresponding prior distribution parameter.

About step D, according to the deduction process of novel topic model, design sampling algorithm, sampling article database; We can to writing out complete deduction process.

Described step D comprises:

Wherein, repeating step D2, until gibbs sampler convergence;

Step D3: according to s corresponding in every section of article in the new corpus counted_m,nThe frequency that in the frequency of occurrences that the proportion of the word of=1, every section of article are quoted, every section of article, under the frequency of occurrences of theme and each theme, word occurs, obtains the original index �� of every section of article respectively, quotes intensity distribution ��, the word distribution phi of the theme distribution �� of every section of article and each theme;

For one section of new article (namely at the article newly adding new corpus instantly) d_new, add up the theme distribution �� of this section of article_new, quote intensity distribution ��_new, original index ��_new, specifically comprise step:

Step D401: initialize, to current article d_newIn each word w_m,nOriginal index s is given at random based on binominal distribution_m,nIf, to w_m,nSampling obtain s_m,n=1, then based on multinomial distribution at random from this article d_dnewQuote in article extract one section quote article c_m,n; For this word w_m,nTheme z is given at random based on multinomial distribution_m,n;

Wherein, repeating step D402, until gibbs sampler convergence;

Step D403: add up current article d_newTheme distribution, this theme distribution is exactly ��_new, statistics article d_newMiddle corresponding s_m,nThe proportion of the word of=1, this proportion is exactly ��_new, the appearance distribution that statistics article is quoted, this distribution is exactly ��_new��

About step e, after topic model is restrained, (such as preferably according to the gibbs sampler algorithm in step 4, we carry out circulating sampling, after sampling carries out enough number of times, model parameter convergence can be considered as), we use formula below to obtain the parameter being correlated with:

θ_{m, k} = \frac{n_{m}^{(k)} + α}{n_{m}^{(\cdot) (0)} + n_{m}^{(\cdot) (1)} + K α}

λ_{m} = \frac{N_{m}^{(1)} + α_{λ_{c}}}{N_{m}^{(0)} + α_{λ_{n}} + N_{m}^{(1)} + α_{λ_{c}}}

δ_{m, c} = \frac{R_{m}^{(c)} + η}{R_{m}^{(\cdot)} + L_{m} η}

Subscript () represents sues for peace monomial to subscript herein, as

Above specific embodiments of the invention are described. It is understood that the present invention is not limited to above-mentioned particular implementation, those skilled in the art can make a variety of changes within the scope of the claims or revise, and this does not affect the flesh and blood of the present invention. When not conflicting, the feature in the embodiment of the application and embodiment can combine arbitrarily mutually.

Claims

1. the article feature extraction method based on topic model, it is characterised in that, comprising:

2. the article feature extraction method based on topic model according to claim 1, it is characterised in that, described steps A comprises:

3. the article feature extraction method based on topic model according to claim 1, it is characterised in that, described step B comprises:

Step B1: to each theme of new corpus, performs following steps:

Step B2: to each section article of new corpus, performs following steps:

Step B3: each word in each section article is performed following steps:

Generating and obeying Bernoulli parameter is ��_mThe original index s of the n-th word of m section article of Bernoulli Jacob's distribution_m, n; N is positive integer;

If-s_{M, n}Be 1, then generating and obeying parameter is ��_mMultinomial distribution quote article c_{M, n}, generating obedience parameter isThe theme Z of multinomial distribution_{M, n}, generating obedience parameter isThe word w of multinomial distribution_{M, n};

If-s_{M, n}Be 0, then generating and obeying parameter is ��_mThe theme Z of multinomial distribution_{M, n}, generating obedience parameter isThe word w of multinomial distribution_{M, n};

Wherein,Represent the corresponding c of matrix ��_{M, n}Row vector,Represent that square is oldCorresponding Z_{M, n}Row vector; �� represents the distribution matrix of article to theme,Represent the distribution matrix of theme to word, w_{M, n}Represent the n-th word in m section article, Z_{M, n}Represent the theme of the n-th word in m section article, c_{M, n}Represent the n-th word in m section article and article quoted in this n-th word original word of right and wrong;

Step B4: the joint probability distribution building topic model is as follows:

\begin{matrix} p (\overset{&RightArrow;}{w}, \overset{&RightArrow;}{z}, \overset{&RightArrow;}{c}, \overset{&RightArrow;}{s} | \overset{&RightArrow;}{α}, \overset{&RightArrow;}{β}, \overset{&RightArrow;}{η}, α_{λ_{n}}, α_{λ_{c}}) \\ = p (\overset{&RightArrow;}{w} | \overset{&RightArrow;}{z}, \overset{&RightArrow;}{c}, \overset{&RightArrow;}{s}, \overset{&RightArrow;}{β}) \cdot p (\overset{&RightArrow;}{z} | \overset{&RightArrow;}{c}, \overset{&RightArrow;}{s}, \overset{&RightArrow;}{α}) \cdot p (\overset{&RightArrow;}{c} | \overset{&RightArrow;}{s}, \overset{&RightArrow;}{η}) \cdot p (\overset{&RightArrow;}{s} | α_{λ_{n}}, α_{λ_{c}}) \\ = Π_{k = 1}^{K} \frac{Δ (\overset{&RightArrow;}{n_{k}} + \overset{&RightArrow;}{β})}{Δ (\overset{&RightArrow;}{β})} \cdot Π_{m = 1}^{M} \frac{Δ (\overset{&RightArrow;}{n_{m}} + \overset{&RightArrow;}{α})}{Δ (\overset{&RightArrow;}{α})} \cdot Π_{m = 1}^{M} \frac{Δ (\overset{&RightArrow;}{R_{m}} + \overset{&RightArrow;}{η})}{Δ (\overset{&RightArrow;}{η})} \cdot Π_{m = 1}^{M} \frac{B (N_{m}^{(1)} + α_{λ_{c}}, N_{m}^{(0)} + α_{λ_{n}})}{B (α_{λ_{c}}, α_{λ_{n}})} \end{matrix}

�� () is defined as:

4. the article feature extraction method based on topic model according to claim 3, it is characterised in that, described step C comprises:

&Proportional; \frac{n_{z_{m, n}}^{w_{m, n}} + β_{w_{m, n}} - 1}{Σ_{t = 1}^{V} (n_{z_{m, n}}^{t} + β_{t}) - 1} \cdot \frac{(n_{m}^{z_{m, n} (0)} - 1) + n_{m}^{z_{m, n} (1)} + α_{z_{m, n}}}{Σ_{k = 1}^{K} (n_{m}^{(k) (0)} + n_{m}^{(k) (1)} + α_{k}) - 1}

&Proportional; \frac{n_{z_{m, n}}^{w_{m, n}} + β_{w_{m, n}} - 1}{Σ_{t = 1}^{V} (n_{z_{m, n}}^{t} + β_{t}) - 1} \cdot \frac{n_{c_{m, n}}^{z_{m, n} (0)} + (n_{c_{m, n}}^{z_{m, n} (1)} - 1) + α_{z_{m, n}}}{Σ_{k = 1}^{K} (n_{c_{m, n}}^{(k) (0)} + n_{c_{m, n}}^{(k) (1)} + α_{k}) - 1}

&Proportional; \frac{n_{m}^{c_{m, n}} - 1 + η_{c_{m, n}}}{Σ_{r = 1}^{L_{m}} (n_{m}^{(r)} + η_{r}) - 1} \cdot \frac{n_{c_{m, n}}^{z_{m, n} (0)} + (n_{c_{m, n}}^{z_{m, n} (1)} - 1) + α_{z_{m, n}}}{Σ_{k = 1}^{K} (n_{c_{m, n}}^{(k) (0)} + n_{c_{m, n}}^{(k) (1)} + α_{k}) - 1}

&Proportional; \frac{n_{c_{m, n}}^{z_{m, n} (0)} + (n_{c_{m, n}}^{z_{m, n} (1)} - 1) + α_{z_{m, n}}}{n_{c_{m, n}}^{(\cdot) (0)} + n_{c_{m, n}}^{(\cdot) (1)} + K α - 1} \cdot \frac{N_{m}^{(1)} - 1 + α_{λ_{c}}}{(N_{m}^{(1)} - 1) + N_{m}^{(0)} + α_{λ_{n}} + α_{λ_{c}}}

&Proportional; \frac{(n_{m}^{z_{m, n} (0)} - 1) + n_{m}^{z_{m, n} (1)} + α_{z_{m, n}}}{n_{m}^{(\cdot) (0)} + n_{m}^{(\cdot) (1)} + K α - 1} \cdot \frac{N_{m}^{(0)} - 1 + α_{λ_{n}}}{N_{m}^{(1)} + (N_{m}^{(0)} - 1) + α_{λ_{n}} + α_{λ_{c}}}

Wherein,Represent vectorRemove Z_{M, n}Corresponding component; Symbol �� represent direct ratio in;Represent theme Z_{M, n}Under, word w_{M, n}The frequency occurred;Represent vectorMiddle w_{M, n}Corresponding component; V represents total word number;Represent Z_{M, n}In the frequency that occurs of the t word; ��_tRepresent vectorThe t component;Represent c_{M, n}Middle theme is Z_{M, n}And s_{M, n}The frequency of the word of=0;Represent c_{M, n}Middle theme is Z_{M, n}And s_{M, n}The frequency of the word of=1;Represent vectorZ_{M, n}Corresponding component;Represent c_{M, n}Middle theme is K theme and s_{M, n}The frequency of the word of=0;Represent c_{M, n}Middle theme is K theme and s_{M, n}The frequency of the word of=1; ��_kRepresent vectorThe K component;Represent vectorRemove c_{M, n}Corresponding component;Represent in m section article from c_{M, n}Word number,Represent vectorC_{M, n}Corresponding component; L_mRepresent that the number of article quoted altogether in m section article;Represent the word number of the article being cited in m section article from a �� section; ��_��Represent vectorThe �� component;Represent vectorRemove s_{M, n}Corresponding component;Represent Represent Represent the frequency of all non-original words in m section article;Represent the frequency of all original words;Represent that in m section article, theme is Z_{M, n}And s_{M, n}The frequency of the word of=0;Represent that in m section article, theme is Z_{M, n}And s_{M, n}The frequency of the word of=1;Represent Represent that in m section article, theme is K theme and s_{M, n}The frequency of the word of=0;Represent Represent that in m section article, theme is K theme and s_{M, n}The frequency of the word of=0;Represent the frequency of all non-original words in m section article,Represent the frequency of all original words in m section article.

5. the article feature extraction method based on topic model according to claim 4, it is characterised in that, described step D comprises:

Step D1: initialize; To each word w in every section of article in new corpus_{M, n}Based on the original index s of binominal distribution stochastic sampling_{M, n}; If to s_{M, n}Sampling obtain s_{M, n}=1, then quote article c from the extraction one section article of quoting of the article instantly sampled at random based on multinomial distribution_{M, n}; For the word w instantly sampled_{M, n}Theme Z is given at random based on multinomial distribution_{M, n};

Step D2: rescan new corpus; For each word w_{M, n}, according to the described gibbs sampler original index s of formula resampling_{M, n}; If newly obtain to s_{M, n}Sampling s_{M, n}=1, then sample w again_{M, n}Corresponding quotes article c_{M, n}, otherwise, then directly omit quoting article c_{M, n}Sampling; Sampling w_{M, n}Theme Z_{M, n}, upgrade in new corpus;

Step D3: according to s corresponding in every section of article in the new corpus counted_{M, n}The frequency that in the frequency of occurrences that the proportion of the word of=1, every section of article are quoted, every section of article, under the frequency of occurrences of theme and each theme, word occurs, obtains the original index �� of every section of article respectively, quotes intensity distribution ��, the word distribution phi of the theme distribution �� of every section of article and each theme.

6. the article feature extraction method based on topic model according to claim 5, it is characterised in that, described step D also comprises:

Step D401: initialize, to current article d_newIn each word w_{M, n}Original index s is given at random based on binominal distribution_{M, n}If, to s_{M, n}Sampling obtain s_{M, n}=1, then based on multinomial distribution at random from this article d_newQuote in article extract one section quote article c_{M, n}; For this word w_{M, n}Theme Z is given at random based on multinomial distribution_{M, n};

Step D402: rescan current article d_new, for each word w_{M, n}According to the described gibbs sampler original index s of formula resampling_{M, n}; If newly obtain to s_{M, n}Sampling s_{M, n}=1, then sample w again_{M, n}Corresponding quotes article c_{M, n}, otherwise, then directly omit quoting article c_{M, n}Sampling; Sampling w_{M, n}Theme Z_{M, n}, upgrade in new corpus;

Step D403: add up current article d_newTheme distribution ��_new, statistics article d_newMiddle corresponding s_{M, n}The proportion �� of the word of=1_new, the appearance distribution �� that statistics article is quoted_new��

7. the article feature extraction method based on topic model according to claim 4, it is characterised in that, described step e comprises:

Formula below is used to obtain the parameter being correlated with:

θ_{m, k} = \frac{n_{m}^{(k)} + α}{n_{m}^{(\cdot) (0)} + n_{m}^{(\cdot) (1)} + K α}

λ_{m} = \frac{N_{m}^{(1)} + α_{λ_{c}}}{N_{m}^{(0)} + α_{λ_{n}} + N_{m}^{(1)} + α_{λ_{c}}}

δ_{m, c} = \frac{R_{m}^{(c)} + η}{R_{m}^{(\cdot)} + L_{m} η}

Wherein, ��_{M, k}It is the distribution probability of m section article about kth theme,It is the distribution probability of kth theme about t word, ��_mIt is the original index of m section article, ��_m, c is m section article and the power of c section article adduction relationship;Represent that in m section article, theme is the frequency of the word of K theme;Represent the frequency that in the K theme, the t word occurs,RepresentV represents the quantity of word in the K theme;Represent the frequency of the word of all referenced c section article of m section article,Represent