CN105631018B

CN105631018B - Article Feature Extraction Method based on topic model

Info

Publication number: CN105631018B
Application number: CN201511016955.7A
Authority: CN
Inventors: 沈嘉明; 宋振宇; 李世韬; 毛宇宁; 谈兆炜; 朱鸿儒; 王乐群; 郭运奇; 王彪; 傅洛伊; 王新兵
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2015-12-29
Filing date: 2015-12-29
Publication date: 2018-12-18
Anticipated expiration: 2035-12-29
Also published as: CN105631018A

Abstract

The present invention provides a kind of article Feature Extraction Method based on topic model, it include: the Citations networks based on original language material library building article, construct the generation model and parametric joint expression formula of topic model, according to the deduction process for generating model construction topic model, article is sampled to new corpus, article parameter is extracted according to the sampled result of sampling article.Present invention uses article citation networks to extend traditional topic model, to extract more accurate article feature.

Description

Article Feature Extraction Method based on topic model

Technical field

The present invention relates to article feature extraction technique fields, and in particular, to the article feature extraction based on topic model Method, especially a kind of integrated Citations networks arrange, the method for the feature extraction based on topic model.

Background technique

Scientific research activity is the strategic support for improving social productive forces and overall national strength.It all pays much attention to for section countries in the world Grind movable investment.The core position in the national development overall situation, branch of the state revenue and expenditure to scientific research are put into science and technology research and development in China Increase steadily out.2012, Chinese research and development investment funds (including industry and academia) alreadyd exceed ten thousand Hundred million, it is 10298.4 hundred million yuan, reaches the level of medium-developed country.

The most direct output result of scientific research activity first is that academic article.According to statistics, from 2004 to 2014 year, section of China It grinds personnel and delivers scientific and technical article 136.98 ten thousand altogether in the world, occupy the second in the world；Article is cited 1037.01 ten thousand times altogether, Occupy the world the 4th.Research practice shows that academic article is that scientific research personnel carries out scientific research activity or continues further investigation Very important information resources.However, how the documents and materials vast as the open sea in face of the information age, quickly and accurately retrieve Academic resources required for oneself, for scientific research personnel, a strictly extremely important and challenging work Make.

The demand recommended in face of academics search, Google were proposed the academic search engine of beta version in 2004, are Global scientific research personnel provides free academic documents information service；2006, Microsoft was proposed academic search engine Microsoft Academic Search.Although these comprehensive academic search engines have relied on its affiliated business search public The search technique of department, in fact, their search result and unsatisfactory.These existing academic search engines are directed to user Inquiry input, still in the form of article list return query result.They pay more attention to the accuracy of search result, i.e., Article search results are accurately matched with the keyword of user query, it is locating in respective field without paying attention to article Position and article topic development trend.It is more prior than accurately matching title but for scientific research personnel, it is past Toward being the forward position achievement and significant contribution article belonging to obtained in subject.For example, the search for just relating to a certain research field is used When scanning for, they and oneself indefinite what kind of document of needs, the keyword of search are usually only rough at family Theme or topic, if using above-mentioned comprehensive academic search engine, user often can not fast and effeciently understand phase Forward position achievement and significant contribution article, obtained result in the subject of pass is unsatisfactory.

As it can be seen that constructing the academics search recommender system of a set of highly effective, for resource needed for scientific research personnel's acquisition, in time It grasps discipline development dynamic, improve itself capacity of scientific research, and then enhance the research strength of country, all there is considerable meaning Justice.Just because of this, academics search recommender system gradually obtains the attention of people in recent years.Since 2000, related article was searched The article quantity of rope and recommender system shows an increasing trend year by year.According to incompletely statistics, only related article quantity in 2013 is just More than 30 pieces are reached.But the research of academics search recommender system is still within the initial stage.

In the building process of academics search system, an important content is from large-scale article data collection and reference In cyberrelationship data set, the feature of article is extracted.As quoted between the theme of every article, the Academic Contribution degree of article, article The power Feature Words corresponding with theme of relationship.

Up to the present, the main direction of studying of concern article feature extraction both at home and abroad includes: to carry out to the semanteme of article Analysis, to obtain the recommendation results of other articles similar with this article theme；To article citation network modeling analysis, obtain The importance of article.

Currently, the article Feature Extraction Method based on subject analysis includes: to analyze text using topic model (such as LDA algorithm) Chapter theme, and Topic Similarity is introduced in the collaborative filtering of recommender system；It is found in conjunction with topic model and language model Similar topic article；Based on LDA algorithm, to word group theme modeling etc..Article feature extraction side based on article citation network Method includes: to calculate the bipartite graph based on article and term building the authority value of article using HITS algorithm；It is quoted using article Network calculates the authority value of author and is recommended；Using PageRank algorithm, in conjunction with the quality and reference net of periodical Network calculates the PageRank value etc. of article.

But these research achievements otherwise do not account for model to the availability of large sample size article database or It is solely focused on the information of citation network and has ignored the extraction of article text information or only only account for article database text Information but the information for having ignored Citations networks.Therefore the use value of final result is not high.

Summary of the invention

For the defects in the prior art, the article feature extraction based on topic model that the object of the present invention is to provide a kind of Method.

A kind of article Feature Extraction Method based on topic model provided according to the present invention, comprising:

Step A: the Citations networks based on original language material library building article set initial article collection and merge according to reference Relational network obtains new corpus；

Step B: it is directed to new corpus, constructs the generation model and parametric joint expression formula of topic model；

Step C: according to the deduction process for generating model construction topic model；

Step D: according to the deduction process of topic model, article is sampled to new corpus；

Step E: article parameter is extracted according to the sampled result of sampling article.

Preferably, the step A includes:

Vertex set V is set as empty set by step A1, and side collection E is set as empty set, and figure G is set as V, the set of E；

Step A2 is added to current article node u in vertex set V for each article in original language material library, will work as The all references relationship of preceding article node u is added in the collection E of side；

Step A3: using the figure G obtained by step A2 as the Citations networks；

Vertex set V is set as initial known point set V by step A4₀, by the collection E when collection E is set as initial known₀, figure G is set as The set of V, E；

Step A5 constantly searches the point v in original language material library not in vertex set V, if there is such point v and point v There are adduction relationships with the point in vertex set V, then point v are added in vertex set V, and the adduction relationship of point v is added in E； Until V, until E no longer changes；

Step A6: the corresponding corpus export of figure G will be obtained by step A5 and is used as the new corpus.

Preferably, the step B includes:

Step B1: to each theme of new corpus, following steps are executed:

The polynomial parameters of the distribution of k-th of theme to word are generated based on Di Li Cray hyper parameter βWherein, β is The parameter for the Di Li Cray distribution obeyed；K is positive integer；

Step B2: to each piece article of new corpus, following steps are executed:

The polynomial parameters θ of the distribution of m articles to theme is generated based on Di Li Cray hyper parameter α_m；Wherein, α is θ_m The parameter for the Di Li Cray distribution obeyed；M is positive integer；

The polynomial parameters δ of the reference intensity distribution of m articles is generated based on Di Li Cray hyper parameter η_m；Wherein, η is δ_mThe parameter for the Di Li Cray distribution obeyed；

Hyper parameter group based on beta distributionGenerate the Bernoulli parameter of the original index of m articles λ_m；Wherein,It is λ_mThe parameter for the beta distribution obeyed；

Step B3: following steps are executed to each word in each piece article:

Generating and obeying Bernoulli parameter is λ_mBernoulli Jacob distribution m articles n-th of word original index s_m,_n；n For positive integer；

If s_m,_nIt is 1, then generating and obeying parameter is δ_mMultinomial distribution reference article c_m,n, generate and obey parameter For θ_cm,nMultinomial distribution theme z_m,n, generating obedience parameter isMultinomial distribution word w_m,n；

If s_m,nIt is 0, then generating and obeying parameter is θ_mMultinomial distribution theme z_m,n, generating obedience parameter isMultinomial distribution word w_m,n；

Wherein,Representing matrix θ corresponds to c_m,nRow vector,Representing matrixCorresponding z_m,nRow vector；θ table Show article to theme distribution matrix,Indicate distribution matrix of the theme to word, w_m,nN-th of word in m articles is represented, z_m,nRepresent the theme of n-th of word in m articles, c_m,nRepresent n-th of word in m articles and n-th of word right and wrong Original word cited article；

Step B4: the joint probability distribution for constructing topic model is as follows:

Wherein, p (A | B) indicates the probability of A under the conditions of B, symbol → expression vector；It is the theme to the distribution of word,For Article to theme distribution,For the distribution of the reference of article,For the distribution of word original in article,For under k-th of theme Word frequency, K indicate theme quantity,For the frequency of theme under m articles, M is article quantity,For m articles The frequency of reference,For the frequency of non-original word in m articles,For the frequency of original word in m articles；B(p, Q) Beta that expression parameter is p and q is distributed；

Δ () is defined as:

Wherein,For vectorDimension, Γ be Gamma function, A_kIndicate vectorK-th of component.

Preferably, the step C includes:

Step C1: parameter Estimation is carried out using following gibbs sampler formula:

Wherein,Indicate vectorRemove z_m,nCorresponding component；Symbol ∝ expression is proportional to；Represent master Inscribe z_m,nUnder, word w_m,nThe frequency of appearance；Indicate vectorMiddle w_m,nCorresponding component；V indicates total word number；It indicates z_m,nIn t-th of word occur frequency；β_tIndicate vectorT-th of component；Indicate c_m,nMiddle theme is z_m,nAnd s_m,n The frequency of=0 word；Indicate c_m,nMiddle theme is z_m,nAnd s_m,nThe frequency of=1 word；Indicate vector's z_m,nCorresponding component；Indicate c_m,nMiddle theme is k-th of theme and s_m,nThe frequency of=0 word；Indicate c_m,n Middle theme is k-th of theme and s_m,nThe frequency of=1 word；α_kIndicate vectorK-th of component；Indicate vector Remove c_m,nCorresponding component；It indicates in m articles from c_m,nWord number,Indicate vectorC_m,nIt is corresponding Component；L_mIndicate that m articles quote the number of article in total；It indicates to be cited in m articles from r The word number of article；η_rIndicate vectorR-th of component；Indicate vectorRemove s_m,nCorresponding component；It indicates It indicates Represent the frequency of all non-original words in m articles； Represent the frequency of all original words；Indicate that theme is z in m articles_m,nAnd s_m,nThe frequency of=0 word；Indicate that theme is z in m articles_m,nAnd s_m,nThe frequency of=1 word；It indicates Indicate that theme is k-th of theme and s in m articles_m,nThe frequency of=0 word；It indicates Table Show that theme is k-th of theme and s in m articles_m,nThe frequency of=0 word；It represents all non-original in m articles Word frequency,Represent the frequency of all original words in m articles.

Preferably, the step D includes:

Step D1: initialization；To each word w in every article in new corpus_m,nIt is original based on bi-distribution stochastical sampling Index s_m,n；If to s_m,nSampling obtain s_m,n=1, then based on multinomial distribution at random from the reference article of the article when down-sampling One reference article c of middle extraction_m,n；For as the word w of down-sampling_m,nAssign theme z at random based on multinomial distribution_m,n；

Step D2: new corpus is rescaned；For each word w_m,n, according to the gibbs sampler formula resampling Original index s_m,n；If newly obtain to s_m,nSampling s_m,n=1, then w is sampled again_m,nCorresponding reference article c_m,n, otherwise, It then directly omits to reference article c_m,nSampling；Sample w_m,nTheme z_m,n, it is updated in new corpus；

Wherein, step D2 is repeatedly executed, and until gibbs sampler convergence, is entered step D3 and is continued to execute；

Step D3: according to corresponding to s in every article in the new corpus counted_m,nThe specific gravity of=1 word, every article The frequency that word occurs under the frequency of occurrences of theme and each theme in the frequency of occurrences of reference, every article, respectively obtains The original index λ of every article, reference intensity distribution δ, every article theme distribution θ and each theme word distribution φ。

Preferably, the step D further include:

The new article d being added in new corpus for one_new, count this article d_newTheme distribution θ_new, reference Intensity distribution δ_new, original index λ_new, specifically include step:

Step D401: initialization, to current article d_newIn each word w_m,nAssign original finger at random based on bi-distribution Mark s_m,nIf to s_m,nSampling obtain s_m,n=1, then based on multinomial distribution at random from this article d_newReference article in extract one Piece reference article c_m,n；For word w_m,nAssign theme z at random based on multinomial distribution_m,n；

Step D402: current article d is rescaned_new, for each word w_m,nAgain according to the gibbs sampler formula Sample original index s_m,n；If newly obtain to s_m,nSampling s_m,n=1, then w is sampled again_m,nCorresponding reference article c_m,n, no Then, then it directly omits to reference article c_m,nSampling；Sample w_m,nTheme z_m,n, it is updated in new corpus；

Wherein, step D402 is repeatedly executed, and until gibbs sampler convergence, is entered step D403 and is continued to execute；

Step D403: statistics current article d_newTheme distribution θ_new, count article d_newMiddle corresponding s_m,nThe ratio of=1 word Weight λ_new, the appearance distribution δ of statistics article reference_new。

Preferably, the step E includes:

Relevant parameter is obtained using following formula:

Wherein, θ_{M, k}It is distribution probability of the m articles about k-th of theme,It is k-th of theme about t-th The distribution probability of word, λ_mIt is the original index of m articles, δ_{M, c}It is the power of m articles and c article adduction relationships；Indicate that theme in m articles is the frequency of the word of k-th of theme；Indicate t-th of word appearance in k-th of theme Frequency,It indicatesV indicates the quantity of word in k-th of theme；Indicate that all references of m articles crosses The frequency of the word of c articles,It indicates

In preferred technical solution: extracting effective keyword in corpus, and effective keyword is treated as taking out As object；Theme number, the intensity of theme distribution, the intensity of article reference distribution of article extraction, can be by user Demand determines or by systemic presupposition.It is assumed that in every article the theme source of each word be it is random, by the theme of article itself Distribution generates or the theme distribution of certain article as cited in this article generates；

The probabilistic model of text generation include it is assumed hereinafter that:

(1) theme of each word obeys multinomial distribution in every article, and its prior distribution obeys Dirichlet distribution.

(2) the different words under each theme obey multinomial distribution, and its prior distribution obeys Dirichlet distribution.

(3) multinomial distribution is obeyed in the reference source of each word in every article, and its prior distribution obeys Dirichlet points Cloth.

(4) the original of each word obeys bi-distribution in every article, and its prior distribution obeys Beta distribution；

Wherein, about probabilistic model it is assumed that the parameter of prior distribution will be put down by article average length, theme number, article Reference article quantity determines.

Compared with prior art, the present invention have it is following the utility model has the advantages that

1, the present invention is based on above-mentioned problems of the prior art, article feature extraction is thought deeply at the visual angle new from one Method, can be improved the accuracy of article feature extraction and can extract traditional characteristic extraction system from article and do not account for Information.

2, the present invention uses the traditional topic model of the Information expansion of citation network, allows model by both sides Informix extracts article feature, is not only applicable in the situation that data bank data volume is larger, and can be to the data of dynamic expansion Library is applicable in, additionally it is possible to extract the information such as article adduction relationship intensity, the original index of article that previous topic model cannot extract.

3, the present invention utilizes the sparsity of article theme distribution, the sparsity of words distribution in theme, article reference distribution Sparsity, reduce sampling complexity.

Detailed description of the invention

Upon reading the detailed description of non-limiting embodiments with reference to the following drawings, other feature of the invention, Objects and advantages will become more apparent upon:

Fig. 1 is original article data sample.

Fig. 2 is the generating process of novel topic model.

Fig. 3 is flow chart of the method for the present invention.

Specific embodiment

The present invention is described in detail combined with specific embodiments below.Following embodiment will be helpful to the technology of this field Personnel further understand the present invention, but the invention is not limited in any way.It should be pointed out that the ordinary skill of this field For personnel, without departing from the inventive concept of the premise, several changes and improvements can also be made.These belong to the present invention Protection scope.

The present invention extracts article feature using original method.Present invention uses article citation networks to extend tradition Topic model, allow topic model to extract article feature using topic model and citation network simultaneously, to extract More accurate article feature.Key step of the invention includes:

Article Feature Extraction Method designed by the present invention is related to arranging the automation scientific procedure of Citations networks, knot The novel topic model for closing citation network generates model and federal expression derives, novel topic model infers process and sampling is calculated Method derives, the parameter Estimation of novel topic model this five core components.Method provided by the invention includes the following steps:

About step A, the original language material library based on large sample size automatically generates the adduction relationship net of article (such as paper) Network, and it is output to file；Corpus includes two parts information, and a part of information is the information about article itself, including article Topic, author, abstract etc., another part information are the adduction relationships between article, for example article A quotes article B, article A reference Article C.

Academic Data on internet is vast as the open sea, and is increased every year with million grades of quantity.Therefore basis in the present invention The original language material library of existing XML and JSON format, based on each article in original language material library, extract article title, Then article abstract and article bibliography set initial article set, according to the adduction relationship of academic article, obtain maximum Connected component, and export as new corpus.

Existing original article corpus library format is as shown in table 1 and Fig. 1.

The original article data storage format specification of table 1.

In the step A, it is described based on original language material library building article Citations networks the step of, comprising:

Step A3: using the figure G obtained by step A2 as the Citations networks.

In the step A, the initial article collection of setting, which merges, obtains the step of new corpus according to Citations networks Suddenly, comprising: according to Citations networks, automatically obtain maximum component, obtain new corpus；It specifically includes:

About step B, traditional topic model is using the word frequency characteristic of every article as the theme feature of article, this hair The topic model of bright middle use can cover the relationship between article, i.e. article Citations networks.The topic model includes Process (being described in detail in step C) is inferred in two cores, respectively generation model (being described in detail in stepb).Generate model phase When under conditions of known parameters, it will be assumed that the model that article generating process is obeyed, the corresponding diagram of the generation model of article Model can be found in attached drawing 2.

The step B includes:

Step B1: to each theme of new corpus, following steps are executed:

Step B2: to each piece article of new corpus, following steps are executed:

Hyper parameter group based on beta distributionGenerate the Bernoulli parameter of the original index of m articles λ_m；Wherein,It is λ_mThe parameter for the beta distribution obeyed；It will be appreciated by those skilled in the art that beta distribution needs itself Two hyper parameters are wanted, the two hyper parameters can be interchanged.

Step B3: following steps are executed to each word in each piece article:

Generating and obeying Bernoulli parameter is λ_mBernoulli Jacob distribution m articles n-th of word original index s_{M, n}；n For positive integer；

If s_{M, n}It is 1, then generating and obeying parameter is δ_mMultinomial distribution reference article c_m,n, generate and obey parameter ForMultinomial distribution theme z_m,n, generating obedience parameter isMultinomial distribution word w_m,n；

Wherein,Representing matrix θ corresponds to c_m,nRow vector,Representing matrixCorresponding z_m,nRow vector；θ table Show article to theme distribution matrix,Distribution matrix of the expression theme to word；w_m,nN-th of word in m articles is represented, z_m,nRepresent the theme of n-th of word in m articles, c_m,nRepresent n-th of word in m articles and n-th of word right and wrong Original word cited article；

Δ () is defined as:

About step C, infer that process is used to estimate to generate the parameter in model.In practical situation, we are known texts Word in chapter, it is desirable to instead derive implicit parameter in the past, just need to complete by the method for statistical inference here.For me The novel topic model that proposes, conventional optimal method can not solve the problems, such as maximal possibility estimation, then we use A kind of mode being known as gibbs sampler carries out parameter Estimation.

The step C includes:

Wherein,Indicate vectorRemove z_m,nCorresponding component；Symbol ∝ expression is proportional to；Represent master Inscribe z_m,nUnder, word w_m,nThe frequency of appearance；Indicate vectorMiddle w_m,nCorresponding component；V indicates total word number；It indicates z_m,nIn t-th of word occur frequency；β_tIndicate vectorT-th of component；Indicate c_m,nMiddle theme is z_m,nAnd s_m,n The frequency of=0 word；Indicate c_m,nMiddle theme is z_m,nAnd s_m,nThe frequency of=1 word；Indicate vector's z_m,nCorresponding component；Indicate c_m,nMiddle theme is k-th of theme and s_m,nThe frequency of=0 word；Indicate c_m,n Middle theme is k-th of theme and s_m,nThe frequency of=1 word；α_kIndicate vectorK-th of component；Indicate vector Remove c_m,nCorresponding component；It indicates in m articles from c_m,nWord number,Indicate vectorC_m,nIt is corresponding Component；L_mIndicate that m articles quote the number of article in total；It indicates to be cited in m articles from r The word number of article；η r indicates vectorR-th of component；Indicate vectorRemove s_m,nCorresponding component；It indicates It indicates Represent the frequency of all non-original words in m articles； Represent the frequency of all original words；Indicate that theme is z in m articles_m,nAnd s_m,nThe frequency of=0 word；Indicate that theme is z in m articles_m,nAnd s_m,nThe frequency of=1 word；It indicates Indicate that theme is k-th of theme and s in m articles_m,nThe frequency of=0 word；It indicates Indicate that theme is k-th of theme and s in m articles_m,nThe frequency of=0 word；Represent all non-originals in m articles The frequency of the word of wound,Represent the frequency of all original words in m articles.

Wherein,In subscript represent the component of corresponding prior distribution parameter.

Sampling algorithm is designed according to the deduction process of novel topic model about step D, samples article database；We It can arrive and write out complete deduction process.

The step D includes:

Wherein, step D2 is repeated, until gibbs sampler is restrained；

Step D3: according to corresponding to s in every article in the new corpus counted_m,nThe specific gravity of=1 word, every article The frequency that word occurs under the frequency of occurrences of theme and each theme in the frequency of occurrences of reference, every article, respectively obtains The original index λ of every article, reference intensity distribution δ, every article theme distribution θ and each theme word distribution φ；

Article (i.e. in the new article that is added instantly new corpus) d new for one_new, count the theme point of this article Cloth θ_new, reference intensity distribution δ_new, original index λ_new, specifically include step:

Step D401: initialization, to current article d_newIn each word w_m,nAssign original finger at random based on bi-distribution Mark s_m,nIf to w_m,nSampling obtain s_m,n=1, then based on multinomial distribution at random from this article d_dnewReference article in extract One reference article c_m,n；For word w_m,nAssign theme z at random based on multinomial distribution_m,n；

Wherein, step D402 is repeated, until gibbs sampler is restrained；

Step D403: statistics current article d_newTheme distribution, which is exactly θ_new, count article d_newIn it is right Answer s_m,nThe specific gravity of=1 word, the specific gravity are exactly λ_new, what statistics article was quoted is distributed, which is exactly δ_new。

About step E, after topic model convergence (such as advantageously according to the gibbs sampler algorithm in step 4, I Carry out circulating sampling, after carrying out enough numbers to sampling, model parameter convergence can be considered as), the public affairs below our uses Formula obtains relevant parameter:

Subscript (), which is represented, sums monomial to subscript herein, such as

Specific embodiments of the present invention are described above.It is to be appreciated that the invention is not limited to above-mentioned Particular implementation, those skilled in the art can make a variety of changes or modify within the scope of the claims, this not shadow Ring substantive content of the invention.In the absence of conflict, the feature in embodiments herein and embodiment can any phase Mutually combination.

Claims

1. a kind of article Feature Extraction Method based on topic model characterized by comprising

Step A: the Citations networks based on original language material library building article set initial article collection and merge according to adduction relationship Network obtains new corpus；

Step E: article parameter is extracted according to the sampled result of sampling article；

The step B includes:

Step B1: to each theme of new corpus, following steps are executed:

The polynomial parameters of the distribution of k-th of theme to word are generated based on Di Li Cray hyper parameter βWherein, β isIt is obeyed Di Li Cray distribution parameter；K is positive integer；

Step B2: to each piece article of new corpus, following steps are executed:

The polynomial parameters θ of the distribution of m articles to theme is generated based on Di Li Cray hyper parameter α_m；Wherein, α is θ_mIt is taken From Di Li Cray be distributed parameter；M is positive integer；

The polynomial parameters δ of the reference intensity distribution of m articles is generated based on Di Li Cray hyper parameter η_m；Wherein, η is δ_mInstitute The parameter of the Di Li Cray distribution of obedience；

Hyper parameter group based on beta distributionGenerate the Bernoulli parameter λ of the original index of m articles_m；Its In,It is λ_mThe parameter for the beta distribution obeyed；

Step B3: following steps are executed to each word in each piece article:

Generating and obeying Bernoulli parameter is λ_mBernoulli Jacob distribution m articles n-th of word original index s_{M, n}；N is positive Integer；

If s_{M, n}It is 1, then generating and obeying parameter is δ_mMultinomial distribution reference article c_m,n, generating obedience parameter is Multinomial distribution theme z_m,n, generating obedience parameter isMultinomial distribution word w_m,n；

If s_m,nIt is 0, then generating and obeying parameter is θ_mMultinomial distribution theme z_m,n, generating obedience parameter is's The word w of multinomial distribution_m,n；

Wherein,Representing matrix θ corresponds to c_m,nRow vector,Representing matrixCorresponding z_m,nRow vector；θ indicates article To the distribution matrix of theme,Indicate distribution matrix of the theme to word, w_m,nRepresent n-th of word in m articles, z_m,nIt represents The theme of n-th of word in m articles, c_m,nIt represents n-th of word in m articles and n-th of word is non-original word institute Quote article；

Wherein, p (A | B) indicates the probability of A under the conditions of B, symbol → expression vector；It is the theme to the distribution of word,It is arrived for article The distribution of theme,For the distribution of the reference of article,For the distribution of word original in article,For the word frequency under k-th of theme, K Indicate theme quantity,For the frequency of theme under m articles, M is article quantity,For the frequency of the reference of m articles Number,For the frequency of non-original word in m articles,For the frequency of original word in m articles；B (p, q) indicates ginseng The beta that number is p and q is distributed；

△ () is defined as:

Wherein,For vectorDimension, Γ be Gamma function, A_kIndicate vectorK-th of component；

The step C includes:

Wherein,Indicate vectorRemove z_m,nCorresponding component；Symbol ∝ expression is proportional to；Represent theme z_m,n Under, word w_m,nThe frequency of appearance；Indicate vectorMiddle w_m,nCorresponding component；V indicates total word number；Indicate z_m,nIn The frequency that t-th of word occurs；β_tIndicate vectorT-th of component；Indicate c_m,nMiddle theme is z_m,nAnd s_m,n=0 The frequency of word；Indicate c_m,nMiddle theme is z_m,nAnd s_m,nThe frequency of=1 word；Indicate vectorZ_m,nIt is corresponding Component；Indicate c_m,nMiddle theme is k-th of theme and s_m,nThe frequency of=0 word；Indicate c_m,nMiddle theme is K-th of theme and s_m,nThe frequency of=1 word；α_kIndicate vectorK-th of component；Indicate vectorRemove c_m,nIt is right The component answered；It indicates in m articles from c_m,nWord number,Indicate vectorC_m,nCorresponding component；L_m Indicate that m articles quote the number of article in total；Indicate the word for the article being cited in m articles from r Number；η_rIndicate vectorR-th of component；Indicate vectorRemove s_m,nCorresponding component；It indicates It indicates Represent the frequency of all non-original words in m articles；Generation The frequency of all original words of table；Indicate that theme is z in m articles_m,nAnd s_m,nThe frequency of=0 word；Indicate that theme is z in m articles_m,nAnd s_m,nThe frequency of=1 word；It indicates Indicate that theme is k-th of theme and s in m articles_m,nThe frequency of=0 word；It indicates Table Show that theme is k-th of theme and s in m articles_m,nThe frequency of=0 word；It represents all non-original in m articles Word frequency,Represent the frequency of all original words in m articles.

2. the article Feature Extraction Method according to claim 1 based on topic model, which is characterized in that the step D Include:

Step D1: initialization；To each word w in every article in new corpus_m,nBased on the original index of bi-distribution stochastical sampling s_m,n；If to s_m,nSampling obtain s_m,n=1, then it is taken out from the reference article when the article of down-sampling at random based on multinomial distribution Take a reference article c_m,n；For as the word w of down-sampling_m,nAssign theme z at random based on multinomial distribution_m,n；

Step D2: new corpus is rescaned；For each word w_m,n, original according to the gibbs sampler formula resampling Index s_m,n；If newly obtain to s_m,nSampling s_m,n=1, then w is sampled again_m,nCorresponding reference article c_m,n, otherwise, then directly It connects and omits to reference article c_m,nSampling；Sample w_m,nTheme z_m,n, it is updated in new corpus；

Step D3: according to corresponding to s in every article in the new corpus counted_m,nThe specific gravity of=1 word, every article reference The frequency of occurrences, the frequency that word occurs under the frequency of occurrences of theme and each theme in every article, respectively obtain every The word distribution phi of the original index λ of article, reference intensity distribution δ, the theme distribution θ of every article and each theme.

3. the article Feature Extraction Method according to claim 2 based on topic model, which is characterized in that the step D Further include:

The new article d being added in new corpus for one_new, count this article d_newTheme distribution θ_new, reference intensity It is distributed δ_new, original index λ_new, specifically include step:

Step D401: initialization, to current article d_newIn each word w_m,nAssign original index at random based on bi-distribution s_m,nIf to s_m,nSampling obtain s_m,n=1, then based on multinomial distribution at random from this article d_newReference article in extract one Quote article c_m,n；For word w_m,nAssign theme z at random based on multinomial distribution_m,n；

Step D402: current article d is rescaned_new, for each word w_m,nAccording to the gibbs sampler formula resampling Original index s_m,n；If newly obtain to s_m,nSampling s_m,n=1, then w is sampled again_m,nCorresponding reference article c_m,n, otherwise, It then directly omits to reference article c_m,nSampling；Sample w_m,nTheme z_m,n, it is updated in new corpus；

Step D403: statistics current article d_newTheme distribution θ_new, count article d_newMiddle corresponding s_m,nThe specific gravity of=1 word λ_new, the appearance distribution δ of statistics article reference_new。

4. the article Feature Extraction Method according to claim 1 based on topic model, which is characterized in that the step E Include:

Relevant parameter is obtained using following formula:

Wherein, θ_{M, k}It is distribution probability of the m articles about k-th of theme,It is point of k-th of theme about t-th of word Cloth probability, λ_mIt is the Bernoulli parameter of the original index of m articles, δ_{M, c}It is m articles and c article adduction relationships Power；Indicate that theme in m articles is the frequency of the word of k-th of theme；Indicate t-th of word in k-th of theme The frequency of appearance,It indicatesV indicates total word number of k-th of theme；Indicate all references of m articles The frequency of the word of c articles is crossed,It indicates

5. the article Feature Extraction Method according to claim 1 based on topic model, which is characterized in that the step A Include:

Step A2 is added to current article node u in vertex set V each article in original language material library, ought be above The all references relationship of chapter node u is added in the collection E of side；

Step A3: using the figure G obtained by step A2 as the Citations networks；

Vertex set V is set as initial known point set V by step A4₀, by the collection E when collection E is set as initial known₀, figure G is set as V, E Set；

Step A5 constantly searches the point v in original language material library not in vertex set V, if there is such point v and point v and top There are adduction relationships for point in point set V, then point v are added in vertex set V, and the adduction relationship of point v is added in E；Until Until V, E no longer change；