CN105631018B - Article Feature Extraction Method based on topic model - Google Patents

Article Feature Extraction Method based on topic model Download PDF

Info

Publication number
CN105631018B
CN105631018B CN201511016955.7A CN201511016955A CN105631018B CN 105631018 B CN105631018 B CN 105631018B CN 201511016955 A CN201511016955 A CN 201511016955A CN 105631018 B CN105631018 B CN 105631018B
Authority
CN
China
Prior art keywords
article
word
theme
distribution
articles
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201511016955.7A
Other languages
Chinese (zh)
Other versions
CN105631018A (en
Inventor
沈嘉明
宋振宇
李世韬
毛宇宁
谈兆炜
朱鸿儒
王乐群
郭运奇
王彪
傅洛伊
王新兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN201511016955.7A priority Critical patent/CN105631018B/en
Publication of CN105631018A publication Critical patent/CN105631018A/en
Application granted granted Critical
Publication of CN105631018B publication Critical patent/CN105631018B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The present invention provides a kind of article Feature Extraction Method based on topic model, it include: the Citations networks based on original language material library building article, construct the generation model and parametric joint expression formula of topic model, according to the deduction process for generating model construction topic model, article is sampled to new corpus, article parameter is extracted according to the sampled result of sampling article.Present invention uses article citation networks to extend traditional topic model, to extract more accurate article feature.

Description

Article Feature Extraction Method based on topic model
Technical field
The present invention relates to article feature extraction technique fields, and in particular, to the article feature extraction based on topic model Method, especially a kind of integrated Citations networks arrange, the method for the feature extraction based on topic model.
Background technique
Scientific research activity is the strategic support for improving social productive forces and overall national strength.It all pays much attention to for section countries in the world Grind movable investment.The core position in the national development overall situation, branch of the state revenue and expenditure to scientific research are put into science and technology research and development in China Increase steadily out.2012, Chinese research and development investment funds (including industry and academia) alreadyd exceed ten thousand Hundred million, it is 10298.4 hundred million yuan, reaches the level of medium-developed country.
The most direct output result of scientific research activity first is that academic article.According to statistics, from 2004 to 2014 year, section of China It grinds personnel and delivers scientific and technical article 136.98 ten thousand altogether in the world, occupy the second in the world;Article is cited 1037.01 ten thousand times altogether, Occupy the world the 4th.Research practice shows that academic article is that scientific research personnel carries out scientific research activity or continues further investigation Very important information resources.However, how the documents and materials vast as the open sea in face of the information age, quickly and accurately retrieve Academic resources required for oneself, for scientific research personnel, a strictly extremely important and challenging work Make.
The demand recommended in face of academics search, Google were proposed the academic search engine of beta version in 2004, are Global scientific research personnel provides free academic documents information service;2006, Microsoft was proposed academic search engine Microsoft Academic Search.Although these comprehensive academic search engines have relied on its affiliated business search public The search technique of department, in fact, their search result and unsatisfactory.These existing academic search engines are directed to user Inquiry input, still in the form of article list return query result.They pay more attention to the accuracy of search result, i.e., Article search results are accurately matched with the keyword of user query, it is locating in respective field without paying attention to article Position and article topic development trend.It is more prior than accurately matching title but for scientific research personnel, it is past Toward being the forward position achievement and significant contribution article belonging to obtained in subject.For example, the search for just relating to a certain research field is used When scanning for, they and oneself indefinite what kind of document of needs, the keyword of search are usually only rough at family Theme or topic, if using above-mentioned comprehensive academic search engine, user often can not fast and effeciently understand phase Forward position achievement and significant contribution article, obtained result in the subject of pass is unsatisfactory.
As it can be seen that constructing the academics search recommender system of a set of highly effective, for resource needed for scientific research personnel's acquisition, in time It grasps discipline development dynamic, improve itself capacity of scientific research, and then enhance the research strength of country, all there is considerable meaning Justice.Just because of this, academics search recommender system gradually obtains the attention of people in recent years.Since 2000, related article was searched The article quantity of rope and recommender system shows an increasing trend year by year.According to incompletely statistics, only related article quantity in 2013 is just More than 30 pieces are reached.But the research of academics search recommender system is still within the initial stage.
In the building process of academics search system, an important content is from large-scale article data collection and reference In cyberrelationship data set, the feature of article is extracted.As quoted between the theme of every article, the Academic Contribution degree of article, article The power Feature Words corresponding with theme of relationship.
Up to the present, the main direction of studying of concern article feature extraction both at home and abroad includes: to carry out to the semanteme of article Analysis, to obtain the recommendation results of other articles similar with this article theme;To article citation network modeling analysis, obtain The importance of article.
Currently, the article Feature Extraction Method based on subject analysis includes: to analyze text using topic model (such as LDA algorithm) Chapter theme, and Topic Similarity is introduced in the collaborative filtering of recommender system;It is found in conjunction with topic model and language model Similar topic article;Based on LDA algorithm, to word group theme modeling etc..Article feature extraction side based on article citation network Method includes: to calculate the bipartite graph based on article and term building the authority value of article using HITS algorithm;It is quoted using article Network calculates the authority value of author and is recommended;Using PageRank algorithm, in conjunction with the quality and reference net of periodical Network calculates the PageRank value etc. of article.
But these research achievements otherwise do not account for model to the availability of large sample size article database or It is solely focused on the information of citation network and has ignored the extraction of article text information or only only account for article database text Information but the information for having ignored Citations networks.Therefore the use value of final result is not high.
Summary of the invention
For the defects in the prior art, the article feature extraction based on topic model that the object of the present invention is to provide a kind of Method.
A kind of article Feature Extraction Method based on topic model provided according to the present invention, comprising:
Step A: the Citations networks based on original language material library building article set initial article collection and merge according to reference Relational network obtains new corpus;
Step B: it is directed to new corpus, constructs the generation model and parametric joint expression formula of topic model;
Step C: according to the deduction process for generating model construction topic model;
Step D: according to the deduction process of topic model, article is sampled to new corpus;
Step E: article parameter is extracted according to the sampled result of sampling article.
Preferably, the step A includes:
Vertex set V is set as empty set by step A1, and side collection E is set as empty set, and figure G is set as V, the set of E;
Step A2 is added to current article node u in vertex set V for each article in original language material library, will work as The all references relationship of preceding article node u is added in the collection E of side;
Step A3: using the figure G obtained by step A2 as the Citations networks;
Vertex set V is set as initial known point set V by step A40, by the collection E when collection E is set as initial known0, figure G is set as The set of V, E;
Step A5 constantly searches the point v in original language material library not in vertex set V, if there is such point v and point v There are adduction relationships with the point in vertex set V, then point v are added in vertex set V, and the adduction relationship of point v is added in E; Until V, until E no longer changes;
Step A6: the corresponding corpus export of figure G will be obtained by step A5 and is used as the new corpus.
Preferably, the step B includes:
Step B1: to each theme of new corpus, following steps are executed:
The polynomial parameters of the distribution of k-th of theme to word are generated based on Di Li Cray hyper parameter βWherein, β is The parameter for the Di Li Cray distribution obeyed;K is positive integer;
Step B2: to each piece article of new corpus, following steps are executed:
The polynomial parameters θ of the distribution of m articles to theme is generated based on Di Li Cray hyper parameter αm;Wherein, α is θm The parameter for the Di Li Cray distribution obeyed;M is positive integer;
The polynomial parameters δ of the reference intensity distribution of m articles is generated based on Di Li Cray hyper parameter ηm;Wherein, η is δmThe parameter for the Di Li Cray distribution obeyed;
Hyper parameter group based on beta distributionGenerate the Bernoulli parameter of the original index of m articles λm;Wherein,It is λmThe parameter for the beta distribution obeyed;
Step B3: following steps are executed to each word in each piece article:
Generating and obeying Bernoulli parameter is λmBernoulli Jacob distribution m articles n-th of word original index sm,n;n For positive integer;
If sm,nIt is 1, then generating and obeying parameter is δmMultinomial distribution reference article cm,n, generate and obey parameter For θcm,nMultinomial distribution theme zm,n, generating obedience parameter isMultinomial distribution word wm,n
If sm,nIt is 0, then generating and obeying parameter is θmMultinomial distribution theme zm,n, generating obedience parameter isMultinomial distribution word wm,n
Wherein,Representing matrix θ corresponds to cm,nRow vector,Representing matrixCorresponding zm,nRow vector;θ table Show article to theme distribution matrix,Indicate distribution matrix of the theme to word, wm,nN-th of word in m articles is represented, zm,nRepresent the theme of n-th of word in m articles, cm,nRepresent n-th of word in m articles and n-th of word right and wrong Original word cited article;
Step B4: the joint probability distribution for constructing topic model is as follows:
Wherein, p (A | B) indicates the probability of A under the conditions of B, symbol → expression vector;It is the theme to the distribution of word,For Article to theme distribution,For the distribution of the reference of article,For the distribution of word original in article,For under k-th of theme Word frequency, K indicate theme quantity,For the frequency of theme under m articles, M is article quantity,For m articles The frequency of reference,For the frequency of non-original word in m articles,For the frequency of original word in m articles;B(p, Q) Beta that expression parameter is p and q is distributed;
Δ () is defined as:
Wherein,For vectorDimension, Γ be Gamma function, AkIndicate vectorK-th of component.
Preferably, the step C includes:
Step C1: parameter Estimation is carried out using following gibbs sampler formula:
Wherein,Indicate vectorRemove zm,nCorresponding component;Symbol ∝ expression is proportional to;Represent master Inscribe zm,nUnder, word wm,nThe frequency of appearance;Indicate vectorMiddle wm,nCorresponding component;V indicates total word number;It indicates zm,nIn t-th of word occur frequency;βtIndicate vectorT-th of component;Indicate cm,nMiddle theme is zm,nAnd sm,n The frequency of=0 word;Indicate cm,nMiddle theme is zm,nAnd sm,nThe frequency of=1 word;Indicate vector's zm,nCorresponding component;Indicate cm,nMiddle theme is k-th of theme and sm,nThe frequency of=0 word;Indicate cm,n Middle theme is k-th of theme and sm,nThe frequency of=1 word;αkIndicate vectorK-th of component;Indicate vector Remove cm,nCorresponding component;It indicates in m articles from cm,nWord number,Indicate vectorCm,nIt is corresponding Component;LmIndicate that m articles quote the number of article in total;It indicates to be cited in m articles from r The word number of article;ηrIndicate vectorR-th of component;Indicate vectorRemove sm,nCorresponding component;It indicates It indicates Represent the frequency of all non-original words in m articles; Represent the frequency of all original words;Indicate that theme is z in m articlesm,nAnd sm,nThe frequency of=0 word;Indicate that theme is z in m articlesm,nAnd sm,nThe frequency of=1 word;It indicates Indicate that theme is k-th of theme and s in m articlesm,nThe frequency of=0 word;It indicates Table Show that theme is k-th of theme and s in m articlesm,nThe frequency of=0 word;It represents all non-original in m articles Word frequency,Represent the frequency of all original words in m articles.
Preferably, the step D includes:
Step D1: initialization;To each word w in every article in new corpusm,nIt is original based on bi-distribution stochastical sampling Index sm,n;If to sm,nSampling obtain sm,n=1, then based on multinomial distribution at random from the reference article of the article when down-sampling One reference article c of middle extractionm,n;For as the word w of down-samplingm,nAssign theme z at random based on multinomial distributionm,n
Step D2: new corpus is rescaned;For each word wm,n, according to the gibbs sampler formula resampling Original index sm,n;If newly obtain to sm,nSampling sm,n=1, then w is sampled againm,nCorresponding reference article cm,n, otherwise, It then directly omits to reference article cm,nSampling;Sample wm,nTheme zm,n, it is updated in new corpus;
Wherein, step D2 is repeatedly executed, and until gibbs sampler convergence, is entered step D3 and is continued to execute;
Step D3: according to corresponding to s in every article in the new corpus countedm,nThe specific gravity of=1 word, every article The frequency that word occurs under the frequency of occurrences of theme and each theme in the frequency of occurrences of reference, every article, respectively obtains The original index λ of every article, reference intensity distribution δ, every article theme distribution θ and each theme word distribution φ。
Preferably, the step D further include:
The new article d being added in new corpus for onenew, count this article dnewTheme distribution θnew, reference Intensity distribution δnew, original index λnew, specifically include step:
Step D401: initialization, to current article dnewIn each word wm,nAssign original finger at random based on bi-distribution Mark sm,nIf to sm,nSampling obtain sm,n=1, then based on multinomial distribution at random from this article dnewReference article in extract one Piece reference article cm,n;For word wm,nAssign theme z at random based on multinomial distributionm,n
Step D402: current article d is rescanednew, for each word wm,nAgain according to the gibbs sampler formula Sample original index sm,n;If newly obtain to sm,nSampling sm,n=1, then w is sampled againm,nCorresponding reference article cm,n, no Then, then it directly omits to reference article cm,nSampling;Sample wm,nTheme zm,n, it is updated in new corpus;
Wherein, step D402 is repeatedly executed, and until gibbs sampler convergence, is entered step D403 and is continued to execute;
Step D403: statistics current article dnewTheme distribution θnew, count article dnewMiddle corresponding sm,nThe ratio of=1 word Weight λnew, the appearance distribution δ of statistics article referencenew
Preferably, the step E includes:
Relevant parameter is obtained using following formula:
Wherein, θM, kIt is distribution probability of the m articles about k-th of theme,It is k-th of theme about t-th The distribution probability of word, λmIt is the original index of m articles, δM, cIt is the power of m articles and c article adduction relationships;Indicate that theme in m articles is the frequency of the word of k-th of theme;Indicate t-th of word appearance in k-th of theme Frequency,It indicatesV indicates the quantity of word in k-th of theme;Indicate that all references of m articles crosses The frequency of the word of c articles,It indicates
In preferred technical solution: extracting effective keyword in corpus, and effective keyword is treated as taking out As object;Theme number, the intensity of theme distribution, the intensity of article reference distribution of article extraction, can be by user Demand determines or by systemic presupposition.It is assumed that in every article the theme source of each word be it is random, by the theme of article itself Distribution generates or the theme distribution of certain article as cited in this article generates;
The probabilistic model of text generation include it is assumed hereinafter that:
(1) theme of each word obeys multinomial distribution in every article, and its prior distribution obeys Dirichlet distribution.
(2) the different words under each theme obey multinomial distribution, and its prior distribution obeys Dirichlet distribution.
(3) multinomial distribution is obeyed in the reference source of each word in every article, and its prior distribution obeys Dirichlet points Cloth.
(4) the original of each word obeys bi-distribution in every article, and its prior distribution obeys Beta distribution;
Wherein, about probabilistic model it is assumed that the parameter of prior distribution will be put down by article average length, theme number, article Reference article quantity determines.
Compared with prior art, the present invention have it is following the utility model has the advantages that
1, the present invention is based on above-mentioned problems of the prior art, article feature extraction is thought deeply at the visual angle new from one Method, can be improved the accuracy of article feature extraction and can extract traditional characteristic extraction system from article and do not account for Information.
2, the present invention uses the traditional topic model of the Information expansion of citation network, allows model by both sides Informix extracts article feature, is not only applicable in the situation that data bank data volume is larger, and can be to the data of dynamic expansion Library is applicable in, additionally it is possible to extract the information such as article adduction relationship intensity, the original index of article that previous topic model cannot extract.
3, the present invention utilizes the sparsity of article theme distribution, the sparsity of words distribution in theme, article reference distribution Sparsity, reduce sampling complexity.
Detailed description of the invention
Upon reading the detailed description of non-limiting embodiments with reference to the following drawings, other feature of the invention, Objects and advantages will become more apparent upon:
Fig. 1 is original article data sample.
Fig. 2 is the generating process of novel topic model.
Fig. 3 is flow chart of the method for the present invention.
Specific embodiment
The present invention is described in detail combined with specific embodiments below.Following embodiment will be helpful to the technology of this field Personnel further understand the present invention, but the invention is not limited in any way.It should be pointed out that the ordinary skill of this field For personnel, without departing from the inventive concept of the premise, several changes and improvements can also be made.These belong to the present invention Protection scope.
The present invention extracts article feature using original method.Present invention uses article citation networks to extend tradition Topic model, allow topic model to extract article feature using topic model and citation network simultaneously, to extract More accurate article feature.Key step of the invention includes:
Step A: the Citations networks based on original language material library building article set initial article collection and merge according to reference Relational network obtains new corpus;
Step B: it is directed to new corpus, constructs the generation model and parametric joint expression formula of topic model;
Step C: according to the deduction process for generating model construction topic model;
Step D: according to the deduction process of topic model, article is sampled to new corpus;
Step E: article parameter is extracted according to the sampled result of sampling article.
Article Feature Extraction Method designed by the present invention is related to arranging the automation scientific procedure of Citations networks, knot The novel topic model for closing citation network generates model and federal expression derives, novel topic model infers process and sampling is calculated Method derives, the parameter Estimation of novel topic model this five core components.Method provided by the invention includes the following steps:
About step A, the original language material library based on large sample size automatically generates the adduction relationship net of article (such as paper) Network, and it is output to file;Corpus includes two parts information, and a part of information is the information about article itself, including article Topic, author, abstract etc., another part information are the adduction relationships between article, for example article A quotes article B, article A reference Article C.
Academic Data on internet is vast as the open sea, and is increased every year with million grades of quantity.Therefore basis in the present invention The original language material library of existing XML and JSON format, based on each article in original language material library, extract article title, Then article abstract and article bibliography set initial article set, according to the adduction relationship of academic article, obtain maximum Connected component, and export as new corpus.
Existing original article corpus library format is as shown in table 1 and Fig. 1.
The original article data storage format specification of table 1.
In the step A, it is described based on original language material library building article Citations networks the step of, comprising:
Vertex set V is set as empty set by step A1, and side collection E is set as empty set, and figure G is set as V, the set of E;
Step A2 is added to current article node u in vertex set V for each article in original language material library, will work as The all references relationship of preceding article node u is added in the collection E of side;
Step A3: using the figure G obtained by step A2 as the Citations networks.
In the step A, the initial article collection of setting, which merges, obtains the step of new corpus according to Citations networks Suddenly, comprising: according to Citations networks, automatically obtain maximum component, obtain new corpus;It specifically includes:
Vertex set V is set as initial known point set V by step A40, by the collection E when collection E is set as initial known0, figure G is set as The set of V, E;
Step A5 constantly searches the point v in original language material library not in vertex set V, if there is such point v and point v There are adduction relationships with the point in vertex set V, then point v are added in vertex set V, and the adduction relationship of point v is added in E; Until V, until E no longer changes;
Step A6: the corresponding corpus export of figure G will be obtained by step A5 and is used as the new corpus.
About step B, traditional topic model is using the word frequency characteristic of every article as the theme feature of article, this hair The topic model of bright middle use can cover the relationship between article, i.e. article Citations networks.The topic model includes Process (being described in detail in step C) is inferred in two cores, respectively generation model (being described in detail in stepb).Generate model phase When under conditions of known parameters, it will be assumed that the model that article generating process is obeyed, the corresponding diagram of the generation model of article Model can be found in attached drawing 2.
The step B includes:
Step B1: to each theme of new corpus, following steps are executed:
The polynomial parameters of the distribution of k-th of theme to word are generated based on Di Li Cray hyper parameter βWherein, β is The parameter for the Di Li Cray distribution obeyed;K is positive integer;
Step B2: to each piece article of new corpus, following steps are executed:
The polynomial parameters θ of the distribution of m articles to theme is generated based on Di Li Cray hyper parameter αm;Wherein, α is θm The parameter for the Di Li Cray distribution obeyed;M is positive integer;
The polynomial parameters δ of the reference intensity distribution of m articles is generated based on Di Li Cray hyper parameter ηm;Wherein, η is δmThe parameter for the Di Li Cray distribution obeyed;
Hyper parameter group based on beta distributionGenerate the Bernoulli parameter of the original index of m articles λm;Wherein,It is λmThe parameter for the beta distribution obeyed;It will be appreciated by those skilled in the art that beta distribution needs itself Two hyper parameters are wanted, the two hyper parameters can be interchanged.
Step B3: following steps are executed to each word in each piece article:
Generating and obeying Bernoulli parameter is λmBernoulli Jacob distribution m articles n-th of word original index sM, n;n For positive integer;
If sM, nIt is 1, then generating and obeying parameter is δmMultinomial distribution reference article cm,n, generate and obey parameter ForMultinomial distribution theme zm,n, generating obedience parameter isMultinomial distribution word wm,n
If sm,nIt is 0, then generating and obeying parameter is θmMultinomial distribution theme zm,n, generating obedience parameter isMultinomial distribution word wm,n
Wherein,Representing matrix θ corresponds to cm,nRow vector,Representing matrixCorresponding zm,nRow vector;θ table Show article to theme distribution matrix,Distribution matrix of the expression theme to word;wm,nN-th of word in m articles is represented, zm,nRepresent the theme of n-th of word in m articles, cm,nRepresent n-th of word in m articles and n-th of word right and wrong Original word cited article;
Step B4: the joint probability distribution for constructing topic model is as follows:
Wherein, p (A | B) indicates the probability of A under the conditions of B, symbol → expression vector;It is the theme to the distribution of word,For Article to theme distribution,For the distribution of the reference of article,For the distribution of word original in article,For under k-th of theme Word frequency, K indicate theme quantity,For the frequency of theme under m articles, M is article quantity,For m articles The frequency of reference,For the frequency of non-original word in m articles,For the frequency of original word in m articles;B(p, Q) Beta that expression parameter is p and q is distributed;
Δ () is defined as:
Wherein,For vectorDimension, Γ be Gamma function, AkIndicate vectorK-th of component.
About step C, infer that process is used to estimate to generate the parameter in model.In practical situation, we are known texts Word in chapter, it is desirable to instead derive implicit parameter in the past, just need to complete by the method for statistical inference here.For me The novel topic model that proposes, conventional optimal method can not solve the problems, such as maximal possibility estimation, then we use A kind of mode being known as gibbs sampler carries out parameter Estimation.
The step C includes:
Step C1: parameter Estimation is carried out using following gibbs sampler formula:
Wherein,Indicate vectorRemove zm,nCorresponding component;Symbol ∝ expression is proportional to;Represent master Inscribe zm,nUnder, word wm,nThe frequency of appearance;Indicate vectorMiddle wm,nCorresponding component;V indicates total word number;It indicates zm,nIn t-th of word occur frequency;βtIndicate vectorT-th of component;Indicate cm,nMiddle theme is zm,nAnd sm,n The frequency of=0 word;Indicate cm,nMiddle theme is zm,nAnd sm,nThe frequency of=1 word;Indicate vector's zm,nCorresponding component;Indicate cm,nMiddle theme is k-th of theme and sm,nThe frequency of=0 word;Indicate cm,n Middle theme is k-th of theme and sm,nThe frequency of=1 word;αkIndicate vectorK-th of component;Indicate vector Remove cm,nCorresponding component;It indicates in m articles from cm,nWord number,Indicate vectorCm,nIt is corresponding Component;LmIndicate that m articles quote the number of article in total;It indicates to be cited in m articles from r The word number of article;η r indicates vectorR-th of component;Indicate vectorRemove sm,nCorresponding component;It indicates It indicates Represent the frequency of all non-original words in m articles; Represent the frequency of all original words;Indicate that theme is z in m articlesm,nAnd sm,nThe frequency of=0 word;Indicate that theme is z in m articlesm,nAnd sm,nThe frequency of=1 word;It indicates Indicate that theme is k-th of theme and s in m articlesm,nThe frequency of=0 word;It indicates Indicate that theme is k-th of theme and s in m articlesm,nThe frequency of=0 word;Represent all non-originals in m articles The frequency of the word of wound,Represent the frequency of all original words in m articles.
Wherein,In subscript represent the component of corresponding prior distribution parameter.
Sampling algorithm is designed according to the deduction process of novel topic model about step D, samples article database;We It can arrive and write out complete deduction process.
The step D includes:
Step D1: initialization;To each word w in every article in new corpusm,nIt is original based on bi-distribution stochastical sampling Index sm,n;If to sm,nSampling obtain sm,n=1, then based on multinomial distribution at random from the reference article of the article when down-sampling One reference article c of middle extractionm,n;For as the word w of down-samplingm,nAssign theme z at random based on multinomial distributionm,n
Step D2: new corpus is rescaned;For each word wm,n, according to the gibbs sampler formula resampling Original index sm,n;If newly obtain to sm,nSampling sm,n=1, then w is sampled againm,nCorresponding reference article cm,n, otherwise, It then directly omits to reference article cm,nSampling;Sample wm,nTheme zm,n, it is updated in new corpus;
Wherein, step D2 is repeated, until gibbs sampler is restrained;
Step D3: according to corresponding to s in every article in the new corpus countedm,nThe specific gravity of=1 word, every article The frequency that word occurs under the frequency of occurrences of theme and each theme in the frequency of occurrences of reference, every article, respectively obtains The original index λ of every article, reference intensity distribution δ, every article theme distribution θ and each theme word distribution φ;
Article (i.e. in the new article that is added instantly new corpus) d new for onenew, count the theme point of this article Cloth θnew, reference intensity distribution δnew, original index λnew, specifically include step:
Step D401: initialization, to current article dnewIn each word wm,nAssign original finger at random based on bi-distribution Mark sm,nIf to wm,nSampling obtain sm,n=1, then based on multinomial distribution at random from this article ddnewReference article in extract One reference article cm,n;For word wm,nAssign theme z at random based on multinomial distributionm,n
Step D402: current article d is rescanednew, for each word wm,nAgain according to the gibbs sampler formula Sample original index sm,n;If newly obtain to sm,nSampling sm,n=1, then w is sampled againm,nCorresponding reference article cm,n, no Then, then it directly omits to reference article cm,nSampling;Sample wm,nTheme zm,n, it is updated in new corpus;
Wherein, step D402 is repeated, until gibbs sampler is restrained;
Step D403: statistics current article dnewTheme distribution, which is exactly θnew, count article dnewIn it is right Answer sm,nThe specific gravity of=1 word, the specific gravity are exactly λnew, what statistics article was quoted is distributed, which is exactly δnew
About step E, after topic model convergence (such as advantageously according to the gibbs sampler algorithm in step 4, I Carry out circulating sampling, after carrying out enough numbers to sampling, model parameter convergence can be considered as), the public affairs below our uses Formula obtains relevant parameter:
Wherein, θM, kIt is distribution probability of the m articles about k-th of theme,It is k-th of theme about t-th The distribution probability of word, λmIt is the original index of m articles, δM, cIt is the power of m articles and c article adduction relationships;Indicate that theme in m articles is the frequency of the word of k-th of theme;Indicate t-th of word appearance in k-th of theme Frequency,It indicatesV indicates the quantity of word in k-th of theme;Indicate that all references of m articles crosses The frequency of the word of c articles,It indicates
Subscript (), which is represented, sums monomial to subscript herein, such as
Specific embodiments of the present invention are described above.It is to be appreciated that the invention is not limited to above-mentioned Particular implementation, those skilled in the art can make a variety of changes or modify within the scope of the claims, this not shadow Ring substantive content of the invention.In the absence of conflict, the feature in embodiments herein and embodiment can any phase Mutually combination.

Claims (5)

1. a kind of article Feature Extraction Method based on topic model characterized by comprising
Step A: the Citations networks based on original language material library building article set initial article collection and merge according to adduction relationship Network obtains new corpus;
Step B: it is directed to new corpus, constructs the generation model and parametric joint expression formula of topic model;
Step C: according to the deduction process for generating model construction topic model;
Step D: according to the deduction process of topic model, article is sampled to new corpus;
Step E: article parameter is extracted according to the sampled result of sampling article;
The step B includes:
Step B1: to each theme of new corpus, following steps are executed:
The polynomial parameters of the distribution of k-th of theme to word are generated based on Di Li Cray hyper parameter βWherein, β isIt is obeyed Di Li Cray distribution parameter;K is positive integer;
Step B2: to each piece article of new corpus, following steps are executed:
The polynomial parameters θ of the distribution of m articles to theme is generated based on Di Li Cray hyper parameter αm;Wherein, α is θmIt is taken From Di Li Cray be distributed parameter;M is positive integer;
The polynomial parameters δ of the reference intensity distribution of m articles is generated based on Di Li Cray hyper parameter ηm;Wherein, η is δmInstitute The parameter of the Di Li Cray distribution of obedience;
Hyper parameter group based on beta distributionGenerate the Bernoulli parameter λ of the original index of m articlesm;Its In,It is λmThe parameter for the beta distribution obeyed;
Step B3: following steps are executed to each word in each piece article:
Generating and obeying Bernoulli parameter is λmBernoulli Jacob distribution m articles n-th of word original index sM, n;N is positive Integer;
If sM, nIt is 1, then generating and obeying parameter is δmMultinomial distribution reference article cm,n, generating obedience parameter is Multinomial distribution theme zm,n, generating obedience parameter isMultinomial distribution word wm,n
If sm,nIt is 0, then generating and obeying parameter is θmMultinomial distribution theme zm,n, generating obedience parameter is's The word w of multinomial distributionm,n
Wherein,Representing matrix θ corresponds to cm,nRow vector,Representing matrixCorresponding zm,nRow vector;θ indicates article To the distribution matrix of theme,Indicate distribution matrix of the theme to word, wm,nRepresent n-th of word in m articles, zm,nIt represents The theme of n-th of word in m articles, cm,nIt represents n-th of word in m articles and n-th of word is non-original word institute Quote article;
Step B4: the joint probability distribution for constructing topic model is as follows:
Wherein, p (A | B) indicates the probability of A under the conditions of B, symbol → expression vector;It is the theme to the distribution of word,It is arrived for article The distribution of theme,For the distribution of the reference of article,For the distribution of word original in article,For the word frequency under k-th of theme, K Indicate theme quantity,For the frequency of theme under m articles, M is article quantity,For the frequency of the reference of m articles Number,For the frequency of non-original word in m articles,For the frequency of original word in m articles;B (p, q) indicates ginseng The beta that number is p and q is distributed;
△ () is defined as:
Wherein,For vectorDimension, Γ be Gamma function, AkIndicate vectorK-th of component;
The step C includes:
Step C1: parameter Estimation is carried out using following gibbs sampler formula:
Wherein,Indicate vectorRemove zm,nCorresponding component;Symbol ∝ expression is proportional to;Represent theme zm,n Under, word wm,nThe frequency of appearance;Indicate vectorMiddle wm,nCorresponding component;V indicates total word number;Indicate zm,nIn The frequency that t-th of word occurs;βtIndicate vectorT-th of component;Indicate cm,nMiddle theme is zm,nAnd sm,n=0 The frequency of word;Indicate cm,nMiddle theme is zm,nAnd sm,nThe frequency of=1 word;Indicate vectorZm,nIt is corresponding Component;Indicate cm,nMiddle theme is k-th of theme and sm,nThe frequency of=0 word;Indicate cm,nMiddle theme is K-th of theme and sm,nThe frequency of=1 word;αkIndicate vectorK-th of component;Indicate vectorRemove cm,nIt is right The component answered;It indicates in m articles from cm,nWord number,Indicate vectorCm,nCorresponding component;Lm Indicate that m articles quote the number of article in total;Indicate the word for the article being cited in m articles from r Number;ηrIndicate vectorR-th of component;Indicate vectorRemove sm,nCorresponding component;It indicates It indicates Represent the frequency of all non-original words in m articles;Generation The frequency of all original words of table;Indicate that theme is z in m articlesm,nAnd sm,nThe frequency of=0 word;Indicate that theme is z in m articlesm,nAnd sm,nThe frequency of=1 word;It indicates Indicate that theme is k-th of theme and s in m articlesm,nThe frequency of=0 word;It indicates Table Show that theme is k-th of theme and s in m articlesm,nThe frequency of=0 word;It represents all non-original in m articles Word frequency,Represent the frequency of all original words in m articles.
2. the article Feature Extraction Method according to claim 1 based on topic model, which is characterized in that the step D Include:
Step D1: initialization;To each word w in every article in new corpusm,nBased on the original index of bi-distribution stochastical sampling sm,n;If to sm,nSampling obtain sm,n=1, then it is taken out from the reference article when the article of down-sampling at random based on multinomial distribution Take a reference article cm,n;For as the word w of down-samplingm,nAssign theme z at random based on multinomial distributionm,n
Step D2: new corpus is rescaned;For each word wm,n, original according to the gibbs sampler formula resampling Index sm,n;If newly obtain to sm,nSampling sm,n=1, then w is sampled againm,nCorresponding reference article cm,n, otherwise, then directly It connects and omits to reference article cm,nSampling;Sample wm,nTheme zm,n, it is updated in new corpus;
Wherein, step D2 is repeatedly executed, and until gibbs sampler convergence, is entered step D3 and is continued to execute;
Step D3: according to corresponding to s in every article in the new corpus countedm,nThe specific gravity of=1 word, every article reference The frequency of occurrences, the frequency that word occurs under the frequency of occurrences of theme and each theme in every article, respectively obtain every The word distribution phi of the original index λ of article, reference intensity distribution δ, the theme distribution θ of every article and each theme.
3. the article Feature Extraction Method according to claim 2 based on topic model, which is characterized in that the step D Further include:
The new article d being added in new corpus for onenew, count this article dnewTheme distribution θnew, reference intensity It is distributed δnew, original index λnew, specifically include step:
Step D401: initialization, to current article dnewIn each word wm,nAssign original index at random based on bi-distribution sm,nIf to sm,nSampling obtain sm,n=1, then based on multinomial distribution at random from this article dnewReference article in extract one Quote article cm,n;For word wm,nAssign theme z at random based on multinomial distributionm,n
Step D402: current article d is rescanednew, for each word wm,nAccording to the gibbs sampler formula resampling Original index sm,n;If newly obtain to sm,nSampling sm,n=1, then w is sampled againm,nCorresponding reference article cm,n, otherwise, It then directly omits to reference article cm,nSampling;Sample wm,nTheme zm,n, it is updated in new corpus;
Wherein, step D402 is repeatedly executed, and until gibbs sampler convergence, is entered step D403 and is continued to execute;
Step D403: statistics current article dnewTheme distribution θnew, count article dnewMiddle corresponding sm,nThe specific gravity of=1 word λnew, the appearance distribution δ of statistics article referencenew
4. the article Feature Extraction Method according to claim 1 based on topic model, which is characterized in that the step E Include:
Relevant parameter is obtained using following formula:
Wherein, θM, kIt is distribution probability of the m articles about k-th of theme,It is point of k-th of theme about t-th of word Cloth probability, λmIt is the Bernoulli parameter of the original index of m articles, δM, cIt is m articles and c article adduction relationships Power;Indicate that theme in m articles is the frequency of the word of k-th of theme;Indicate t-th of word in k-th of theme The frequency of appearance,It indicatesV indicates total word number of k-th of theme;Indicate all references of m articles The frequency of the word of c articles is crossed,It indicates
5. the article Feature Extraction Method according to claim 1 based on topic model, which is characterized in that the step A Include:
Vertex set V is set as empty set by step A1, and side collection E is set as empty set, and figure G is set as V, the set of E;
Step A2 is added to current article node u in vertex set V each article in original language material library, ought be above The all references relationship of chapter node u is added in the collection E of side;
Step A3: using the figure G obtained by step A2 as the Citations networks;
Vertex set V is set as initial known point set V by step A40, by the collection E when collection E is set as initial known0, figure G is set as V, E Set;
Step A5 constantly searches the point v in original language material library not in vertex set V, if there is such point v and point v and top There are adduction relationships for point in point set V, then point v are added in vertex set V, and the adduction relationship of point v is added in E;Until Until V, E no longer change;
Step A6: the corresponding corpus export of figure G will be obtained by step A5 and is used as the new corpus.
CN201511016955.7A 2015-12-29 2015-12-29 Article Feature Extraction Method based on topic model Active CN105631018B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201511016955.7A CN105631018B (en) 2015-12-29 2015-12-29 Article Feature Extraction Method based on topic model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201511016955.7A CN105631018B (en) 2015-12-29 2015-12-29 Article Feature Extraction Method based on topic model

Publications (2)

Publication Number Publication Date
CN105631018A CN105631018A (en) 2016-06-01
CN105631018B true CN105631018B (en) 2018-12-18

Family

ID=56045951

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201511016955.7A Active CN105631018B (en) 2015-12-29 2015-12-29 Article Feature Extraction Method based on topic model

Country Status (1)

Country Link
CN (1) CN105631018B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106372147B (en) * 2016-08-29 2020-09-15 上海交通大学 Heterogeneous topic network construction and visualization method based on text network
CN106709520B (en) * 2016-12-23 2019-05-31 浙江大学 A kind of case classification method based on topic model
CN107515854B (en) * 2017-07-27 2021-06-04 上海交通大学 Time sequence community and topic detection method based on right-carrying time sequence text network
CN108549625B (en) * 2018-02-28 2020-11-17 首都师范大学 Chinese chapter expression theme analysis method based on syntactic object clustering
CN110555198B (en) * 2018-05-31 2023-05-23 北京百度网讯科技有限公司 Method, apparatus, device and computer readable storage medium for generating articles
CN109299257B (en) * 2018-09-18 2020-09-15 杭州科以才成科技有限公司 English periodical recommendation method based on LSTM and knowledge graph
CN109597879B (en) * 2018-11-30 2022-03-29 京华信息科技股份有限公司 Service behavior relation extraction method and device based on 'citation relation' data
CN110807315A (en) * 2019-10-15 2020-02-18 上海大学 Topic model-based online comment emotion mining method
CN115438654B (en) * 2022-11-07 2023-03-24 华东交通大学 Article title generation method and device, storage medium and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1766871A (en) * 2004-10-29 2006-05-03 中国科学院研究生院 The processing method of the semi-structured data extraction of semantics of based on the context
JP2011180862A (en) * 2010-03-02 2011-09-15 Nippon Telegr & Teleph Corp <Ntt> Method and device of extracting term, and program
CN104408153A (en) * 2014-12-03 2015-03-11 中国科学院自动化研究所 Short text hash learning method based on multi-granularity topic models
CN104834735A (en) * 2015-05-18 2015-08-12 大连理工大学 Automatic document summarization extraction method based on term vectors

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100030732A1 (en) * 2008-07-31 2010-02-04 International Business Machines Corporation System and method to create process reference maps from links described in a business process model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1766871A (en) * 2004-10-29 2006-05-03 中国科学院研究生院 The processing method of the semi-structured data extraction of semantics of based on the context
JP2011180862A (en) * 2010-03-02 2011-09-15 Nippon Telegr & Teleph Corp <Ntt> Method and device of extracting term, and program
CN104408153A (en) * 2014-12-03 2015-03-11 中国科学院自动化研究所 Short text hash learning method based on multi-granularity topic models
CN104834735A (en) * 2015-05-18 2015-08-12 大连理工大学 Automatic document summarization extraction method based on term vectors

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于LDA模型的评论热点挖掘:原理与实现;余传明 等;《信息系统》;20100531;第33卷(第5期);第103-106页 *
基于主题特征的关键词抽取;刘俊 等;《计算机应用研究》;20121130;第29卷(第11期);第4224-4227页 *

Also Published As

Publication number Publication date
CN105631018A (en) 2016-06-01

Similar Documents

Publication Publication Date Title
CN105631018B (en) Article Feature Extraction Method based on topic model
CN104699766B (en) A kind of implicit attribute method for digging for merging word association relation and context of co-text deduction
CN106649272B (en) A kind of name entity recognition method based on mixed model
CN105528437B (en) A kind of question answering system construction method extracted based on structured text knowledge
CN106484664A (en) Similarity calculating method between a kind of short text
WO2022156328A1 (en) Restful-type web service clustering method fusing service cooperation relationships
CN106682172A (en) Keyword-based document research hotspot recommending method
CN106776797A (en) A kind of knowledge Q-A system and its method of work based on ontology inference
CN106354708A (en) Client interaction information search engine system based on electricity information collection system
CN106897437B (en) High-order rule multi-classification method and system of knowledge system
Zhao et al. Research on information extraction of technical documents and construction of domain knowledge graph
CN105843860A (en) Microblog attention recommendation method based on parallel item-based collaborative filtering algorithm
CN105808729B (en) Academic big data analysis method based on adduction relationship between paper
CN105468780B (en) The normalization method and device of ProductName entity in a kind of microblogging text
Liu et al. Chinese named entity recognition based on rules and conditional random field
CN105787072B (en) A kind of domain knowledge of Process-Oriented extracts and method for pushing
CN107908749A (en) A kind of personage&#39;s searching system and method based on search engine
Sun et al. Joint topic-opinion model for implicit feature extracting
CN115730078A (en) Event knowledge graph construction method and device for class case retrieval and electronic equipment
CN110188352A (en) A kind of text subject determines method, apparatus, calculates equipment and storage medium
Chen et al. Web Evaluation Analysis of Tourism Destinations Based on Data Mining
CN108536796A (en) A kind of isomery Ontology Matching method and system based on figure
CN103294662B (en) Match judging apparatus and consistance determination methods
Fuller et al. Structuring, recording, and analyzing historical networks in the china biographical database
CN110377845A (en) Collaborative filtering recommending method based on the semi-supervised LDA in section

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant