CN105631018B - Article Feature Extraction Method based on topic model - Google Patents
Article Feature Extraction Method based on topic model Download PDFInfo
- Publication number
- CN105631018B CN105631018B CN201511016955.7A CN201511016955A CN105631018B CN 105631018 B CN105631018 B CN 105631018B CN 201511016955 A CN201511016955 A CN 201511016955A CN 105631018 B CN105631018 B CN 105631018B
- Authority
- CN
- China
- Prior art keywords
- article
- word
- theme
- distribution
- articles
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3346—Query execution using probabilistic model
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Probability & Statistics with Applications (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The present invention provides a kind of article Feature Extraction Method based on topic model, it include: the Citations networks based on original language material library building article, construct the generation model and parametric joint expression formula of topic model, according to the deduction process for generating model construction topic model, article is sampled to new corpus, article parameter is extracted according to the sampled result of sampling article.Present invention uses article citation networks to extend traditional topic model, to extract more accurate article feature.
Description
Technical field
The present invention relates to article feature extraction technique fields, and in particular, to the article feature extraction based on topic model
Method, especially a kind of integrated Citations networks arrange, the method for the feature extraction based on topic model.
Background technique
Scientific research activity is the strategic support for improving social productive forces and overall national strength.It all pays much attention to for section countries in the world
Grind movable investment.The core position in the national development overall situation, branch of the state revenue and expenditure to scientific research are put into science and technology research and development in China
Increase steadily out.2012, Chinese research and development investment funds (including industry and academia) alreadyd exceed ten thousand
Hundred million, it is 10298.4 hundred million yuan, reaches the level of medium-developed country.
The most direct output result of scientific research activity first is that academic article.According to statistics, from 2004 to 2014 year, section of China
It grinds personnel and delivers scientific and technical article 136.98 ten thousand altogether in the world, occupy the second in the world;Article is cited 1037.01 ten thousand times altogether,
Occupy the world the 4th.Research practice shows that academic article is that scientific research personnel carries out scientific research activity or continues further investigation
Very important information resources.However, how the documents and materials vast as the open sea in face of the information age, quickly and accurately retrieve
Academic resources required for oneself, for scientific research personnel, a strictly extremely important and challenging work
Make.
The demand recommended in face of academics search, Google were proposed the academic search engine of beta version in 2004, are
Global scientific research personnel provides free academic documents information service;2006, Microsoft was proposed academic search engine
Microsoft Academic Search.Although these comprehensive academic search engines have relied on its affiliated business search public
The search technique of department, in fact, their search result and unsatisfactory.These existing academic search engines are directed to user
Inquiry input, still in the form of article list return query result.They pay more attention to the accuracy of search result, i.e.,
Article search results are accurately matched with the keyword of user query, it is locating in respective field without paying attention to article
Position and article topic development trend.It is more prior than accurately matching title but for scientific research personnel, it is past
Toward being the forward position achievement and significant contribution article belonging to obtained in subject.For example, the search for just relating to a certain research field is used
When scanning for, they and oneself indefinite what kind of document of needs, the keyword of search are usually only rough at family
Theme or topic, if using above-mentioned comprehensive academic search engine, user often can not fast and effeciently understand phase
Forward position achievement and significant contribution article, obtained result in the subject of pass is unsatisfactory.
As it can be seen that constructing the academics search recommender system of a set of highly effective, for resource needed for scientific research personnel's acquisition, in time
It grasps discipline development dynamic, improve itself capacity of scientific research, and then enhance the research strength of country, all there is considerable meaning
Justice.Just because of this, academics search recommender system gradually obtains the attention of people in recent years.Since 2000, related article was searched
The article quantity of rope and recommender system shows an increasing trend year by year.According to incompletely statistics, only related article quantity in 2013 is just
More than 30 pieces are reached.But the research of academics search recommender system is still within the initial stage.
In the building process of academics search system, an important content is from large-scale article data collection and reference
In cyberrelationship data set, the feature of article is extracted.As quoted between the theme of every article, the Academic Contribution degree of article, article
The power Feature Words corresponding with theme of relationship.
Up to the present, the main direction of studying of concern article feature extraction both at home and abroad includes: to carry out to the semanteme of article
Analysis, to obtain the recommendation results of other articles similar with this article theme;To article citation network modeling analysis, obtain
The importance of article.
Currently, the article Feature Extraction Method based on subject analysis includes: to analyze text using topic model (such as LDA algorithm)
Chapter theme, and Topic Similarity is introduced in the collaborative filtering of recommender system;It is found in conjunction with topic model and language model
Similar topic article;Based on LDA algorithm, to word group theme modeling etc..Article feature extraction side based on article citation network
Method includes: to calculate the bipartite graph based on article and term building the authority value of article using HITS algorithm;It is quoted using article
Network calculates the authority value of author and is recommended;Using PageRank algorithm, in conjunction with the quality and reference net of periodical
Network calculates the PageRank value etc. of article.
But these research achievements otherwise do not account for model to the availability of large sample size article database or
It is solely focused on the information of citation network and has ignored the extraction of article text information or only only account for article database text
Information but the information for having ignored Citations networks.Therefore the use value of final result is not high.
Summary of the invention
For the defects in the prior art, the article feature extraction based on topic model that the object of the present invention is to provide a kind of
Method.
A kind of article Feature Extraction Method based on topic model provided according to the present invention, comprising:
Step A: the Citations networks based on original language material library building article set initial article collection and merge according to reference
Relational network obtains new corpus;
Step B: it is directed to new corpus, constructs the generation model and parametric joint expression formula of topic model;
Step C: according to the deduction process for generating model construction topic model;
Step D: according to the deduction process of topic model, article is sampled to new corpus;
Step E: article parameter is extracted according to the sampled result of sampling article.
Preferably, the step A includes:
Vertex set V is set as empty set by step A1, and side collection E is set as empty set, and figure G is set as V, the set of E;
Step A2 is added to current article node u in vertex set V for each article in original language material library, will work as
The all references relationship of preceding article node u is added in the collection E of side;
Step A3: using the figure G obtained by step A2 as the Citations networks;
Vertex set V is set as initial known point set V by step A40, by the collection E when collection E is set as initial known0, figure G is set as
The set of V, E;
Step A5 constantly searches the point v in original language material library not in vertex set V, if there is such point v and point v
There are adduction relationships with the point in vertex set V, then point v are added in vertex set V, and the adduction relationship of point v is added in E;
Until V, until E no longer changes;
Step A6: the corresponding corpus export of figure G will be obtained by step A5 and is used as the new corpus.
Preferably, the step B includes:
Step B1: to each theme of new corpus, following steps are executed:
The polynomial parameters of the distribution of k-th of theme to word are generated based on Di Li Cray hyper parameter βWherein, β is
The parameter for the Di Li Cray distribution obeyed;K is positive integer;
Step B2: to each piece article of new corpus, following steps are executed:
The polynomial parameters θ of the distribution of m articles to theme is generated based on Di Li Cray hyper parameter αm;Wherein, α is θm
The parameter for the Di Li Cray distribution obeyed;M is positive integer;
The polynomial parameters δ of the reference intensity distribution of m articles is generated based on Di Li Cray hyper parameter ηm;Wherein, η is
δmThe parameter for the Di Li Cray distribution obeyed;
Hyper parameter group based on beta distributionGenerate the Bernoulli parameter of the original index of m articles
λm;Wherein,It is λmThe parameter for the beta distribution obeyed;
Step B3: following steps are executed to each word in each piece article:
Generating and obeying Bernoulli parameter is λmBernoulli Jacob distribution m articles n-th of word original index sm,n;n
For positive integer;
If sm,nIt is 1, then generating and obeying parameter is δmMultinomial distribution reference article cm,n, generate and obey parameter
For θcm,nMultinomial distribution theme zm,n, generating obedience parameter isMultinomial distribution word wm,n;
If sm,nIt is 0, then generating and obeying parameter is θmMultinomial distribution theme zm,n, generating obedience parameter isMultinomial distribution word wm,n;
Wherein,Representing matrix θ corresponds to cm,nRow vector,Representing matrixCorresponding zm,nRow vector;θ table
Show article to theme distribution matrix,Indicate distribution matrix of the theme to word, wm,nN-th of word in m articles is represented,
zm,nRepresent the theme of n-th of word in m articles, cm,nRepresent n-th of word in m articles and n-th of word right and wrong
Original word cited article;
Step B4: the joint probability distribution for constructing topic model is as follows:
Wherein, p (A | B) indicates the probability of A under the conditions of B, symbol → expression vector;It is the theme to the distribution of word,For
Article to theme distribution,For the distribution of the reference of article,For the distribution of word original in article,For under k-th of theme
Word frequency, K indicate theme quantity,For the frequency of theme under m articles, M is article quantity,For m articles
The frequency of reference,For the frequency of non-original word in m articles,For the frequency of original word in m articles;B(p,
Q) Beta that expression parameter is p and q is distributed;
Δ () is defined as:
Wherein,For vectorDimension, Γ be Gamma function, AkIndicate vectorK-th of component.
Preferably, the step C includes:
Step C1: parameter Estimation is carried out using following gibbs sampler formula:
Wherein,Indicate vectorRemove zm,nCorresponding component;Symbol ∝ expression is proportional to;Represent master
Inscribe zm,nUnder, word wm,nThe frequency of appearance;Indicate vectorMiddle wm,nCorresponding component;V indicates total word number;It indicates
zm,nIn t-th of word occur frequency;βtIndicate vectorT-th of component;Indicate cm,nMiddle theme is zm,nAnd sm,n
The frequency of=0 word;Indicate cm,nMiddle theme is zm,nAnd sm,nThe frequency of=1 word;Indicate vector's
zm,nCorresponding component;Indicate cm,nMiddle theme is k-th of theme and sm,nThe frequency of=0 word;Indicate cm,n
Middle theme is k-th of theme and sm,nThe frequency of=1 word;αkIndicate vectorK-th of component;Indicate vector
Remove cm,nCorresponding component;It indicates in m articles from cm,nWord number,Indicate vectorCm,nIt is corresponding
Component;LmIndicate that m articles quote the number of article in total;It indicates to be cited in m articles from r
The word number of article;ηrIndicate vectorR-th of component;Indicate vectorRemove sm,nCorresponding component;It indicates It indicates Represent the frequency of all non-original words in m articles;
Represent the frequency of all original words;Indicate that theme is z in m articlesm,nAnd sm,nThe frequency of=0 word;Indicate that theme is z in m articlesm,nAnd sm,nThe frequency of=1 word;It indicates
Indicate that theme is k-th of theme and s in m articlesm,nThe frequency of=0 word;It indicates Table
Show that theme is k-th of theme and s in m articlesm,nThe frequency of=0 word;It represents all non-original in m articles
Word frequency,Represent the frequency of all original words in m articles.
Preferably, the step D includes:
Step D1: initialization;To each word w in every article in new corpusm,nIt is original based on bi-distribution stochastical sampling
Index sm,n;If to sm,nSampling obtain sm,n=1, then based on multinomial distribution at random from the reference article of the article when down-sampling
One reference article c of middle extractionm,n;For as the word w of down-samplingm,nAssign theme z at random based on multinomial distributionm,n;
Step D2: new corpus is rescaned;For each word wm,n, according to the gibbs sampler formula resampling
Original index sm,n;If newly obtain to sm,nSampling sm,n=1, then w is sampled againm,nCorresponding reference article cm,n, otherwise,
It then directly omits to reference article cm,nSampling;Sample wm,nTheme zm,n, it is updated in new corpus;
Wherein, step D2 is repeatedly executed, and until gibbs sampler convergence, is entered step D3 and is continued to execute;
Step D3: according to corresponding to s in every article in the new corpus countedm,nThe specific gravity of=1 word, every article
The frequency that word occurs under the frequency of occurrences of theme and each theme in the frequency of occurrences of reference, every article, respectively obtains
The original index λ of every article, reference intensity distribution δ, every article theme distribution θ and each theme word distribution
φ。
Preferably, the step D further include:
The new article d being added in new corpus for onenew, count this article dnewTheme distribution θnew, reference
Intensity distribution δnew, original index λnew, specifically include step:
Step D401: initialization, to current article dnewIn each word wm,nAssign original finger at random based on bi-distribution
Mark sm,nIf to sm,nSampling obtain sm,n=1, then based on multinomial distribution at random from this article dnewReference article in extract one
Piece reference article cm,n;For word wm,nAssign theme z at random based on multinomial distributionm,n;
Step D402: current article d is rescanednew, for each word wm,nAgain according to the gibbs sampler formula
Sample original index sm,n;If newly obtain to sm,nSampling sm,n=1, then w is sampled againm,nCorresponding reference article cm,n, no
Then, then it directly omits to reference article cm,nSampling;Sample wm,nTheme zm,n, it is updated in new corpus;
Wherein, step D402 is repeatedly executed, and until gibbs sampler convergence, is entered step D403 and is continued to execute;
Step D403: statistics current article dnewTheme distribution θnew, count article dnewMiddle corresponding sm,nThe ratio of=1 word
Weight λnew, the appearance distribution δ of statistics article referencenew。
Preferably, the step E includes:
Relevant parameter is obtained using following formula:
Wherein, θM, kIt is distribution probability of the m articles about k-th of theme,It is k-th of theme about t-th
The distribution probability of word, λmIt is the original index of m articles, δM, cIt is the power of m articles and c article adduction relationships;Indicate that theme in m articles is the frequency of the word of k-th of theme;Indicate t-th of word appearance in k-th of theme
Frequency,It indicatesV indicates the quantity of word in k-th of theme;Indicate that all references of m articles crosses
The frequency of the word of c articles,It indicates
In preferred technical solution: extracting effective keyword in corpus, and effective keyword is treated as taking out
As object;Theme number, the intensity of theme distribution, the intensity of article reference distribution of article extraction, can be by user
Demand determines or by systemic presupposition.It is assumed that in every article the theme source of each word be it is random, by the theme of article itself
Distribution generates or the theme distribution of certain article as cited in this article generates;
The probabilistic model of text generation include it is assumed hereinafter that:
(1) theme of each word obeys multinomial distribution in every article, and its prior distribution obeys Dirichlet distribution.
(2) the different words under each theme obey multinomial distribution, and its prior distribution obeys Dirichlet distribution.
(3) multinomial distribution is obeyed in the reference source of each word in every article, and its prior distribution obeys Dirichlet points
Cloth.
(4) the original of each word obeys bi-distribution in every article, and its prior distribution obeys Beta distribution;
Wherein, about probabilistic model it is assumed that the parameter of prior distribution will be put down by article average length, theme number, article
Reference article quantity determines.
Compared with prior art, the present invention have it is following the utility model has the advantages that
1, the present invention is based on above-mentioned problems of the prior art, article feature extraction is thought deeply at the visual angle new from one
Method, can be improved the accuracy of article feature extraction and can extract traditional characteristic extraction system from article and do not account for
Information.
2, the present invention uses the traditional topic model of the Information expansion of citation network, allows model by both sides
Informix extracts article feature, is not only applicable in the situation that data bank data volume is larger, and can be to the data of dynamic expansion
Library is applicable in, additionally it is possible to extract the information such as article adduction relationship intensity, the original index of article that previous topic model cannot extract.
3, the present invention utilizes the sparsity of article theme distribution, the sparsity of words distribution in theme, article reference distribution
Sparsity, reduce sampling complexity.
Detailed description of the invention
Upon reading the detailed description of non-limiting embodiments with reference to the following drawings, other feature of the invention,
Objects and advantages will become more apparent upon:
Fig. 1 is original article data sample.
Fig. 2 is the generating process of novel topic model.
Fig. 3 is flow chart of the method for the present invention.
Specific embodiment
The present invention is described in detail combined with specific embodiments below.Following embodiment will be helpful to the technology of this field
Personnel further understand the present invention, but the invention is not limited in any way.It should be pointed out that the ordinary skill of this field
For personnel, without departing from the inventive concept of the premise, several changes and improvements can also be made.These belong to the present invention
Protection scope.
The present invention extracts article feature using original method.Present invention uses article citation networks to extend tradition
Topic model, allow topic model to extract article feature using topic model and citation network simultaneously, to extract
More accurate article feature.Key step of the invention includes:
Step A: the Citations networks based on original language material library building article set initial article collection and merge according to reference
Relational network obtains new corpus;
Step B: it is directed to new corpus, constructs the generation model and parametric joint expression formula of topic model;
Step C: according to the deduction process for generating model construction topic model;
Step D: according to the deduction process of topic model, article is sampled to new corpus;
Step E: article parameter is extracted according to the sampled result of sampling article.
Article Feature Extraction Method designed by the present invention is related to arranging the automation scientific procedure of Citations networks, knot
The novel topic model for closing citation network generates model and federal expression derives, novel topic model infers process and sampling is calculated
Method derives, the parameter Estimation of novel topic model this five core components.Method provided by the invention includes the following steps:
About step A, the original language material library based on large sample size automatically generates the adduction relationship net of article (such as paper)
Network, and it is output to file;Corpus includes two parts information, and a part of information is the information about article itself, including article
Topic, author, abstract etc., another part information are the adduction relationships between article, for example article A quotes article B, article A reference
Article C.
Academic Data on internet is vast as the open sea, and is increased every year with million grades of quantity.Therefore basis in the present invention
The original language material library of existing XML and JSON format, based on each article in original language material library, extract article title,
Then article abstract and article bibliography set initial article set, according to the adduction relationship of academic article, obtain maximum
Connected component, and export as new corpus.
Existing original article corpus library format is as shown in table 1 and Fig. 1.
The original article data storage format specification of table 1.
In the step A, it is described based on original language material library building article Citations networks the step of, comprising:
Vertex set V is set as empty set by step A1, and side collection E is set as empty set, and figure G is set as V, the set of E;
Step A2 is added to current article node u in vertex set V for each article in original language material library, will work as
The all references relationship of preceding article node u is added in the collection E of side;
Step A3: using the figure G obtained by step A2 as the Citations networks.
In the step A, the initial article collection of setting, which merges, obtains the step of new corpus according to Citations networks
Suddenly, comprising: according to Citations networks, automatically obtain maximum component, obtain new corpus;It specifically includes:
Vertex set V is set as initial known point set V by step A40, by the collection E when collection E is set as initial known0, figure G is set as
The set of V, E;
Step A5 constantly searches the point v in original language material library not in vertex set V, if there is such point v and point v
There are adduction relationships with the point in vertex set V, then point v are added in vertex set V, and the adduction relationship of point v is added in E;
Until V, until E no longer changes;
Step A6: the corresponding corpus export of figure G will be obtained by step A5 and is used as the new corpus.
About step B, traditional topic model is using the word frequency characteristic of every article as the theme feature of article, this hair
The topic model of bright middle use can cover the relationship between article, i.e. article Citations networks.The topic model includes
Process (being described in detail in step C) is inferred in two cores, respectively generation model (being described in detail in stepb).Generate model phase
When under conditions of known parameters, it will be assumed that the model that article generating process is obeyed, the corresponding diagram of the generation model of article
Model can be found in attached drawing 2.
The step B includes:
Step B1: to each theme of new corpus, following steps are executed:
The polynomial parameters of the distribution of k-th of theme to word are generated based on Di Li Cray hyper parameter βWherein, β is
The parameter for the Di Li Cray distribution obeyed;K is positive integer;
Step B2: to each piece article of new corpus, following steps are executed:
The polynomial parameters θ of the distribution of m articles to theme is generated based on Di Li Cray hyper parameter αm;Wherein, α is θm
The parameter for the Di Li Cray distribution obeyed;M is positive integer;
The polynomial parameters δ of the reference intensity distribution of m articles is generated based on Di Li Cray hyper parameter ηm;Wherein, η is
δmThe parameter for the Di Li Cray distribution obeyed;
Hyper parameter group based on beta distributionGenerate the Bernoulli parameter of the original index of m articles
λm;Wherein,It is λmThe parameter for the beta distribution obeyed;It will be appreciated by those skilled in the art that beta distribution needs itself
Two hyper parameters are wanted, the two hyper parameters can be interchanged.
Step B3: following steps are executed to each word in each piece article:
Generating and obeying Bernoulli parameter is λmBernoulli Jacob distribution m articles n-th of word original index sM, n;n
For positive integer;
If sM, nIt is 1, then generating and obeying parameter is δmMultinomial distribution reference article cm,n, generate and obey parameter
ForMultinomial distribution theme zm,n, generating obedience parameter isMultinomial distribution word wm,n;
If sm,nIt is 0, then generating and obeying parameter is θmMultinomial distribution theme zm,n, generating obedience parameter isMultinomial distribution word wm,n;
Wherein,Representing matrix θ corresponds to cm,nRow vector,Representing matrixCorresponding zm,nRow vector;θ table
Show article to theme distribution matrix,Distribution matrix of the expression theme to word;wm,nN-th of word in m articles is represented,
zm,nRepresent the theme of n-th of word in m articles, cm,nRepresent n-th of word in m articles and n-th of word right and wrong
Original word cited article;
Step B4: the joint probability distribution for constructing topic model is as follows:
Wherein, p (A | B) indicates the probability of A under the conditions of B, symbol → expression vector;It is the theme to the distribution of word,For
Article to theme distribution,For the distribution of the reference of article,For the distribution of word original in article,For under k-th of theme
Word frequency, K indicate theme quantity,For the frequency of theme under m articles, M is article quantity,For m articles
The frequency of reference,For the frequency of non-original word in m articles,For the frequency of original word in m articles;B(p,
Q) Beta that expression parameter is p and q is distributed;
Δ () is defined as:
Wherein,For vectorDimension, Γ be Gamma function, AkIndicate vectorK-th of component.
About step C, infer that process is used to estimate to generate the parameter in model.In practical situation, we are known texts
Word in chapter, it is desirable to instead derive implicit parameter in the past, just need to complete by the method for statistical inference here.For me
The novel topic model that proposes, conventional optimal method can not solve the problems, such as maximal possibility estimation, then we use
A kind of mode being known as gibbs sampler carries out parameter Estimation.
The step C includes:
Step C1: parameter Estimation is carried out using following gibbs sampler formula:
Wherein,Indicate vectorRemove zm,nCorresponding component;Symbol ∝ expression is proportional to;Represent master
Inscribe zm,nUnder, word wm,nThe frequency of appearance;Indicate vectorMiddle wm,nCorresponding component;V indicates total word number;It indicates
zm,nIn t-th of word occur frequency;βtIndicate vectorT-th of component;Indicate cm,nMiddle theme is zm,nAnd sm,n
The frequency of=0 word;Indicate cm,nMiddle theme is zm,nAnd sm,nThe frequency of=1 word;Indicate vector's
zm,nCorresponding component;Indicate cm,nMiddle theme is k-th of theme and sm,nThe frequency of=0 word;Indicate cm,n
Middle theme is k-th of theme and sm,nThe frequency of=1 word;αkIndicate vectorK-th of component;Indicate vector
Remove cm,nCorresponding component;It indicates in m articles from cm,nWord number,Indicate vectorCm,nIt is corresponding
Component;LmIndicate that m articles quote the number of article in total;It indicates to be cited in m articles from r
The word number of article;η r indicates vectorR-th of component;Indicate vectorRemove sm,nCorresponding component;It indicates It indicates Represent the frequency of all non-original words in m articles;
Represent the frequency of all original words;Indicate that theme is z in m articlesm,nAnd sm,nThe frequency of=0 word;Indicate that theme is z in m articlesm,nAnd sm,nThe frequency of=1 word;It indicates
Indicate that theme is k-th of theme and s in m articlesm,nThe frequency of=0 word;It indicates
Indicate that theme is k-th of theme and s in m articlesm,nThe frequency of=0 word;Represent all non-originals in m articles
The frequency of the word of wound,Represent the frequency of all original words in m articles.
Wherein,In subscript represent the component of corresponding prior distribution parameter.
Sampling algorithm is designed according to the deduction process of novel topic model about step D, samples article database;We
It can arrive and write out complete deduction process.
The step D includes:
Step D1: initialization;To each word w in every article in new corpusm,nIt is original based on bi-distribution stochastical sampling
Index sm,n;If to sm,nSampling obtain sm,n=1, then based on multinomial distribution at random from the reference article of the article when down-sampling
One reference article c of middle extractionm,n;For as the word w of down-samplingm,nAssign theme z at random based on multinomial distributionm,n;
Step D2: new corpus is rescaned;For each word wm,n, according to the gibbs sampler formula resampling
Original index sm,n;If newly obtain to sm,nSampling sm,n=1, then w is sampled againm,nCorresponding reference article cm,n, otherwise,
It then directly omits to reference article cm,nSampling;Sample wm,nTheme zm,n, it is updated in new corpus;
Wherein, step D2 is repeated, until gibbs sampler is restrained;
Step D3: according to corresponding to s in every article in the new corpus countedm,nThe specific gravity of=1 word, every article
The frequency that word occurs under the frequency of occurrences of theme and each theme in the frequency of occurrences of reference, every article, respectively obtains
The original index λ of every article, reference intensity distribution δ, every article theme distribution θ and each theme word distribution
φ;
Article (i.e. in the new article that is added instantly new corpus) d new for onenew, count the theme point of this article
Cloth θnew, reference intensity distribution δnew, original index λnew, specifically include step:
Step D401: initialization, to current article dnewIn each word wm,nAssign original finger at random based on bi-distribution
Mark sm,nIf to wm,nSampling obtain sm,n=1, then based on multinomial distribution at random from this article ddnewReference article in extract
One reference article cm,n;For word wm,nAssign theme z at random based on multinomial distributionm,n;
Step D402: current article d is rescanednew, for each word wm,nAgain according to the gibbs sampler formula
Sample original index sm,n;If newly obtain to sm,nSampling sm,n=1, then w is sampled againm,nCorresponding reference article cm,n, no
Then, then it directly omits to reference article cm,nSampling;Sample wm,nTheme zm,n, it is updated in new corpus;
Wherein, step D402 is repeated, until gibbs sampler is restrained;
Step D403: statistics current article dnewTheme distribution, which is exactly θnew, count article dnewIn it is right
Answer sm,nThe specific gravity of=1 word, the specific gravity are exactly λnew, what statistics article was quoted is distributed, which is exactly δnew。
About step E, after topic model convergence (such as advantageously according to the gibbs sampler algorithm in step 4, I
Carry out circulating sampling, after carrying out enough numbers to sampling, model parameter convergence can be considered as), the public affairs below our uses
Formula obtains relevant parameter:
Wherein, θM, kIt is distribution probability of the m articles about k-th of theme,It is k-th of theme about t-th
The distribution probability of word, λmIt is the original index of m articles, δM, cIt is the power of m articles and c article adduction relationships;Indicate that theme in m articles is the frequency of the word of k-th of theme;Indicate t-th of word appearance in k-th of theme
Frequency,It indicatesV indicates the quantity of word in k-th of theme;Indicate that all references of m articles crosses
The frequency of the word of c articles,It indicates
Subscript (), which is represented, sums monomial to subscript herein, such as
Specific embodiments of the present invention are described above.It is to be appreciated that the invention is not limited to above-mentioned
Particular implementation, those skilled in the art can make a variety of changes or modify within the scope of the claims, this not shadow
Ring substantive content of the invention.In the absence of conflict, the feature in embodiments herein and embodiment can any phase
Mutually combination.
Claims (5)
1. a kind of article Feature Extraction Method based on topic model characterized by comprising
Step A: the Citations networks based on original language material library building article set initial article collection and merge according to adduction relationship
Network obtains new corpus;
Step B: it is directed to new corpus, constructs the generation model and parametric joint expression formula of topic model;
Step C: according to the deduction process for generating model construction topic model;
Step D: according to the deduction process of topic model, article is sampled to new corpus;
Step E: article parameter is extracted according to the sampled result of sampling article;
The step B includes:
Step B1: to each theme of new corpus, following steps are executed:
The polynomial parameters of the distribution of k-th of theme to word are generated based on Di Li Cray hyper parameter βWherein, β isIt is obeyed
Di Li Cray distribution parameter;K is positive integer;
Step B2: to each piece article of new corpus, following steps are executed:
The polynomial parameters θ of the distribution of m articles to theme is generated based on Di Li Cray hyper parameter αm;Wherein, α is θmIt is taken
From Di Li Cray be distributed parameter;M is positive integer;
The polynomial parameters δ of the reference intensity distribution of m articles is generated based on Di Li Cray hyper parameter ηm;Wherein, η is δmInstitute
The parameter of the Di Li Cray distribution of obedience;
Hyper parameter group based on beta distributionGenerate the Bernoulli parameter λ of the original index of m articlesm;Its
In,It is λmThe parameter for the beta distribution obeyed;
Step B3: following steps are executed to each word in each piece article:
Generating and obeying Bernoulli parameter is λmBernoulli Jacob distribution m articles n-th of word original index sM, n;N is positive
Integer;
If sM, nIt is 1, then generating and obeying parameter is δmMultinomial distribution reference article cm,n, generating obedience parameter is
Multinomial distribution theme zm,n, generating obedience parameter isMultinomial distribution word wm,n;
If sm,nIt is 0, then generating and obeying parameter is θmMultinomial distribution theme zm,n, generating obedience parameter is's
The word w of multinomial distributionm,n;
Wherein,Representing matrix θ corresponds to cm,nRow vector,Representing matrixCorresponding zm,nRow vector;θ indicates article
To the distribution matrix of theme,Indicate distribution matrix of the theme to word, wm,nRepresent n-th of word in m articles, zm,nIt represents
The theme of n-th of word in m articles, cm,nIt represents n-th of word in m articles and n-th of word is non-original word institute
Quote article;
Step B4: the joint probability distribution for constructing topic model is as follows:
Wherein, p (A | B) indicates the probability of A under the conditions of B, symbol → expression vector;It is the theme to the distribution of word,It is arrived for article
The distribution of theme,For the distribution of the reference of article,For the distribution of word original in article,For the word frequency under k-th of theme, K
Indicate theme quantity,For the frequency of theme under m articles, M is article quantity,For the frequency of the reference of m articles
Number,For the frequency of non-original word in m articles,For the frequency of original word in m articles;B (p, q) indicates ginseng
The beta that number is p and q is distributed;
△ () is defined as:
Wherein,For vectorDimension, Γ be Gamma function, AkIndicate vectorK-th of component;
The step C includes:
Step C1: parameter Estimation is carried out using following gibbs sampler formula:
Wherein,Indicate vectorRemove zm,nCorresponding component;Symbol ∝ expression is proportional to;Represent theme zm,n
Under, word wm,nThe frequency of appearance;Indicate vectorMiddle wm,nCorresponding component;V indicates total word number;Indicate zm,nIn
The frequency that t-th of word occurs;βtIndicate vectorT-th of component;Indicate cm,nMiddle theme is zm,nAnd sm,n=0
The frequency of word;Indicate cm,nMiddle theme is zm,nAnd sm,nThe frequency of=1 word;Indicate vectorZm,nIt is corresponding
Component;Indicate cm,nMiddle theme is k-th of theme and sm,nThe frequency of=0 word;Indicate cm,nMiddle theme is
K-th of theme and sm,nThe frequency of=1 word;αkIndicate vectorK-th of component;Indicate vectorRemove cm,nIt is right
The component answered;It indicates in m articles from cm,nWord number,Indicate vectorCm,nCorresponding component;Lm
Indicate that m articles quote the number of article in total;Indicate the word for the article being cited in m articles from r
Number;ηrIndicate vectorR-th of component;Indicate vectorRemove sm,nCorresponding component;It indicates It indicates Represent the frequency of all non-original words in m articles;Generation
The frequency of all original words of table;Indicate that theme is z in m articlesm,nAnd sm,nThe frequency of=0 word;Indicate that theme is z in m articlesm,nAnd sm,nThe frequency of=1 word;It indicates
Indicate that theme is k-th of theme and s in m articlesm,nThe frequency of=0 word;It indicates Table
Show that theme is k-th of theme and s in m articlesm,nThe frequency of=0 word;It represents all non-original in m articles
Word frequency,Represent the frequency of all original words in m articles.
2. the article Feature Extraction Method according to claim 1 based on topic model, which is characterized in that the step D
Include:
Step D1: initialization;To each word w in every article in new corpusm,nBased on the original index of bi-distribution stochastical sampling
sm,n;If to sm,nSampling obtain sm,n=1, then it is taken out from the reference article when the article of down-sampling at random based on multinomial distribution
Take a reference article cm,n;For as the word w of down-samplingm,nAssign theme z at random based on multinomial distributionm,n;
Step D2: new corpus is rescaned;For each word wm,n, original according to the gibbs sampler formula resampling
Index sm,n;If newly obtain to sm,nSampling sm,n=1, then w is sampled againm,nCorresponding reference article cm,n, otherwise, then directly
It connects and omits to reference article cm,nSampling;Sample wm,nTheme zm,n, it is updated in new corpus;
Wherein, step D2 is repeatedly executed, and until gibbs sampler convergence, is entered step D3 and is continued to execute;
Step D3: according to corresponding to s in every article in the new corpus countedm,nThe specific gravity of=1 word, every article reference
The frequency of occurrences, the frequency that word occurs under the frequency of occurrences of theme and each theme in every article, respectively obtain every
The word distribution phi of the original index λ of article, reference intensity distribution δ, the theme distribution θ of every article and each theme.
3. the article Feature Extraction Method according to claim 2 based on topic model, which is characterized in that the step D
Further include:
The new article d being added in new corpus for onenew, count this article dnewTheme distribution θnew, reference intensity
It is distributed δnew, original index λnew, specifically include step:
Step D401: initialization, to current article dnewIn each word wm,nAssign original index at random based on bi-distribution
sm,nIf to sm,nSampling obtain sm,n=1, then based on multinomial distribution at random from this article dnewReference article in extract one
Quote article cm,n;For word wm,nAssign theme z at random based on multinomial distributionm,n;
Step D402: current article d is rescanednew, for each word wm,nAccording to the gibbs sampler formula resampling
Original index sm,n;If newly obtain to sm,nSampling sm,n=1, then w is sampled againm,nCorresponding reference article cm,n, otherwise,
It then directly omits to reference article cm,nSampling;Sample wm,nTheme zm,n, it is updated in new corpus;
Wherein, step D402 is repeatedly executed, and until gibbs sampler convergence, is entered step D403 and is continued to execute;
Step D403: statistics current article dnewTheme distribution θnew, count article dnewMiddle corresponding sm,nThe specific gravity of=1 word
λnew, the appearance distribution δ of statistics article referencenew。
4. the article Feature Extraction Method according to claim 1 based on topic model, which is characterized in that the step E
Include:
Relevant parameter is obtained using following formula:
Wherein, θM, kIt is distribution probability of the m articles about k-th of theme,It is point of k-th of theme about t-th of word
Cloth probability, λmIt is the Bernoulli parameter of the original index of m articles, δM, cIt is m articles and c article adduction relationships
Power;Indicate that theme in m articles is the frequency of the word of k-th of theme;Indicate t-th of word in k-th of theme
The frequency of appearance,It indicatesV indicates total word number of k-th of theme;Indicate all references of m articles
The frequency of the word of c articles is crossed,It indicates
5. the article Feature Extraction Method according to claim 1 based on topic model, which is characterized in that the step A
Include:
Vertex set V is set as empty set by step A1, and side collection E is set as empty set, and figure G is set as V, the set of E;
Step A2 is added to current article node u in vertex set V each article in original language material library, ought be above
The all references relationship of chapter node u is added in the collection E of side;
Step A3: using the figure G obtained by step A2 as the Citations networks;
Vertex set V is set as initial known point set V by step A40, by the collection E when collection E is set as initial known0, figure G is set as V, E
Set;
Step A5 constantly searches the point v in original language material library not in vertex set V, if there is such point v and point v and top
There are adduction relationships for point in point set V, then point v are added in vertex set V, and the adduction relationship of point v is added in E;Until
Until V, E no longer change;
Step A6: the corresponding corpus export of figure G will be obtained by step A5 and is used as the new corpus.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201511016955.7A CN105631018B (en) | 2015-12-29 | 2015-12-29 | Article Feature Extraction Method based on topic model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201511016955.7A CN105631018B (en) | 2015-12-29 | 2015-12-29 | Article Feature Extraction Method based on topic model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105631018A CN105631018A (en) | 2016-06-01 |
CN105631018B true CN105631018B (en) | 2018-12-18 |
Family
ID=56045951
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201511016955.7A Active CN105631018B (en) | 2015-12-29 | 2015-12-29 | Article Feature Extraction Method based on topic model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105631018B (en) |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106372147B (en) * | 2016-08-29 | 2020-09-15 | 上海交通大学 | Heterogeneous topic network construction and visualization method based on text network |
CN106709520B (en) * | 2016-12-23 | 2019-05-31 | 浙江大学 | A kind of case classification method based on topic model |
CN107515854B (en) * | 2017-07-27 | 2021-06-04 | 上海交通大学 | Time sequence community and topic detection method based on right-carrying time sequence text network |
CN108549625B (en) * | 2018-02-28 | 2020-11-17 | 首都师范大学 | Chinese chapter expression theme analysis method based on syntactic object clustering |
CN110555198B (en) * | 2018-05-31 | 2023-05-23 | 北京百度网讯科技有限公司 | Method, apparatus, device and computer readable storage medium for generating articles |
CN109299257B (en) * | 2018-09-18 | 2020-09-15 | 杭州科以才成科技有限公司 | English periodical recommendation method based on LSTM and knowledge graph |
CN109597879B (en) * | 2018-11-30 | 2022-03-29 | 京华信息科技股份有限公司 | Service behavior relation extraction method and device based on 'citation relation' data |
CN110807315A (en) * | 2019-10-15 | 2020-02-18 | 上海大学 | Topic model-based online comment emotion mining method |
CN115438654B (en) * | 2022-11-07 | 2023-03-24 | 华东交通大学 | Article title generation method and device, storage medium and electronic equipment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1766871A (en) * | 2004-10-29 | 2006-05-03 | 中国科学院研究生院 | The processing method of the semi-structured data extraction of semantics of based on the context |
JP2011180862A (en) * | 2010-03-02 | 2011-09-15 | Nippon Telegr & Teleph Corp <Ntt> | Method and device of extracting term, and program |
CN104408153A (en) * | 2014-12-03 | 2015-03-11 | 中国科学院自动化研究所 | Short text hash learning method based on multi-granularity topic models |
CN104834735A (en) * | 2015-05-18 | 2015-08-12 | 大连理工大学 | Automatic document summarization extraction method based on term vectors |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100030732A1 (en) * | 2008-07-31 | 2010-02-04 | International Business Machines Corporation | System and method to create process reference maps from links described in a business process model |
-
2015
- 2015-12-29 CN CN201511016955.7A patent/CN105631018B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1766871A (en) * | 2004-10-29 | 2006-05-03 | 中国科学院研究生院 | The processing method of the semi-structured data extraction of semantics of based on the context |
JP2011180862A (en) * | 2010-03-02 | 2011-09-15 | Nippon Telegr & Teleph Corp <Ntt> | Method and device of extracting term, and program |
CN104408153A (en) * | 2014-12-03 | 2015-03-11 | 中国科学院自动化研究所 | Short text hash learning method based on multi-granularity topic models |
CN104834735A (en) * | 2015-05-18 | 2015-08-12 | 大连理工大学 | Automatic document summarization extraction method based on term vectors |
Non-Patent Citations (2)
Title |
---|
基于LDA模型的评论热点挖掘:原理与实现;余传明 等;《信息系统》;20100531;第33卷(第5期);第103-106页 * |
基于主题特征的关键词抽取;刘俊 等;《计算机应用研究》;20121130;第29卷(第11期);第4224-4227页 * |
Also Published As
Publication number | Publication date |
---|---|
CN105631018A (en) | 2016-06-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105631018B (en) | Article Feature Extraction Method based on topic model | |
CN104699766B (en) | A kind of implicit attribute method for digging for merging word association relation and context of co-text deduction | |
CN106649272B (en) | A kind of name entity recognition method based on mixed model | |
CN105528437B (en) | A kind of question answering system construction method extracted based on structured text knowledge | |
CN106484664A (en) | Similarity calculating method between a kind of short text | |
WO2022156328A1 (en) | Restful-type web service clustering method fusing service cooperation relationships | |
CN106682172A (en) | Keyword-based document research hotspot recommending method | |
CN106776797A (en) | A kind of knowledge Q-A system and its method of work based on ontology inference | |
CN106354708A (en) | Client interaction information search engine system based on electricity information collection system | |
CN106897437B (en) | High-order rule multi-classification method and system of knowledge system | |
Zhao et al. | Research on information extraction of technical documents and construction of domain knowledge graph | |
CN105843860A (en) | Microblog attention recommendation method based on parallel item-based collaborative filtering algorithm | |
CN105808729B (en) | Academic big data analysis method based on adduction relationship between paper | |
CN105468780B (en) | The normalization method and device of ProductName entity in a kind of microblogging text | |
Liu et al. | Chinese named entity recognition based on rules and conditional random field | |
CN105787072B (en) | A kind of domain knowledge of Process-Oriented extracts and method for pushing | |
CN107908749A (en) | A kind of personage's searching system and method based on search engine | |
Sun et al. | Joint topic-opinion model for implicit feature extracting | |
CN115730078A (en) | Event knowledge graph construction method and device for class case retrieval and electronic equipment | |
CN110188352A (en) | A kind of text subject determines method, apparatus, calculates equipment and storage medium | |
Chen et al. | Web Evaluation Analysis of Tourism Destinations Based on Data Mining | |
CN108536796A (en) | A kind of isomery Ontology Matching method and system based on figure | |
CN103294662B (en) | Match judging apparatus and consistance determination methods | |
Fuller et al. | Structuring, recording, and analyzing historical networks in the china biographical database | |
CN110377845A (en) | Collaborative filtering recommending method based on the semi-supervised LDA in section |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |