CN105631018A - Article feature extraction method based on topic model - Google Patents

Article feature extraction method based on topic model Download PDF

Info

Publication number
CN105631018A
CN105631018A CN201511016955.7A CN201511016955A CN105631018A CN 105631018 A CN105631018 A CN 105631018A CN 201511016955 A CN201511016955 A CN 201511016955A CN 105631018 A CN105631018 A CN 105631018A
Authority
CN
China
Prior art keywords
article
word
theme
represent
section
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201511016955.7A
Other languages
Chinese (zh)
Other versions
CN105631018B (en
Inventor
沈嘉明
宋振宇
李世韬
毛宇宁
谈兆炜
朱鸿儒
王乐群
郭运奇
王彪
傅洛伊
王新兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN201511016955.7A priority Critical patent/CN105631018B/en
Publication of CN105631018A publication Critical patent/CN105631018A/en
Application granted granted Critical
Publication of CN105631018B publication Critical patent/CN105631018B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model

Abstract

The invention provides an article feature extraction method based on a topic model. The method comprises the steps that an article reference relationship network is built on the basis of a raw corpus; a generative model of the topic model and a parameter combining expression formula are built; the inference process of the topic model is built according to the generative model, article sampling is conducted on a new corpus, and article parameters are extracted according to the sampling result of article sampling. According to the article feature extraction method based on the topic model, the article reference network is used for extending the traditional topic model, and therefore more accurate article features are extracted.

Description

Based on the article feature extraction method of topic model
Technical field
The present invention relates to article feature extraction technique field, specifically, it relates to based on the article feature extraction method of topic model, especially a kind of integrated Citations networks arranges, based on the method for the feature extraction of topic model.
Background technology
Scientific research activity is the strategic support improving social productive forces and overall national strength. The input for scientific research activity is all paid much attention in countries in the world. Science and technology research and development are put the core position in the national development overall situation by China, and the expenditure of scientific research is increased by state revenue and expenditure steadily. 2012, the research and development of China dropped into funds (comprising industry member and academia) more than trillion, is 10298.4 hundred million yuan, reaches the level of medium-developed country.
One of the most direct output result of scientific research activity is academic article. According to statistics, from 2004 to 2014, China scientific research personnel delivers scientific and technical article 136.98 ten thousand sections in the world altogether, occupies the second in the world; Article is cited 1037.01 ten thousand times altogether, occupies the world the 4th. Research practice shows, academic article is that scientific research personnel carries out scientific research activity or proceeds the extremely important information resources of further investigation. But, in the face of the documents and materials that the information age is vast as the open sea, the academic resources required for how retrieving oneself quickly and accurately, for scientific research personnel, is an extremely important and challenging job really.
In the face of the demand that academics search is recommended, company of Google was proposed the academic search engine of beta version in 2004, for the scientific research personnel in the whole world provides free academic documents Information services; 2006, Microsoft was proposed academic search engine MicrosoftAcademicSearch. Although these comprehensive academic search engine have relied on the search technique of business search company belonging to it, in fact, they Search Results and unsatisfactory. These academic search engine existing input for the inquiry of user, still return Query Result with the form of article list. Their accuracy paying attention to result for retrieval, the keyword inquired about with user by article search results mates accurately, and does not pay attention to the position that article is residing in respective field, and the development trend of article topic more. But, for scientific research personnel, than accurately coupling title is prior, obtain the forward position achievement in affiliated subject theme and important contribution article often. Such as, the search subscriber just relating to a certain research field is when searching for, they and the indefinite document oneself needing what type, the keyword of its search is usually rough theme or topic, if adopting above-mentioned comprehensive academic search engine, user often cannot fast and effeciently understand the forward position achievement in related discipline theme and important contribution article, and the result obtained can not be satisfactory.
Visible, build the academics search commending system of a set of highly effective, for scientific research personnel obtain required resource, in time grasp discipline development dynamically, improve self capacity of scientific research, and then strengthen the research strength of country, all there is considerable meaning. Just because of this, academics search commending system obtains the attention of people gradually in recent years. From 2000, about the article quantity of article search and commending system in the trend risen year by year. According to incompletely statistics, only the related article quantity of 2013 just reaches more than 30 sections. But, the research of academics search commending system is still in the starting stage.
In the building process of academics search system, an important content concentrates from large-scale article data collection and citation network relation data, extracts the feature of article. Such as the power of adduction relationship between the Academic Contribution degree of the theme of every section of article, article, article and feature word corresponding to theme.
Up to the present, the main direction of studying paying close attention to article feature extraction both at home and abroad comprises: is analyzed by the semanteme of article, thus obtains the recommendation results of other articles similar to this article theme; To article citation network modeling analysis, draw the importance of article.
At present, the article feature extraction method based on theme analysis comprises: uses topic model (such as LDA algorithm) to analyze article theme, and introduces theme similarity in the collaborative filtering of commending system; Similar topic article is found in conjunction with topic model and language model; Based on LDA algorithm, to word group theme modeling etc. Article feature extraction method based on article citation network comprises: use HITS algorithm, and the two points of figure built based on article and term calculate the authority value of article; Utilize article citation network, calculate the authority value of article author and recommend; Utilize PageRank algorithm, in conjunction with quality and the citation network of periodical, calculate the PageRank value etc. of article.
But, these achievements in research or do not consider that model is to the operability of large sample amount article database, only pay close attention to the information of citation network and have ignored the extraction of article text information, or only consider article database text information but have ignored the information of Citations networks. The use value of therefore final result is not high.
Summary of the invention
For defect of the prior art, it is an object of the invention to provide a kind of article feature extraction method based on topic model.
According to a kind of article feature extraction method based on topic model provided by the invention, comprising:
Steps A: based on the Citations networks of original building of corpus article, sets initial article set and obtains new corpus according to Citations networks;
Step B: for new corpus, builds generation model and the parametric joint expression formula of topic model;
Step C: the deduction process building topic model according to described generation model;
Step D: according to the deduction process of topic model, article that new corpus is sampled;
Step e: extract article parameter according to the sampled result of sampling article.
Preferably, described steps A comprises:
Steps A 1, is set to empty set by summit collection V, limit collection E is set to empty set, figure G is set to V, the set of E;
Steps A 2, for each section of article in original corpus, is added to current article node u in the collection V of summit, adds in the collection E of limit by all adduction relationships of current article node u;
Steps A 3: using the figure G that obtained by steps A 2 as described Citations networks;
Steps A 4, is set to initial known point set V by summit collection V0, limit collection E is set to initial known limit collection E0, figure G is set to V, the set of E;
Steps A 5, constantly searches in original corpus the not some v in the collection V of summit, if the point existed in such some v and some v and summit collection V exists adduction relationship, then a v is added in the collection V of summit, and the adduction relationship of a v is added in E; Until till V, E no longer change;
Steps A 6: derive obtaining corpus corresponding to figure G by steps A 5 as described new corpus.
Preferably, described step B comprises:
Step B1: to each theme of new corpus, performs following steps:
The polynomial parameters of kth theme to the distribution of word is generated based on Di Likelei hyper parameter ��Wherein, �� isThe parameter of the Dirichlet distribute obeyed; K is positive integer;
Step B2: to each section article of new corpus, performs following steps:
The polynomial parameters �� of m section article to the distribution of theme is generated based on Di Likelei hyper parameter ��m; Wherein, �� is ��mThe parameter of the Dirichlet distribute obeyed; M is positive integer;
The polynomial parameters �� quoting intensity distribution of m section article is generated based on Di Likelei hyper parameter ��m; Wherein, �� is ��mThe parameter of the Dirichlet distribute obeyed;
Based on the hyper parameter group of Bei Ta distributionGenerate the Bernoulli parameter �� of the original index of m section articlem; Wherein,It is ��mThe parameter of the Bei Ta distribution obeyed;
Step B3: each word in each section article is performed following steps:
Generating and obeying Bernoulli parameter is ��mThe original index s of the n-th word of m section article of Bernoulli Jacob's distributionm,n; N is positive integer;
If-sm,nBe 1, then generating and obeying parameter is ��mMultinomial distribution quote article cm,n, generating and obeying parameter is ��cm,nThe theme z of multinomial distributionm,n, generating obedience parameter isThe word w of multinomial distributionm,n;
If-sm,nBe 0, then generating and obeying parameter is ��mThe theme z of multinomial distributionm,n, generating obedience parameter isThe word w of multinomial distributionm,n;
Wherein,Represent the corresponding c of matrix ��m,nRow vector,Represent matrixCorresponding zm,nRow vector; �� represents the distribution matrix of article to theme,Represent the distribution matrix of theme to word, wm,nRepresent the n-th word in m section article, zm,nRepresent the theme of the n-th word in m section article, cm,nRepresent the n-th word in m section article and article quoted in this n-th word original word of right and wrong;
Step B4: the joint probability distribution building topic model is as follows:
p ( w → , z → , c → , s → | α → , β → , η → , α λ n , α λ c ) = p ( w → | z → , c → , s → , β → ) · p ( z → | c → , s → , α → ) · p ( c → | s → , η → ) · p ( s → | α λ n , α λ c ) = Π k = 1 K Δ ( n k → + β → ) Δ ( β → ) · Π m = 1 M Δ ( n m → + α → ) Δ ( α → ) · Π m = 1 M Δ ( R m → + η → ) Δ ( η → ) · Π m = 1 M B ( N m ( 1 ) + α λ c , N m ( 0 ) + α λ n ) B ( α λ c , α λ n )
Wherein, the probability of A when p (A | B) represents B, symbol �� expression vector;Be the theme the distribution of word,For article is to the distribution of theme,For the distribution quoted of article,For the distribution of word original in article,For the word frequency under kth theme, K represents theme quantity,Being the frequency of theme under m section article, M is article quantity,It is the frequency quoted of m section article,It is the frequency of non-original word in m section article,It it is the frequency of original word in m section article; B (p, q) represents that parameter is the Beta distribution of p and q;
�� () is defined as:
Wherein,For vectorDimension, �� is Gamma function, AkRepresent vectorKth component.
Preferably, described step C comprises:
Step C1: adopt following gibbs sampler formula to carry out parameter estirmation:
Wherein,Represent vectorRemove zm,nCorresponding component; Symbol �� represent direct ratio in;Represent theme zm,nUnder, word wm,nThe frequency occurred;Represent vectorMiddle wm,nCorresponding component; V represents total word number;Represent zm,nIn the frequency that occurs of the t word; ��tRepresent vectorThe t component;Represent cm,nMiddle theme is zm,nAnd sm,nThe frequency of the word of=0;Represent cm,nMiddle theme is zm,nAnd sm,nThe frequency of the word of=1;Represent vectorZm,nCorresponding component;Represent cm,nMiddle theme is kth theme and sm,nThe frequency of the word of=0;Represent cm,nMiddle theme is kth theme and sm,nThe frequency of the word of=1; ��kRepresent vectorKth component;Represent vectorRemove cm,nCorresponding component;Represent in m section article from cm,nWord number,Represent vectorCm,nCorresponding component; LmRepresent that the number of article quoted altogether in m section article;Represent the word number of the article being cited in m section article from a r section; ��rRepresent vectorThe r component;Represent vectorRemove sm,nCorresponding component;Represent Represent Represent the frequency of all non-original words in m section article;Represent the frequency of all original words;Represent that in m section article, theme is zm,nAnd sm,nThe frequency of the word of=0;Represent that in m section article, theme is zm,nAnd sm,nThe frequency of the word of=1;Represent Represent that in m section article, theme is kth theme and sm,nThe frequency of the word of=0;Represent Represent that in m section article, theme is kth theme and sm,nThe frequency of the word of=0;Represent the frequency of all non-original words in m section article,Represent the frequency of all original words in m section article.
Preferably, described step D comprises:
Step D1: initialize; To each word w in every section of article in new corpusm,nBased on the original index s of binominal distribution stochastic samplingm,n; If to sm,nSampling obtain sm,n=1, then quote article c from the extraction one section article of quoting of the article instantly sampled at random based on multinomial distributionm,n; For the word w instantly sampledm,nTheme z is given at random based on multinomial distributionm,n;
Step D2: rescan new corpus; For each word wm,n, according to the described gibbs sampler original index s of formula resamplingm,n; If newly obtain to sm,nSampling sm,n=1, then sample w againm,nCorresponding quotes article cm,n, otherwise, then directly omit quoting article cm,nSampling; Sampling wm,nTheme zm,n, upgrade in new corpus;
Wherein, step D2 is repeatedly executed, until gibbs sampler convergence, enters step D3 and continues to perform;
Step D3: according to s corresponding in every section of article in the new corpus countedm,nThe frequency that in the frequency of occurrences that the proportion of the word of=1, every section of article are quoted, every section of article, under the frequency of occurrences of theme and each theme, word occurs, obtains the original index �� of every section of article respectively, quotes intensity distribution ��, the word distribution phi of the theme distribution �� of every section of article and each theme.
Preferably, described step D also comprises:
One section is joined to the new article d in new corpusnew, add up this section of article dnewTheme distribution ��new, quote intensity distribution ��new, original index ��new, specifically comprise step:
Step D401: initialize, to current article dnewIn each word wm,nOriginal index s is given at random based on binominal distributionm,nIf, to sm,nSampling obtain sm,n=1, then based on multinomial distribution at random from this article dnewQuote in article extract one section quote article cm,n; For this word wm,nTheme z is given at random based on multinomial distributionm,n;
Step D402: rescan current article dnew, for each word wm,nAccording to the described gibbs sampler original index s of formula resamplingm,n; If newly obtain to sm,nSampling sm,n=1, then sample w againm,nCorresponding quotes article cm,n, otherwise, then directly omit quoting article cm,nSampling; Sampling wm,nTheme zm,n, upgrade in new corpus;
Wherein, step D402 is repeatedly executed, until gibbs sampler convergence, enters step D403 and continues to perform;
Step D403: add up current article dnewTheme distribution ��new, statistics article dnewMiddle corresponding sm,nThe proportion �� of the word of=1new, the appearance distribution �� that statistics article is quotednew��
Preferably, described step e comprises:
Formula below is used to obtain the parameter being correlated with:
θ m , k = n m ( k ) + α n m ( · ) ( 0 ) + n m ( · ) ( 1 ) + K α
λ m = N m ( 1 ) + α λ c N m ( 0 ) + α λ n + N m ( 1 ) + α λ c
δ m , c = R m ( c ) + η R m ( · ) + L m η
Wherein, ��M, kIt is the distribution probability of m section article about kth theme,It is the distribution probability of kth theme about t word, ��mIt is the original index of m section article, ��M, cIt is m section article and the power of c section article adduction relationship;Represent that in m section article, theme is the frequency of the word of kth theme;Represent the frequency that in kth theme, the t word occurs,RepresentV represents the quantity of word in kth theme;Represent the frequency of the word of all referenced c section article of m section article,Represent
In preferred technical scheme: extract the effective keyword in corpus, and effective keyword is treated as abstract object; Theme number that article extracts, the intensity of theme distribution, article quote the intensity of distribution, can determine by customer need or by systemic presupposition. Assuming that the theme source of each word is random in every section of article, the theme distribution of certain section of article producing by the theme distribution of article itself or quoting by this article produces;
The probability model of text generation comprise it is assumed hereinafter that:
(1) in every section of article, multinomial distribution obeyed in the theme of each word, and Dirichlet distribution obeyed by its prior distribution.
(2) multinomial distribution obeyed in the different words under each theme, and Dirichlet distribution obeyed by its prior distribution.
(3) in every section of article, multinomial distribution is obeyed in the source of quoting of each word, and Dirichlet distribution obeyed by its prior distribution.
(4) in every section of article, the original of each word obeys binominal distribution, and Beta distribution obeyed by its prior distribution;
Wherein, assuming about probability model, the parameter of prior distribution is on average quoted article quantity determine by article mean length, theme number, article.
Compared with prior art, the present invention has following useful effect:
1, the present invention is based on above-mentioned problems of the prior art, thinks deeply article feature extraction method from a new visual angle, it is possible to improves the accuracy of article feature extraction and can extract the information that traditional characteristic extraction system is not considered from article.
2, the topic model that the present invention uses the Information expansion of citation network traditional, model is made can comprehensively to extract article feature by the information of two aspects, not only the situation that database data volume is bigger is suitable for, and the database of dynamic expansion can be suitable for, additionally it is possible to extract the information such as article adduction relationship intensity that conventional topic model can not extract, the original index of article.
3, the present invention utilizes the openness of article theme distribution, and in theme, words distribution is openness, and the openness of distribution quoted in article, reduces sampling complexity.
Accompanying drawing explanation
By reading with reference to the detailed description that non-limiting example is done by the following drawings, the other features, objects and advantages of the present invention will become more obvious:
Fig. 1 is original article data sample.
Fig. 2 is the generative process of novel topic model.
Fig. 3 is the method flow diagram of the present invention.
Embodiment
Below in conjunction with specific embodiment, the present invention is described in detail. The technician contributing to this area is understood the present invention by following examples further, but does not limit the present invention in any form. It should be appreciated that to those skilled in the art, without departing from the inventive concept of the premise, it is also possible to make some changes and improvements. These all belong to protection scope of the present invention.
The present invention utilizes original method to extract article feature. Present invention uses article citation network to expand traditional topic model so that topic model can utilize topic model and citation network to extract article feature simultaneously, thus extract more accurate article feature. The key step of the present invention comprises:
Steps A: based on the Citations networks of original building of corpus article, sets initial article set and obtains new corpus according to Citations networks;
Step B: for new corpus, builds generation model and the parametric joint expression formula of topic model;
Step C: the deduction process building topic model according to described generation model;
Step D: according to the deduction process of topic model, article that new corpus is sampled;
Step e: extract article parameter according to the sampled result of sampling article.
Article feature extraction method designed by the present invention relate to arrange Citations networks automatization scientific procedure, derive in conjunction with the novel topic model generation model of citation network and federal expression, novel topic model infers process and sampling algorithmic derivation, novel topic model these five core components of parameter estirmation. Method provided by the invention comprises the steps:
About steps A, based on the original corpus of large sample amount, automatically generate the Citations networks of article (such as paper), and output to file; Corpus comprises two portions information, and part information is the information about article itself, comprises title of article, author, summary etc., and another part information is the adduction relationship between article, and such as article A quotes article B, and article A quotes article C.
Academic Data on internet is vast as the open sea, and increases with the quantity of 1,000,000 grades every year. Therefore original corpus according to existing XML and JSON form in the present invention, based on each section of article in original corpus, extract article title, article abstract and article reference, then initial article set is set, adduction relationship according to academic article, obtain maximum component, and derive as new corpus.
Existing original article corpus form is as shown in table 1 and Fig. 1.
The original article data of table 1. stores format specification
In described steps A, the step of the described Citations networks based on original building of corpus article, comprising:
Steps A 1, is set to empty set by summit collection V, limit collection E is set to empty set, figure G is set to V, the set of E;
Steps A 2, for each section of article in original corpus, is added to current article node u in the collection V of summit, adds in the collection E of limit by all adduction relationships of current article node u;
Steps A 3: using the figure G that obtained by steps A 2 as described Citations networks.
In described steps A, described setting initial article set also obtains the step of new corpus according to Citations networks, comprising: according to Citations networks, automatically obtains maximum component, obtains new corpus; Specifically comprise:
Steps A 4, is set to initial known point set V by summit collection V0, limit collection E is set to initial known limit collection E0, figure G is set to V, the set of E;
Steps A 5, constantly searches in original corpus the not some v in the collection V of summit, if the point existed in such some v and some v and summit collection V exists adduction relationship, then a v is added in the collection V of summit, and the adduction relationship of a v is added in E; Until till V, E no longer change;
Steps A 6: derive obtaining corpus corresponding to figure G by steps A 5 as described new corpus.
About step B, traditional topic model utilizes the theme feature of word frequency characteristic as article of every section of article, and the topic model adopted in the present invention can contain the relation between article, i.e. article Citations networks. Described topic model comprises two cores, is respectively generation model (describing in detail in stepb), infers process (describing in detail in step C). Generation model is equivalent to when known parameters, we assume that the model that article generative process is obeyed, the corresponding diagram model of the generation model of article can FIGS 2.
Described step B comprises:
Step B1: to each theme of new corpus, performs following steps:
The polynomial parameters of kth theme to the distribution of word is generated based on Di Likelei hyper parameter ��Wherein, �� isThe parameter of the Dirichlet distribute obeyed; K is positive integer;
Step B2: to each section article of new corpus, performs following steps:
The polynomial parameters �� of m section article to the distribution of theme is generated based on Di Likelei hyper parameter ��m; Wherein, �� is ��mThe parameter of the Dirichlet distribute obeyed; M is positive integer;
The polynomial parameters �� quoting intensity distribution of m section article is generated based on Di Likelei hyper parameter ��m; Wherein, �� is ��mThe parameter of the Dirichlet distribute obeyed;
Based on the hyper parameter group of Bei Ta distributionGenerate the Bernoulli parameter �� of the original index of m section articlem; Wherein,It is ��mThe parameter of the Bei Ta distribution obeyed; It is understood by those skilled in the art that Bei Ta distribution itself needs two hyper parameter, these two hyper parameter can exchange.
Step B3: each word in each section article is performed following steps:
Generating and obeying Bernoulli parameter is ��mThe original index s of the n-th word of m section article of Bernoulli Jacob's distributionM, n; N is positive integer;
If-sM, nBe 1, then generating and obeying parameter is ��mMultinomial distribution quote article cm,n, generating obedience parameter isThe theme z of multinomial distributionm,n, generating obedience parameter isThe word w of multinomial distributionm,n;
If-sm,nBe 0, then generating and obeying parameter is ��mThe theme z of multinomial distributionm,n, generating obedience parameter isThe word w of multinomial distributionm,n;
Wherein,Represent the corresponding c of matrix ��m,nRow vector,Represent matrixCorresponding zm,nRow vector; �� represents the distribution matrix of article to theme,Represent the distribution matrix of theme to word; wm,nRepresent the n-th word in m section article, zm,nRepresent the theme of the n-th word in m section article, cm,nRepresent the n-th word in m section article and article quoted in this n-th word original word of right and wrong;
Step B4: the joint probability distribution building topic model is as follows:
p ( w → , z → , c → , s → | α → , β → , η → , α λ n , α λ c ) = p ( w → | z → , c → , s → , β → ) · p ( z → | c → , s → , α → ) · p ( c → | s → , η → ) · p ( s → | α λ n , α λ c ) = Π k = 1 K Δ ( n k → + β → ) Δ ( β → ) · Π m = 1 M Δ ( n m → + α → ) Δ ( α → ) · Π m = 1 M Δ ( R m → + η → ) Δ ( η → ) · Π m = 1 M B ( N m ( 1 ) + α λ c , N m ( 0 ) + α λ n ) B ( α λ c , α λ n )
Wherein, the probability of A when p (A | B) represents B, symbol �� expression vector;Be the theme the distribution of word,For article is to the distribution of theme,For the distribution quoted of article,For the distribution of word original in article,For the word frequency under kth theme, K represents theme quantity,Being the frequency of theme under m section article, M is article quantity,It is the frequency quoted of m section article,It is the frequency of non-original word in m section article,It it is the frequency of original word in m section article; B (p, q) represents that parameter is the Beta distribution of p and q;
�� () is defined as:
Wherein,For vectorDimension, �� is Gamma function, AkRepresent vectorKth component.
About step C, the parameter of deduction process for estimating in generation model. In practical situation, we are the words in known article, it is desirable to instead derive implicit parameter in the past, just need the method by statistical inference to complete here. For the novel topic model that we propose, conventional optimization method cannot solve the problem of maximum likelihood estimation, then we adopt and a kind of are called that the mode of gibbs sampler is to carry out parameter estirmation.
Described step C comprises:
Step C1: adopt following gibbs sampler formula to carry out parameter estirmation:
Wherein,Represent vectorRemove zm,nCorresponding component; Symbol �� represent direct ratio in;Represent theme zm,nUnder, word wm,nThe frequency occurred;Represent vectorMiddle wm,nCorresponding component; V represents total word number;Represent zm,nIn the frequency that occurs of the t word; ��tRepresent vectorThe t component;Represent cm,nMiddle theme is zm,nAnd sm,nThe frequency of the word of=0;Represent cm,nMiddle theme is zm,nAnd sm,nThe frequency of the word of=1;Represent vectorZm,nCorresponding component;Represent cm,nMiddle theme is kth theme and sm,nThe frequency of the word of=0;Represent cm,nMiddle theme is kth theme and sm,nThe frequency of the word of=1; ��kRepresent vectorKth component;Represent vectorRemove cm,nCorresponding component;Represent in m section article from cm,nWord number,Represent vectorCm,nCorresponding component; LmRepresent that the number of article quoted altogether in m section article;Represent the word number of the article being cited in m section article from a r section; �� r represents vectorThe r component;Represent vectorRemove sm,nCorresponding component;Represent Represent Represent the frequency of all non-original words in m section article;Represent the frequency of all original words;Represent that in m section article, theme is zm,nAnd sm,nThe frequency of the word of=0;Represent that in m section article, theme is zm,nAnd sm,nThe frequency of the word of=1;Represent Represent that in m section article, theme is kth theme and sm,nThe frequency of the word of=0;Represent Represent that in m section article, theme is kth theme and sm,nThe frequency of the word of=0;Represent the frequency of all non-original words in m section article,Represent the frequency of all original words in m section article.
Wherein,In subscript represent the component of corresponding prior distribution parameter.
About step D, according to the deduction process of novel topic model, design sampling algorithm, sampling article database; We can to writing out complete deduction process.
Described step D comprises:
Step D1: initialize; To each word w in every section of article in new corpusm,nBased on the original index s of binominal distribution stochastic samplingm,n; If to sm,nSampling obtain sm,n=1, then quote article c from the extraction one section article of quoting of the article instantly sampled at random based on multinomial distributionm,n; For the word w instantly sampledm,nTheme z is given at random based on multinomial distributionm,n;
Step D2: rescan new corpus; For each word wm,n, according to the described gibbs sampler original index s of formula resamplingm,n; If newly obtain to sm,nSampling sm,n=1, then sample w againm,nCorresponding quotes article cm,n, otherwise, then directly omit quoting article cm,nSampling; Sampling wm,nTheme zm,n, upgrade in new corpus;
Wherein, repeating step D2, until gibbs sampler convergence;
Step D3: according to s corresponding in every section of article in the new corpus countedm,nThe frequency that in the frequency of occurrences that the proportion of the word of=1, every section of article are quoted, every section of article, under the frequency of occurrences of theme and each theme, word occurs, obtains the original index �� of every section of article respectively, quotes intensity distribution ��, the word distribution phi of the theme distribution �� of every section of article and each theme;
For one section of new article (namely at the article newly adding new corpus instantly) dnew, add up the theme distribution �� of this section of articlenew, quote intensity distribution ��new, original index ��new, specifically comprise step:
Step D401: initialize, to current article dnewIn each word wm,nOriginal index s is given at random based on binominal distributionm,nIf, to wm,nSampling obtain sm,n=1, then based on multinomial distribution at random from this article ddnewQuote in article extract one section quote article cm,n; For this word wm,nTheme z is given at random based on multinomial distributionm,n;
Step D402: rescan current article dnew, for each word wm,nAccording to the described gibbs sampler original index s of formula resamplingm,n; If newly obtain to sm,nSampling sm,n=1, then sample w againm,nCorresponding quotes article cm,n, otherwise, then directly omit quoting article cm,nSampling; Sampling wm,nTheme zm,n, upgrade in new corpus;
Wherein, repeating step D402, until gibbs sampler convergence;
Step D403: add up current article dnewTheme distribution, this theme distribution is exactly ��new, statistics article dnewMiddle corresponding sm,nThe proportion of the word of=1, this proportion is exactly ��new, the appearance distribution that statistics article is quoted, this distribution is exactly ��new��
About step e, after topic model is restrained, (such as preferably according to the gibbs sampler algorithm in step 4, we carry out circulating sampling, after sampling carries out enough number of times, model parameter convergence can be considered as), we use formula below to obtain the parameter being correlated with:
θ m , k = n m ( k ) + α n m ( · ) ( 0 ) + n m ( · ) ( 1 ) + K α
λ m = N m ( 1 ) + α λ c N m ( 0 ) + α λ n + N m ( 1 ) + α λ c
δ m , c = R m ( c ) + η R m ( · ) + L m η
Wherein, ��M, kIt is the distribution probability of m section article about kth theme,It is the distribution probability of kth theme about t word, ��mIt is the original index of m section article, ��M, cIt is m section article and the power of c section article adduction relationship;Represent that in m section article, theme is the frequency of the word of kth theme;Represent the frequency that in kth theme, the t word occurs,RepresentV represents the quantity of word in kth theme;Represent the frequency of the word of all referenced c section article of m section article,Represent
Subscript () represents sues for peace monomial to subscript herein, as
Above specific embodiments of the invention are described. It is understood that the present invention is not limited to above-mentioned particular implementation, those skilled in the art can make a variety of changes within the scope of the claims or revise, and this does not affect the flesh and blood of the present invention. When not conflicting, the feature in the embodiment of the application and embodiment can combine arbitrarily mutually.

Claims (7)

1. the article feature extraction method based on topic model, it is characterised in that, comprising:
Steps A: based on the Citations networks of original building of corpus article, sets initial article set and obtains new corpus according to Citations networks;
Step B: for new corpus, builds generation model and the parametric joint expression formula of topic model;
Step C: the deduction process building topic model according to described generation model;
Step D: according to the deduction process of topic model, article that new corpus is sampled;
Step e: extract article parameter according to the sampled result of sampling article.
2. the article feature extraction method based on topic model according to claim 1, it is characterised in that, described steps A comprises:
Steps A 1, is set to empty set by summit collection V, limit collection E is set to empty set, figure G is set to V, the set of E;
Steps A 2, for each section of article in original corpus, is added to current article node u in the collection V of summit, adds in the collection E of limit by all adduction relationships of current article node u;
Steps A 3: using the figure G that obtained by steps A 2 as described Citations networks;
Steps A 4, is set to initial known point set V by summit collection V0, limit collection E is set to initial known limit collection E0, figure G is set to V, the set of E;
Steps A 5, constantly searches in original corpus the not some v in the collection V of summit, if the point existed in such some v and some v and summit collection V exists adduction relationship, then a v is added in the collection V of summit, and the adduction relationship of a v is added in E; Until till V, E no longer change;
Steps A 6: derive obtaining corpus corresponding to figure G by steps A 5 as described new corpus.
3. the article feature extraction method based on topic model according to claim 1, it is characterised in that, described step B comprises:
Step B1: to each theme of new corpus, performs following steps:
The polynomial parameters of kth theme to the distribution of word is generated based on Di Likelei hyper parameter ��Wherein, �� isThe parameter of the Dirichlet distribute obeyed; K is positive integer;
Step B2: to each section article of new corpus, performs following steps:
The polynomial parameters �� of m section article to the distribution of theme is generated based on Di Likelei hyper parameter ��m; Wherein, �� is ��mThe parameter of the Dirichlet distribute obeyed; M is positive integer;
The polynomial parameters �� quoting intensity distribution of m section article is generated based on Di Likelei hyper parameter ��m; Wherein, �� is ��mThe parameter of the Dirichlet distribute obeyed;
Based on the hyper parameter group of Bei Ta distributionGenerate the Bernoulli parameter �� of the original index of m section articlem; Wherein,It is ��mThe parameter of the Bei Ta distribution obeyed;
Step B3: each word in each section article is performed following steps:
Generating and obeying Bernoulli parameter is ��mThe original index s of the n-th word of m section article of Bernoulli Jacob's distributionm, n; N is positive integer;
If-sM, nBe 1, then generating and obeying parameter is ��mMultinomial distribution quote article cM, n, generating obedience parameter isThe theme Z of multinomial distributionM, n, generating obedience parameter isThe word w of multinomial distributionM, n;
If-sM, nBe 0, then generating and obeying parameter is ��mThe theme Z of multinomial distributionM, n, generating obedience parameter isThe word w of multinomial distributionM, n;
Wherein,Represent the corresponding c of matrix ��M, nRow vector,Represent that square is oldCorresponding ZM, nRow vector; �� represents the distribution matrix of article to theme,Represent the distribution matrix of theme to word, wM, nRepresent the n-th word in m section article, ZM, nRepresent the theme of the n-th word in m section article, cM, nRepresent the n-th word in m section article and article quoted in this n-th word original word of right and wrong;
Step B4: the joint probability distribution building topic model is as follows:
p ( w → , z → , c → , s → | α → , β → , η → , α λ n , α λ c ) = p ( w → | z → , c → , s → , β → ) · p ( z → | c → , s → , α → ) · p ( c → | s → , η → ) · p ( s → | α λ n , α λ c ) = Π k = 1 K Δ ( n k → + β → ) Δ ( β → ) · Π m = 1 M Δ ( n m → + α → ) Δ ( α → ) · Π m = 1 M Δ ( R m → + η → ) Δ ( η → ) · Π m = 1 M B ( N m ( 1 ) + α λ c , N m ( 0 ) + α λ n ) B ( α λ c , α λ n )
Wherein, the probability of A when p (A | B) represents B, symbol �� expression vector;Be the theme the distribution of word,For article is to the distribution of theme,For the distribution quoted of article,For the distribution of word original in article,For the word frequency under kth theme, K represents theme quantity,Being the frequency of theme under m section article, M is article quantity,It is the frequency quoted of m section article,It is the frequency of non-original word in m section article,It it is the frequency of original word in m section article; B (p, q) represents that parameter is the Beta distribution of p and q;
�� () is defined as:
Wherein,For vectorDimension, �� is Gamma function, AkRepresent vectorKth component.
4. the article feature extraction method based on topic model according to claim 3, it is characterised in that, described step C comprises:
Step C1: adopt following gibbs sampler formula to carry out parameter estirmation:
∝ n z m , n w m , n + β w m , n - 1 Σ t = 1 V ( n z m , n t + β t ) - 1 · ( n m z m , n ( 0 ) - 1 ) + n m z m , n ( 1 ) + α z m , n Σ k = 1 K ( n m ( k ) ( 0 ) + n m ( k ) ( 1 ) + α k ) - 1
∝ n z m , n w m , n + β w m , n - 1 Σ t = 1 V ( n z m , n t + β t ) - 1 · n c m , n z m , n ( 0 ) + ( n c m , n z m , n ( 1 ) - 1 ) + α z m , n Σ k = 1 K ( n c m , n ( k ) ( 0 ) + n c m , n ( k ) ( 1 ) + α k ) - 1
∝ n m c m , n - 1 + η c m , n Σ r = 1 L m ( n m ( r ) + η r ) - 1 · n c m , n z m , n ( 0 ) + ( n c m , n z m , n ( 1 ) - 1 ) + α z m , n Σ k = 1 K ( n c m , n ( k ) ( 0 ) + n c m , n ( k ) ( 1 ) + α k ) - 1
∝ n c m , n z m , n ( 0 ) + ( n c m , n z m , n ( 1 ) - 1 ) + α z m , n n c m , n ( · ) ( 0 ) + n c m , n ( · ) ( 1 ) + K α - 1 · N m ( 1 ) - 1 + α λ c ( N m ( 1 ) - 1 ) + N m ( 0 ) + α λ n + α λ c
∝ ( n m z m , n ( 0 ) - 1 ) + n m z m , n ( 1 ) + α z m , n n m ( · ) ( 0 ) + n m ( · ) ( 1 ) + K α - 1 · N m ( 0 ) - 1 + α λ n N m ( 1 ) + ( N m ( 0 ) - 1 ) + α λ n + α λ c
Wherein,Represent vectorRemove ZM, nCorresponding component; Symbol �� represent direct ratio in;Represent theme ZM, nUnder, word wM, nThe frequency occurred;Represent vectorMiddle wM, nCorresponding component; V represents total word number;Represent ZM, nIn the frequency that occurs of the t word; ��tRepresent vectorThe t component;Represent cM, nMiddle theme is ZM, nAnd sM, nThe frequency of the word of=0;Represent cM, nMiddle theme is ZM, nAnd sM, nThe frequency of the word of=1;Represent vectorZM, nCorresponding component;Represent cM, nMiddle theme is K theme and sM, nThe frequency of the word of=0;Represent cM, nMiddle theme is K theme and sM, nThe frequency of the word of=1; ��kRepresent vectorThe K component;Represent vectorRemove cM, nCorresponding component;Represent in m section article from cM, nWord number,Represent vectorCM, nCorresponding component; LmRepresent that the number of article quoted altogether in m section article;Represent the word number of the article being cited in m section article from a �� section; ����Represent vectorThe �� component;Represent vectorRemove sM, nCorresponding component;Represent Represent Represent the frequency of all non-original words in m section article;Represent the frequency of all original words;Represent that in m section article, theme is ZM, nAnd sM, nThe frequency of the word of=0;Represent that in m section article, theme is ZM, nAnd sM, nThe frequency of the word of=1;Represent Represent that in m section article, theme is K theme and sM, nThe frequency of the word of=0;Represent Represent that in m section article, theme is K theme and sM, nThe frequency of the word of=0;Represent the frequency of all non-original words in m section article,Represent the frequency of all original words in m section article.
5. the article feature extraction method based on topic model according to claim 4, it is characterised in that, described step D comprises:
Step D1: initialize; To each word w in every section of article in new corpusM, nBased on the original index s of binominal distribution stochastic samplingM, n; If to sM, nSampling obtain sM, n=1, then quote article c from the extraction one section article of quoting of the article instantly sampled at random based on multinomial distributionM, n; For the word w instantly sampledM, nTheme Z is given at random based on multinomial distributionM, n;
Step D2: rescan new corpus; For each word wM, n, according to the described gibbs sampler original index s of formula resamplingM, n; If newly obtain to sM, nSampling sM, n=1, then sample w againM, nCorresponding quotes article cM, n, otherwise, then directly omit quoting article cM, nSampling; Sampling wM, nTheme ZM, n, upgrade in new corpus;
Wherein, step D2 is repeatedly executed, until gibbs sampler convergence, enters step D3 and continues to perform;
Step D3: according to s corresponding in every section of article in the new corpus countedM, nThe frequency that in the frequency of occurrences that the proportion of the word of=1, every section of article are quoted, every section of article, under the frequency of occurrences of theme and each theme, word occurs, obtains the original index �� of every section of article respectively, quotes intensity distribution ��, the word distribution phi of the theme distribution �� of every section of article and each theme.
6. the article feature extraction method based on topic model according to claim 5, it is characterised in that, described step D also comprises:
One section is joined to the new article d in new corpusnew, add up this section of article dnewTheme distribution ��new, quote intensity distribution ��new, original index ��new, specifically comprise step:
Step D401: initialize, to current article dnewIn each word wM, nOriginal index s is given at random based on binominal distributionM, nIf, to sM, nSampling obtain sM, n=1, then based on multinomial distribution at random from this article dnewQuote in article extract one section quote article cM, n; For this word wM, nTheme Z is given at random based on multinomial distributionM, n;
Step D402: rescan current article dnew, for each word wM, nAccording to the described gibbs sampler original index s of formula resamplingM, n; If newly obtain to sM, nSampling sM, n=1, then sample w againM, nCorresponding quotes article cM, n, otherwise, then directly omit quoting article cM, nSampling; Sampling wM, nTheme ZM, n, upgrade in new corpus;
Wherein, step D402 is repeatedly executed, until gibbs sampler convergence, enters step D403 and continues to perform;
Step D403: add up current article dnewTheme distribution ��new, statistics article dnewMiddle corresponding sM, nThe proportion �� of the word of=1new, the appearance distribution �� that statistics article is quotednew��
7. the article feature extraction method based on topic model according to claim 4, it is characterised in that, described step e comprises:
Formula below is used to obtain the parameter being correlated with:
θ m , k = n m ( k ) + α n m ( · ) ( 0 ) + n m ( · ) ( 1 ) + K α
λ m = N m ( 1 ) + α λ c N m ( 0 ) + α λ n + N m ( 1 ) + α λ c
δ m , c = R m ( c ) + η R m ( · ) + L m η
Wherein, ��M, kIt is the distribution probability of m section article about kth theme,It is the distribution probability of kth theme about t word, ��mIt is the original index of m section article, ��m, c is m section article and the power of c section article adduction relationship;Represent that in m section article, theme is the frequency of the word of K theme;Represent the frequency that in the K theme, the t word occurs,RepresentV represents the quantity of word in the K theme;Represent the frequency of the word of all referenced c section article of m section article,Represent
CN201511016955.7A 2015-12-29 2015-12-29 Article Feature Extraction Method based on topic model Active CN105631018B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201511016955.7A CN105631018B (en) 2015-12-29 2015-12-29 Article Feature Extraction Method based on topic model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201511016955.7A CN105631018B (en) 2015-12-29 2015-12-29 Article Feature Extraction Method based on topic model

Publications (2)

Publication Number Publication Date
CN105631018A true CN105631018A (en) 2016-06-01
CN105631018B CN105631018B (en) 2018-12-18

Family

ID=56045951

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201511016955.7A Active CN105631018B (en) 2015-12-29 2015-12-29 Article Feature Extraction Method based on topic model

Country Status (1)

Country Link
CN (1) CN105631018B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106372147A (en) * 2016-08-29 2017-02-01 上海交通大学 Method for constructing and visualizing heterogeneous thematic network based on text network
CN106709520A (en) * 2016-12-23 2017-05-24 浙江大学 Topic model based medical record classification method
CN107515854A (en) * 2017-07-27 2017-12-26 上海交通大学 The detection method of sequential community and topic based on cum rights sequential text network
CN108549625A (en) * 2018-02-28 2018-09-18 首都师范大学 A kind of Chinese chapter Behaviour theme analysis method based on syntax object cluster
CN109299257A (en) * 2018-09-18 2019-02-01 杭州科以才成科技有限公司 A kind of English Periodicals recommended method based on LSTM and knowledge mapping
CN109597879A (en) * 2018-11-30 2019-04-09 京华信息科技股份有限公司 One kind being based on the business conduct Relation extraction method and device of " quotation relationship " data
CN110555198A (en) * 2018-05-31 2019-12-10 北京百度网讯科技有限公司 method, apparatus, device and computer-readable storage medium for generating article
CN110807315A (en) * 2019-10-15 2020-02-18 上海大学 Topic model-based online comment emotion mining method
CN115438654A (en) * 2022-11-07 2022-12-06 华东交通大学 Article title generation method and device, storage medium and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1766871A (en) * 2004-10-29 2006-05-03 中国科学院研究生院 The processing method of the semi-structured data extraction of semantics of based on the context
US20100030732A1 (en) * 2008-07-31 2010-02-04 International Business Machines Corporation System and method to create process reference maps from links described in a business process model
JP2011180862A (en) * 2010-03-02 2011-09-15 Nippon Telegr & Teleph Corp <Ntt> Method and device of extracting term, and program
CN104408153A (en) * 2014-12-03 2015-03-11 中国科学院自动化研究所 Short text hash learning method based on multi-granularity topic models
CN104834735A (en) * 2015-05-18 2015-08-12 大连理工大学 Automatic document summarization extraction method based on term vectors

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1766871A (en) * 2004-10-29 2006-05-03 中国科学院研究生院 The processing method of the semi-structured data extraction of semantics of based on the context
US20100030732A1 (en) * 2008-07-31 2010-02-04 International Business Machines Corporation System and method to create process reference maps from links described in a business process model
JP2011180862A (en) * 2010-03-02 2011-09-15 Nippon Telegr & Teleph Corp <Ntt> Method and device of extracting term, and program
CN104408153A (en) * 2014-12-03 2015-03-11 中国科学院自动化研究所 Short text hash learning method based on multi-granularity topic models
CN104834735A (en) * 2015-05-18 2015-08-12 大连理工大学 Automatic document summarization extraction method based on term vectors

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
余传明 等: "基于LDA模型的评论热点挖掘:原理与实现", 《信息系统》 *
刘俊 等: "基于主题特征的关键词抽取", 《计算机应用研究》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106372147B (en) * 2016-08-29 2020-09-15 上海交通大学 Heterogeneous topic network construction and visualization method based on text network
CN106372147A (en) * 2016-08-29 2017-02-01 上海交通大学 Method for constructing and visualizing heterogeneous thematic network based on text network
CN106709520A (en) * 2016-12-23 2017-05-24 浙江大学 Topic model based medical record classification method
CN107515854A (en) * 2017-07-27 2017-12-26 上海交通大学 The detection method of sequential community and topic based on cum rights sequential text network
CN107515854B (en) * 2017-07-27 2021-06-04 上海交通大学 Time sequence community and topic detection method based on right-carrying time sequence text network
CN108549625A (en) * 2018-02-28 2018-09-18 首都师范大学 A kind of Chinese chapter Behaviour theme analysis method based on syntax object cluster
CN108549625B (en) * 2018-02-28 2020-11-17 首都师范大学 Chinese chapter expression theme analysis method based on syntactic object clustering
CN110555198A (en) * 2018-05-31 2019-12-10 北京百度网讯科技有限公司 method, apparatus, device and computer-readable storage medium for generating article
CN110555198B (en) * 2018-05-31 2023-05-23 北京百度网讯科技有限公司 Method, apparatus, device and computer readable storage medium for generating articles
CN109299257B (en) * 2018-09-18 2020-09-15 杭州科以才成科技有限公司 English periodical recommendation method based on LSTM and knowledge graph
CN109299257A (en) * 2018-09-18 2019-02-01 杭州科以才成科技有限公司 A kind of English Periodicals recommended method based on LSTM and knowledge mapping
CN109597879A (en) * 2018-11-30 2019-04-09 京华信息科技股份有限公司 One kind being based on the business conduct Relation extraction method and device of " quotation relationship " data
CN109597879B (en) * 2018-11-30 2022-03-29 京华信息科技股份有限公司 Service behavior relation extraction method and device based on 'citation relation' data
CN110807315A (en) * 2019-10-15 2020-02-18 上海大学 Topic model-based online comment emotion mining method
CN115438654A (en) * 2022-11-07 2022-12-06 华东交通大学 Article title generation method and device, storage medium and electronic equipment

Also Published As

Publication number Publication date
CN105631018B (en) 2018-12-18

Similar Documents

Publication Publication Date Title
CN105631018A (en) Article feature extraction method based on topic model
CN103631859B (en) Intelligent review expert recommending method for science and technology projects
CN104391942B (en) Short essay eigen extended method based on semantic collection of illustrative plates
Olczyk A systematic retrieval of international competitiveness literature: a bibliometric study
CN107315738B (en) A kind of innovation degree appraisal procedure of text information
CN109522420B (en) Method and system for acquiring learning demand
CN105843897A (en) Vertical domain-oriented intelligent question and answer system
CN105843795A (en) Topic model based document keyword extraction method and system
CN104268160A (en) Evaluation object extraction method based on domain dictionary and semantic roles
CN105912645B (en) A kind of intelligent answer method and device
CN111143672B (en) Knowledge graph-based professional speciality scholars recommendation method
CN112966091B (en) Knowledge map recommendation system fusing entity information and heat
CN106897437B (en) High-order rule multi-classification method and system of knowledge system
CN104636424A (en) Method for building literature review framework based on atlas analysis
CN105608075A (en) Related knowledge point acquisition method and system
Vel Pre-processing techniques of text mining using computational linguistics and python libraries
CN107015965A (en) A kind of Chinese text sentiment analysis device and method
CN106407482A (en) Multi-feature fusion-based online academic report classification method
Kuntarto et al. Dwipa ontology III: Implementation of ontology method enrichment on tourism domain
CN113610626A (en) Bank credit risk identification knowledge graph construction method and device, computer equipment and computer readable storage medium
CN106202036A (en) A kind of verb Word sense disambiguation method based on interdependent constraint and knowledge and device
CN106021413A (en) Theme model based self-extendable type feature selecting method and system
DE102018007024A1 (en) DOCUMENT BROKEN BY GRAMMATIC UNITS
CN105468780A (en) Normalization method and device of product name entity in microblog text
CN103019924B (en) The intelligent evaluating system of input method and method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant