CN108519971A

CN108519971A - A kind of across languages theme of news similarity comparison methods based on Parallel Corpus

Info

Publication number: CN108519971A
Application number: CN201810245163.4A
Authority: CN
Inventors: 王�琦; 于水源; 曹轶臻; 韩笑; 戴长松
Original assignee: Communication University of China
Current assignee: Communication University of China
Priority date: 2018-03-23
Filing date: 2018-03-23
Publication date: 2018-09-11
Anticipated expiration: 2038-03-23
Also published as: CN108519971B

Abstract

The invention discloses a kind of across languages theme of news similarity comparison methods based on Parallel Corpus.Steps are as follows：(1) each document has independent theme distribution, and the same theme of language description, shared theme distribution in Parallel Corpus；First, the article collection in retrieval Chinese about T themes obtains the Chinese LDA topic models of article collection based on the Universal Chinese character corpus in Parallel Corpus by LDA topic model algorithms；Then, the T theme LDA topic models of Chinese are mapped to the LDA topic models that broad sense topic model space obtains the Chinese of T themes and F language is shared, using LDA algorithm, F language LDA topic models are obtained by the F language corpus in the article and Parallel Corpus of the unknown theme to be screened of F language；By on this generalized space LDA topic models and F language LDA topic models compare, think that this article to be screened is the article about T themes if similar.The present invention can quick and precisely filter out the article of specific subject automatically without translation.

Description

A kind of across languages theme of news similarity comparison methods based on Parallel Corpus

Technical field

This patent proposes a kind of method of across the languages theme of news similitudes comparison based on Parallel Corpus.This patent side Method can automatically filter out specific subject foreign language article without translating.On condition that have bilingual teaching mode, On the basis of LDA topic models, bilingual LDA topic models are invented, and a kind of parallelization is realized by parallel computation frame Processing can fast and efficiently realize that multilingual media event reports automatic screening using this patent method.It is related to language material The fields such as library, word frequency analysis, Similarity measures.

Background technology

How across languages media event topic similarity automatic comparisons are rapidly completed in the case where prosthetic is translated, in turn The multilingual media event report automatic screening for realizing identical theme, reduces and manually translates directly into this, grasp promptly and accurately Western medium's news public sentiment is a problem to be solved.

Machine translation achieves prodigious raising and progress in field of language translation in recent years.Machine translation is to utilize meter Calculation machine is converted into a kind of natural language the process of another natural language.Statistical method be applied to machine translation method it Afterwards so that the accuracy rate of machine translation result is gradually increased.But it in the practical application under the conditions of technology today, generally uses The method that machine translation and human translation combine：Machine translation method is first used, article is translated as another language, is then existed By artificial modification and calibration, a complete, accurate article can be just obtained.However, just by human translation this process The consumption of manpower and time are increased, efficiency reduces, and increases cost.This so that traditional machine translation is multilingual in reply magnanimity News report shows deficiency not in time, inaccurate when translating.So this patent proposes a kind of similitude based on corpus News article is screened in control methods, for the preprocessing process of the theme of news identification screening before machine translation, reaches Unrelated report noise is reduced, is used manpower and material resources sparingly, efficient purpose is put forward.

In view of the above-mentioned problems, this patent proposes and a kind of translating across the languages news based on Parallel Corpus The method of topic similarity comparison.The method that this patent proposes is based on Parallel Corpus.Corpus is extensive e-text library. It is the linguistic data that will really occur in practical application using electronic computer as carrier, the sampling by science and rationally Analysis and processing, become actually useful e-text library.Parallel Corpus is one group of text, each text is gone back in addition to itself There are one or more than one kinds of translation Chinese language sheets, simplest Parallel Corpus to design bilingual text-original text and translate Text.This patent is namely based on Parallel Corpus, and analysis obtains macaronic word segmentation regulation and word frequency distribution rule.It selects first Go out the news report about some subject events in Chinese, be based on Universal Chinese character corpus, theme text is generated using LDA algorithm The topic model of chapter；Then it chooses foreign language article to be screened and text feature is extracted based on the general corpus of foreign language；Finally will The topic model of the text feature obtained by foreign language news article and some the theme article obtained by Chinese compares, if phase Seemingly, then it is determined as that this foreign language article is the foreign language article about the theme.

This patent is expanded into bilingual LDA topic models on the basis of LDA topic models.Different from traditional LDA Each document of topic model has independent theme distribution, bilingual LDA models to utilize bilingual teaching mode, shares theme Distribution, different language describe the same theme.In addition, Parallel Corpus is described using different language, word frequency distribution can be with It is different.This patent method is to establish the bilingual LDA topic models on generalized space on the basis of bilingual teaching mode, when When having new language material, new LDA models are generated, the subject classification for judging new language material is compared with bilingual LDA models.

This patent using Gibbs to distribution sample, but in view of magnanimity training sample the case where, in order to improve LDA Model formation efficiency realizes a kind of parallelization LDA algorithm by parallel computation frame spark here.It is calculated compared to traditional LDA Method, the parallel LDA algorithm in this patent, which has made some improvements, achieves parallelization, and timestamp this feature is added, is being distributed Sampling process is carried out in formula environment, can improve the efficiency and accuracy rate of whole process.

Invention content

The present invention is based on Parallel Corpus to give a kind of foreign language article method filtered out about specific subject event.Before Carry is to gather around that there are one the Parallel Corpus of general Chinese and two languages of F language.Assuming that targeted news theme is T themes, outside Literary language is F.In the case where not translating, filtered out in the article of a large amount of unknown theme of F language F language about T master The article of topic：

1) each document has an independent theme distribution in Parallel Corpus, and the same theme of language description, shared Theme distribution.First, article collection about T themes in retrieval Chinese, based on the Universal Chinese character corpus in Parallel Corpus, by LDA topic model algorithms obtain the Chinese LDA topic models of article collection；

2) the T theme LDA topic models of Chinese then, are mapped to broad sense topic model space and obtain the Chinese of T themes The LDA topic models shared with F language, using LDA algorithm, the article by the unknown theme to be screened of F language and parallel language F language corpus in material library obtains F language LDA topic models；

3) by this generalized space LDA topic models and F language LDA topic models compare, think if similar This article to be screened is the article about T themes.

4) a kind of across languages theme of news similarity comparison methods based on Parallel Corpus, which is characterized in that

This method is divided into three steps：

On condition that gathering around, there are one the Parallel Corpus of general Chinese and two languages of F language；Assuming that targeted news theme T themes, foreign language is F, in the case where not translating, filtered out in the article of the unknown theme of F language F language about T The article of theme：

(1) each document has an independent theme distribution in Parallel Corpus, and the same theme of language description, shared Theme distribution；First, article collection about T themes in retrieval Chinese, based on the Universal Chinese character corpus in Parallel Corpus, by LDA topic model algorithms obtain the Chinese LDA topic models of article collection；

(2) then, the T theme LDA topic models of Chinese are mapped to broad sense topic model space and obtain the Chinese of T themes The LDA topic models shared with F language, using LDA algorithm, the article by the unknown theme to be screened of F language and parallel language F language corpus in material library obtains F language LDA topic models；

(3) by this generalized space LDA topic models and F language LDA topic models compare, think if similar This article to be screened is the article about T themes.

Identical theme contains similar semantic information in different language, thus the text of different language be indicated on it is same In a broad sense theme space；After training data is marked with a kind of language, that is, as the master of certain news articles of Chinese After inscribing model generation, in the topic space for the broad sense that just it is mapped, using the theme class of this broad sense, by the unknown of foreign language The news article of theme generates topic model, compares to obtain theme with the topic model of generalized space as a result, steps are as follows：

For theme k,

(1) the Word probability distribution of the general corpus of sampling ChineseSample the common language of foreign language F Expect the Word probability distribution in library

(2) for m in Parallel Corpus to Chinese, F Language Documents pair, m ∈ [1, M], sampling theme probability distribution θ_m ~Dirichlet (α),

1. for Chinese documentN-th^CA lexical item,Theme Z is implied in selection^C~Dirichlet (θ_m), Generate a lexical item

2. for foreign language documentN-th^FA lexical item,Theme Z is implied in selection^F~Dirichlet (θ_m), Generate a lexical item

Wherein, C represents Chinese, and F represents foreign language languages；θ_mIndicate theme probability distribution of the m to bilingual parallel document； WithTheme Z is indicated respectively_kIn the vocabulary distribution probability of Chinese and foreign language；Z^CAnd Z^FIndicate m to bilingual parallel document respectively Source language and the target language n-th of lexical item implicit theme；ω^CAnd ω^FIndicate m to bilingual parallel document Chinese respectively With n-th of lexical item of foreign language；M indicates document lump logarithm；WithIndicate respectively m to bilingual parallel document Chinese and The sum of foreign language document；θ_mIt obeys Dirichlet distributions and α is its Study first and is used to generate theme；WithIt obeys Dirichlet is distributed and β^CAnd β^FIt to Study first and for generating lexical item；Wherein, α, β^C、β^FIt is that maximum likelihood is estimated Metering indicates " document-theme " probability distribution and " theme-lexical item " probability distribution respectively；Here the general of entire corpus is chosen Rate object function as an optimization obtains α, β by carrying out maximizing estimation to object function^C、β^FValue, and then obtain LDA moulds Type.

The parameter Estimation of LDA topic models, the distribution of the parameter the to be estimated i.e. distribution of document subject matter and descriptor, specifically Realization process is：

Assuming that including m documents in news corpus library, number of threads p, algorithm initializes sampling parameter first；Then will Training corpus is divided into p data set, and the foundation for dividing sample is to assign to the high sample of similarity in the same data set；It connects It using Gibbs sampling algorithms, distribution above is sampled, per thread carries out the data set being assigned to parallel Gibbs sampling processes；Since lexical item w observations given data obtains, only theme z is implicit variable in this distribution, So only needing to sample distribution P (z | w)；By parameter Estimation of the Dirichlet Posterior distrbutionps under Bayesian frame, Assuming that the word ω that observation obtains_i=t, then parameter Estimation：

The Gibbs sampling formula of LDA models are obtained as a result,：

WhereinIndicate m to the Dirichlet probability distribution of the k themes of bilingual parallel document under Bayesian frame Parameter Estimation, α_kIt is its Study first, indicates the probability distribution of k-th of theme,It indicates for m documents of k themes N-th of word,Indicate parameter Estimation of the Dirichlet probability distribution of the lexical item t of theme k under Bayesian frame, β_tIt is it Prior probability indicates the word distribution of t-th of theme；α_kAnd β_tIt is maximum-likelihood estimator, chooses the probability conduct of entire corpus Optimization object function is estimated to obtain α by object function maximize_kAnd β_tValue；

This formula left side expression P (theme | document) P (word | theme), this probability is exactly document subject matter → word in fact The path probability of item, since theme is that manually selected K is a, so Gibbs sampling formula mean that and carried out in this K path Sampling；

After waiting for that all threads complete sampling, corresponding document-theme matrix and theme-word matrix are obtained, will be owned Local parameter calculates global parameter by way of averaging, this global parameter is then distributed to each parallel LDA and is calculated Initial samples parameter of the method as next round iteration, re-starts sampling, is iteratively repeated the above process in this way, until parameter is received It holds back.

The method of parallel sampling, in order to ensure that the corresponding theme of each data set is mutual indepedent, when this patent takes foundation Between element classification policy；Since parallel LDA algorithm cannot timely update global sampling ginseng during each iteration samples Number, therefore the precision of final mask is compared to being to have certain loss for more traditional LDA algorithm, is mainly due to initial here Caused by the mean random of block algorithm, i.e., entire data set is averagely simply divided into p parts, not to the document in every part Between correlation take in, in this way every part of data concentrate theme be intended to equalize；

S text similarity measures are based on TF-IDF algorithms, and TF-IDF algorithms are a kind of for statistical method, use In one words of assessment for the significance level of a copy of it file in a file set or a corpus；It finds out first every Then the TF-IDF values of piece document are added time scale to every document according to the time that document is delivered, are carved based on this time Degree, solves the similarity of two documents, and formula is as follows：

Wherein, t_aWith t_bIndicate the time scale of document a and document b, the molecular formula denominator on the left side adds one also for preventing Denominator is 0,Indicate the vector product of the word TF-IDF vectors of document a and document b.The effect and benefit of this patent：

1. the bilingual LDA topic models of invention can without translation but automatically screening goes out some specific master The article of topic；

2. preprocessing process of this patent for the theme of news identification screening before machine translation, it is unrelated to reach reduction It reports noise, uses manpower and material resources sparingly, put forward efficient purpose；

3. when handling data, this patent realizes a kind of parallelization processing by parallel computation frame, can be quick, high The realization screening of speed；

4. on the basis of parallel LDA algorithm, this patent uses the classification policy according to element of time can to deblocking To improve the precision of final result.

Description of the drawings

Fig. 1 foreign language article screening techniques describe

The Chinese articles of Fig. 2 topic news are handled

Participle processes of the Fig. 3 based on dictionary

The graph model of the bilingual LDA models of Fig. 4

The parallel LDA algorithm flow charts of Fig. 5

Fig. 6 probability calculations path

Fig. 7 algorithm parameter lists

Fig. 8 initialization algorithm pseudocodes

Fig. 9 divides pseudocode

Figure 10 Gibbs sampling pseudocodes

Figure 11 F language LDA topic models

With reference to the attached drawing of the present invention, technical scheme of the present invention is described in detail.

1. Fig. 2 is the processing procedure to the Chinese articles of some topic news.A target topic T is selected first, each It is retrieved in big Chinese portal website and filters out a several pieces and be used as Chinese language material about the Chinese articles of T topics, by these Chinese Language material segments, goes the pretreatment of stop words, and pretreatment is divided into following two steps：

(1) word segmentation processing.This patent is based on corpus, so using the participle side based on dictionary using corpus Method.The character string of Chinese news article is compared with the word in dictionary for word segmentation according to strategy as defined in corpus, such as Fruit finds character string to be slit then successful match in dictionary.The process of participle is illustrated in fig. 3 shown below

(2) stop words is gone to handle.Stop words refers to that the frequency of occurrences is higher and carries the word of less text message.2. using T Theme Chinese language material and Universal Chinese character corpus generate the bilingual LDA models of T topics based on LDA models.This patent is assumed different Identical theme contains similar semantic information in language, so the text of different language can be indicated on the same broad sense master It inscribes in space.After training data is marked with a kind of language, i.e., when the topic model life of certain news articles of Chinese At later, so that it may in the topic space for the broad sense that it is mapped, using the theme class of this broad sense, by the unknown theme of foreign language News article generate topic model, compare to obtain theme result with the topic model of generalized space.Step is described as follows：

Fig. 4 is that the graph model of bilingual LDA topic models indicates that thus graph model is based on LDA topic models, illustrates to generate text The process of shelves.C represents Chinese, and F represents foreign language languages；θ_mIndicate theme probability distribution of the m to bilingual parallel document；WithIndicate theme Z in Chinese and foreign language vocabulary distribution probability respectively；Z^CAnd Z^FSources of the m to bilingual parallel document is indicated respectively The implicit theme of n-th of lexical item of language and object language；ω^CAnd ω^FIndicate m to bilingual parallel document Chinese and outer respectively N-th of lexical item of text；M indicates document lump logarithm；WithChinese and foreign language of the m to bilingual parallel document are indicated respectively The sum of document；θ_mObedience Dirichlet distributions and α are the Dirichlet priori ginsengs of the multinomial distribution of theme under each document Number；WithObey Dirichlet distributions and β^CAnd β^FIt to Study first and for generating lexical item.

α reflects the relativeness implied in collection of document between theme, i.e. the probability distribution of " document-theme ", β is reflected The probability distribution of " theme-Feature Words ".Here the probability of entire corpus object function as an optimization is chosen, by target letter Number, which maximize, to be estimated to obtain α, β^C、β^FValue, and then obtain LDA models.Assuming that having M documents, all lexical items in language material Indicate as follows with corresponding theme：

W=(ω₁..., ω_M)

Z=(z₁..., z_M)

Wherein, ω_mIndicate the word in m documents, z_mIndicate the corresponding theme number of these words, n_mIndicate m documents In k-th of theme generate word number.

The generating process of 2.1LDA models indicates, for theme k：

The generating probability of theme in entire language material：

The generating probability of lexical item in entire language material：

Merge two formulas to obtain：

2.2 in MCMC Gibbs sampling be to obtain the common method of LDA topic models.This patent is using Gibbs to upper The distribution in face is sampled, but in view of magnanimity training sample the case where, in order to improve LDA model formation efficiencies, here by Parallel computation frame realizes a kind of parallelization LDA algorithm, compares traditional LDA algorithm, and the parallel LDA algorithm in this patent is done Some improvement achieve parallelization, and specific implementation process is：

Assuming that including m documents in news corpus library, number of threads p, algorithm initializes sampling parameter first；Then will Training corpus is divided into p data set, and the foundation for dividing sample is that the high sample of similarity is assigned to (this in the same data set In method for measuring similarity can be discussed in detail below)；Then Gibbs sampling algorithms are used, distribution above is adopted Sample, per thread carry out Gibbs sampling processes parallel to the data set being assigned to.Since word w observations given data obtains, Only theme z is implicit variable in this distribution, so only needing to sample distribution P (z | w).By Dirichlet Parameter Estimation of the Posterior distrbutionp under Bayesian frame, it is assumed that the word ω observed_i=t, then parameter Estimation：

The Gibbs sampling formula of LDA models are obtained as a result,：

WhereinIndicate m to the Dirichlet probability distribution of the k themes of bilingual parallel document under Bayesian frame Parameter Estimation, α_kIt is its Study first, indicates the probability distribution of k-th of theme,It indicates for m documents of k themes N-th of word,Indicate parameter Estimation of the Dirichlet probability distribution of the lexical item t of theme k under Bayesian frame, β_tIt is it Prior probability indicates the word distribution of t-th of theme.α_kAnd β_tIt is maximum-likelihood estimator, chooses the probability conduct of entire corpus Optimization object function is estimated to obtain α by object function maximize_kAnd β_tValue.

After waiting for that all threads complete sampling, corresponding document-theme matrix and theme-word matrix are obtained, will be owned Local parameter calculates global parameter by way of averaging, this global parameter is then distributed to each parallel LDA and is calculated Initial samples parameter of the method as next round iteration, re-starts sampling, is iteratively repeated the above process in this way, until parameter is received It holds back, the flow chart of whole process is as shown in Figure 5.

2.3 classification policy according to element of time.On the basis of above-mentioned parallel LDA algorithm, this patent is to deblocking Part has carried out some improvement, to improve the precision of final result.The process sampled in each iteration due to parallel LDA algorithm In cannot timely update global sampling parameter, therefore the precision of final mask compared to be for more traditional LDA algorithm have it is certain Loss, here mainly caused by due to the mean random of initial block algorithm, i.e., simply by entire data ensemble average P parts are divided into, the correlation between the document in every part does not take in, and the theme concentrated in this way in every part of data is intended to Equalization.

Observation is carried out to parallel LDA algorithm and then is relatively learnt with conventional serial LDA algorithm, the theme-of parallel algorithm Word matrix cannot timely update Gibbs sampling parameters, so the calculating of document subject matter probability is caused certain deviation occur, because This, algorithm so that the corresponding theme of each data set is independent, is not influenced by other data sets, then can reduce State deviation.

For Text similarity computing, this patent introduces a kind of classification policy based on element of time, the base of the strategy Plinth is TF-IDF algorithms, TF-IDF algorithms be it is a kind of being used for statistical method, for assess a words for a file set or The significance level of a copy of it file in one corpus, the number that the significance level of word occurs hereof with it is at just Than, but can be inversely proportional with its frequency of occurrences in entire corpus.

TF-IDF is made of two parts：Word frequency (term frequency, TF) and inverse document frequency (inverse Document frequency, IDF), wherein TF refers to the number that some given word occurs in some document, in word On the basis of number, which has done normalized, in order to it be prevented to be biased to long text.

tf_{I, j}Word frequency TF value of i-th of the word in corpus in j-th of document is represented, realizes that formula is as follows：

In above formula, n_{I, j}Indicate occurrence number of the word in document j, and denominator ∑_kn_{K, j}Then indicate institute in document j There is the sum of the number of word appearance.The meaning of entire formula is as follows：

Inverse document frequency idf_{I, j}It is measurement of the word for the importance of entire corpus, the realization of this parameter is public Formula is as follows：

Wherein, D indicates total number of documents in corpus, d (w_i) then indicate to include the document number of word i, minimum value 1, Occurs the case where denominator takes 0 in order to prevent, therefore denominator adds one, the meaning of entire formula is as follows：

There are TF and IDF, then the value of TF-IDF is then the product of the two of TF and IDF, i.e.,：

Documents Similarity is then calculated according to TF-IDF values, and this patent indicates two texts using the cosine value of vector Document representation is the vectorial W for forming the TF-IDF of all words of the document and constituting, is wanted to embody the time by the similarity between shelves Element, it is assumed that the life cycle of the topic of news discussion is one month, then, according to the temporal information of every document in corpus, One time scale t is added to news, is such as published in the time scale t of i-th news of earliest one month_i=1, it allows similar Degree is weighted time attribute, the news documents similarity higher of close life cycle is published in, on the contrary, two documents are delivered Time interval it is more long, illustrate that its similarity is lower, in this way, the similarity s of document a and document b_{A, b}It is expressed as：

Wherein, t_aWith t_bIndicate the time scale of document a and document b, the molecular formula denominator on the left side adds one also for preventing Denominator is 0,Indicate the vector product of the word TF-IDF vectors of document a and document b.

After calculating the similarity matrix between document according to the method described above, according to the similarity size between document, Entire corpus is divided into p parts, the higher document of similarity is assigned to as possible under the same data set.2.4 parallel LDA algorithms Realization on the basis of above-mentioned partitioned data set, next provide the realization process of parallel LDA algorithm and corresponding pseudo- generation Code, algorithm realize to include three parts：Initiation parameter, Segmentation of Data Set, Gibbs samplings.

Parameter name parameter declaration

X corpus

X_iSmall data set after segmentation

P deblocking numbers

K theme numbers

M document numbers

Word number in V vocabularys

N_mThe word number of m documents

The number of word in all documents of N

n_mArray is counted, indicates the sum of contained word in m documents in X

n_kArray is counted, indicates the sum of contained word in k-th of theme in X

Count matrix indicates the number for belonging to the word of theme k in X in m documents

Count matrix indicates the number for belonging to word in all words of theme k in X

Count matrix indicates X_iIn belong to the number of word in all words of theme k

Count matrix indicates X_iIn belong in m documents theme k word number

z_{M, n}The theme distribution of n-th of word in m documents

w_{M, n}N-th of word in m documents

t_a,t_bThe time scale of document a, b indicate

Sim_{(i, j)}The similarity of i-th document and jth piece document

Upper figure is the parameter declaration of parallel LDA algorithm, and the various pieces of algorithm are described in detail below.

It is the initialization procedure of parameter first, initialization is to each word w in document_{M, n}It is randomly assigned a theme, so After update count matrixWith counting array n_m、n_k.The pseudocode that algorithm is realized is illustrated in fig. 8 shown below.

Second part is Segmentation of Data Set, is discussed in detail in 2.3 sections about Segmentation of Data Set principle, provides here Realize that code, pseudocode are illustrated in fig. 9 shown below.

It is finally Gibbs sampling processes, is broadly divided into two parts：Sampling and update.It will be to each data in sampling Collect X_iIt is sampled, during this, initializes the count matrix in sampling parameter firstWithIn order to avoid different numbers According to the access conflict between collection, these matrixes are respectively stored under each local thread, local count matrix is obtained In this way, local count matrix only need to be updated in sampling process；After an iteration is completed, all threads will count Count matrix of the content update of matrix to the overall situationAnd using this matrix as the initial count matrix of next iteration.Weight Multiple above-mentioned iterative process terminates iteration, passes through obtained count matrix when the Gibbs of entire data set samples convergence WithWith counting array n_m、n_kTo calculate the theme probability distribution feelings of the distribution and every document of the word probability under each theme The pseudocode of condition, specific implementation is illustrated in fig. 10 shown below.

2.5 model training.Formula is sampled by Gibbs obtained above, LDA models are trained based on language material.Training process is just It is that the sample for obtaining (z, w) in language material is sampled by Gibbs, all parameters can be obtained based on final sampling in model Sample estimated.Training process is as follows：

(1) random initializtion, to each word w in every news documents in Chinese language material, a random attached theme is compiled Number z；

(2) corpus is rescaned, to each word w, formula resampling theme is sampled according to Gibbs, in language material Update；

(3) the resampling process for repeating the above corpus is restrained until Gibbs is sampled；

(4) frequency matrix that theme-lexical item of statistics corpus occurs altogether, which is exactly LDA models.

3. Fig. 6 is the processing procedure of the article of the unknown theme of F languages.It is obtained with the generation method of same LDA topic models To the F language LDA topic models of article.

4. similarity comparison.

Compare the LDA topic models of T topics Chinese LDA models and the unknown topic article of F language using KL distance methods Similarity.Formula is as follows：

For all j, work as p_j=q_jWhen, D_KL(p, q)=0.In view of the asymmetry of KL range formulas, there is KL apart from right Claim formula：

D_λ(p, q)=λ D_KL(p, λ q+ (1- λ) q)+(1- λ) D_KL(q, λ p+ (1- λ) p)

As λ=0.5, KL range formulas can be converted into JS range formulas：

This patent selects JS range formulas as the module of similarity.

5. evaluation criteria.This patent is using F metrics as evaluation criterion.F metrics are a kind of group in information retrieval Close the balance index of precision ratio and recall rate index.It is indicated using following formula：

P=N_a/N_b

R=N_a/N_C

Wherein N_aIndicate the individual sum correctly identified, N_bIndicate the individual sum identified, N_cIndicate exist in test set Individual sum, P indicate accuracy, R indicate recall rate.

Claims

1. a kind of across languages theme of news similarity comparison methods based on Parallel Corpus, which is characterized in that this method is divided into Three steps：

On condition that gathering around, there are one the Parallel Corpus of general Chinese and two languages of F language；Assuming that targeted news theme is T master Topic, foreign language is F, in the case where not translating, filtered out in the article of the unknown theme of F language F language about T themes Article：

(1) each document has independent theme distribution, and the same theme of language description, shared theme in Parallel Corpus Distribution；First, the article collection in retrieval Chinese about T themes, based on the Universal Chinese character corpus in Parallel Corpus, by LDA Topic model algorithm obtains the Chinese LDA topic models of article collection；

(2) then, the T theme LDA topic models of Chinese are mapped to broad sense topic model space and obtain the Chinese and F of T themes The shared LDA topic models of language, using LDA algorithm, by the article and Parallel Corpus of the unknown theme to be screened of F language In F language corpus obtain F language LDA topic models；

(3) by this generalized space LDA topic models and F language LDA topic models compare, think that this is waited for if similar Screening article is the article about T themes.

2. method according to claim 1, which is characterized in that identical theme contains similar semantic letter in different language Breath, so the text of different language is indicated in same broad sense theme space；When training data is with a kind of language mark After note, that is, after the topic model of certain news articles of Chinese generates, the topic space for the broad sense that just it is mapped In, using the theme class of this broad sense, the news article of the unknown theme of foreign language is generated into topic model, the master with generalized space Topic model compares to obtain theme result, and steps are as follows：

For theme k,

(1) the Word probability distribution of the general corpus of sampling ChineseSample the general language material of foreign language F The Word probability in library is distributed

(2) for m in Parallel Corpus to Chinese, F Language Documents pair, m ∈ [1, M], sampling theme probability distribution θ_m~ Dirichlet (α),

1. for Chinese documentN-th^CA lexical item,Theme Z is implied in selection^C~Dirichlet (θ_m), it is raw At a lexical item

2. for foreign language documentN-th^FA lexical item,Theme Z is implied in selection^F~Dirichlet (θ_m), it is raw At a lexical item

Wherein, C represents Chinese, and F represents foreign language languages；θ_mIndicate theme probability distribution of the m to bilingual parallel document；WithTheme Z is indicated respectively_kIn the vocabulary distribution probability of Chinese and foreign language；Z^CAnd Z^FIndicate m to bilingual parallel document respectively The implicit theme of n-th of lexical item of source language and the target language；ω^CAnd ω^FIndicate respectively m to bilingual parallel document Chinese and N-th of lexical item of foreign language；M indicates document lump logarithm；WithIndicate respectively m to bilingual parallel document Chinese and The sum of foreign language document；θ_mIt obeys Dirichlet distributions and α is its Study first and is used to generate theme；WithIt obeys Dirichlet is distributed and β^CAnd β^FIt to Study first and for generating lexical item；Wherein, α, β^C、β^FIt is that maximum likelihood is estimated Metering indicates " document-theme " probability distribution and " theme-lexical item " probability distribution respectively；Here the general of entire corpus is chosen Rate object function as an optimization obtains α, β by carrying out maximizing estimation to object function^C、β^FValue, and then obtain LDA moulds Type.

3. method according to claim 1, which is characterized in that the parameter Estimation of LDA topic models, the parameter to be estimated are i.e. literary The shelves distribution of theme and the distribution of descriptor, specific implementation process are：

Assuming that including m documents in news corpus library, number of threads p, algorithm initializes sampling parameter first；Then it will train Language material is divided into p data set, and the foundation for dividing sample is to assign to the high sample of similarity in the same data set；Then make With Gibbs sampling algorithms, distribution above is sampled, per thread carries out Gibbs to the data set being assigned to and adopts parallel Sample process；Since lexical item w observations given data obtains, only theme z is implicit variable in this distribution, so only needing Distribution P (z | w) is sampled；By parameter Estimation of the Dirichlet Posterior distrbutionps under Bayesian frame, it is assumed that observation Obtained word ω_i=t, then parameter Estimation：

The Gibbs sampling formula of LDA models are obtained as a result,：

WhereinIndicate ginsengs of the m to the Dirichlet probability distribution of the k themes of bilingual parallel document under Bayesian frame Number estimation, α_kIt is its Study first, indicates the probability distribution of k-th of theme,It indicates for the n-th of k m documents of theme A word,Indicate parameter Estimation of the Dirichlet probability distribution of the lexical item t of theme k under Bayesian frame, β_tIt is its elder generation Probability is tested, indicates the word distribution of t-th of theme；α_kAnd β_tIt is maximum-likelihood estimator, chooses the probability of entire corpus as excellent Change object function, estimates to obtain α by object function maximize_kAnd β_tValue；

This formula left side expression P (theme | document) P (word | theme), this probability is exactly document → theme → lexical item in fact Path probability, since theme is manually selected K, so Gibbs sampling formula mean that and are adopted in this K path Sample；

After waiting for that all threads complete sampling, corresponding document-theme matrix and theme-word matrix are obtained, by all parts Parameter calculates global parameter by way of averaging, this global parameter is then distributed to each parallel LDA algorithm and is made For the initial samples parameter of next round iteration, sampling is re-started, is iteratively repeated the above process in this way, until parameter restrains.

4. method according to claim 1, which is characterized in that s text similarity measures be based on TF-IDF algorithms, TF-IDF algorithms be it is a kind of being used for statistical method, for assessing a words for its in a file set or a corpus The significance level of middle text document；The TF-IDF values of every document are found out first, and the time then delivered according to document is to every Document adds time scale, is based on this time scale, solves the similarity of two documents, and formula is as follows：

Wherein, t_aWith t_bIndicate the time scale of document a and document b, the molecular formula denominator on the left side adds one also for preventing denominator It is 0,Indicate the vector product of the word TF-IDF vectors of document a and document b.