CN108519971A - A kind of across languages theme of news similarity comparison methods based on Parallel Corpus - Google Patents

A kind of across languages theme of news similarity comparison methods based on Parallel Corpus Download PDF

Info

Publication number
CN108519971A
CN108519971A CN201810245163.4A CN201810245163A CN108519971A CN 108519971 A CN108519971 A CN 108519971A CN 201810245163 A CN201810245163 A CN 201810245163A CN 108519971 A CN108519971 A CN 108519971A
Authority
CN
China
Prior art keywords
theme
language
document
chinese
corpus
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810245163.4A
Other languages
Chinese (zh)
Other versions
CN108519971B (en
Inventor
王�琦
于水源
曹轶臻
韩笑
戴长松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Communication University of China
Original Assignee
Communication University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Communication University of China filed Critical Communication University of China
Priority to CN201810245163.4A priority Critical patent/CN108519971B/en
Publication of CN108519971A publication Critical patent/CN108519971A/en
Application granted granted Critical
Publication of CN108519971B publication Critical patent/CN108519971B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/44Statistical methods, e.g. probability models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a kind of across languages theme of news similarity comparison methods based on Parallel Corpus.Steps are as follows:(1) each document has independent theme distribution, and the same theme of language description, shared theme distribution in Parallel Corpus;First, the article collection in retrieval Chinese about T themes obtains the Chinese LDA topic models of article collection based on the Universal Chinese character corpus in Parallel Corpus by LDA topic model algorithms;Then, the T theme LDA topic models of Chinese are mapped to the LDA topic models that broad sense topic model space obtains the Chinese of T themes and F language is shared, using LDA algorithm, F language LDA topic models are obtained by the F language corpus in the article and Parallel Corpus of the unknown theme to be screened of F language;By on this generalized space LDA topic models and F language LDA topic models compare, think that this article to be screened is the article about T themes if similar.The present invention can quick and precisely filter out the article of specific subject automatically without translation.

Description

A kind of across languages theme of news similarity comparison methods based on Parallel Corpus
Technical field
This patent proposes a kind of method of across the languages theme of news similitudes comparison based on Parallel Corpus.This patent side Method can automatically filter out specific subject foreign language article without translating.On condition that have bilingual teaching mode, On the basis of LDA topic models, bilingual LDA topic models are invented, and a kind of parallelization is realized by parallel computation frame Processing can fast and efficiently realize that multilingual media event reports automatic screening using this patent method.It is related to language material The fields such as library, word frequency analysis, Similarity measures.
Background technology
How across languages media event topic similarity automatic comparisons are rapidly completed in the case where prosthetic is translated, in turn The multilingual media event report automatic screening for realizing identical theme, reduces and manually translates directly into this, grasp promptly and accurately Western medium's news public sentiment is a problem to be solved.
Machine translation achieves prodigious raising and progress in field of language translation in recent years.Machine translation is to utilize meter Calculation machine is converted into a kind of natural language the process of another natural language.Statistical method be applied to machine translation method it Afterwards so that the accuracy rate of machine translation result is gradually increased.But it in the practical application under the conditions of technology today, generally uses The method that machine translation and human translation combine:Machine translation method is first used, article is translated as another language, is then existed By artificial modification and calibration, a complete, accurate article can be just obtained.However, just by human translation this process The consumption of manpower and time are increased, efficiency reduces, and increases cost.This so that traditional machine translation is multilingual in reply magnanimity News report shows deficiency not in time, inaccurate when translating.So this patent proposes a kind of similitude based on corpus News article is screened in control methods, for the preprocessing process of the theme of news identification screening before machine translation, reaches Unrelated report noise is reduced, is used manpower and material resources sparingly, efficient purpose is put forward.
In view of the above-mentioned problems, this patent proposes and a kind of translating across the languages news based on Parallel Corpus The method of topic similarity comparison.The method that this patent proposes is based on Parallel Corpus.Corpus is extensive e-text library. It is the linguistic data that will really occur in practical application using electronic computer as carrier, the sampling by science and rationally Analysis and processing, become actually useful e-text library.Parallel Corpus is one group of text, each text is gone back in addition to itself There are one or more than one kinds of translation Chinese language sheets, simplest Parallel Corpus to design bilingual text-original text and translate Text.This patent is namely based on Parallel Corpus, and analysis obtains macaronic word segmentation regulation and word frequency distribution rule.It selects first Go out the news report about some subject events in Chinese, be based on Universal Chinese character corpus, theme text is generated using LDA algorithm The topic model of chapter;Then it chooses foreign language article to be screened and text feature is extracted based on the general corpus of foreign language;Finally will The topic model of the text feature obtained by foreign language news article and some the theme article obtained by Chinese compares, if phase Seemingly, then it is determined as that this foreign language article is the foreign language article about the theme.
This patent is expanded into bilingual LDA topic models on the basis of LDA topic models.Different from traditional LDA Each document of topic model has independent theme distribution, bilingual LDA models to utilize bilingual teaching mode, shares theme Distribution, different language describe the same theme.In addition, Parallel Corpus is described using different language, word frequency distribution can be with It is different.This patent method is to establish the bilingual LDA topic models on generalized space on the basis of bilingual teaching mode, when When having new language material, new LDA models are generated, the subject classification for judging new language material is compared with bilingual LDA models.
This patent using Gibbs to distribution sample, but in view of magnanimity training sample the case where, in order to improve LDA Model formation efficiency realizes a kind of parallelization LDA algorithm by parallel computation frame spark here.It is calculated compared to traditional LDA Method, the parallel LDA algorithm in this patent, which has made some improvements, achieves parallelization, and timestamp this feature is added, is being distributed Sampling process is carried out in formula environment, can improve the efficiency and accuracy rate of whole process.
Invention content
The present invention is based on Parallel Corpus to give a kind of foreign language article method filtered out about specific subject event.Before Carry is to gather around that there are one the Parallel Corpus of general Chinese and two languages of F language.Assuming that targeted news theme is T themes, outside Literary language is F.In the case where not translating, filtered out in the article of a large amount of unknown theme of F language F language about T master The article of topic:
1) each document has an independent theme distribution in Parallel Corpus, and the same theme of language description, shared Theme distribution.First, article collection about T themes in retrieval Chinese, based on the Universal Chinese character corpus in Parallel Corpus, by LDA topic model algorithms obtain the Chinese LDA topic models of article collection;
2) the T theme LDA topic models of Chinese then, are mapped to broad sense topic model space and obtain the Chinese of T themes The LDA topic models shared with F language, using LDA algorithm, the article by the unknown theme to be screened of F language and parallel language F language corpus in material library obtains F language LDA topic models;
3) by this generalized space LDA topic models and F language LDA topic models compare, think if similar This article to be screened is the article about T themes.
4) a kind of across languages theme of news similarity comparison methods based on Parallel Corpus, which is characterized in that
This method is divided into three steps:
On condition that gathering around, there are one the Parallel Corpus of general Chinese and two languages of F language;Assuming that targeted news theme T themes, foreign language is F, in the case where not translating, filtered out in the article of the unknown theme of F language F language about T The article of theme:
(1) each document has an independent theme distribution in Parallel Corpus, and the same theme of language description, shared Theme distribution;First, article collection about T themes in retrieval Chinese, based on the Universal Chinese character corpus in Parallel Corpus, by LDA topic model algorithms obtain the Chinese LDA topic models of article collection;
(2) then, the T theme LDA topic models of Chinese are mapped to broad sense topic model space and obtain the Chinese of T themes The LDA topic models shared with F language, using LDA algorithm, the article by the unknown theme to be screened of F language and parallel language F language corpus in material library obtains F language LDA topic models;
(3) by this generalized space LDA topic models and F language LDA topic models compare, think if similar This article to be screened is the article about T themes.
Identical theme contains similar semantic information in different language, thus the text of different language be indicated on it is same In a broad sense theme space;After training data is marked with a kind of language, that is, as the master of certain news articles of Chinese After inscribing model generation, in the topic space for the broad sense that just it is mapped, using the theme class of this broad sense, by the unknown of foreign language The news article of theme generates topic model, compares to obtain theme with the topic model of generalized space as a result, steps are as follows:
For theme k,
(1) the Word probability distribution of the general corpus of sampling ChineseSample the common language of foreign language F Expect the Word probability distribution in library
(2) for m in Parallel Corpus to Chinese, F Language Documents pair, m ∈ [1, M], sampling theme probability distribution θm ~Dirichlet (α),
1. for Chinese documentN-thCA lexical item,Theme Z is implied in selectionC~Dirichlet (θm), Generate a lexical item
2. for foreign language documentN-thFA lexical item,Theme Z is implied in selectionF~Dirichlet (θm), Generate a lexical item
Wherein, C represents Chinese, and F represents foreign language languages;θmIndicate theme probability distribution of the m to bilingual parallel document; WithTheme Z is indicated respectivelykIn the vocabulary distribution probability of Chinese and foreign language;ZCAnd ZFIndicate m to bilingual parallel document respectively Source language and the target language n-th of lexical item implicit theme;ωCAnd ωFIndicate m to bilingual parallel document Chinese respectively With n-th of lexical item of foreign language;M indicates document lump logarithm;WithIndicate respectively m to bilingual parallel document Chinese and The sum of foreign language document;θmIt obeys Dirichlet distributions and α is its Study first and is used to generate theme;WithIt obeys Dirichlet is distributed and βCAnd βFIt to Study first and for generating lexical item;Wherein, α, βC、βFIt is that maximum likelihood is estimated Metering indicates " document-theme " probability distribution and " theme-lexical item " probability distribution respectively;Here the general of entire corpus is chosen Rate object function as an optimization obtains α, β by carrying out maximizing estimation to object functionC、βFValue, and then obtain LDA moulds Type.
The parameter Estimation of LDA topic models, the distribution of the parameter the to be estimated i.e. distribution of document subject matter and descriptor, specifically Realization process is:
Assuming that including m documents in news corpus library, number of threads p, algorithm initializes sampling parameter first;Then will Training corpus is divided into p data set, and the foundation for dividing sample is to assign to the high sample of similarity in the same data set;It connects It using Gibbs sampling algorithms, distribution above is sampled, per thread carries out the data set being assigned to parallel Gibbs sampling processes;Since lexical item w observations given data obtains, only theme z is implicit variable in this distribution, So only needing to sample distribution P (z | w);By parameter Estimation of the Dirichlet Posterior distrbutionps under Bayesian frame, Assuming that the word ω that observation obtainsi=t, then parameter Estimation:
The Gibbs sampling formula of LDA models are obtained as a result,:
WhereinIndicate m to the Dirichlet probability distribution of the k themes of bilingual parallel document under Bayesian frame Parameter Estimation, αkIt is its Study first, indicates the probability distribution of k-th of theme,It indicates for m documents of k themes N-th of word,Indicate parameter Estimation of the Dirichlet probability distribution of the lexical item t of theme k under Bayesian frame, βtIt is it Prior probability indicates the word distribution of t-th of theme;αkAnd βtIt is maximum-likelihood estimator, chooses the probability conduct of entire corpus Optimization object function is estimated to obtain α by object function maximizekAnd βtValue;
This formula left side expression P (theme | document) P (word | theme), this probability is exactly document subject matter → word in fact The path probability of item, since theme is that manually selected K is a, so Gibbs sampling formula mean that and carried out in this K path Sampling;
After waiting for that all threads complete sampling, corresponding document-theme matrix and theme-word matrix are obtained, will be owned Local parameter calculates global parameter by way of averaging, this global parameter is then distributed to each parallel LDA and is calculated Initial samples parameter of the method as next round iteration, re-starts sampling, is iteratively repeated the above process in this way, until parameter is received It holds back.
The method of parallel sampling, in order to ensure that the corresponding theme of each data set is mutual indepedent, when this patent takes foundation Between element classification policy;Since parallel LDA algorithm cannot timely update global sampling ginseng during each iteration samples Number, therefore the precision of final mask is compared to being to have certain loss for more traditional LDA algorithm, is mainly due to initial here Caused by the mean random of block algorithm, i.e., entire data set is averagely simply divided into p parts, not to the document in every part Between correlation take in, in this way every part of data concentrate theme be intended to equalize;
S text similarity measures are based on TF-IDF algorithms, and TF-IDF algorithms are a kind of for statistical method, use In one words of assessment for the significance level of a copy of it file in a file set or a corpus;It finds out first every Then the TF-IDF values of piece document are added time scale to every document according to the time that document is delivered, are carved based on this time Degree, solves the similarity of two documents, and formula is as follows:
Wherein, taWith tbIndicate the time scale of document a and document b, the molecular formula denominator on the left side adds one also for preventing Denominator is 0,Indicate the vector product of the word TF-IDF vectors of document a and document b.The effect and benefit of this patent:
1. the bilingual LDA topic models of invention can without translation but automatically screening goes out some specific master The article of topic;
2. preprocessing process of this patent for the theme of news identification screening before machine translation, it is unrelated to reach reduction It reports noise, uses manpower and material resources sparingly, put forward efficient purpose;
3. when handling data, this patent realizes a kind of parallelization processing by parallel computation frame, can be quick, high The realization screening of speed;
4. on the basis of parallel LDA algorithm, this patent uses the classification policy according to element of time can to deblocking To improve the precision of final result.
Description of the drawings
Fig. 1 foreign language article screening techniques describe
The Chinese articles of Fig. 2 topic news are handled
Participle processes of the Fig. 3 based on dictionary
The graph model of the bilingual LDA models of Fig. 4
The parallel LDA algorithm flow charts of Fig. 5
Fig. 6 probability calculations path
Fig. 7 algorithm parameter lists
Fig. 8 initialization algorithm pseudocodes
Fig. 9 divides pseudocode
Figure 10 Gibbs sampling pseudocodes
Figure 11 F language LDA topic models
With reference to the attached drawing of the present invention, technical scheme of the present invention is described in detail.
1. Fig. 2 is the processing procedure to the Chinese articles of some topic news.A target topic T is selected first, each It is retrieved in big Chinese portal website and filters out a several pieces and be used as Chinese language material about the Chinese articles of T topics, by these Chinese Language material segments, goes the pretreatment of stop words, and pretreatment is divided into following two steps:
(1) word segmentation processing.This patent is based on corpus, so using the participle side based on dictionary using corpus Method.The character string of Chinese news article is compared with the word in dictionary for word segmentation according to strategy as defined in corpus, such as Fruit finds character string to be slit then successful match in dictionary.The process of participle is illustrated in fig. 3 shown below
(2) stop words is gone to handle.Stop words refers to that the frequency of occurrences is higher and carries the word of less text message.2. using T Theme Chinese language material and Universal Chinese character corpus generate the bilingual LDA models of T topics based on LDA models.This patent is assumed different Identical theme contains similar semantic information in language, so the text of different language can be indicated on the same broad sense master It inscribes in space.After training data is marked with a kind of language, i.e., when the topic model life of certain news articles of Chinese At later, so that it may in the topic space for the broad sense that it is mapped, using the theme class of this broad sense, by the unknown theme of foreign language News article generate topic model, compare to obtain theme result with the topic model of generalized space.Step is described as follows:
Fig. 4 is that the graph model of bilingual LDA topic models indicates that thus graph model is based on LDA topic models, illustrates to generate text The process of shelves.C represents Chinese, and F represents foreign language languages;θmIndicate theme probability distribution of the m to bilingual parallel document;WithIndicate theme Z in Chinese and foreign language vocabulary distribution probability respectively;ZCAnd ZFSources of the m to bilingual parallel document is indicated respectively The implicit theme of n-th of lexical item of language and object language;ωCAnd ωFIndicate m to bilingual parallel document Chinese and outer respectively N-th of lexical item of text;M indicates document lump logarithm;WithChinese and foreign language of the m to bilingual parallel document are indicated respectively The sum of document;θmObedience Dirichlet distributions and α are the Dirichlet priori ginsengs of the multinomial distribution of theme under each document Number;WithObey Dirichlet distributions and βCAnd βFIt to Study first and for generating lexical item.
α reflects the relativeness implied in collection of document between theme, i.e. the probability distribution of " document-theme ", β is reflected The probability distribution of " theme-Feature Words ".Here the probability of entire corpus object function as an optimization is chosen, by target letter Number, which maximize, to be estimated to obtain α, βC、βFValue, and then obtain LDA models.Assuming that having M documents, all lexical items in language material Indicate as follows with corresponding theme:
W=(ω1..., ωM)
Z=(z1..., zM)
Wherein, ωmIndicate the word in m documents, zmIndicate the corresponding theme number of these words, nmIndicate m documents In k-th of theme generate word number.
The generating process of 2.1LDA models indicates, for theme k:
(1) the Word probability distribution of the general corpus of sampling ChineseSample the common language of foreign language F Expect the Word probability distribution in library
(2) for m in Parallel Corpus to Chinese, F Language Documents pair, m ∈ [1, M], sampling theme probability distribution θm ~Dirichlet (α),
1. for Chinese documentN-thCA lexical item,Theme Z is implied in selectionC~Dirichlet (θm), Generate a lexical item
2. for foreign language documentN-thFA lexical item,Theme Z is implied in selectionF~Dirichlet (θm), Generate a lexical item
The generating probability of theme in entire language material:
The generating probability of lexical item in entire language material:
Merge two formulas to obtain:
2.2 in MCMC Gibbs sampling be to obtain the common method of LDA topic models.This patent is using Gibbs to upper The distribution in face is sampled, but in view of magnanimity training sample the case where, in order to improve LDA model formation efficiencies, here by Parallel computation frame realizes a kind of parallelization LDA algorithm, compares traditional LDA algorithm, and the parallel LDA algorithm in this patent is done Some improvement achieve parallelization, and specific implementation process is:
Assuming that including m documents in news corpus library, number of threads p, algorithm initializes sampling parameter first;Then will Training corpus is divided into p data set, and the foundation for dividing sample is that the high sample of similarity is assigned to (this in the same data set In method for measuring similarity can be discussed in detail below);Then Gibbs sampling algorithms are used, distribution above is adopted Sample, per thread carry out Gibbs sampling processes parallel to the data set being assigned to.Since word w observations given data obtains, Only theme z is implicit variable in this distribution, so only needing to sample distribution P (z | w).By Dirichlet Parameter Estimation of the Posterior distrbutionp under Bayesian frame, it is assumed that the word ω observedi=t, then parameter Estimation:
The Gibbs sampling formula of LDA models are obtained as a result,:
WhereinIndicate m to the Dirichlet probability distribution of the k themes of bilingual parallel document under Bayesian frame Parameter Estimation, αkIt is its Study first, indicates the probability distribution of k-th of theme,It indicates for m documents of k themes N-th of word,Indicate parameter Estimation of the Dirichlet probability distribution of the lexical item t of theme k under Bayesian frame, βtIt is it Prior probability indicates the word distribution of t-th of theme.αkAnd βtIt is maximum-likelihood estimator, chooses the probability conduct of entire corpus Optimization object function is estimated to obtain α by object function maximizekAnd βtValue.
After waiting for that all threads complete sampling, corresponding document-theme matrix and theme-word matrix are obtained, will be owned Local parameter calculates global parameter by way of averaging, this global parameter is then distributed to each parallel LDA and is calculated Initial samples parameter of the method as next round iteration, re-starts sampling, is iteratively repeated the above process in this way, until parameter is received It holds back, the flow chart of whole process is as shown in Figure 5.
2.3 classification policy according to element of time.On the basis of above-mentioned parallel LDA algorithm, this patent is to deblocking Part has carried out some improvement, to improve the precision of final result.The process sampled in each iteration due to parallel LDA algorithm In cannot timely update global sampling parameter, therefore the precision of final mask compared to be for more traditional LDA algorithm have it is certain Loss, here mainly caused by due to the mean random of initial block algorithm, i.e., simply by entire data ensemble average P parts are divided into, the correlation between the document in every part does not take in, and the theme concentrated in this way in every part of data is intended to Equalization.
Observation is carried out to parallel LDA algorithm and then is relatively learnt with conventional serial LDA algorithm, the theme-of parallel algorithm Word matrix cannot timely update Gibbs sampling parameters, so the calculating of document subject matter probability is caused certain deviation occur, because This, algorithm so that the corresponding theme of each data set is independent, is not influenced by other data sets, then can reduce State deviation.
For Text similarity computing, this patent introduces a kind of classification policy based on element of time, the base of the strategy Plinth is TF-IDF algorithms, TF-IDF algorithms be it is a kind of being used for statistical method, for assess a words for a file set or The significance level of a copy of it file in one corpus, the number that the significance level of word occurs hereof with it is at just Than, but can be inversely proportional with its frequency of occurrences in entire corpus.
TF-IDF is made of two parts:Word frequency (term frequency, TF) and inverse document frequency (inverse Document frequency, IDF), wherein TF refers to the number that some given word occurs in some document, in word On the basis of number, which has done normalized, in order to it be prevented to be biased to long text.
tfI, jWord frequency TF value of i-th of the word in corpus in j-th of document is represented, realizes that formula is as follows:
In above formula, nI, jIndicate occurrence number of the word in document j, and denominator ∑knK, jThen indicate institute in document j There is the sum of the number of word appearance.The meaning of entire formula is as follows:
Inverse document frequency idfI, jIt is measurement of the word for the importance of entire corpus, the realization of this parameter is public Formula is as follows:
Wherein, D indicates total number of documents in corpus, d (wi) then indicate to include the document number of word i, minimum value 1, Occurs the case where denominator takes 0 in order to prevent, therefore denominator adds one, the meaning of entire formula is as follows:
There are TF and IDF, then the value of TF-IDF is then the product of the two of TF and IDF, i.e.,:
Documents Similarity is then calculated according to TF-IDF values, and this patent indicates two texts using the cosine value of vector Document representation is the vectorial W for forming the TF-IDF of all words of the document and constituting, is wanted to embody the time by the similarity between shelves Element, it is assumed that the life cycle of the topic of news discussion is one month, then, according to the temporal information of every document in corpus, One time scale t is added to news, is such as published in the time scale t of i-th news of earliest one monthi=1, it allows similar Degree is weighted time attribute, the news documents similarity higher of close life cycle is published in, on the contrary, two documents are delivered Time interval it is more long, illustrate that its similarity is lower, in this way, the similarity s of document a and document bA, bIt is expressed as:
Wherein, taWith tbIndicate the time scale of document a and document b, the molecular formula denominator on the left side adds one also for preventing Denominator is 0,Indicate the vector product of the word TF-IDF vectors of document a and document b.
After calculating the similarity matrix between document according to the method described above, according to the similarity size between document, Entire corpus is divided into p parts, the higher document of similarity is assigned to as possible under the same data set.2.4 parallel LDA algorithms Realization on the basis of above-mentioned partitioned data set, next provide the realization process of parallel LDA algorithm and corresponding pseudo- generation Code, algorithm realize to include three parts:Initiation parameter, Segmentation of Data Set, Gibbs samplings.
Parameter name parameter declaration
X corpus
XiSmall data set after segmentation
P deblocking numbers
K theme numbers
M document numbers
Word number in V vocabularys
NmThe word number of m documents
The number of word in all documents of N
nmArray is counted, indicates the sum of contained word in m documents in X
nkArray is counted, indicates the sum of contained word in k-th of theme in X
Count matrix indicates the number for belonging to the word of theme k in X in m documents
Count matrix indicates the number for belonging to word in all words of theme k in X
Count matrix indicates XiIn belong to the number of word in all words of theme k
Count matrix indicates XiIn belong in m documents theme k word number
zM, nThe theme distribution of n-th of word in m documents
wM, nN-th of word in m documents
ta,tbThe time scale of document a, b indicate
Sim(i, j)The similarity of i-th document and jth piece document
Upper figure is the parameter declaration of parallel LDA algorithm, and the various pieces of algorithm are described in detail below.
It is the initialization procedure of parameter first, initialization is to each word w in documentM, nIt is randomly assigned a theme, so After update count matrixWith counting array nm、nk.The pseudocode that algorithm is realized is illustrated in fig. 8 shown below.
Second part is Segmentation of Data Set, is discussed in detail in 2.3 sections about Segmentation of Data Set principle, provides here Realize that code, pseudocode are illustrated in fig. 9 shown below.
It is finally Gibbs sampling processes, is broadly divided into two parts:Sampling and update.It will be to each data in sampling Collect XiIt is sampled, during this, initializes the count matrix in sampling parameter firstWithIn order to avoid different numbers According to the access conflict between collection, these matrixes are respectively stored under each local thread, local count matrix is obtained In this way, local count matrix only need to be updated in sampling process;After an iteration is completed, all threads will count Count matrix of the content update of matrix to the overall situationAnd using this matrix as the initial count matrix of next iteration.Weight Multiple above-mentioned iterative process terminates iteration, passes through obtained count matrix when the Gibbs of entire data set samples convergence WithWith counting array nm、nkTo calculate the theme probability distribution feelings of the distribution and every document of the word probability under each theme The pseudocode of condition, specific implementation is illustrated in fig. 10 shown below.
2.5 model training.Formula is sampled by Gibbs obtained above, LDA models are trained based on language material.Training process is just It is that the sample for obtaining (z, w) in language material is sampled by Gibbs, all parameters can be obtained based on final sampling in model Sample estimated.Training process is as follows:
(1) random initializtion, to each word w in every news documents in Chinese language material, a random attached theme is compiled Number z;
(2) corpus is rescaned, to each word w, formula resampling theme is sampled according to Gibbs, in language material Update;
(3) the resampling process for repeating the above corpus is restrained until Gibbs is sampled;
(4) frequency matrix that theme-lexical item of statistics corpus occurs altogether, which is exactly LDA models.
3. Fig. 6 is the processing procedure of the article of the unknown theme of F languages.It is obtained with the generation method of same LDA topic models To the F language LDA topic models of article.
4. similarity comparison.
Compare the LDA topic models of T topics Chinese LDA models and the unknown topic article of F language using KL distance methods Similarity.Formula is as follows:
For all j, work as pj=qjWhen, DKL(p, q)=0.In view of the asymmetry of KL range formulas, there is KL apart from right Claim formula:
Dλ(p, q)=λ DKL(p, λ q+ (1- λ) q)+(1- λ) DKL(q, λ p+ (1- λ) p)
As λ=0.5, KL range formulas can be converted into JS range formulas:
This patent selects JS range formulas as the module of similarity.
5. evaluation criteria.This patent is using F metrics as evaluation criterion.F metrics are a kind of group in information retrieval Close the balance index of precision ratio and recall rate index.It is indicated using following formula:
P=Na/Nb
R=Na/NC
Wherein NaIndicate the individual sum correctly identified, NbIndicate the individual sum identified, NcIndicate exist in test set Individual sum, P indicate accuracy, R indicate recall rate.

Claims (4)

1. a kind of across languages theme of news similarity comparison methods based on Parallel Corpus, which is characterized in that this method is divided into Three steps:
On condition that gathering around, there are one the Parallel Corpus of general Chinese and two languages of F language;Assuming that targeted news theme is T master Topic, foreign language is F, in the case where not translating, filtered out in the article of the unknown theme of F language F language about T themes Article:
(1) each document has independent theme distribution, and the same theme of language description, shared theme in Parallel Corpus Distribution;First, the article collection in retrieval Chinese about T themes, based on the Universal Chinese character corpus in Parallel Corpus, by LDA Topic model algorithm obtains the Chinese LDA topic models of article collection;
(2) then, the T theme LDA topic models of Chinese are mapped to broad sense topic model space and obtain the Chinese and F of T themes The shared LDA topic models of language, using LDA algorithm, by the article and Parallel Corpus of the unknown theme to be screened of F language In F language corpus obtain F language LDA topic models;
(3) by this generalized space LDA topic models and F language LDA topic models compare, think that this is waited for if similar Screening article is the article about T themes.
2. method according to claim 1, which is characterized in that identical theme contains similar semantic letter in different language Breath, so the text of different language is indicated in same broad sense theme space;When training data is with a kind of language mark After note, that is, after the topic model of certain news articles of Chinese generates, the topic space for the broad sense that just it is mapped In, using the theme class of this broad sense, the news article of the unknown theme of foreign language is generated into topic model, the master with generalized space Topic model compares to obtain theme result, and steps are as follows:
For theme k,
(1) the Word probability distribution of the general corpus of sampling ChineseSample the general language material of foreign language F The Word probability in library is distributed
(2) for m in Parallel Corpus to Chinese, F Language Documents pair, m ∈ [1, M], sampling theme probability distribution θm~ Dirichlet (α),
1. for Chinese documentN-thCA lexical item,Theme Z is implied in selectionC~Dirichlet (θm), it is raw At a lexical item
2. for foreign language documentN-thFA lexical item,Theme Z is implied in selectionF~Dirichlet (θm), it is raw At a lexical item
Wherein, C represents Chinese, and F represents foreign language languages;θmIndicate theme probability distribution of the m to bilingual parallel document;WithTheme Z is indicated respectivelykIn the vocabulary distribution probability of Chinese and foreign language;ZCAnd ZFIndicate m to bilingual parallel document respectively The implicit theme of n-th of lexical item of source language and the target language;ωCAnd ωFIndicate respectively m to bilingual parallel document Chinese and N-th of lexical item of foreign language;M indicates document lump logarithm;WithIndicate respectively m to bilingual parallel document Chinese and The sum of foreign language document;θmIt obeys Dirichlet distributions and α is its Study first and is used to generate theme;WithIt obeys Dirichlet is distributed and βCAnd βFIt to Study first and for generating lexical item;Wherein, α, βC、βFIt is that maximum likelihood is estimated Metering indicates " document-theme " probability distribution and " theme-lexical item " probability distribution respectively;Here the general of entire corpus is chosen Rate object function as an optimization obtains α, β by carrying out maximizing estimation to object functionC、βFValue, and then obtain LDA moulds Type.
3. method according to claim 1, which is characterized in that the parameter Estimation of LDA topic models, the parameter to be estimated are i.e. literary The shelves distribution of theme and the distribution of descriptor, specific implementation process are:
Assuming that including m documents in news corpus library, number of threads p, algorithm initializes sampling parameter first;Then it will train Language material is divided into p data set, and the foundation for dividing sample is to assign to the high sample of similarity in the same data set;Then make With Gibbs sampling algorithms, distribution above is sampled, per thread carries out Gibbs to the data set being assigned to and adopts parallel Sample process;Since lexical item w observations given data obtains, only theme z is implicit variable in this distribution, so only needing Distribution P (z | w) is sampled;By parameter Estimation of the Dirichlet Posterior distrbutionps under Bayesian frame, it is assumed that observation Obtained word ωi=t, then parameter Estimation:
The Gibbs sampling formula of LDA models are obtained as a result,:
WhereinIndicate ginsengs of the m to the Dirichlet probability distribution of the k themes of bilingual parallel document under Bayesian frame Number estimation, αkIt is its Study first, indicates the probability distribution of k-th of theme,It indicates for the n-th of k m documents of theme A word,Indicate parameter Estimation of the Dirichlet probability distribution of the lexical item t of theme k under Bayesian frame, βtIt is its elder generation Probability is tested, indicates the word distribution of t-th of theme;αkAnd βtIt is maximum-likelihood estimator, chooses the probability of entire corpus as excellent Change object function, estimates to obtain α by object function maximizekAnd βtValue;
This formula left side expression P (theme | document) P (word | theme), this probability is exactly document → theme → lexical item in fact Path probability, since theme is manually selected K, so Gibbs sampling formula mean that and are adopted in this K path Sample;
After waiting for that all threads complete sampling, corresponding document-theme matrix and theme-word matrix are obtained, by all parts Parameter calculates global parameter by way of averaging, this global parameter is then distributed to each parallel LDA algorithm and is made For the initial samples parameter of next round iteration, sampling is re-started, is iteratively repeated the above process in this way, until parameter restrains.
4. method according to claim 1, which is characterized in that s text similarity measures be based on TF-IDF algorithms, TF-IDF algorithms be it is a kind of being used for statistical method, for assessing a words for its in a file set or a corpus The significance level of middle text document;The TF-IDF values of every document are found out first, and the time then delivered according to document is to every Document adds time scale, is based on this time scale, solves the similarity of two documents, and formula is as follows:
Wherein, taWith tbIndicate the time scale of document a and document b, the molecular formula denominator on the left side adds one also for preventing denominator It is 0,Indicate the vector product of the word TF-IDF vectors of document a and document b.
CN201810245163.4A 2018-03-23 2018-03-23 Cross-language news topic similarity comparison method based on parallel corpus Active CN108519971B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810245163.4A CN108519971B (en) 2018-03-23 2018-03-23 Cross-language news topic similarity comparison method based on parallel corpus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810245163.4A CN108519971B (en) 2018-03-23 2018-03-23 Cross-language news topic similarity comparison method based on parallel corpus

Publications (2)

Publication Number Publication Date
CN108519971A true CN108519971A (en) 2018-09-11
CN108519971B CN108519971B (en) 2022-02-11

Family

ID=63434170

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810245163.4A Active CN108519971B (en) 2018-03-23 2018-03-23 Cross-language news topic similarity comparison method based on parallel corpus

Country Status (1)

Country Link
CN (1) CN108519971B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111339286A (en) * 2020-02-14 2020-06-26 重庆邮电大学 Method for researching research condition of exploration institution based on topic visualization
CN111553168A (en) * 2020-05-09 2020-08-18 识因智能科技(北京)有限公司 Bilingual short text matching method
CN112069394A (en) * 2020-08-14 2020-12-11 上海风秩科技有限公司 Text information mining method and device
CN112329481A (en) * 2020-10-27 2021-02-05 厦门大学 Training method of multi-language machine translation model for relieving language-to-difference conflict
CN112580355A (en) * 2020-12-30 2021-03-30 中科院计算技术研究所大数据研究院 News information topic detection and real-time aggregation method
CN113076467A (en) * 2021-03-26 2021-07-06 昆明理工大学 Chinese-crossing news topic discovery method based on cross-language neural topic model
CN113344107A (en) * 2021-06-25 2021-09-03 清华大学深圳国际研究生院 Theme analysis method and system based on kernel principal component analysis and LDA (latent Dirichlet Allocation analysis)
CN114742077A (en) * 2022-04-15 2022-07-12 中国电子科技集团公司第十研究所 Generation method of domain parallel corpus and training method of translation model

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160210718A1 (en) * 2015-01-16 2016-07-21 Oracle International Corporation Data-parallel parameter estimation of the latent dirichlet allocation model by greedy gibbs sampling
US20160232155A1 (en) * 2015-02-05 2016-08-11 International Business Machines Corporation Extracting and recommending business processes from evidence in natural language systems
CN106202065A (en) * 2016-06-30 2016-12-07 中央民族大学 A kind of across language topic detecting method and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160210718A1 (en) * 2015-01-16 2016-07-21 Oracle International Corporation Data-parallel parameter estimation of the latent dirichlet allocation model by greedy gibbs sampling
US20160232155A1 (en) * 2015-02-05 2016-08-11 International Business Machines Corporation Extracting and recommending business processes from evidence in natural language systems
CN106202065A (en) * 2016-06-30 2016-12-07 中央民族大学 A kind of across language topic detecting method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈兴蜀等: "基于ICE-LDA模型的中英文跨语言话题发现研究", 《工程科学与技术》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111339286A (en) * 2020-02-14 2020-06-26 重庆邮电大学 Method for researching research condition of exploration institution based on topic visualization
CN111339286B (en) * 2020-02-14 2024-02-09 四川超易宏科技有限公司 Method for exploring mechanism research conditions based on theme visualization
CN111553168A (en) * 2020-05-09 2020-08-18 识因智能科技(北京)有限公司 Bilingual short text matching method
CN112069394A (en) * 2020-08-14 2020-12-11 上海风秩科技有限公司 Text information mining method and device
CN112069394B (en) * 2020-08-14 2023-09-29 上海风秩科技有限公司 Text information mining method and device
CN112329481A (en) * 2020-10-27 2021-02-05 厦门大学 Training method of multi-language machine translation model for relieving language-to-difference conflict
CN112329481B (en) * 2020-10-27 2022-07-19 厦门大学 Training method of multi-language machine translation model for relieving language-to-difference conflict
CN112580355A (en) * 2020-12-30 2021-03-30 中科院计算技术研究所大数据研究院 News information topic detection and real-time aggregation method
CN113076467A (en) * 2021-03-26 2021-07-06 昆明理工大学 Chinese-crossing news topic discovery method based on cross-language neural topic model
CN113344107A (en) * 2021-06-25 2021-09-03 清华大学深圳国际研究生院 Theme analysis method and system based on kernel principal component analysis and LDA (latent Dirichlet Allocation analysis)
CN113344107B (en) * 2021-06-25 2023-07-11 清华大学深圳国际研究生院 Topic analysis method and system based on kernel principal component analysis and LDA
CN114742077A (en) * 2022-04-15 2022-07-12 中国电子科技集团公司第十研究所 Generation method of domain parallel corpus and training method of translation model

Also Published As

Publication number Publication date
CN108519971B (en) 2022-02-11

Similar Documents

Publication Publication Date Title
CN108519971A (en) A kind of across languages theme of news similarity comparison methods based on Parallel Corpus
Abrishami et al. Predicting citation counts based on deep neural network learning techniques
Kong et al. Fake news detection using deep learning
US11995702B2 (en) Item recommendations using convolutions on weighted graphs
Chen et al. Entity embedding-based anomaly detection for heterogeneous categorical events
CN104951548B (en) A kind of computational methods and system of negative public sentiment index
Rustam et al. Classification of shopify app user reviews using novel multi text features
He et al. Time-evolving Text Classification with Deep Neural Networks.
CN105183833B (en) Microblog text recommendation method and device based on user model
CN109036577B (en) Diabetes complication analysis method and device
CN107038480A (en) A kind of text sentiment classification method based on convolutional neural networks
CN104199833B (en) The clustering method and clustering apparatus of a kind of network search words
Sundus et al. A deep learning approach for arabic text classification
CN108717408A (en) A kind of sensitive word method for real-time monitoring, electronic equipment, storage medium and system
CN108304479B (en) Quick density clustering double-layer network recommendation method based on graph structure filtering
CN110909125B (en) Detection method of media rumor of news-level society
CN108986907A (en) A kind of tele-medicine based on KNN algorithm divides the method for examining automatically
CN110990718B (en) Social network model building module of company image lifting system
Munna et al. Sentiment analysis and product review classification in e-commerce platform
Nguyen et al. An ensemble of shallow and deep learning algorithms for Vietnamese sentiment analysis
Saikia et al. Modelling social context for fake news detection: a graph neural network based approach
Baboo et al. Sentiment analysis and automatic emotion detection analysis of twitter using machine learning classifiers
CN112215006B (en) Organization named entity normalization method and system
CN104573003B (en) Financial Time Series Forecasting method based on theme of news information retrieval
CN110162629B (en) Text classification method based on multi-base model framework

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Wang Qi

Inventor after: Wang Yongbin

Inventor after: Yu Shuiyuan

Inventor after: Cao Diezhen

Inventor after: Han Xiao

Inventor after: Dai Changsong

Inventor before: Wang Qi

Inventor before: Yu Shuiyuan

Inventor before: Cao Diezhen

Inventor before: Han Xiao

Inventor before: Dai Changsong