CN108519971B - Cross-language news topic similarity comparison method based on parallel corpus - Google Patents

Cross-language news topic similarity comparison method based on parallel corpus Download PDF

Info

Publication number
CN108519971B
CN108519971B CN201810245163.4A CN201810245163A CN108519971B CN 108519971 B CN108519971 B CN 108519971B CN 201810245163 A CN201810245163 A CN 201810245163A CN 108519971 B CN108519971 B CN 108519971B
Authority
CN
China
Prior art keywords
topic
lda
language
model
corpus
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810245163.4A
Other languages
Chinese (zh)
Other versions
CN108519971A (en
Inventor
王�琦
于水源
曹轶臻
韩笑
戴长松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Communication University of China
Original Assignee
Communication University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Communication University of China filed Critical Communication University of China
Priority to CN201810245163.4A priority Critical patent/CN108519971B/en
Publication of CN108519971A publication Critical patent/CN108519971A/en
Application granted granted Critical
Publication of CN108519971B publication Critical patent/CN108519971B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/44Statistical methods, e.g. probability models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation

Abstract

The invention discloses a cross-language news topic similarity comparison method based on a parallel corpus. The method comprises the following steps: (1) each document in the parallel corpus has independent theme distribution, and the language describes the same theme and shared theme distribution; firstly, retrieving an article set about a T topic in Chinese, and obtaining a Chinese LDA topic model of the article set by an LDA topic model algorithm based on a Chinese general corpus in a parallel corpus; then mapping the Chinese T theme LDA theme model to a generalized theme model space to obtain a T theme LDA theme model shared by Chinese and F languages, and obtaining an F language LDA theme model by using an LDA algorithm and an article of an unknown theme to be screened in the F language and an F language corpus in a parallel corpus; and comparing the LDA topic model in the generalized space with the LDA topic model in the F language, and if the LDA topic model is similar to the LDA topic model, considering that the article to be screened is an article about the T topic. The method can quickly, accurately and automatically screen out the articles with the specific subjects without translation.

Description

Cross-language news topic similarity comparison method based on parallel corpus
Technical Field
The patent provides a cross-language news topic similarity comparison method based on a parallel corpus. The patent method can automatically screen foreign language articles with specific themes without translation. The method is characterized in that a bilingual parallel corpus is provided, a bilingual LDA topic model is invented on the basis of an LDA topic model, parallelization processing is realized by means of a parallel computing framework, and automatic screening of multilingual news event reports can be quickly and efficiently realized by using the method. Relates to the fields of corpus, word frequency analysis, similarity calculation and the like.
Background
How to rapidly complete automatic comparison of topic similarity of cross-language news events without manual translation, and further realize automatic screening of multi-language news event reports with the same topic, reduce manual direct translation cost, and timely and accurately master foreign media news public opinions is a problem to be solved urgently.
Machine translation has made great improvements and advances in the field of language translation in recent years. Machine translation is the process of converting one natural language into another using a computer. After the statistical method is applied to the machine translation method, the accuracy of the machine translation result is gradually improved. However, in the practical application under the current technical conditions, a method combining machine translation and manual translation is generally used: the method is characterized in that a machine translation method is used for translating an article into another language, and then a complete and accurate article can be obtained after manual modification and calibration. However, the manual translation process increases the consumption of labor and time, reduces efficiency, and increases cost. This makes traditional machine translation appear untimely, inaccurate inadequacy when dealing with the translation of a vast multilingual news story. Therefore, the patent provides a corpus-based similarity comparison method for screening news articles, which is used for a preprocessing process of news topic identification and screening before machine translation, and achieves the purposes of reducing irrelevant report noise, saving manpower and material resources and improving efficiency.
In order to solve the problems, the patent provides a cross-language news topic similarity comparison method based on a parallel corpus under the condition of no translation. The method proposed by the patent is based on a parallel corpus. A corpus is a large-scale electronic text library. It uses electronic computer as carrier, and makes the language material which is actually appeared in the practical application undergo the processes of scientific sampling and reasonable analysis and treatment so as to obtain the true useful electronic text library. A parallel corpus is a set of texts, each of which contains one or more than one translated language text in addition to its own text, and the simplest parallel corpus designs two language texts — an original text and a translated text. The method is based on a parallel corpus, and word segmentation rules and word frequency distribution rules of two languages are obtained through analysis. Firstly, selecting news reports about a certain theme event in the Chinese, and generating a theme model of the theme article by utilizing an LDA algorithm based on a Chinese general corpus; then selecting foreign language articles to be screened, and extracting text features based on a foreign language general language library; finally, the text features obtained from the foreign language news articles are compared with the topic model of a certain topic article obtained from Chinese, and if the text features are similar to the topic model, the foreign language article is judged to be the foreign language article about the topic.
The method is expanded into a bilingual LDA theme model on the basis of the LDA theme model. Different from the traditional LDA topic model, each document has independent topic distribution, and the bilingual LDA model utilizes a bilingual parallel corpus to share the topic distribution, and different languages describe the same topic. In addition, the parallel corpus is described by different languages, and the word frequency distribution can be different. The method comprises the steps of establishing a bilingual LDA topic model in a generalized space on the basis of a bilingual parallel corpus, generating a new LDA model when a new corpus exists, and comparing the new LDA model with the bilingual LDA model to judge topic classification of the new corpus.
In the method, Gibbs is used for sampling distribution, but in order to improve the generation efficiency of the LDA model by considering the condition of massive training samples, a parallel computing framework spark is used for realizing a parallel LDA algorithm. Compare traditional LDA algorithm, some improvements have been made to the parallel LDA algorithm in this patent and have made its realization parallelization, add this characteristic of timestamp, carry out the sampling process in distributed environment, can improve the efficiency and the rate of accuracy of whole process.
Disclosure of Invention
The invention provides a method for screening foreign language articles about specific topic events based on a parallel corpus. The premise is that a general parallel corpus of Chinese and F languages is provided. Assume that the target news topic is a T topic and the foreign language is F. Screening out F language articles about T topics from a large number of F language unknown topic articles without translation:
1) each document in the parallel corpus has independent topic distribution, and the language describes the same topic, shared topic distribution. Firstly, retrieving an article set about a T topic in Chinese, and obtaining a Chinese LDA topic model of the article set by an LDA topic model algorithm based on a Chinese general corpus in a parallel corpus;
2) then mapping the Chinese T theme LDA theme model to a generalized theme model space to obtain a T theme LDA theme model shared by Chinese and F languages, and obtaining an F language LDA theme model by using an LDA algorithm and an article of an unknown theme to be screened in the F language and an F language corpus in a parallel corpus;
3) and comparing the LDA topic model in the generalized space with the LDA topic model in the F language, and if the LDA topic model is similar to the LDA topic model, considering that the article to be screened is an article about the T topic.
4) A cross-language news topic similarity comparison method based on a parallel corpus is characterized in that,
the method comprises the following three steps:
the premise is that a general parallel language database of Chinese and F languages is provided; assuming that the target news topic is a T topic and the foreign language is F, in the case of no translation, the articles about the T topic in the F language are screened out from the articles with unknown topics in the F language:
(1) each document in the parallel corpus has independent theme distribution, and the language describes the same theme and shared theme distribution; firstly, retrieving an article set about a T topic in Chinese, and obtaining a Chinese LDA topic model of the article set by an LDA topic model algorithm based on a Chinese general corpus in a parallel corpus;
(2) then mapping the Chinese T theme LDA theme model to a generalized theme model space to obtain a T theme LDA theme model shared by Chinese and F languages, and obtaining an F language LDA theme model by using an LDA algorithm and an article of an unknown theme to be screened in the F language and an F language corpus in a parallel corpus;
(3) and comparing the LDA topic model in the generalized space with the LDA topic model in the F language, and if the LDA topic model is similar to the LDA topic model, considering that the article to be screened is an article about the T topic.
The same subject in different languages contains similar semantic information, so the texts in different languages are represented in the same generalized subject space; after the training data is labeled by a language, namely after a topic model of some Chinese news articles is generated, the training data is mapped in a generalized topic space, the generalized topic class is used for generating a topic model of the news articles with unknown topics in foreign languages, and the topic model is compared with the topic model in the generalized topic space to obtain a topic result, wherein the steps are as follows:
for the subject k to be presented,
(1) word probability distribution of a generalized corpus of sampled chinese
Figure BDA0001606323670000031
Word probability distribution for a generalized corpus of sampled foreign language F
Figure BDA0001606323670000032
(2) For the mth pair of Chinese and F language documents in the parallel corpus, M belongs to [1, M ∈]Sampling the subject probability distribution θm~Dirichlet(α),
For Chinese documents
Figure BDA0001606323670000033
N of (2)CThe number of the individual terms is,
Figure BDA0001606323670000034
selecting implicit topic ZC~Dirichlet(θm) Generating a term
Figure BDA0001606323670000035
② for foreign language documents
Figure BDA0001606323670000036
N of (2)FThe number of the individual terms is,
Figure BDA0001606323670000037
selecting implicit topic ZF~Dirichlet(θm) Generating a term
Figure BDA0001606323670000038
WhereinC represents Chinese, F represents foreign language; thetamRepresenting the topic probability distribution of the mth pair of bilingual parallel documents;
Figure BDA0001606323670000039
and
Figure BDA00016063236700000310
respectively represent a subject ZkVocabulary distribution probability in Chinese and foreign languages; zCAnd ZFRespectively representing implicit themes of the nth terms of the source language and the target language of the mth pair of bilingual parallel documents; omegaCAnd ωFRespectively representing the nth terms of the Chinese and foreign languages of the mth pair of bilingual parallel documents; m represents a document lumped logarithm;
Figure BDA00016063236700000311
and
Figure BDA00016063236700000312
respectively representing the total number of Chinese and foreign documents of the mth pair of bilingual parallel documents; thetamObey Dirichlet distribution and α is its prior parameter and is used to generate a topic;
Figure BDA00016063236700000313
and
Figure BDA00016063236700000314
obey Dirichlet distribution and betaCAnd betaFIs a priori parameter of its pair and is used to generate a term; wherein, alpha, betaC、βFAre maximum likelihood estimators representing a "document-topic" probability distribution and a "topic-term" probability distribution, respectively; the probability of the whole corpus is selected as an optimization objective function, and alpha and beta are obtained by performing maximum estimation on the objective functionC、βFAnd obtaining the LDA model.
The parameter estimation of the LDA topic model, the parameters to be estimated, namely the distribution of the document topics and the distribution of the topic words, is realized by the following steps:
suppose a news corpusThe method comprises the steps of (1) including m documents, wherein the number of threads is p, and an algorithm initializes sampling parameters at first; then, dividing the training corpus into p data sets equally, and dividing the samples into the same data set according to the basis of dividing the samples with high similarity; secondly, sampling the distribution by using a Gibbs sampling algorithm, and performing a Gibbs sampling process on the distributed data set in parallel by each thread; since the term w is obtained by observing known data, and only the subject z in the distribution is an implicit variable, only the distribution P (z | w) needs to be sampled; by means of parameter estimation of Dirichlet posterior distribution under Bayes framework, the observed word omega is assumediAnd (3) if t, estimating the parameters:
Figure BDA0001606323670000041
Figure BDA0001606323670000042
thus, the Gibbs sampling formula of the LDA model is obtained:
Figure BDA0001606323670000043
wherein
Figure BDA0001606323670000044
Parameter estimation of Dirichlet probability distribution of k subjects of mth pair of bilingual parallel documents under Bayesian framework, alphakIs its prior parameter, representing the probability distribution of the kth topic,
Figure BDA0001606323670000045
the nth word representing the mth document for the k topic,
Figure BDA0001606323670000046
parameter estimation of Dirichlet probability distribution of term t representing subject k under Bayesian framework, betatIs its prior probability, representing the word score of the tth topicCloth; alpha is alphakAnd betatIs maximum likelihood estimator, selects the probability of the whole corpus as the optimization objective function, and obtains alpha by the maximum estimation of the objective functionkAnd betatA value of (d);
the left side of the formula represents P (topic | document) · P (word | topic), the probability is the path probability of document topic → term, because the topics are K manually selected, the Gibbs sampling formula represents sampling in the K paths;
and after all threads finish sampling, obtaining a corresponding document-theme matrix and a corresponding theme-word matrix, calculating a global parameter by averaging all local parameters, distributing the global parameter to each parallel LDA algorithm as an initial sampling parameter of the next iteration, and sampling again, so that the process is repeated repeatedly until the parameters are converged.
In order to ensure that the topics corresponding to each data set are independent from each other, the patent adopts a classification strategy according to time elements; because the parallel LDA algorithm can not update the global sampling parameter in time in the process of each iterative sampling, the precision of the final model has certain loss compared with the traditional LDA algorithm, the precision is mainly caused by the average randomness of the initial blocking algorithm, namely, the whole data set is evenly divided into p parts, the correlation among the documents in each part is not considered, and thus, the theme in each data set tends to be averaged;
the s text similarity measurement method is based on a TF-IDF algorithm, wherein the TF-IDF algorithm is a statistical method and is used for evaluating the importance degree of a word to one of files in a file set or a corpus; firstly, the TF-IDF value of each document is calculated, then a time scale is added to each document according to the publication time of the document, and the similarity of the two documents is solved based on the time scale, wherein the formula is as follows:
Figure BDA0001606323670000047
wherein, taAnd tbRepresenting the time scales of document a and document b, the left molecular denominator plus one is also to prevent the denominator from being 0,
Figure BDA0001606323670000051
representing the vector product of the word TF-IDF vectors for document a and document b. The effect and the benefit of this patent:
1. the bilingual LDA topic model can automatically screen out articles with a certain specific topic without translation;
2. the method is used for a preprocessing process of news theme recognition and screening before machine translation, and achieves the purposes of reducing irrelevant report noise, saving manpower and material resources and improving efficiency;
3. when data is processed, the parallel processing is realized by means of a parallel computing framework, and screening can be realized quickly and at high speed;
4. on the basis of a parallel LDA algorithm, the method can improve the precision of a final result by partitioning data according to a classification strategy of time elements.
Drawings
FIG. 1 foreign article screening method description
FIG. 2 Chinese article processing of topic News
FIG. 3 dictionary-based word segmentation process
FIG. 4 graphical model of bilingual LDA model
FIG. 5 parallel LDA Algorithm flow chart
FIG. 6 probability computation paths
FIG. 7 Algorithm parameter List
FIG. 8 initialization algorithm pseudo code
FIG. 9 partitioning pseudo code
FIG. 10 Gibbs sampling pseudo code
FIG. 11F language LDA topic model
The technical scheme of the invention is described in detail in the following with reference to the accompanying drawings of the invention.
1. Fig. 2 is a process of processing a chinese article for a topic news. Firstly, selecting a target topic T, searching and screening a plurality of Chinese articles about the T topic in each Chinese portal website as Chinese linguistic data, and preprocessing the Chinese linguistic data into participles and stop words, wherein the preprocessing comprises the following two steps:
(1) and (5) word segmentation processing. Since this patent is based on a corpus, it is sufficient to adopt a dictionary-based word segmentation method using a corpus. And comparing the character strings of the Chinese news articles with words in the word segmentation dictionary according to a strategy specified by the corpus, and if the character strings to be segmented are found in the dictionary, the matching is successful. The process of word segmentation is shown in figure 3 below.
(2) To stop word processing. Stop words refer to words that appear more frequently and carry less textual information. 2. And generating a bilingual LDA model of the T topic based on the LDA model by using the T topic Chinese corpus and the Chinese general corpus. This patent assumes that the same topic in different languages contains similar semantic information, so that text in different languages can be represented in the same generalized topic space. After the training data is labeled by a language, namely after topic models of some Chinese news articles are generated, the training data can be mapped in a generalized topic space, the generalized topic classes are used for generating topic models of the foreign language news articles with unknown topics, and the topic models are compared with the topic models in the generalized topic space to obtain topic results. The steps are as follows:
FIG. 4 is a pictorial model representation of a bilingual LDA topic model, whereby the pictorial model illustrates the process of generating a document based on the LDA topic model. C represents Chinese, F represents foreign language; thetamRepresenting the topic probability distribution of the mth pair of bilingual parallel documents;
Figure BDA0001606323670000061
and
Figure BDA0001606323670000062
respectively representing the vocabulary distribution probability of the subject Z in Chinese and foreign languages; zCAnd ZFRespectively representing implicit themes of the nth terms of the source language and the target language of the mth pair of bilingual parallel documents; omegaCAnd ωFAre respectively provided withThe nth term representing Chinese and foreign languages of the mth pair of bilingual parallel documents; m represents a document lumped logarithm;
Figure BDA0001606323670000063
and
Figure BDA0001606323670000064
respectively representing the total number of Chinese and foreign documents of the mth pair of bilingual parallel documents; thetamDirichlet prior parameters that obey Dirichlet distribution and α is the multinomial distribution of topics under each document;
Figure BDA0001606323670000065
and
Figure BDA0001606323670000066
obey Dirichlet distribution and betaCAnd betaFIs a priori the parameters it pairs and is used to generate the term.
Alpha reflects the relative relation between the implied topics in the document set, namely the probability distribution of 'document-topic', and beta reflects the probability distribution of 'topic-feature words'. The probability of the whole corpus is selected as an optimization objective function, and alpha and beta are obtained by performing maximum estimation on the objective functionC、βFAnd obtaining the LDA model. Assuming that there are M documents in the corpus, all terms and corresponding topics are represented as follows:
w=(ω1,......,ωM)
z=(z1,......,zM)
Figure BDA0001606323670000067
wherein, ω ismRepresenting words in the mth document, zmDenotes the topic number, n, corresponding to these wordsmIndicating the number of words generated by the kth topic in the mth document.
2.1 Generation Process representation of LDA model, for topic k:
(1) sampling ChineseWord probability distribution of a generalized corpus
Figure BDA0001606323670000068
Word probability distribution for a generalized corpus of sampled foreign language F
Figure BDA0001606323670000069
(2) For the mth pair of Chinese and F language documents in the parallel corpus, M belongs to [1, M ∈]Sampling the subject probability distribution θm~Dirichlet(α),
For Chinese documents
Figure BDA0001606323670000071
N of (2)CThe number of the individual terms is,
Figure BDA0001606323670000072
selecting implicit topic ZC~Dirichlet(θm) Generating a term
Figure BDA0001606323670000073
② for foreign language documents
Figure BDA0001606323670000074
N of (2)FThe number of the individual terms is,
Figure BDA0001606323670000075
selecting implicit topic ZF~Dirichlet(θm) Generating a term
Figure BDA0001606323670000076
The generation probability of the subject in the whole corpus is as follows:
Figure BDA0001606323670000077
the generation probability of the terms in the whole corpus is as follows:
Figure BDA0001606323670000078
two formulas are combined to obtain:
Figure BDA0001606323670000079
2.2 Gibbs sampling in MCMC is a common method to derive an LDA topic model. This patent uses Gibbs to sample above distribution, but considers the condition of magnanimity training sample, in order to improve LDA model generation efficiency, here realizes a parallelization LDA algorithm with the help of parallel computing frame, compares traditional LDA algorithm, and some improvements have been done to the parallel LDA algorithm in this patent and make it realize the parallelization, and concrete implementation process is:
supposing that a news corpus contains m documents, the number of threads is p, and an algorithm initializes sampling parameters at first; then, dividing the training corpus into p data sets equally, and dividing the samples into the same data set according to the basis of dividing the samples with high similarity (the similarity measurement method will be described in detail later); the distribution is then sampled using a Gibbs sampling algorithm, with each thread performing a Gibbs sampling process on the assigned data set in parallel. Since the word w is observed from known data, only the subject z in this distribution is an implicit variable, so only the distribution P (z | w) needs to be sampled. By means of parameter estimation of Dirichlet posterior distribution under Bayes framework, the observed word omega is assumediAnd (3) if t, estimating the parameters:
Figure BDA00016063236700000710
Figure BDA0001606323670000081
thus, the Gibbs sampling formula of the LDA model is obtained:
Figure BDA0001606323670000082
wherein
Figure BDA0001606323670000083
Parameter estimation of Dirichlet probability distribution of k subjects of mth pair of bilingual parallel documents under Bayesian framework, alphakIs its prior parameter, representing the probability distribution of the kth topic,
Figure BDA0001606323670000084
the nth word representing the mth document for the k topic,
Figure BDA0001606323670000085
parameter estimation of Dirichlet probability distribution of term t representing subject k under Bayesian framework, betatIs its prior probability, representing the word distribution of the tth topic. Alpha is alphakAnd betatIs maximum likelihood estimator, selects the probability of the whole corpus as the optimization objective function, and obtains alpha by the maximum estimation of the objective functionkAnd betatThe value of (c).
After all threads finish sampling, obtaining a corresponding document-theme matrix and a corresponding theme-word matrix, calculating a global parameter by averaging all local parameters, then distributing the global parameter to each parallel LDA algorithm as an initial sampling parameter of the next iteration, and sampling again, so that the above processes are repeated repeatedly until the parameters are converged, and the flow chart of the whole process is shown in FIG. 5.
2.3 classification strategy according to time elements. On the basis of the parallel LDA algorithm, the patent performs some improvements on the data block part so as to improve the precision of the final result. Because the parallel LDA algorithm cannot update the global sampling parameters in time in the process of each iterative sampling, the precision of the final model has certain loss compared with the traditional LDA algorithm, and the method is mainly caused by the average randomness of the initial blocking algorithm, namely, the whole data set is evenly divided into p parts, the correlation among documents in each part is not considered, and thus, the subjects in each data set tend to be averaged.
After the parallel LDA algorithm is observed, the comparison with the traditional serial LDA algorithm shows that the Gibbs sampling parameter cannot be updated in time by the theme-word matrix of the parallel algorithm, so that certain deviation occurs in the calculation of the document theme probability, and therefore, the theme corresponding to each data set is independent by the algorithm and is not influenced by other data sets, and the deviation can be reduced.
For text similarity calculation, the patent introduces a time-element-based classification strategy based on the TF-IDF algorithm, which is a statistical method for evaluating the importance of a word to one of a set of documents or a corpus, the importance of a word being proportional to the number of occurrences of the word in the document but inversely proportional to the frequency of occurrences of the word in the corpus as a whole.
The TF-IDF consists of two parts: term Frequency (TF) and Inverse Document Frequency (IDF), where TF refers to the number of times a given word appears in a document, and the parameter is normalized based on the number of words in order to prevent it from being biased toward long text.
tfi,jThe word frequency TF value of the ith word in the jth document in the corpus is represented, and the realization formula is as follows:
Figure BDA0001606323670000091
in the above formula, ni,jRepresents the number of occurrences of the word in document j, and the denominator Σknk,jThen represents the sum of the number of times all words in document j appear. The meaning of the whole formula is as follows:
Figure BDA0001606323670000092
inverse document frequency idfi,jIs oneThe importance of a word to the entire corpus is measured, and the implementation formula of this parameter is as follows:
Figure BDA0001606323670000093
where D represents the total number of documents in the corpus, D (w)i) The number of documents containing the word i is represented, the minimum value of the number is 1, in order to prevent the situation that the denominator takes 0, therefore, the denominator is increased by one, and the meaning of the whole formula is as follows:
Figure BDA0001606323670000094
with TF and IDF, the value of TF-IDF is the product of TF and IDF, i.e.:
Figure BDA0001606323670000095
the document similarity is calculated according to TF-IDF values, the document similarity is expressed by using cosine values of vectors, the document is expressed as a vector W formed by TF-IDF forming all words of the document, in order to embody time elements, the life cycle of a topic discussed by news is assumed to be one month, then a time scale t is added to the news according to the time information of each document in a corpus, for example, the time scale t of the ith news published in the earliest one monthiThe similarity is weighted according to the time attribute, news documents published in similar life cycles are higher in similarity, and conversely, the longer the publication time interval of the two documents is, the lower the similarity is, so that the similarity s of the document a and the document b isa,bExpressed as:
Figure BDA0001606323670000101
wherein, taAnd tbTime scales representing documents a and b, left formulaThe addition of one to the denominator also prevents the denominator from being 0,
Figure BDA0001606323670000102
representing the vector product of the word TF-IDF vectors for document a and document b.
After the similarity matrix among the documents is calculated according to the method, the whole corpus is divided into p parts according to the similarity among the documents, and the documents with higher similarity are distributed to the same data set as much as possible. 2.4 implementation of the parallel LDA algorithm on the basis of the above-mentioned partitioned data set, the implementation process of the parallel LDA algorithm and corresponding pseudo code are given next, and the algorithm implementation includes three parts: initializing parameters, segmenting a data set and Gibbs sampling.
Parameter name parameter description
X language database
XiSegmented small data set
Number of P data blocks
Number of K topics
Number of M documents
Number of words in V word list
NmNumber of words of mth document
N number of words in all documents
nmA count array representing the total number of words contained in the mth document in X
nkA count array representing the total number of words contained in the kth topic in X
Figure BDA0001606323670000111
A count matrix representing the number of words belonging to topic k in the mth document in X
Figure BDA0001606323670000112
A count matrix representing the number of words in all words belonging to subject k in X
Figure BDA0001606323670000113
Count matrix, representing XiNumber of words in all words belonging to topic k
Figure BDA0001606323670000114
Count matrix, representing XiNumber of words belonging to topic k in mth document
zm,nTopic distribution of nth word in mth document
wm,nNth word in mth document
ta,tbTime scale representation of documents a, b
Sim(i,j)Similarity of ith document and jth document
The upper diagram is a parameter description of the parallel LDA algorithm, and the following describes each part of the algorithm in detail.
The first is the initialization process of the parameters, the initialization is to each word w in the documentm,nRandomly assigning a theme and then updating the count matrix
Figure BDA0001606323670000115
And a count array nm、nk. The pseudo code for the algorithm implementation is shown in FIG. 8 below.
The second part is data set partitioning, which has been described in detail in section 2.3 with respect to the data set partitioning principle, and implementation code is given here, whose pseudo code is shown below in fig. 9.
Finally, the Gibbs sampling process is divided into two parts: sampling and updating. For each data set X in the sampleiSampling is carried out, in the process, a counting matrix in sampling parameters is initialized firstly
Figure BDA0001606323670000116
And
Figure BDA0001606323670000117
in order to avoid access conflict between different data sets, the matrixes are respectively stored under each local thread to obtain a local counting matrix
Figure BDA0001606323670000121
Therefore, only the local counting matrix needs to be updated in the sampling process; when one iteration is completed, all threads need to update the content of the counting matrix to the global counting matrix
Figure BDA0001606323670000122
And this matrix is used as the initial count matrix for the next iteration. Repeating the iteration process until Gibbs sampling of the whole data set is converged, ending the iteration, and obtaining a counting matrix
Figure BDA0001606323670000123
And
Figure BDA0001606323670000124
and count array nm、nkTo calculate the word probability distribution under each topic and the topic probability distribution of each document, the pseudo code of the specific implementation is as shown in fig. 10 below.
2.5 model training. And training the LDA model based on the corpus according to the Gibbs sampling formula obtained above. The training process is to obtain (z, w) samples in the corpus through Gibbs sampling, and all parameters in the model can be estimated based on the finally sampled samples. The training process is as follows:
(1) random initialization, namely randomly attaching a theme number z to each word w in each news document in the Chinese corpus;
(2) rescanning the corpus, resampling the theme of each word w according to a Gibbs sampling formula, and updating in the corpus;
(3) repeating the resampling process of the corpus until Gibbs sampling converges;
(4) and counting a frequency matrix of the topic-term co-occurrence of the corpus, wherein the matrix is an LDA model.
3. FIG. 6 is a process for an article with an unknown topic in the F language. And obtaining the F language LDA topic model of the article by using the same generation method of the LDA topic model.
4. And (5) comparing the similarity.
And comparing the similarity of the T topic Chinese LDA model and the LDA topic model of the F language unknown topic article by using a KL distance method. The formula is as follows:
Figure BDA0001606323670000125
for all j, when pj=qjWhen D isKL(p, q) ═ 0. Considering the asymmetry of the KL distance formula, there is a KL distance symmetry formula:
Dλ(p,q)=λDKL(p,λq+(1-λ)q)+(1-λ)DKL(q,λp+(1-λ)p)
when λ is 0.5, the KL distance formula can be converted to the JS distance formula:
Figure BDA0001606323670000131
the JS distance formula is selected as the similarity measurement standard.
5. And (6) evaluating the standard. This patent adopts F measurement as the evaluation criteria. The F metric is a balance of combined precision and recall indicators in the information retrieval. Expressed using the following formula:
P=Na/Nb
R=Na/NC
Figure BDA0001606323670000132
wherein N isaIndicates the total number of correctly identified individuals, NbRepresenting the total number of recognized individuals, NcRepresents the total number of individuals present in the test set, P represents the correct rate, and R represents the recall rate.

Claims (3)

1. A cross-language news topic similarity comparison method based on a parallel corpus is characterized in that:
the method comprises the following steps of having a universal parallel language database of two languages of Chinese and F; assuming that the target news topic is a T topic and the foreign language is F, in the case of no translation, the articles about the T topic in the F language are screened out from the articles with unknown topics in the F language:
(1) each document in the parallel corpus has independent theme distribution, and the language describes the same theme and shared theme distribution; firstly, retrieving an article set about a T topic in Chinese, and obtaining a Chinese LDA topic model of the article set by an LDA topic model algorithm based on a Chinese general corpus in a parallel corpus;
(2) then mapping the Chinese T theme LDA theme model to a generalized theme model space to obtain a T theme LDA theme model shared by Chinese and F languages, and obtaining an F language LDA theme model by using an LDA algorithm and an article of an unknown theme to be screened in the F language and an F language corpus in a parallel corpus;
(3) comparing the LDA topic model in the generalized space with the LDA topic model in the F language, and if the LDA topic model is similar to the LDA topic model, considering the article to be screened as an article about the T topic; the same subject in different languages contains similar semantic information, so the texts in different languages are represented in the same generalized subject space; after the training data is labeled by a language, namely after a topic model of some Chinese news articles is generated, the training data is mapped in a generalized topic space, the generalized topic class is used for generating a topic model of the news articles with unknown topics in foreign languages, and the topic model is compared with the topic model in the generalized topic space to obtain a topic result, wherein the steps are as follows:
for the k topic Zk
(1) Word probability distribution of a generalized corpus of sampled chinese C
Figure FDA0003328081130000011
Word outlines for a generalized corpus of sampled foreign languages FRate distribution
Figure FDA0003328081130000012
(2) For the mth pair of Chinese C and foreign F documents in the parallel corpus, M belongs to [1, M ∈]Sampling the subject probability distribution θm~Dirichlet(α),
For Chinese documents
Figure FDA0003328081130000013
N of (2)CThe number of the individual terms is,
Figure FDA0003328081130000014
selecting implicit topic ZC~Dirichlet(θm) Generating a term
Figure FDA0003328081130000015
② for foreign language documents
Figure FDA0003328081130000016
N of (2)FThe number of the individual terms is,
Figure FDA0003328081130000017
selecting implicit topic ZF~Dirichlet(θm) Generating a term
Figure FDA0003328081130000018
Wherein C represents Chinese, F represents foreign language; thetamRepresenting the topic probability distribution of the mth pair of bilingual parallel documents;
Figure FDA0003328081130000019
and
Figure FDA00033280811300000110
respectively represent the k-th subject ZkVocabulary distribution probability in Chinese and foreign languages; zCAnd ZFRespectively representing the source language and the nth language of the target language of the mth pair of bilingual parallel documentsC、nFImplicit topics of individual terms; omegaCAnd ωFN-th for respectively representing Chinese and foreign languages in the m-th bilingual parallel documentC、nFAn individual term; m represents a document lumped logarithm;
Figure FDA00033280811300000111
and
Figure FDA00033280811300000112
respectively representing the total number of Chinese and foreign documents of the mth pair of bilingual parallel documents; thetamObey Dirichlet distribution and α is its prior parameter and is used to generate a topic;
Figure FDA00033280811300000113
and
Figure FDA00033280811300000114
obey Dirichlet distribution and betaCAnd betaFIs its prior parameter and is used to generate a term; wherein, alpha, betaC、βFIs a maximum likelihood estimator, alpha denotes the "document-topic" probability distribution, betaCRepresenting the Chinese "topic-term" probability distribution, betaFRepresenting a foreign language "topic-term" probability distribution; the probability of the whole corpus is selected as an optimization objective function, and alpha and beta are obtained by performing maximum estimation on the objective functionC、βFAnd obtaining the LDA model.
2. The method of claim 1, wherein the parameters to be estimated, i.e. the distribution of "document-topic" and the distribution of "topic-term" in the parallelization process of the LDA topic model, are divided into p sub-datasets according to the implementation requirements of the distributed system, and the implementation process is as follows:
assuming that a news corpus contains m documents and the number of threads is p, an algorithm firstly initializes sampling parametersCounting; then, dividing the training corpus into p data sets equally, and dividing the samples into the same data set according to the basis of dividing the samples with high similarity; secondly, sampling the distribution by using a Gibbs sampling algorithm, and performing a Gibbs sampling process on the distributed data set in parallel by each thread; since the term ω is obtained by observing known data, and only the subject z in the distribution is an implicit variable, only the distribution P (z | ω) needs to be sampled; by means of parameter estimation of Dirichlet posterior distribution under Bayes framework, the observed word omega is assumediAnd (3) if t, estimating the parameters:
Figure FDA0003328081130000021
Figure FDA0003328081130000022
wherein
Figure FDA0003328081130000023
Parameter estimation of Dirichlet probability distribution of kth topic of mth pair of bilingual parallel documents under Bayesian framework, alphakIs its prior parameter, representing the probability distribution of the kth topic, K is the number of topics,
Figure FDA0003328081130000024
indicating the number of words produced by the kth topic in the mth document,
Figure FDA0003328081130000025
parameter estimation of Dirichlet probability distribution of term t representing kth topic under Bayesian framework, betatIs its prior probability, representing the distribution of terms t, V is the number of terms in each topic,
Figure FDA0003328081130000026
indicating for term t in the kth topicThe number of the cells; alpha is alphakAnd betatIs maximum likelihood estimator, selects the probability of the whole corpus as the optimization objective function, and obtains alpha by the maximum estimation of the objective functionkAnd betatA value of (d);
thus, the Gibbs sampling formula of the LDA model is obtained:
Figure FDA0003328081130000027
wherein z isiRepresenting the topic corresponding to the ith word in the corpus, i ═ (m, n) is a two-dimensional subscript, corresponding to the nth word of the mth document,
Figure FDA0003328081130000028
representing the topic corresponding to the word with the subscript of i, and the right side of the Gibbs sampling formula of the LDA model represents P (topic | document) · P (word | topic), wherein the probability is the probability of the path of the document → the topic → the term, and the Gibbs sampling formula represents that sampling is carried out in the K paths because the topics are K manually selected;
and after all threads finish sampling, obtaining a corresponding document-theme matrix and a corresponding theme-term matrix, calculating a global parameter by averaging all local parameters, distributing the global parameter to each parallel LDA algorithm as an initial sampling parameter of the next iteration, and sampling again, so that the process is repeated repeatedly until the parameters are converged.
3. The method of claim 1, wherein the text similarity measure is based on a TF-IDF algorithm, which is a statistical method for evaluating the importance of a word to one of the documents in a corpus or a corpus; firstly, the TF-IDF value of each document is calculated, then a time scale is added to each document according to the publication time of the document, and the similarity of the two documents is solved based on the time scale, wherein the formula is as follows:
Figure FDA0003328081130000031
wherein, taAnd tbRepresenting the time scale of documents a and b, the left molecular denominator plus one is also to prevent the denominator from being 0, Wa,i.Term weight, W, representing the ith word in document ab,iThe term weight indicating the ith word in document b,
Figure FDA0003328081130000032
representing the vector product of the word TF-IDF vectors for document a and document b.
CN201810245163.4A 2018-03-23 2018-03-23 Cross-language news topic similarity comparison method based on parallel corpus Active CN108519971B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810245163.4A CN108519971B (en) 2018-03-23 2018-03-23 Cross-language news topic similarity comparison method based on parallel corpus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810245163.4A CN108519971B (en) 2018-03-23 2018-03-23 Cross-language news topic similarity comparison method based on parallel corpus

Publications (2)

Publication Number Publication Date
CN108519971A CN108519971A (en) 2018-09-11
CN108519971B true CN108519971B (en) 2022-02-11

Family

ID=63434170

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810245163.4A Active CN108519971B (en) 2018-03-23 2018-03-23 Cross-language news topic similarity comparison method based on parallel corpus

Country Status (1)

Country Link
CN (1) CN108519971B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111339286B (en) * 2020-02-14 2024-02-09 四川超易宏科技有限公司 Method for exploring mechanism research conditions based on theme visualization
CN111553168A (en) * 2020-05-09 2020-08-18 识因智能科技(北京)有限公司 Bilingual short text matching method
CN112069394B (en) * 2020-08-14 2023-09-29 上海风秩科技有限公司 Text information mining method and device
CN112329481B (en) * 2020-10-27 2022-07-19 厦门大学 Training method of multi-language machine translation model for relieving language-to-difference conflict
CN112580355B (en) * 2020-12-30 2021-08-31 中科院计算技术研究所大数据研究院 News information topic detection and real-time aggregation method
CN113076467A (en) * 2021-03-26 2021-07-06 昆明理工大学 Chinese-crossing news topic discovery method based on cross-language neural topic model
CN113344107B (en) * 2021-06-25 2023-07-11 清华大学深圳国际研究生院 Topic analysis method and system based on kernel principal component analysis and LDA
CN114742077A (en) * 2022-04-15 2022-07-12 中国电子科技集团公司第十研究所 Generation method of domain parallel corpus and training method of translation model

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202065A (en) * 2016-06-30 2016-12-07 中央民族大学 A kind of across language topic detecting method and system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10860829B2 (en) * 2015-01-16 2020-12-08 Oracle International Corporation Data-parallel parameter estimation of the Latent Dirichlet allocation model by greedy Gibbs sampling
US9514124B2 (en) * 2015-02-05 2016-12-06 International Business Machines Corporation Extracting and recommending business processes from evidence in natural language systems

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202065A (en) * 2016-06-30 2016-12-07 中央民族大学 A kind of across language topic detecting method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于ICE-LDA模型的中英文跨语言话题发现研究;陈兴蜀等;《工程科学与技术》;20170331;第49卷(第2期);第100-106页 *

Also Published As

Publication number Publication date
CN108519971A (en) 2018-09-11

Similar Documents

Publication Publication Date Title
CN108519971B (en) Cross-language news topic similarity comparison method based on parallel corpus
Yao et al. Efficient methods for topic model inference on streaming document collections
CN108073568B (en) Keyword extraction method and device
CN110825877A (en) Semantic similarity analysis method based on text clustering
CN110717034A (en) Ontology construction method and device
CN109086265B (en) Semantic training method and multi-semantic word disambiguation method in short text
CN110879831A (en) Chinese medicine sentence word segmentation method based on entity recognition technology
CN104750875B (en) A kind of machine error data classification method and system
CN107329954B (en) Topic detection method based on document content and mutual relation
CN112052684A (en) Named entity identification method, device, equipment and storage medium for power metering
CN112559684A (en) Keyword extraction and information retrieval method
CN110633371A (en) Log classification method and system
CN113761218A (en) Entity linking method, device, equipment and storage medium
Wu et al. Extracting topics based on Word2Vec and improved Jaccard similarity coefficient
Marchant et al. In search of an entity resolution OASIS: optimal asymptotic sequential importance sampling
Shi et al. A new evaluation framework for topic modeling algorithms based on synthetic corpora
CN110866102A (en) Search processing method
CN106681985A (en) Establishment system of multi-field dictionaries based on theme automatic matching
Bhutada et al. Semantic latent dirichlet allocation for automatic topic extraction
CN113032573B (en) Large-scale text classification method and system combining topic semantics and TF-IDF algorithm
CN113486670B (en) Text classification method, device, equipment and storage medium based on target semantics
CN112215006B (en) Organization named entity normalization method and system
WO2016090625A1 (en) Scalable web data extraction
CN113139061B (en) Case feature extraction method based on word vector clustering
CN113780006A (en) Training method of medical semantic matching model, medical knowledge matching method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CB03 Change of inventor or designer information

Inventor after: Wang Qi

Inventor after: Wang Yongbin

Inventor after: Yu Shuiyuan

Inventor after: Cao Diezhen

Inventor after: Han Xiao

Inventor after: Dai Changsong

Inventor before: Wang Qi

Inventor before: Yu Shuiyuan

Inventor before: Cao Diezhen

Inventor before: Han Xiao

Inventor before: Dai Changsong

CB03 Change of inventor or designer information