CN109117436A

CN109117436A - Synonym automatic discovering method and its system based on topic model

Info

Publication number: CN109117436A
Application number: CN201710492902.5A
Authority: CN
Inventors: 曲德君; 李进岭; 曹大军; 杨冠军; 郁抒思
Original assignee: Shanghai Xinfeifan E-Commerce Co Ltd
Current assignee: Shanghai Xinfeifan E-Commerce Co Ltd
Priority date: 2017-06-26
Filing date: 2017-06-26
Publication date: 2019-01-01

Abstract

The invention discloses a kind of synonym automatic discovering method based on topic model, at least includes the following steps: importing the data of synonym to be found；Word segmentation processing is carried out according to data of the information of database to importing；Building topic model simultaneously carries out topic model cluster；Minimum relevant cluster is carried out to Subject Clustering；Export synonym.The present invention does not need priori knowledge, and mark by hand, realizes the automatic cluster of synonym, improves the efficiency of synonym discovery；And solve the problems, such as semantic approximation to a certain extent, it is not necessarily to manual intervention in addition to last screening in implementation process, to have biggish promotion to the efficiency that synonym is found automatically.

Description

Automatic synonym discovery method and system based on topic model

Technical Field

The invention relates to the technical field of natural language processing, in particular to a method and a system for automatically discovering synonyms based on a topic model.

Background

With the development of the information age, the scale of web text data is getting larger, so that the processing of natural language is becoming more and more important, more and more words are generated based on new appearance, and the importance of the semantic automatic analysis technology, such as the semantic automatic discovery technology, is embodied day by day. The existing mainstream synonym automatic discovery algorithm needs prior knowledge to construct a synonym discovery reference text mode, so that the synonym discovery efficiency is limited; in another reference text pattern matching method, parts of speech and semantics of known words need to be manually labeled in advance to construct a reference text pattern.

Referring to fig. 1, it can be seen that, in the existing system, synonym discovery needs to be assisted by manual screening, and because the method for automatically discovering synonyms has a certain error rate, the existing synonym discovery methods are all inefficient.

In the present patent application with the patent application number CN201410156107.5, a synonym determination method, a synonym search method, and a synonym server are claimed, but according to the understanding of the technical solutions in the application documents, the solutions given in the comparison documents cannot improve the efficiency of synonym discovery.

Disclosure of Invention

The invention aims to provide a synonym automatic discovery method based on a topic model, which comprises the steps of constructing the topic model by analyzing the mutual occurrence probability of words, gathering the words of the same topic into the same cluster by utilizing a Gibbs sampling method, and further clustering the words of each cluster by utilizing an iterative minimum correlation method to obtain alternative synonym groups.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a synonym automatic discovery method based on a topic model at least comprises the following steps:

importing data of synonyms to be found;

performing word segmentation processing on the imported data according to the information of the database;

constructing a theme model and clustering the theme model;

performing minimum correlation clustering on the topic clusters;

outputting the synonyms.

Wherein, a step of manually screening synonyms is further included after the step of outputting synonyms.

Wherein, the topic model can be a hidden Dirichlet allocation model, and the clustering step at least comprises the following steps:

from Dirichlet distribution Dir_αSampling to generate subject distribution theta of document i_iWherein α is a parameter of Dirichlet distribution preset by a user and represents the degree of balance of distribution of the subject on the document, and theta is Dir_αOne sample of (1);

distribution from topic theta_iGenerating main word of jth word of document i by intermediate samplingQuestion z_i,j；

From Dirichlet distribution Dir_β(β is a parameter of Dirichlet distribution) sampling to generate a topic Z_i,jDistribution of the above words

From the distribution of wordsFinally generating word w by intermediate sampling_i,j。

Wherein, the clustering of the topic model is under the precondition that the topic to which all other words belong is determined, and a word z_iThe posterior probability P belonging to topic j is:

where W is the total number of words, T is the total number of implied topics, α, β are user-set parameters as described above,finger at the exclusion of z_iThen, w_iThe number of words in (1) that belong to topic j, and so on.Finger at the exclusion of z_iThen, w_iThe number of words in (i) that belong to document i.

Wherein, the theme clustering is the theme clustering of Gibbs sampling method, include the following steps at least:

A. each word in the document set is randomly assigned to a topic;

B. assigning each word of the document set to each topic, calculating the probability P that the word belongs to the topic under the condition, and finally enabling the word to belong to the topic with the highest P;

C. and step B is executed iteratively until the probability variation of each iteration is less than the threshold value given by the user.

When the topic clustering is carried out with the minimum correlation clustering, the co-occurrence condition of the words in the document set is calculated through the Pearson correlation coefficient, and the words w belonging to the topic T are subjected to_iTo say that this word is in document d_kThe number of occurrences in (1) is r_i,kConstructing a vectorThe length of this vector is equal to the number of documents in the document set, and the value of each unit k is r_i,kThe number of the themes, for each theme,andthe Pearson correlation coefficient ρ between:

wherein,andcosine of the angle between the two vectors.

Wherein the least relevant clusters include at least:

a1, randomly assigning each word in T to a cluster;

b1, endowing each word to each theme, calculating a Pearson correlation coefficient between the vector of the word and the average vector of each theme (excluding the word), and selecting the class with the lowest Pearson correlation coefficient as the cluster to which the word belongs;

and C1, iteratively executing the step B1 until the change amount of the correlation coefficient generated in each iteration is lower than the threshold value.

An automatic synonym discovery system based on a topic model, which comprises a database storing natural language processing information, and is characterized by at least comprising:

the data import module is used for importing data of synonyms to be found;

the word segmentation processing module is used for carrying out word segmentation processing on the imported data according to the information of the database;

the topic model clustering module is used for constructing a topic model and clustering the topic model;

the minimum correlation clustering module is used for performing minimum correlation clustering on the theme clusters;

the invention discloses a synonym output module for outputting synonym data, which has the following beneficial effects:

the invention discloses a synonym automatic discovery method based on a topic model. And constructing a topic model by analyzing the mutual occurrence probability of the words, and gathering the words expressing the same topic. Then, the topics are further clustered into alternative synonym groups by a minimal correlation clustering method. According to the method, prior knowledge and manual labeling are not needed, automatic clustering of synonyms is achieved, and the efficiency of synonym discovery is improved; the problem of semantic similarity is solved to a certain extent, and manual intervention is not needed except for final screening in the implementation process, so that the efficiency of automatic synonym discovery is greatly improved.

Drawings

FIG. 1 is a flow diagram of a method of a synonym discovery system in the prior art;

FIG. 2 is a flowchart illustrating a method for automatically discovering synonyms based on a topic model according to the present invention.

Detailed Description

The technical solution of the present invention is further described below with reference to the following embodiments and the accompanying drawings.

The invention provides a synonym automatic discovery method based on a topic model, which at least comprises the following steps:

importing data of synonyms to be found;

constructing a theme model and clustering the theme model;

performing minimum correlation clustering on the topic clusters;

outputting the synonyms.

In the method of the present invention, a step of manually screening synonyms is further included after the step of outputting synonyms, and the technique of manually screening synonyms can be implemented by using the prior art, and therefore, details are not repeated in this embodiment.

In the present invention, the topic model may be a hidden dirichlet allocation model, and the clustering step at least includes:

distribution from topic theta_iMidampling to generate a topic z of a jth word of a document ith_i,j；

Clustering of the topic model under the precondition that the topic to which all other words belong is determined, wherein the topic z of a certain word_iThe posterior probability P belonging to topic j is:

The theme clustering is the theme clustering of the Gibbs sampling method, and at least comprises the following steps:

A. each word in the document set is randomly assigned to a topic;

In addition, when the topic clustering is carried out with the minimum correlation clustering, the co-occurrence condition of the words in the document set is calculated through the Pearson correlation coefficient, and the words w belonging to the topic T are subjected to_iTo say that this word is in document d_kThe number of occurrences in (1) is r_i,kConstructing a vectorThe length of this vector is equal to the number of documents in the document set, and the value of each unit k is r_i,kThe number of the themes, for each theme,andthe Pearson correlation coefficient ρ between:

wherein,andcosine of the angle between the two vectors.

The minimal correlation clustering comprises at least the following steps:

a1, randomly assigning each word in T to a cluster;

In addition, the present invention further provides a system for automatically discovering synonyms based on a topic model by using the above method, which includes a database storing natural language processing information, and referring to fig. 2, and at least includes:

the data import module is used for importing data of synonyms to be found;

and the synonym output module is used for outputting synonym data.

The method can be applied to analysis of mass unlabeled text materials, and is particularly suitable for the conditions of cold start of a corpus and lack of known synonym data.

As further shown in fig. 2, the system process of the present invention includes data import, word segmentation, topic model clustering, minimum correlation clustering, synonym output, and an optional manual screening module. Wherein, the data import means: a certain amount of text is imported into the system as basic data, where one text may be a news article, an advertisement, an academic paper, etc. Typically this is a collection of a set of documents that are relevant to a particular business of an enterprise. The word segmentation means that: because there is no separator between words in the chinese context, in order to further process the imported text set and implement semantic analysis, word segmentation is required for the text. The related art of word segmentation can be realized by adopting the prior art, and therefore, the detailed description is not repeated. In addition, manual tagging is adopted in the embodiment of the invention, and the manual tagging is generally divided into part-of-speech tagging and semantic tagging on the result after word segmentation, wherein part-of-speech tagging indexes are used for tagging whether a word is a verb, a noun or a conjunctive word and the like; semantic annotation refers to further classifying words according to preset categories, such as that "husky" belongs to the animal category, that "pen" belongs to the office supply category, and the like. And analyzing the occurrence rule of the words, and finding frequent semantic sequences, namely generating the template. Words that appear at the same position in the same semantic sequence may be synonyms, e.g., "today's english exam" translates into semantic sequences that are "time-course-verb" and thus "exam" for "tomorrow's math exam" for the same semantic sequence is at the same position as "exam" for "today's english exam" and may be synonyms. The generation of the related template can be completely realized by adopting the existing technology, and therefore, the description is not repeated. Also synonyms for output mean: for the keywords input by the user, the system searches a template suitable for the keywords in a template library, finds alternative synonyms of the keywords in the input text set by using the template, and finally determines the synonym relationship through manual screening. In the embodiment of the present invention, the data import, the word segmentation, the manual labeling, the template generation and the synonym output in fig. 2 can all be implemented by the same technology as in fig. 1.

In this embodiment, there is topic model clustering:

the method comprises the following steps of researching the generation frequency of each word in a natural language text according to a Latent Dirichlet Allocation (LDA) model, wherein the model also has an implicit Topic (Topic) set, semantically speaking, each document in the document set is a representation of a certain Topic(s) in the Topic set, and each word in the document can be attributed to a certain Topic (Topic). According to the LDA model, a document is generated as follows:

a) from Dirichlet distribution Dir_αSampling to generate subject distribution theta of document i_iWhere α is a parameter of the Dirichlet distribution indicating that the topic is in the documentThe degree of equalization of the upper distribution is generally preset by a user; theta is Dir_αOne sample of (1);

b) distribution from topic theta_iMidampling to generate a topic z of a jth word of a document ith_i,j；

c) From Dirichlet distribution Dir_β(β is a parameter of Dirichlet distribution) sampling to generate a topic Z_i,jDistribution of the above words

d) From the distribution of wordsFinally generating word w by intermediate sampling_i,j。

In fact, because the distribution of the implicit topics is unknown, fitting is generally performed by reversely calculating the distribution of the topics according to the distribution of the words in the document set, so that the reversely calculated distribution of the topics conforms to the actual distribution of the words as much as possible. The commonly used fitting method includes maximum likelihood method and Gibbs sampling method, and both the calculation complexity of the maximum likelihood method and the Gibbs sampling method are O (N × T × i) and are at the same level. A certain word z, under precondition of a definite topic to which all other words belong_iThe formula for calculating the posterior probability P belonging to topic j is (formula 1):

in equation 1, W is the total number of words, T is the total number of implied topics, α, β are user-set parameters as described above,finger at the exclusion of z_iThen, w_iThe number of words in (1) that belong to topic j, and so on.Finger at the exclusion of z_iThen, w_iThe number of words in (i) that belong to document i.

Therefore, according to the formula 1, the topic clustering implementation method of the Gibbs sampling method is as follows:

a) each word in the document set is randomly assigned to a topic;

b) according to formula 1, each word of the document set is given to each topic, the probability P that the word belongs to the topic is calculated, and finally the word belongs to the topic with the highest P;

c) and b) executing the iteration until the probability variation of each iteration is less than the threshold value given by the user.

In this embodiment, the input of the topic model cluster is the document set after word segmentation, the output is the topic cluster to which each word belongs, and the following further explains the minimum related cluster:

1. in the case of the minimum correlation clustering process, the words belonging to the same topic are not necessarily synonyms, for example, in news explaining the topic of a disaster, the words "down mountain" and "earthquake" are explaining the same event and belong to the same topic, but are not synonyms because they belong to different semantic groups. Therefore, there is a need to further cluster words in the same topic, forming groups, where the words in each group are more likely to be synonyms.

In view of the context schema content of existing solutions, synonyms typically appear in different documents at the same schema location, i.e., synonyms typically do not appear in the same document. Therefore, words in the same topic can be further clustered according to the co-occurrence condition of the words, and a group of words with few co-occurrences with each other is more likely to be synonyms.

2. When the minimum correlation clustering processing is carried out, the co-occurrence condition of the words in the document set can be calculated through the Pearson correlation coefficient. For belonging to a topicWord w of T_iTo say that this word is in document d_kThe number of occurrences in (1) is r_i,k. Thus, for each word, a vector can be constructedThe length of this vector is equal to the number of documents in the document set, and the value of each unit k is r_i,k. Thus, for each topic, the user may,andthe Pearson correlation coefficient ρ between (formula 2):

in the formula 2, the first and second groups,andcosine of the angle between the two vectors. The larger the angle, the smaller ρ, w_iAnd w_jThe fewer co-occurrences there between, the less co-occurring words within the same topic may be synonyms.

And further clustering words in the same topic by adopting a minimum correlation clustering method to generate alternative synonym groups. The meaning of the minimum correlation clustering method is that a batch of clusters is generated, and the correlation coefficient of two Pearson between each word in each cluster is minimum.

The implementation method of the minimum correlation clustering is as follows:

a) randomly assigning each word in the T to a cluster;

b) assigning each word to each topic, calculating a Pearson correlation coefficient between a vector of the word and an average vector of each topic (excluding the word), and selecting a class with the lowest Pearson correlation coefficient as a cluster to which the word belongs;

c) and b) executing the iteration until the change amount of the correlation coefficient generated by each iteration is lower than the threshold value.

In the invention, the input of the topic model cluster is the topic cluster to which each word belongs, and the output is the alternative synonym grouping for further subdividing the topic, wherein the words in each grouping belong to the same semantic category and belong to synonyms.

In summary, the method of the present invention constructs a topic model by analyzing the mutual occurrence probability of words, and aggregates words expressing the same topic. Then, the topics are further clustered into alternative synonym groups by a minimal correlation clustering method. The invention solves the problem of semantic similarity to a certain extent, and does not need manual intervention except final screening in the implementation process, thereby greatly improving the efficiency of automatic synonym discovery.

The sequence of the above embodiments is only for convenience of description and does not represent the advantages and disadvantages of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A synonym automatic discovery method based on a topic model is characterized by at least comprising the following steps:

importing data of synonyms to be found;

constructing a theme model and clustering the theme model;

performing minimum correlation clustering on the topic clusters;

outputting the synonyms.

2. The method according to claim 1, further comprising a step of manually selecting synonyms after the step of outputting synonyms.

3. The method according to claim 1, wherein the topic model is a hidden dirichlet allocation model, and the clustering step at least comprises:

4. The method according to claim 1, wherein the topic model cluster is a word z under a precondition determined by a topic to which all other words belong_iThe posterior probability P belonging to topic j is:

wherein, W is the total number of words,t is the total number of implied topics, α, β are user-set parameters as described above,finger at the exclusion of z_iThen, w_iThe number of words in (1) that belong to topic j, and so on.Finger at the exclusion of z_iThen, w_iThe number of words in (i) that belong to document i.

5. The method according to claim 4, wherein the topic cluster is a Gibbs sampling topic cluster, and comprises at least the following steps:

A. each word in the document set is randomly assigned to a topic;

6. The method according to claim 1, wherein when performing the least relevant clustering on the topic clusters, the co-occurrence of words in the document set is calculated by Pearson correlation coefficient, and for words w belonging to the topic T, the method is characterized in that_iTo say that this word is in document d_kThe number of occurrences in (1) is r_i,kConstructing a vectorThe length of this vector is equal to the number of documents in the document set, and the value of each unit k is r_i,kThe number of the themes, for each theme,andthe Pearson correlation coefficient ρ between:

wherein,andcosine of the angle between the two vectors.

7. The method according to claim 6, wherein the minimal related clusters at least comprise:

a1, randomly assigning each word in T to a cluster;

8. A topic model-based automatic synonym discovery system using the method of claim 1, comprising a database storing natural language processing information, and further comprising at least:

the data import module is used for importing data of synonyms to be found;

and the synonym output module is used for outputting synonym data.