CN111813935B

CN111813935B - Multi-source text clustering method based on hierarchical dirichlet allocation model

Info

Publication number: CN111813935B
Application number: CN202010570969.8A
Authority: CN
Inventors: 黄瑞章; 许伟佳; 秦永彬; 陈艳平
Original assignee: Guizhou University
Current assignee: Guizhou University
Priority date: 2020-06-22
Filing date: 2020-06-22
Publication date: 2024-04-30
Anticipated expiration: 2040-06-22
Also published as: CN111813935A

Abstract

The invention discloses a multi-source text clustering method based on a hierarchical dirichlet allocation model, which comprises the following steps: 1. collecting a text set from a plurality of sources; 2. text preprocessing is carried out on text information from a plurality of data sources; 3. constructing a theme model based on the hierarchical dirichlet allocation model; 4. performing Blocked Gibbs sampling and updating parameters; 5. And carrying out text clustering according to the sampling result. The invention improves the clustering effect of the multi-source text by updating the prior parameter of the topic-word distribution of the multi-source text; the built model can automatically judge the number of clusters in each data source text without artificial advance giving, and can greatly improve the clustering effect of the multi-source text.

Description

Multi-source text clustering method based on hierarchical dirichlet allocation model

Technical Field

The invention relates to a text clustering method, in particular to a multi-source text clustering method based on a hierarchical dirichlet allocation model, and belongs to the technical field of machine learning and natural language processing.

Background

With the rapid development of information technology, people have more and more ways to obtain information, especially text information. The text information sources are different, and the information characteristics are inconsistent. We can mine their subject information as well as text structure information from multi-source text datasets, which is highly desirable for many scenarios. For example, mining text information from various news websites, forums, social media and other sources can help us to learn about hot topics of social interest, and besides, we find sudden traffic accidents by analyzing traffic information from various sources such as citizen hotlines, traffic bulletin boards and the like. Therefore, it is necessary to develop a topic model based on a multi-source text dataset and mine information in the multi-source text dataset.

Mining text information of a multi-source text dataset using a traditional topic model has many difficulties, such as: 1) The word distributions of the topics of the multiple data sources are similar but not identical. For example, articles of news websites tend to describe a topic in standard terms, while words in social media documents are more arbitrary. Therefore, it is not feasible to directly solve the clustering problem of the multi-source document by adopting the traditional topic model mining, because the clustering performance of the document is seriously affected by the writing style difference of topics of different sources. 2) Estimating the cluster number K is also difficult for multi-source document clustering. For most conventional document clustering methods, K is considered a parameter that the user has previously determined, but it is difficult and impractical to provide the correct value of K before it is put into operation. Furthermore, K is often different for different data sources, which greatly increases the difficulty of estimating the correct K. An improper number of K can mislead the clustering process, resulting in a reduced document clustering performance. Therefore, it is useful if the multi-source document clustering method can automatically learn the number of clusters K per data source. 3) Traditional document clustering methods assume that the topic distribution is different for each data source. For example, AIJNewsweek most topics are focused on news categories including "political news", "technical news", "commercial news", etc., while news articles of "wall street news" are more relevant to "economic news". The difference in the topic scale of each data source also accounts for why the topic number K is different for each data source. Thus, automatically discovering source-level topic proportions facilitates accurate discovery of document structures for multi-source documents.

Therefore, in order to solve the above three problems, a new clustering method for multi-source text data is needed to obtain a more ideal clustering effect.

Disclosure of Invention

The invention aims to solve the technical problems that: a multi-source text clustering method based on a hierarchical dirichlet allocation model is provided, and an HDMA model is researched by adopting a two-step hierarchical theme generation process. The learned topics share their general characteristics among the data sources while preserving the local characteristics of the data sources. Each data source applies an exclusive topic partition to learn the topic emphasis at the source level. In addition, the invention can automatically identify the number of text clusters of each data set in the multi-source data set without manual setting in advance, thereby effectively solving the problems.

The technical scheme of the invention is as follows: a multi-source text clustering method based on a hierarchical dirichlet allocation model, the method comprising the steps of: 1. collecting a text set from a plurality of sources; 2. text preprocessing is carried out on text information from a plurality of data sources; 3. constructing a theme model based on the hierarchical dirichlet allocation model; 4. performing Blocked Gibbs sampling and updating a parameter beta; 5. and carrying out text clustering according to the sampling result.

In the second step, the preprocessing method is to divide words and to deactivate words, low-frequency words and punctuation numbers.

In the third step, the built multi-source theme model text generation step comprises the following steps:

1) For each topic k:

A. select β _k,i～N(μ,σ² I), i=1, 2, …, V

B. for each data source s:

Selection of

2) For each data source s:

C. Selection of

D. For document d in data source s:

Selection of

E. For word w in document d:

select w _i～Multinomial(w_i|z_d,φ^s

The fourth step comprises the following specific steps: firstly, initializing model parameters, wherein the model parameters to be initialized comprise super parameters { alpha, mu, sigma ² } and hidden variables { beta, z }; after the model parameters are initialized, blocked Gibbs sampling is performed again, and when the sampling result tends to be stable, the dirichlet parameter beta _k generating the topic-word distribution parameter is updated, and the Blocked Gibbs sampling process is repeated.

The process of inference of Blocked Gibbs sampling is as follows:

for each data source s in the multi-source dataset:

3) Updating topic-word distribution

4) Updating the topic distribution theta _s;

5) Updating the theme of each text Where d= {1,2, M ^s.

In the fifth step, a clustering result is obtained according to the final sampling in the fourth step.

The beneficial effects of the invention are as follows: compared with the prior art, the clustering method and the clustering device have the advantages that by adopting the technical scheme, the clustering effect of the multi-source text is improved by updating the prior parameters of the topic-word distribution of the multi-source text; the built model can automatically judge the number of clusters in each data source text without artificial advance. The multi-source text shares a priori parameter which can generate similar but different topic-word distribution, is the basis for improving the clustering effect of the multi-source text data, and each data source has respective topic distribution and topic-word distribution parameters, so that the invention can automatically infer the respective topic number and word characteristics of each data source of the multi-source text. The method and the device can improve the multi-source text clustering effect to a greater extent.

The invention provides a novel multi-source document clustering model, namely an HDMA model. The HDMA model was studied using a two-step hierarchical topic generation process. The learned topics share their general characteristics among the data sources while preserving the local characteristics of the data sources. Each data source applies an exclusive topic partition to learn the topic emphasis at the source level. In addition, the invention can automatically identify the number of text clusters of each data set in the multi-source data set without manual setting in advance.

Drawings

FIG. 1 is a flow chart of the present invention;

Fig. 2 is a subject model of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings of the present specification.

Example 1: 1-2, a multi-source text clustering method based on a hierarchical dirichlet allocation model, the method comprises the following steps: 1. collecting a text set from a plurality of sources; 2. text preprocessing is carried out on text information from a plurality of data sources; 3. constructing a theme model based on the hierarchical dirichlet allocation model; 4. performing Blocked Gibbs sampling and updating a parameter beta; 5. and carrying out text clustering according to the sampling result.

The method of the invention is executed, firstly, step one is executed, and a text set from a plurality of data sources to be clustered is obtained.

The first multisource textset of this embodiment is HuffAmaSet data. The dataset contains 10000 texts, of which 5000 texts are news articles collected from HuffPost website (denoted HnewSet below) and the other 5000 documents are comment texts collected from Amazon website (denoted ASet). The dataset contains two topics, "food" and "sport," respectively. The second text set of this embodiment is BBCTset. The dataset also contains 10000 texts. Where 5000 texts are 5000 news articles collected from the BBC website, and another 5000 documents are 5000 articles collected from Twitter. The dataset contains three topics, "bussiness", "sport" and "politic", respectively.

And secondly, executing the second step, and performing text preprocessing on the acquired text set. And performing text preprocessing on the acquired text set. And performing work such as word segmentation, stop word removal, low-frequency word removal and the like on the text set. After pretreatment, redundant information in the text is removed, so that the text set is concise, neat, resource-saving and convenient to calculate.

Thirdly, after the text set is processed, performing modeling in the third step, and establishing a probability topic model of multi-source text clustering based on the hierarchical dirichlet allocation model. First, the model can automatically determine the number of clusters in the text set for each data source without requiring human advance. The text sets of each data source in the multi-source text data set have a respective topic-word distribution and topic distribution.

In the third step, the text generation step of the constructed multi-source theme model is as follows:

1) For each topic k:

A. select β _k,i～N(μ,σ² I), i=1, 2, …, V

B. for each data source s:

Selection of

2) For each data source s:

C. Selection of

D. For document d in data source s:

Selection of

E. For word w in document d:

select w _i～Multinomial(w_i|z_d,φ^s

Wherein alpha represents the parameters of dirichlet distribution, is a vector, and the dimension is equal to the number of topics; μ and σ ² represent parameters of normal distribution; beta _k represents a dirichlet allocation parameter for generating word allocation of the topic k, and the dimension is equal to the total number of vocabulary of the corpus; θ _s represents the topic distribution of the data source s in the multi-source text set; phi _s represents the topic-word distribution of the data source s in the multi-source text set; Representing a text topic sampled from θ _s for text m; /(I) Respectively representing the mth text of the data source s in the source text set; m ^s represents the text total record of the data source s in the multi-source text set; k represents the total number of topics at initialization;

fourth, based on the above model, step four of the present invention is performed. In this step, the parameters are estimated using the Blocked Gibbs sampling method. The state of the markov chain is denoted u= { β, θ, Φ, Z }.

Firstly, initializing model parameters, wherein the model parameters to be initialized comprise super parameters { alpha, mu, sigma ² } and hidden variables { beta, z }; after initializing the model parameters, the inference process for Blocked Gibbs sampling is as follows:

3) The topic-word distribution of each data source text set of the multi-source data set is updated. For k= {1,2,..k }, if K is not Phi _k is employed from the dirichlet profile with parameter beta _k, otherwise phi _k is sampled from the dirichlet profile with the following parameters:

4) The topic distribution θ ^s for each data source text set is updated for the multi-source data set. Sampling a subject distribution from dirichlet parameters having:

where I (z _l =k) is an identification function. When z _l =k, I (z _l =k) =1.

5) The topic z _d of each text is sampled and updated from a discrete distribution { p _d,1,p_d,2,...,p_d,K }, where:

It should be noted that: assuming that the number of classes estimated by the model is K ^*, the value is vector Is smaller than the initialized K value.

In step four, an update operation of the parameter β is required. When the sampling result tends to be stable, the invention updates beta by optimizing the posterior probability of generating the whole data set.The updated formula of (c) is as follows:

Wherein the method comprises the steps of Represents the number of occurrences of word w in the mth document under topic k, and/>

And step five of the invention is executed to perform text clustering. And obtaining the theme distribution condition of each table of contents mark of texts according to the sampling result to cluster.

The symbol descriptions in this example are shown in Table 1.

TABLE 1

The clustering effect of the multi-source text is improved by updating the prior parameters of the topic-word distribution of the multi-source text; the built model can automatically judge the number of clusters in each data source text without artificial advance. The multi-source text shares a priori parameter which can generate similar but different topic-word distribution, is the basis for improving the clustering effect of the multi-source text data, and each data source has respective topic distribution and topic-word distribution parameters, so that the invention can automatically infer the respective topic number and word characteristics of each data source of the multi-source text. The method and the device can improve the multi-source text clustering effect to a greater extent.

The present invention is not described in detail in the present application, and is well known to those skilled in the art. Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered by the scope of the claims of the present invention.

Claims

1. A multi-source text clustering method based on a hierarchical dirichlet allocation model is characterized by comprising the following steps of: the method comprises the following steps: 1. collecting a text set from a plurality of sources; 2. text preprocessing is carried out on text information from a plurality of data sources; 3. constructing a theme model based on the hierarchical dirichlet allocation model; 4. performing a Blocked Gibbs sampling party and updating a parameter beta; 5. performing text clustering according to the sampling result;

In the second step, the preprocessing method is to divide words, stop words, low-frequency words and punctuation numbers;

1) For each topic k:

A. Selection of

B. for each data source s:

Selection of

2) For each data source s:

C. Selection of

D. For document d in data source s:

Selection of

E. For word w in document d:

select w _i～Multinomial(w_i|z_d,φ^s

In the fourth step, based on the topic model constructed in the third step, the characteristic word distribution, the noise word distribution and the topic distribution of each data source in the multi-source data set are sampled by using a Blocked Gibbs sampling algorithm, and when the sampling result tends to be stable, the dirichlet parameter beta for generating the topic-word distribution parameter is updated and the Blocked Gibbs sampling process is repeated;

The process of the inference of the Blocked Gibbs samples is as follows:

for each data source s in the multi-source dataset:

1) Updating topic-word distribution

For k= {1,2,..k }, if K is notPhi _k is employed from the dirichlet profile with parameter beta _k, otherwise phi _k is sampled from the dirichlet profile with the following parameters:

2) Updating the topic distribution theta _s;

Sampling a subject distribution from dirichlet parameters having:

where I (z _l =k) is an identification function. When z _l =k, I (z _l =k) =1

3) Updating the theme of each textWhere d= {1,2,., M ^s }; wherein:

In the fourth step, an updating operation of the parameter β is required; after the sampling result tends to be stable, the invention updates beta by optimizing the posterior probability of the whole data set; the updated formula of (c) is as follows:

2. The multi-source text clustering method based on the hierarchical dirichlet allocation model according to claim 1, wherein: in the fifth step, a clustering result is obtained according to the final sampling in the fourth step.