CN111813935B - Multi-source text clustering method based on hierarchical dirichlet allocation model - Google Patents

Multi-source text clustering method based on hierarchical dirichlet allocation model Download PDF

Info

Publication number
CN111813935B
CN111813935B CN202010570969.8A CN202010570969A CN111813935B CN 111813935 B CN111813935 B CN 111813935B CN 202010570969 A CN202010570969 A CN 202010570969A CN 111813935 B CN111813935 B CN 111813935B
Authority
CN
China
Prior art keywords
text
source
topic
model
distribution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010570969.8A
Other languages
Chinese (zh)
Other versions
CN111813935A (en
Inventor
黄瑞章
许伟佳
秦永彬
陈艳平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guizhou University
Original Assignee
Guizhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guizhou University filed Critical Guizhou University
Priority to CN202010570969.8A priority Critical patent/CN111813935B/en
Publication of CN111813935A publication Critical patent/CN111813935A/en
Application granted granted Critical
Publication of CN111813935B publication Critical patent/CN111813935B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a multi-source text clustering method based on a hierarchical dirichlet allocation model, which comprises the following steps: 1. collecting a text set from a plurality of sources; 2. text preprocessing is carried out on text information from a plurality of data sources; 3. constructing a theme model based on the hierarchical dirichlet allocation model; 4. performing Blocked Gibbs sampling and updating parameters; 5. And carrying out text clustering according to the sampling result. The invention improves the clustering effect of the multi-source text by updating the prior parameter of the topic-word distribution of the multi-source text; the built model can automatically judge the number of clusters in each data source text without artificial advance giving, and can greatly improve the clustering effect of the multi-source text.

Description

Multi-source text clustering method based on hierarchical dirichlet allocation model
Technical Field
The invention relates to a text clustering method, in particular to a multi-source text clustering method based on a hierarchical dirichlet allocation model, and belongs to the technical field of machine learning and natural language processing.
Background
With the rapid development of information technology, people have more and more ways to obtain information, especially text information. The text information sources are different, and the information characteristics are inconsistent. We can mine their subject information as well as text structure information from multi-source text datasets, which is highly desirable for many scenarios. For example, mining text information from various news websites, forums, social media and other sources can help us to learn about hot topics of social interest, and besides, we find sudden traffic accidents by analyzing traffic information from various sources such as citizen hotlines, traffic bulletin boards and the like. Therefore, it is necessary to develop a topic model based on a multi-source text dataset and mine information in the multi-source text dataset.
Mining text information of a multi-source text dataset using a traditional topic model has many difficulties, such as: 1) The word distributions of the topics of the multiple data sources are similar but not identical. For example, articles of news websites tend to describe a topic in standard terms, while words in social media documents are more arbitrary. Therefore, it is not feasible to directly solve the clustering problem of the multi-source document by adopting the traditional topic model mining, because the clustering performance of the document is seriously affected by the writing style difference of topics of different sources. 2) Estimating the cluster number K is also difficult for multi-source document clustering. For most conventional document clustering methods, K is considered a parameter that the user has previously determined, but it is difficult and impractical to provide the correct value of K before it is put into operation. Furthermore, K is often different for different data sources, which greatly increases the difficulty of estimating the correct K. An improper number of K can mislead the clustering process, resulting in a reduced document clustering performance. Therefore, it is useful if the multi-source document clustering method can automatically learn the number of clusters K per data source. 3) Traditional document clustering methods assume that the topic distribution is different for each data source. For example, AIJNewsweek most topics are focused on news categories including "political news", "technical news", "commercial news", etc., while news articles of "wall street news" are more relevant to "economic news". The difference in the topic scale of each data source also accounts for why the topic number K is different for each data source. Thus, automatically discovering source-level topic proportions facilitates accurate discovery of document structures for multi-source documents.
Therefore, in order to solve the above three problems, a new clustering method for multi-source text data is needed to obtain a more ideal clustering effect.
Disclosure of Invention
The invention aims to solve the technical problems that: a multi-source text clustering method based on a hierarchical dirichlet allocation model is provided, and an HDMA model is researched by adopting a two-step hierarchical theme generation process. The learned topics share their general characteristics among the data sources while preserving the local characteristics of the data sources. Each data source applies an exclusive topic partition to learn the topic emphasis at the source level. In addition, the invention can automatically identify the number of text clusters of each data set in the multi-source data set without manual setting in advance, thereby effectively solving the problems.
The technical scheme of the invention is as follows: a multi-source text clustering method based on a hierarchical dirichlet allocation model, the method comprising the steps of: 1. collecting a text set from a plurality of sources; 2. text preprocessing is carried out on text information from a plurality of data sources; 3. constructing a theme model based on the hierarchical dirichlet allocation model; 4. performing Blocked Gibbs sampling and updating a parameter beta; 5. and carrying out text clustering according to the sampling result.
In the second step, the preprocessing method is to divide words and to deactivate words, low-frequency words and punctuation numbers.
In the third step, the built multi-source theme model text generation step comprises the following steps:
1) For each topic k:
A. select β k,i~N(μ,σ2 I), i=1, 2, …, V
B. for each data source s:
Selection of
2) For each data source s:
C. Selection of
D. For document d in data source s:
Selection of
E. For word w in document d:
select w i~Multinomial(wi|zds
The fourth step comprises the following specific steps: firstly, initializing model parameters, wherein the model parameters to be initialized comprise super parameters { alpha, mu, sigma 2 } and hidden variables { beta, z }; after the model parameters are initialized, blocked Gibbs sampling is performed again, and when the sampling result tends to be stable, the dirichlet parameter beta k generating the topic-word distribution parameter is updated, and the Blocked Gibbs sampling process is repeated.
The process of inference of Blocked Gibbs sampling is as follows:
for each data source s in the multi-source dataset:
3) Updating topic-word distribution
4) Updating the topic distribution theta s;
5) Updating the theme of each text Where d= {1,2, M s.
In the fifth step, a clustering result is obtained according to the final sampling in the fourth step.
The beneficial effects of the invention are as follows: compared with the prior art, the clustering method and the clustering device have the advantages that by adopting the technical scheme, the clustering effect of the multi-source text is improved by updating the prior parameters of the topic-word distribution of the multi-source text; the built model can automatically judge the number of clusters in each data source text without artificial advance. The multi-source text shares a priori parameter which can generate similar but different topic-word distribution, is the basis for improving the clustering effect of the multi-source text data, and each data source has respective topic distribution and topic-word distribution parameters, so that the invention can automatically infer the respective topic number and word characteristics of each data source of the multi-source text. The method and the device can improve the multi-source text clustering effect to a greater extent.
The invention provides a novel multi-source document clustering model, namely an HDMA model. The HDMA model was studied using a two-step hierarchical topic generation process. The learned topics share their general characteristics among the data sources while preserving the local characteristics of the data sources. Each data source applies an exclusive topic partition to learn the topic emphasis at the source level. In addition, the invention can automatically identify the number of text clusters of each data set in the multi-source data set without manual setting in advance.
Drawings
FIG. 1 is a flow chart of the present invention;
Fig. 2 is a subject model of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings of the present specification.
Example 1: 1-2, a multi-source text clustering method based on a hierarchical dirichlet allocation model, the method comprises the following steps: 1. collecting a text set from a plurality of sources; 2. text preprocessing is carried out on text information from a plurality of data sources; 3. constructing a theme model based on the hierarchical dirichlet allocation model; 4. performing Blocked Gibbs sampling and updating a parameter beta; 5. and carrying out text clustering according to the sampling result.
The method of the invention is executed, firstly, step one is executed, and a text set from a plurality of data sources to be clustered is obtained.
The first multisource textset of this embodiment is HuffAmaSet data. The dataset contains 10000 texts, of which 5000 texts are news articles collected from HuffPost website (denoted HnewSet below) and the other 5000 documents are comment texts collected from Amazon website (denoted ASet). The dataset contains two topics, "food" and "sport," respectively. The second text set of this embodiment is BBCTset. The dataset also contains 10000 texts. Where 5000 texts are 5000 news articles collected from the BBC website, and another 5000 documents are 5000 articles collected from Twitter. The dataset contains three topics, "bussiness", "sport" and "politic", respectively.
And secondly, executing the second step, and performing text preprocessing on the acquired text set. And performing text preprocessing on the acquired text set. And performing work such as word segmentation, stop word removal, low-frequency word removal and the like on the text set. After pretreatment, redundant information in the text is removed, so that the text set is concise, neat, resource-saving and convenient to calculate.
Thirdly, after the text set is processed, performing modeling in the third step, and establishing a probability topic model of multi-source text clustering based on the hierarchical dirichlet allocation model. First, the model can automatically determine the number of clusters in the text set for each data source without requiring human advance. The text sets of each data source in the multi-source text data set have a respective topic-word distribution and topic distribution.
In the third step, the text generation step of the constructed multi-source theme model is as follows:
1) For each topic k:
A. select β k,i~N(μ,σ2 I), i=1, 2, …, V
B. for each data source s:
Selection of
2) For each data source s:
C. Selection of
D. For document d in data source s:
Selection of
E. For word w in document d:
select w i~Multinomial(wi|zds
Wherein alpha represents the parameters of dirichlet distribution, is a vector, and the dimension is equal to the number of topics; μ and σ 2 represent parameters of normal distribution; beta k represents a dirichlet allocation parameter for generating word allocation of the topic k, and the dimension is equal to the total number of vocabulary of the corpus; θ s represents the topic distribution of the data source s in the multi-source text set; phi s represents the topic-word distribution of the data source s in the multi-source text set; Representing a text topic sampled from θ s for text m; /(I) Respectively representing the mth text of the data source s in the source text set; m s represents the text total record of the data source s in the multi-source text set; k represents the total number of topics at initialization;
fourth, based on the above model, step four of the present invention is performed. In this step, the parameters are estimated using the Blocked Gibbs sampling method. The state of the markov chain is denoted u= { β, θ, Φ, Z }.
Firstly, initializing model parameters, wherein the model parameters to be initialized comprise super parameters { alpha, mu, sigma 2 } and hidden variables { beta, z }; after initializing the model parameters, the inference process for Blocked Gibbs sampling is as follows:
3) The topic-word distribution of each data source text set of the multi-source data set is updated. For k= {1,2,..k }, if K is not Phi k is employed from the dirichlet profile with parameter beta k, otherwise phi k is sampled from the dirichlet profile with the following parameters:
4) The topic distribution θ s for each data source text set is updated for the multi-source data set. Sampling a subject distribution from dirichlet parameters having:
where I (z l =k) is an identification function. When z l =k, I (z l =k) =1.
5) The topic z d of each text is sampled and updated from a discrete distribution { p d,1,pd,2,...,pd,K }, where:
It should be noted that: assuming that the number of classes estimated by the model is K *, the value is vector Is smaller than the initialized K value.
In step four, an update operation of the parameter β is required. When the sampling result tends to be stable, the invention updates beta by optimizing the posterior probability of generating the whole data set.The updated formula of (c) is as follows:
Wherein the method comprises the steps of Represents the number of occurrences of word w in the mth document under topic k, and/>
And step five of the invention is executed to perform text clustering. And obtaining the theme distribution condition of each table of contents mark of texts according to the sampling result to cluster.
The symbol descriptions in this example are shown in Table 1.
TABLE 1
The clustering effect of the multi-source text is improved by updating the prior parameters of the topic-word distribution of the multi-source text; the built model can automatically judge the number of clusters in each data source text without artificial advance. The multi-source text shares a priori parameter which can generate similar but different topic-word distribution, is the basis for improving the clustering effect of the multi-source text data, and each data source has respective topic distribution and topic-word distribution parameters, so that the invention can automatically infer the respective topic number and word characteristics of each data source of the multi-source text. The method and the device can improve the multi-source text clustering effect to a greater extent.
The invention provides a novel multi-source document clustering model, namely an HDMA model. The HDMA model was studied using a two-step hierarchical topic generation process. The learned topics share their general characteristics among the data sources while preserving the local characteristics of the data sources. Each data source applies an exclusive topic partition to learn the topic emphasis at the source level. In addition, the invention can automatically identify the number of text clusters of each data set in the multi-source data set without manual setting in advance.
The present invention is not described in detail in the present application, and is well known to those skilled in the art. Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered by the scope of the claims of the present invention.

Claims (2)

1. A multi-source text clustering method based on a hierarchical dirichlet allocation model is characterized by comprising the following steps of: the method comprises the following steps: 1. collecting a text set from a plurality of sources; 2. text preprocessing is carried out on text information from a plurality of data sources; 3. constructing a theme model based on the hierarchical dirichlet allocation model; 4. performing a Blocked Gibbs sampling party and updating a parameter beta; 5. performing text clustering according to the sampling result;
In the second step, the preprocessing method is to divide words, stop words, low-frequency words and punctuation numbers;
In the third step, the text generation step of the constructed multi-source theme model is as follows:
1) For each topic k:
A. Selection of
B. for each data source s:
Selection of
2) For each data source s:
C. Selection of
D. For document d in data source s:
Selection of
E. For word w in document d:
select w i~Multinomial(wi|zds
In the fourth step, based on the topic model constructed in the third step, the characteristic word distribution, the noise word distribution and the topic distribution of each data source in the multi-source data set are sampled by using a Blocked Gibbs sampling algorithm, and when the sampling result tends to be stable, the dirichlet parameter beta for generating the topic-word distribution parameter is updated and the Blocked Gibbs sampling process is repeated;
The process of the inference of the Blocked Gibbs samples is as follows:
for each data source s in the multi-source dataset:
1) Updating topic-word distribution
For k= {1,2,..k }, if K is notPhi k is employed from the dirichlet profile with parameter beta k, otherwise phi k is sampled from the dirichlet profile with the following parameters:
2) Updating the topic distribution theta s;
Sampling a subject distribution from dirichlet parameters having:
where I (z l =k) is an identification function. When z l =k, I (z l =k) =1
3) Updating the theme of each textWhere d= {1,2,., M s }; wherein:
In the fourth step, an updating operation of the parameter β is required; after the sampling result tends to be stable, the invention updates beta by optimizing the posterior probability of the whole data set; the updated formula of (c) is as follows:
Wherein the method comprises the steps of Represents the number of occurrences of word w in the mth document under topic k, and/>
2. The multi-source text clustering method based on the hierarchical dirichlet allocation model according to claim 1, wherein: in the fifth step, a clustering result is obtained according to the final sampling in the fourth step.
CN202010570969.8A 2020-06-22 2020-06-22 Multi-source text clustering method based on hierarchical dirichlet allocation model Active CN111813935B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010570969.8A CN111813935B (en) 2020-06-22 2020-06-22 Multi-source text clustering method based on hierarchical dirichlet allocation model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010570969.8A CN111813935B (en) 2020-06-22 2020-06-22 Multi-source text clustering method based on hierarchical dirichlet allocation model

Publications (2)

Publication Number Publication Date
CN111813935A CN111813935A (en) 2020-10-23
CN111813935B true CN111813935B (en) 2024-04-30

Family

ID=72846295

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010570969.8A Active CN111813935B (en) 2020-06-22 2020-06-22 Multi-source text clustering method based on hierarchical dirichlet allocation model

Country Status (1)

Country Link
CN (1) CN111813935B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107798043A (en) * 2017-06-28 2018-03-13 贵州大学 The Text Clustering Method of long text auxiliary short text based on the multinomial mixed model of Di Li Crays
CN109829151A (en) * 2018-11-27 2019-05-31 国网浙江省电力有限公司 A kind of text segmenting method based on layering Di Li Cray model
CN110046228A (en) * 2019-04-18 2019-07-23 合肥工业大学 Short text subject identifying method and system
CN110263153A (en) * 2019-05-15 2019-09-20 北京邮电大学 Mixing text topic towards multi-source information finds method
KR20200026351A (en) * 2018-08-29 2020-03-11 동국대학교 산학협력단 Device and method for topic analysis using an enhanced latent dirichlet allocation model

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8510257B2 (en) * 2010-10-19 2013-08-13 Xerox Corporation Collapsed gibbs sampler for sparse topic models and discrete matrix factorization
US8527448B2 (en) * 2011-12-16 2013-09-03 Huawei Technologies Co., Ltd. System, method and apparatus for increasing speed of hierarchial latent dirichlet allocation model
CN107133238A (en) * 2016-02-29 2017-09-05 阿里巴巴集团控股有限公司 A kind of text message clustering method and text message clustering system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107798043A (en) * 2017-06-28 2018-03-13 贵州大学 The Text Clustering Method of long text auxiliary short text based on the multinomial mixed model of Di Li Crays
KR20200026351A (en) * 2018-08-29 2020-03-11 동국대학교 산학협력단 Device and method for topic analysis using an enhanced latent dirichlet allocation model
CN109829151A (en) * 2018-11-27 2019-05-31 国网浙江省电力有限公司 A kind of text segmenting method based on layering Di Li Cray model
CN110046228A (en) * 2019-04-18 2019-07-23 合肥工业大学 Short text subject identifying method and system
CN110263153A (en) * 2019-05-15 2019-09-20 北京邮电大学 Mixing text topic towards multi-source information finds method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
一种基于狄利克雷过程混合模型的文本聚类算法;高悦;王文贤;杨淑贤;;信息网络安全(11);全文 *
基于狄利克雷多项分配模型的多源文本主题挖掘模型;徐立洋等;《计算机应用》;第38卷(第11期);3094-3099 *

Also Published As

Publication number Publication date
CN111813935A (en) 2020-10-23

Similar Documents

Publication Publication Date Title
CN108052593B (en) Topic keyword extraction method based on topic word vector and network structure
CN107133213B (en) Method and system for automatically extracting text abstract based on algorithm
CN106383877B (en) Social media online short text clustering and topic detection method
CN109815336B (en) Text aggregation method and system
CN102298576B (en) Method and device for generating document keywords
CN111832289B (en) Service discovery method based on clustering and Gaussian LDA
CN107798043B (en) Text clustering method for long text auxiliary short text based on Dirichlet multinomial mixed model
CN108920482B (en) Microblog short text classification method based on lexical chain feature extension and LDA (latent Dirichlet Allocation) model
CN104573070B (en) A kind of Text Clustering Method for mixing length text set
CN107066555A (en) Towards the online topic detection method of professional domain
CN112347778A (en) Keyword extraction method and device, terminal equipment and storage medium
CN108733647B (en) Word vector generation method based on Gaussian distribution
CN109902290B (en) Text information-based term extraction method, system and equipment
CN110377695B (en) Public opinion theme data clustering method and device and storage medium
CN107391565B (en) Matching method of cross-language hierarchical classification system based on topic model
CN107357785A (en) Theme feature word abstracting method and system, feeling polarities determination methods and system
CN110705272A (en) Named entity identification method for automobile engine fault diagnosis
CN111967267A (en) XLNET-based news text region extraction method and system
CN111813935B (en) Multi-source text clustering method based on hierarchical dirichlet allocation model
CN111858860B (en) Search information processing method and system, server and computer readable medium
CN111680146A (en) Method and device for determining new words, electronic equipment and readable storage medium
CN111339287B (en) Abstract generation method and device
CN110377845B (en) Collaborative filtering recommendation method based on interval semi-supervised LDA
CN111813934B (en) Multi-source text topic model clustering method based on DMA model and feature division
CN114528378A (en) Text classification method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant