CN111813935B - Multi-source text clustering method based on hierarchical dirichlet allocation model - Google Patents
Multi-source text clustering method based on hierarchical dirichlet allocation model Download PDFInfo
- Publication number
- CN111813935B CN111813935B CN202010570969.8A CN202010570969A CN111813935B CN 111813935 B CN111813935 B CN 111813935B CN 202010570969 A CN202010570969 A CN 202010570969A CN 111813935 B CN111813935 B CN 111813935B
- Authority
- CN
- China
- Prior art keywords
- text
- source
- topic
- model
- distribution
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 35
- 238000009826 distribution Methods 0.000 claims abstract description 32
- 238000005070 sampling Methods 0.000 claims abstract description 24
- 238000007781 pre-processing Methods 0.000 claims abstract description 8
- 230000000694 effects Effects 0.000 abstract description 9
- 238000005065 mining Methods 0.000 description 3
- 238000005192 partition Methods 0.000 description 3
- 206010039203 Road traffic accident Diseases 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a multi-source text clustering method based on a hierarchical dirichlet allocation model, which comprises the following steps: 1. collecting a text set from a plurality of sources; 2. text preprocessing is carried out on text information from a plurality of data sources; 3. constructing a theme model based on the hierarchical dirichlet allocation model; 4. performing Blocked Gibbs sampling and updating parameters; 5. And carrying out text clustering according to the sampling result. The invention improves the clustering effect of the multi-source text by updating the prior parameter of the topic-word distribution of the multi-source text; the built model can automatically judge the number of clusters in each data source text without artificial advance giving, and can greatly improve the clustering effect of the multi-source text.
Description
Technical Field
The invention relates to a text clustering method, in particular to a multi-source text clustering method based on a hierarchical dirichlet allocation model, and belongs to the technical field of machine learning and natural language processing.
Background
With the rapid development of information technology, people have more and more ways to obtain information, especially text information. The text information sources are different, and the information characteristics are inconsistent. We can mine their subject information as well as text structure information from multi-source text datasets, which is highly desirable for many scenarios. For example, mining text information from various news websites, forums, social media and other sources can help us to learn about hot topics of social interest, and besides, we find sudden traffic accidents by analyzing traffic information from various sources such as citizen hotlines, traffic bulletin boards and the like. Therefore, it is necessary to develop a topic model based on a multi-source text dataset and mine information in the multi-source text dataset.
Mining text information of a multi-source text dataset using a traditional topic model has many difficulties, such as: 1) The word distributions of the topics of the multiple data sources are similar but not identical. For example, articles of news websites tend to describe a topic in standard terms, while words in social media documents are more arbitrary. Therefore, it is not feasible to directly solve the clustering problem of the multi-source document by adopting the traditional topic model mining, because the clustering performance of the document is seriously affected by the writing style difference of topics of different sources. 2) Estimating the cluster number K is also difficult for multi-source document clustering. For most conventional document clustering methods, K is considered a parameter that the user has previously determined, but it is difficult and impractical to provide the correct value of K before it is put into operation. Furthermore, K is often different for different data sources, which greatly increases the difficulty of estimating the correct K. An improper number of K can mislead the clustering process, resulting in a reduced document clustering performance. Therefore, it is useful if the multi-source document clustering method can automatically learn the number of clusters K per data source. 3) Traditional document clustering methods assume that the topic distribution is different for each data source. For example, AIJNewsweek most topics are focused on news categories including "political news", "technical news", "commercial news", etc., while news articles of "wall street news" are more relevant to "economic news". The difference in the topic scale of each data source also accounts for why the topic number K is different for each data source. Thus, automatically discovering source-level topic proportions facilitates accurate discovery of document structures for multi-source documents.
Therefore, in order to solve the above three problems, a new clustering method for multi-source text data is needed to obtain a more ideal clustering effect.
Disclosure of Invention
The invention aims to solve the technical problems that: a multi-source text clustering method based on a hierarchical dirichlet allocation model is provided, and an HDMA model is researched by adopting a two-step hierarchical theme generation process. The learned topics share their general characteristics among the data sources while preserving the local characteristics of the data sources. Each data source applies an exclusive topic partition to learn the topic emphasis at the source level. In addition, the invention can automatically identify the number of text clusters of each data set in the multi-source data set without manual setting in advance, thereby effectively solving the problems.
The technical scheme of the invention is as follows: a multi-source text clustering method based on a hierarchical dirichlet allocation model, the method comprising the steps of: 1. collecting a text set from a plurality of sources; 2. text preprocessing is carried out on text information from a plurality of data sources; 3. constructing a theme model based on the hierarchical dirichlet allocation model; 4. performing Blocked Gibbs sampling and updating a parameter beta; 5. and carrying out text clustering according to the sampling result.
In the second step, the preprocessing method is to divide words and to deactivate words, low-frequency words and punctuation numbers.
In the third step, the built multi-source theme model text generation step comprises the following steps:
1) For each topic k:
A. select β k,i~N(μ,σ2 I), i=1, 2, …, V
B. for each data source s:
Selection of
2) For each data source s:
C. Selection of
D. For document d in data source s:
Selection of
E. For word w in document d:
select w i~Multinomial(wi|zd,φs
The fourth step comprises the following specific steps: firstly, initializing model parameters, wherein the model parameters to be initialized comprise super parameters { alpha, mu, sigma 2 } and hidden variables { beta, z }; after the model parameters are initialized, blocked Gibbs sampling is performed again, and when the sampling result tends to be stable, the dirichlet parameter beta k generating the topic-word distribution parameter is updated, and the Blocked Gibbs sampling process is repeated.
The process of inference of Blocked Gibbs sampling is as follows:
for each data source s in the multi-source dataset:
3) Updating topic-word distribution
4) Updating the topic distribution theta s;
5) Updating the theme of each text Where d= {1,2, M s.
In the fifth step, a clustering result is obtained according to the final sampling in the fourth step.
The beneficial effects of the invention are as follows: compared with the prior art, the clustering method and the clustering device have the advantages that by adopting the technical scheme, the clustering effect of the multi-source text is improved by updating the prior parameters of the topic-word distribution of the multi-source text; the built model can automatically judge the number of clusters in each data source text without artificial advance. The multi-source text shares a priori parameter which can generate similar but different topic-word distribution, is the basis for improving the clustering effect of the multi-source text data, and each data source has respective topic distribution and topic-word distribution parameters, so that the invention can automatically infer the respective topic number and word characteristics of each data source of the multi-source text. The method and the device can improve the multi-source text clustering effect to a greater extent.
The invention provides a novel multi-source document clustering model, namely an HDMA model. The HDMA model was studied using a two-step hierarchical topic generation process. The learned topics share their general characteristics among the data sources while preserving the local characteristics of the data sources. Each data source applies an exclusive topic partition to learn the topic emphasis at the source level. In addition, the invention can automatically identify the number of text clusters of each data set in the multi-source data set without manual setting in advance.
Drawings
FIG. 1 is a flow chart of the present invention;
Fig. 2 is a subject model of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings of the present specification.
Example 1: 1-2, a multi-source text clustering method based on a hierarchical dirichlet allocation model, the method comprises the following steps: 1. collecting a text set from a plurality of sources; 2. text preprocessing is carried out on text information from a plurality of data sources; 3. constructing a theme model based on the hierarchical dirichlet allocation model; 4. performing Blocked Gibbs sampling and updating a parameter beta; 5. and carrying out text clustering according to the sampling result.
The method of the invention is executed, firstly, step one is executed, and a text set from a plurality of data sources to be clustered is obtained.
The first multisource textset of this embodiment is HuffAmaSet data. The dataset contains 10000 texts, of which 5000 texts are news articles collected from HuffPost website (denoted HnewSet below) and the other 5000 documents are comment texts collected from Amazon website (denoted ASet). The dataset contains two topics, "food" and "sport," respectively. The second text set of this embodiment is BBCTset. The dataset also contains 10000 texts. Where 5000 texts are 5000 news articles collected from the BBC website, and another 5000 documents are 5000 articles collected from Twitter. The dataset contains three topics, "bussiness", "sport" and "politic", respectively.
And secondly, executing the second step, and performing text preprocessing on the acquired text set. And performing text preprocessing on the acquired text set. And performing work such as word segmentation, stop word removal, low-frequency word removal and the like on the text set. After pretreatment, redundant information in the text is removed, so that the text set is concise, neat, resource-saving and convenient to calculate.
Thirdly, after the text set is processed, performing modeling in the third step, and establishing a probability topic model of multi-source text clustering based on the hierarchical dirichlet allocation model. First, the model can automatically determine the number of clusters in the text set for each data source without requiring human advance. The text sets of each data source in the multi-source text data set have a respective topic-word distribution and topic distribution.
In the third step, the text generation step of the constructed multi-source theme model is as follows:
1) For each topic k:
A. select β k,i~N(μ,σ2 I), i=1, 2, …, V
B. for each data source s:
Selection of
2) For each data source s:
C. Selection of
D. For document d in data source s:
Selection of
E. For word w in document d:
select w i~Multinomial(wi|zd,φs
Wherein alpha represents the parameters of dirichlet distribution, is a vector, and the dimension is equal to the number of topics; μ and σ 2 represent parameters of normal distribution; beta k represents a dirichlet allocation parameter for generating word allocation of the topic k, and the dimension is equal to the total number of vocabulary of the corpus; θ s represents the topic distribution of the data source s in the multi-source text set; phi s represents the topic-word distribution of the data source s in the multi-source text set; Representing a text topic sampled from θ s for text m; /(I) Respectively representing the mth text of the data source s in the source text set; m s represents the text total record of the data source s in the multi-source text set; k represents the total number of topics at initialization;
fourth, based on the above model, step four of the present invention is performed. In this step, the parameters are estimated using the Blocked Gibbs sampling method. The state of the markov chain is denoted u= { β, θ, Φ, Z }.
Firstly, initializing model parameters, wherein the model parameters to be initialized comprise super parameters { alpha, mu, sigma 2 } and hidden variables { beta, z }; after initializing the model parameters, the inference process for Blocked Gibbs sampling is as follows:
3) The topic-word distribution of each data source text set of the multi-source data set is updated. For k= {1,2,..k }, if K is not Phi k is employed from the dirichlet profile with parameter beta k, otherwise phi k is sampled from the dirichlet profile with the following parameters:
4) The topic distribution θ s for each data source text set is updated for the multi-source data set. Sampling a subject distribution from dirichlet parameters having:
where I (z l =k) is an identification function. When z l =k, I (z l =k) =1.
5) The topic z d of each text is sampled and updated from a discrete distribution { p d,1,pd,2,...,pd,K }, where:
It should be noted that: assuming that the number of classes estimated by the model is K *, the value is vector Is smaller than the initialized K value.
In step four, an update operation of the parameter β is required. When the sampling result tends to be stable, the invention updates beta by optimizing the posterior probability of generating the whole data set.The updated formula of (c) is as follows:
Wherein the method comprises the steps of Represents the number of occurrences of word w in the mth document under topic k, and/>
And step five of the invention is executed to perform text clustering. And obtaining the theme distribution condition of each table of contents mark of texts according to the sampling result to cluster.
The symbol descriptions in this example are shown in Table 1.
TABLE 1
The clustering effect of the multi-source text is improved by updating the prior parameters of the topic-word distribution of the multi-source text; the built model can automatically judge the number of clusters in each data source text without artificial advance. The multi-source text shares a priori parameter which can generate similar but different topic-word distribution, is the basis for improving the clustering effect of the multi-source text data, and each data source has respective topic distribution and topic-word distribution parameters, so that the invention can automatically infer the respective topic number and word characteristics of each data source of the multi-source text. The method and the device can improve the multi-source text clustering effect to a greater extent.
The invention provides a novel multi-source document clustering model, namely an HDMA model. The HDMA model was studied using a two-step hierarchical topic generation process. The learned topics share their general characteristics among the data sources while preserving the local characteristics of the data sources. Each data source applies an exclusive topic partition to learn the topic emphasis at the source level. In addition, the invention can automatically identify the number of text clusters of each data set in the multi-source data set without manual setting in advance.
The present invention is not described in detail in the present application, and is well known to those skilled in the art. Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered by the scope of the claims of the present invention.
Claims (2)
1. A multi-source text clustering method based on a hierarchical dirichlet allocation model is characterized by comprising the following steps of: the method comprises the following steps: 1. collecting a text set from a plurality of sources; 2. text preprocessing is carried out on text information from a plurality of data sources; 3. constructing a theme model based on the hierarchical dirichlet allocation model; 4. performing a Blocked Gibbs sampling party and updating a parameter beta; 5. performing text clustering according to the sampling result;
In the second step, the preprocessing method is to divide words, stop words, low-frequency words and punctuation numbers;
In the third step, the text generation step of the constructed multi-source theme model is as follows:
1) For each topic k:
A. Selection of
B. for each data source s:
Selection of
2) For each data source s:
C. Selection of
D. For document d in data source s:
Selection of
E. For word w in document d:
select w i~Multinomial(wi|zd,φs
In the fourth step, based on the topic model constructed in the third step, the characteristic word distribution, the noise word distribution and the topic distribution of each data source in the multi-source data set are sampled by using a Blocked Gibbs sampling algorithm, and when the sampling result tends to be stable, the dirichlet parameter beta for generating the topic-word distribution parameter is updated and the Blocked Gibbs sampling process is repeated;
The process of the inference of the Blocked Gibbs samples is as follows:
for each data source s in the multi-source dataset:
1) Updating topic-word distribution
For k= {1,2,..k }, if K is notPhi k is employed from the dirichlet profile with parameter beta k, otherwise phi k is sampled from the dirichlet profile with the following parameters:
2) Updating the topic distribution theta s;
Sampling a subject distribution from dirichlet parameters having:
where I (z l =k) is an identification function. When z l =k, I (z l =k) =1
3) Updating the theme of each textWhere d= {1,2,., M s }; wherein:
In the fourth step, an updating operation of the parameter β is required; after the sampling result tends to be stable, the invention updates beta by optimizing the posterior probability of the whole data set; the updated formula of (c) is as follows:
Wherein the method comprises the steps of Represents the number of occurrences of word w in the mth document under topic k, and/>
2. The multi-source text clustering method based on the hierarchical dirichlet allocation model according to claim 1, wherein: in the fifth step, a clustering result is obtained according to the final sampling in the fourth step.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010570969.8A CN111813935B (en) | 2020-06-22 | 2020-06-22 | Multi-source text clustering method based on hierarchical dirichlet allocation model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010570969.8A CN111813935B (en) | 2020-06-22 | 2020-06-22 | Multi-source text clustering method based on hierarchical dirichlet allocation model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111813935A CN111813935A (en) | 2020-10-23 |
CN111813935B true CN111813935B (en) | 2024-04-30 |
Family
ID=72846295
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010570969.8A Active CN111813935B (en) | 2020-06-22 | 2020-06-22 | Multi-source text clustering method based on hierarchical dirichlet allocation model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111813935B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107798043A (en) * | 2017-06-28 | 2018-03-13 | 贵州大学 | The Text Clustering Method of long text auxiliary short text based on the multinomial mixed model of Di Li Crays |
CN109829151A (en) * | 2018-11-27 | 2019-05-31 | 国网浙江省电力有限公司 | A kind of text segmenting method based on layering Di Li Cray model |
CN110046228A (en) * | 2019-04-18 | 2019-07-23 | 合肥工业大学 | Short text subject identifying method and system |
CN110263153A (en) * | 2019-05-15 | 2019-09-20 | 北京邮电大学 | Mixing text topic towards multi-source information finds method |
KR20200026351A (en) * | 2018-08-29 | 2020-03-11 | 동국대학교 산학협력단 | Device and method for topic analysis using an enhanced latent dirichlet allocation model |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8510257B2 (en) * | 2010-10-19 | 2013-08-13 | Xerox Corporation | Collapsed gibbs sampler for sparse topic models and discrete matrix factorization |
US8527448B2 (en) * | 2011-12-16 | 2013-09-03 | Huawei Technologies Co., Ltd. | System, method and apparatus for increasing speed of hierarchial latent dirichlet allocation model |
CN107133238A (en) * | 2016-02-29 | 2017-09-05 | 阿里巴巴集团控股有限公司 | A kind of text message clustering method and text message clustering system |
-
2020
- 2020-06-22 CN CN202010570969.8A patent/CN111813935B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107798043A (en) * | 2017-06-28 | 2018-03-13 | 贵州大学 | The Text Clustering Method of long text auxiliary short text based on the multinomial mixed model of Di Li Crays |
KR20200026351A (en) * | 2018-08-29 | 2020-03-11 | 동국대학교 산학협력단 | Device and method for topic analysis using an enhanced latent dirichlet allocation model |
CN109829151A (en) * | 2018-11-27 | 2019-05-31 | 国网浙江省电力有限公司 | A kind of text segmenting method based on layering Di Li Cray model |
CN110046228A (en) * | 2019-04-18 | 2019-07-23 | 合肥工业大学 | Short text subject identifying method and system |
CN110263153A (en) * | 2019-05-15 | 2019-09-20 | 北京邮电大学 | Mixing text topic towards multi-source information finds method |
Non-Patent Citations (2)
Title |
---|
一种基于狄利克雷过程混合模型的文本聚类算法;高悦;王文贤;杨淑贤;;信息网络安全(11);全文 * |
基于狄利克雷多项分配模型的多源文本主题挖掘模型;徐立洋等;《计算机应用》;第38卷(第11期);3094-3099 * |
Also Published As
Publication number | Publication date |
---|---|
CN111813935A (en) | 2020-10-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108052593B (en) | Topic keyword extraction method based on topic word vector and network structure | |
CN107133213B (en) | Method and system for automatically extracting text abstract based on algorithm | |
CN106383877B (en) | Social media online short text clustering and topic detection method | |
CN109815336B (en) | Text aggregation method and system | |
CN102298576B (en) | Method and device for generating document keywords | |
CN111832289B (en) | Service discovery method based on clustering and Gaussian LDA | |
CN107798043B (en) | Text clustering method for long text auxiliary short text based on Dirichlet multinomial mixed model | |
CN108920482B (en) | Microblog short text classification method based on lexical chain feature extension and LDA (latent Dirichlet Allocation) model | |
CN104573070B (en) | A kind of Text Clustering Method for mixing length text set | |
CN107066555A (en) | Towards the online topic detection method of professional domain | |
CN112347778A (en) | Keyword extraction method and device, terminal equipment and storage medium | |
CN108733647B (en) | Word vector generation method based on Gaussian distribution | |
CN109902290B (en) | Text information-based term extraction method, system and equipment | |
CN110377695B (en) | Public opinion theme data clustering method and device and storage medium | |
CN107391565B (en) | Matching method of cross-language hierarchical classification system based on topic model | |
CN107357785A (en) | Theme feature word abstracting method and system, feeling polarities determination methods and system | |
CN110705272A (en) | Named entity identification method for automobile engine fault diagnosis | |
CN111967267A (en) | XLNET-based news text region extraction method and system | |
CN111813935B (en) | Multi-source text clustering method based on hierarchical dirichlet allocation model | |
CN111858860B (en) | Search information processing method and system, server and computer readable medium | |
CN111680146A (en) | Method and device for determining new words, electronic equipment and readable storage medium | |
CN111339287B (en) | Abstract generation method and device | |
CN110377845B (en) | Collaborative filtering recommendation method based on interval semi-supervised LDA | |
CN111813934B (en) | Multi-source text topic model clustering method based on DMA model and feature division | |
CN114528378A (en) | Text classification method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |