CN106096014A

CN106096014A - The Text Clustering Method of mixing length text set based on DMR

Info

Publication number: CN106096014A
Application number: CN201610469360.5A
Authority: CN
Inventors: 黄瑞章; 闫盈盈; 王瑞; 钟文良; 黄庭; 李晶; 陈功; 刘博伟; 朱坤; 王振军
Original assignee: Guizhou Farming Technology Co Ltd; Guizhou University
Current assignee: Guizhou Farming Technology Co Ltd; Guizhou University
Priority date: 2016-06-25
Filing date: 2016-06-25
Publication date: 2016-11-09

Abstract

The invention discloses the Text Clustering Method of a kind of mixing length text set based on DMR.The present invention is mixing length text set, compared with traditional long text collection, has universality；And have employed DMR method and determine the Study first of model, improve tradition cluster use and be manually set priori value and obtain method；For relatively conventional hybrid text set, long assigned short text set shares identical theme.The present invention is simple, and using effect is good.

Description

The Text Clustering Method of mixing length text set based on DMR

Technical field

The present invention relates to field of computer technology, the text cluster of a kind of mixing length text set based on DMR Method.

Background technology

Along with the arrival of big data age, from mass text data, excavate potential subject information become increasingly to weigh Want.

User's request and valuable content is met, in text mining field, generally in order to find in the data of magnanimity Use Text Clustering Method.Text cluster refers to be divided into the text data set that given multiple data class, each class Interior text semantic height is similar, and between class, semantic similarity is extremely low.At present, Text Clustering Method is widely used to text and digs Pick field, especially in the field such as information retrieval, intelligent searching engine.

Text is divided into long text and short text two class according to the difference of length.Single long text is clustered by existing technology It is made quite ripe, in terms of to single short text clustering, has also achieved some achievements, but due to short text self Two major features: feature height is sparse and contextual dependency is strong.Clustering method for short text still needs to explore and improve. Cluster for mixing length text set equally still cannot obtain preferable Clustering Effect due to the Characteristic Problem of short text.

At present, Text Clustering Algorithm layer based on probability topic model goes out not group, and they are for long text (news, blog And mail etc.) often there is good Clustering Effect.But it is as the development of the social form explosion types such as microblogging, excavates and hide Semanteme in this kind of short text is very important, and the key feature yet with short text is the most sparse and contextual dependency By force, directly applying the clustering method of long text, the Clustering Effect of generation is not so good as people's will.Certainly, it is known that in real life Text set includes long text collection and assigned short text set two kinds, so far, for mixing length text set clustering method also in In the immature stage, still there are many improvements.

Summary of the invention

The technical problem to be solved is: provide the text cluster of a kind of mixing length text set based on DMR Method, it can realize being better than the Clustering Effect of prior art.

The present invention is achieved in that the Text Clustering Method of mixing length text set based on DMR, including walking as follows Rapid:

1) original mixing length text set is carried out Text Pretreatment；

2) text set that pretreatment is good is divided into long text collection and assigned short text set；

3) use DMR method that text set is modeled；

4) according to model, it is thus achieved that the theme of whole corpus-word distribution and the respective document-theme distribution of long short text；

5) corresponding mixing length text cluster is realized according to described distribution.

Carrying out Text Pretreatment described in step 1), text set uses and data base, graph image or computer network The collection of thesis that network is relevant, pretreatment includes participle and goes stop words process.

Step 2) in the text set that pretreatment is good is divided into long text collection and assigned short text set, content is less than 140 characters Text set term assigned short text set, the most then be long text collection；The Abstractb part of every paper of text set is divided into length Text set, is divided into the title of every paper in text set in assigned short text set.

Modeling described in step 3) is to utilize the auxiliary assigned short text set modeling of long text collection, and both have identical word-master Topic distribution.

In modeling process, have employed the method i.e. DMR method logarithm as document-theme of Di Li Cray polynomial regression Linear priori.Owing to the respective prior information of long assigned short text set is different, therefore long short text can produce different Study first, this mould Type use prior information be current document be long text or short text, if long text is then labeled as 1, if short text mark It is designated as 0.

Owing to short text has the shortcoming that feature is openness and contextual dependency is strong, therefore to mixing length text set Modeling process in, by long text information auxiliary short text help to create relatively good effect.In described modeling Cheng Zhong, long assigned short text set has identical theme-word distribution, therefore, it is possible to reach the target of long text information auxiliary short text.

Compared with prior art, the present invention is mixing length text set, compared with traditional long text collection, has pervasive Property；And have employed DMR method and determine the Study first of model, improve tradition cluster use and be manually set the priori value side of obtaining Method；For relatively conventional hybrid text set, long assigned short text set shares identical theme.The present invention is simple, and using effect is good.

Accompanying drawing explanation

Fig. 1 is the execution flow chart of embodiments of the invention；

Fig. 2 is the model of embodiments of the invention.

Detailed description of the invention

Embodiments of the invention 1: the Text Clustering Method of mixing length text set based on DMR, the flow process of the present embodiment As shown in the figure:

S1 step being first carried out, obtains mixing text set to be clustered, the present embodiment uses the data being derived from Twitter Collection；

Next performs s2, and mixing length text set is carried out Text Pretreatment work；For English text, need to carry out point Word, the removal work such as stop words, root reduction；After pre-treatment step, remove the information of redundancy in text so that text Collection becomes succinct carefully and neatly done very saving resource and is easy to calculate；

Preferably assist short text to realize long text, perform s3 step, in extraction collection of thesis in every paper Abstract part, is brought into long text and is concentrated, and forms auxiliary text set, otherwise extracts the title of every paper, include in In assigned short text set, form text set to be assisted；

Long short text be divided complete after, perform s4 step set up model；In the model, long assigned short text set is used in conjunction with One theme-word distribution matrix, is the marrow place using long text auxiliary short text；But then because of Di Li to be passed through Cray polynomial regression method determines the most different Study first, so respective theme distribution also differs.Such as Fig. 2 Shown in.

First the symbol in interpretation model.Symbol scalar main in this example is as shown in table 1.

Table 1

The generation process of this instance model is described below:

After setting up model, perform the present invention s5 step, in this step, in every document each word give one with The theme of machine, as the original state of markov chain.

Owing to the present embodiment uses mixing length text set, so when updating text subject, if long text, then Perform s6 step, if short text, then perform s7 step.In the two step, gibbs (Gibbs) sampling is all used to carry out Theme updates, and its more new regulation is as follows:

Can be obtained by sampled result when gibbs sampler reaches convergence state and obtain parameter estimation by statistics.

Performing s8 step and obtain the document-theme distribution of short text, the theme-word performing the s9 whole corpus of acquisition divides Cloth, performs s10 step, it is thus achieved that the document-theme distribution of long text.

Perform the s11 step of the present invention, it is achieved the cluster of text.

It is above embodiments of the present invention, it is noted that for those skilled in the art, not In the case of departing from the principle of the invention, some improvement can be made, and these improvement are also considered as protection scope of the present invention.

Claims

1. the Text Clustering Method of a mixing length text set based on DMR, it is characterised in that: comprise the steps:

1) original mixing length text set is carried out Text Pretreatment；

3) use DMR method that text set is modeled；

The Text Clustering Method of mixing length text set based on DMR the most according to claim 1, it is characterised in that: Carrying out Text Pretreatment described in step 1), text set uses the opinion relevant to data base, graph image or computer network Collected works, pretreatment includes participle and goes stop words process.

The Text Clustering Method of mixing length text set based on DMR the most according to claim 1, it is characterised in that: step Rapid 2) in, the text set that pretreatment is good being divided into long text collection and assigned short text set, content is less than the text set term of 140 characters Assigned short text set, the most then be long text collection；The Abstractb part of every paper of text set is divided into long text collection, will In text set, the title of every paper is divided in assigned short text set.

The Text Clustering Method of mixing length text set based on DMR the most according to claim 3, it is characterised in that: step Rapid 3) modeling described in is to utilize the auxiliary assigned short text set modeling of long text collection, and both have identical word-theme distribution.