CN100495408C

CN100495408C - Text clustering element study method and device

Info

Publication number: CN100495408C
Application number: CN 200710117752
Authority: CN
Inventors: 向继; 夏鲁宁; 荆继武; 冯登国
Original assignee: University of Chinese Academy of Sciences
Current assignee: University of Chinese Academy of Sciences; Institute of Information Engineering of CAS
Priority date: 2007-06-22
Filing date: 2007-06-22
Publication date: 2009-06-03
Anticipated expiration: 2027-06-22
Also published as: CN101079072A

Abstract

The invention discloses a text clustering element learning method, which comprises the following steps: analyzing the text collection with the text analysis method; getting not less than two text analysis results; synthesizing the text vector matrix according to the text analysis results; proceeding the element learning for the text vector matrix; getting the final clustering results. The invention also discloses a text clustering element learning device, which comprises the following parts: a text analysis modular, a matrix synthesis modular and an element learning modular. The invention reduces the divagation of the single clustering, which improves the accuracy and stability of the clustering results.

Description

A kind of text clustering element study method and device

Technical field

The present invention relates to the text cluster method, refer in particular to a kind of text clustering element study method and device.

Background technology

The text cluster method is a kind of clustering method, is a kind of application of cluster analysis technology in the text-processing field.The method of text cluster can be found some bunches in the text set automatically, and all texts in the text set are divided into a plurality of bunches, make the content belong between the text in same bunch have higher similarity, and the content difference that belongs between different bunches the text is bigger.The text cluster method can be applicable to a lot of aspects, and for example: the topic of U.S. Department of Defense detects with (TDT, the Topic detection and tracking) project of tracking just tries hard to find much-talked-about topic by the text cluster method automatically in a newsletter archive stream; In addition, can also use the text cluster method that the results web page that search engine returns is carried out cluster, thereby make the user obtain more structurized and intelligible Search Results; By using the text cluster method, also can produce automatically the taxonomic hierarchies that is similar to the such network text of Yahoo's catalogue (Yahoo Directory) etc.

Present text cluster method is normally based on vector space model.In vector space model, each text all is represented as a text vector in the multidimensional Euclidean space, each dimension in the space is all corresponding with a feature speech, and text vector is commonly defined as this in the value on each dimension and ties up the number of times that pairing feature speech occurs in the pairing text of text vector.For any one text set, utilize vector space model can produce a text vector matrix V (n*k) based on the feature speech, wherein n is the quantity of text set Chinese version, k is the dimension of each text vector, the corresponding text vector in each provisional capital of matrix.After obtaining the vector matrix of text set, can utilize the clustering algorithm of various classics such as K average (K-means) algorithm algorithm, level cohesion cluster (HAC) algorithm etc. the vector matrix of text set to be carried out cluster calculation, thereby produce the text cluster result.

Existing clustering algorithm is broadly divided into hierarchical clustering, divides cluster, based on the cluster of density, based on the cluster of grid and several based on the clustering algorithm of model etc.Wherein divide clustering algorithm, especially the K-means algorithm is one of clustering algorithm that is most widely used always.In the K-means algorithm, divide classification, data set is divided into K part through iterating by the distance between comparing data sample and each class central point.Wherein, K for wish to obtain bunch quantity, need to specify in advance.Specifically, above-mentioned K-means algorithm comprises three steps: the first step, determine K initial classes central point in data centralization, and represent K class bunch respectively; Second step, the class bunch of each data sample being given the class central point representative nearest with it; The 3rd step, calculate the central point of each class bunch of current formation, replace original class central point, and returned for second step; So second and third step is carried out in circulation, up to result's convergence, till just bunch no longer changing under all data samples, thereby reaches the purpose of dividing cluster.

Except the text cluster method, text classification is the method that another kind carries out text analyzing.Different is with the text cluster method, file classification method need manually be trained, promptly need manually to specify classification, and for each classification provides certain training data, then according to the classification under the judgement of the difference between detected text and the training data detected text.File classification method commonly used has K arest neighbors (KNN, K-Nearest Neighbor) algorithm etc.

At present, general text cluster and sorting technique all are that the text in the text set is divided in one specific bunch or the classification, soft cluster and soft sorting technique then are expansions to above-mentioned text cluster and file classification method, these two kinds of methods are not that the text in the text set is divided in one bunch or the classification, but with different probability the text in the text set are divided in a plurality of bunches and the classification.In general, by soft cluster and the resulting classification results of soft sorting technique science more.

A subject matter of text cluster method existence at present is the poor stability of text cluster method, and is promptly for different text sets, bad when using a text cluster method to its possibility of result fashion of handling; And might occur some text sets, it is better than using the resulting classification results of text cluster method B to use text cluster method A, and, then use text cluster method A than the problem of using the resulting classification results difference of text cluster method B to another text set.

In order to address this problem, some researchers have proposed the method for text clustering element study.This method is undertaken comprehensively by the cluster result that multiple text cluster algorithm is produced, to improve the stability of text cluster method.And the study of described unit is exactly the study again to learning outcome, i.e. the study again that cluster result carried out that above-mentioned multiple text cluster algorithm is produced.The method of above-mentioned this text clustering element study can realize that wherein method is ballot (consensus) method the most intuitively by different strategies, and this method is highly susceptible to realizing.In voting method, used multiple text cluster algorithm, and should belong to which bunch for some objects, each different text cluster method all will provide evaluation separately, take a kind of mode of ballot then, be about to the maximum cluster result of number of votes obtained as this object finally should belong to bunch.In addition, people such as Alexander Strehl have also proposed three kinds of element study methods based on ballot: first method be referred to as based on bunch the partitioning algorithm (CSPA of similarity, Cluster-based Similarity PartitioningAlgorithm), this algorithm is estimated similarity between per two objects by analyzing two relation between objects in same bunch, repartition bunch according to the similarity evaluation that obtains, thereby obtain the better cluster result; Second method is referred to as hypergraph partitioning algorithm (HGPA, HyperGraph PartitioningAlgorithm), and this algorithm is regarded the process of text clustering element study as the process that hypergraph is cut apart of carrying out, the limit of hypergraph be generation bunch; The third method is the process that the text clustering element learning process is considered as a bunch of correspondence, is about in two groups of different cluster results similar bunch and is mapped, and comprehensively then produces last bunch.This method must at first solve the problem of bunch correspondence before enforcement, but owing to do not have fixing classification system in the text cluster method, so will be unusual difficulty with bunch being mapped one by one in two groups of different cluster results.Corresponding to the third above-mentioned method, another kind of looser method is to seek the class that all occurs in two groups of cluster results bunch, and ignores other class bunch.But nonetheless, when the number of times of text cluster more for a long time, that repeats like this bunch also can become considerably less; Consider above-mentioned situation, can also again standard be reduced, promptly as long as bunch more similar it can being mapped in the different cluster results, but so, how to determine that problem such as similarity threshold value itself be brought labile factor can for first study, make the target of utilizing first learning art to improve algorithm stability become difficult to achieve.

In summary, in the text clustering element study method in the prior art, though undertaken comprehensively by the cluster result that multiple text cluster algorithm is produced, but owing to what unit's study that cluster result carried out was taked is the mode of ballot, thereby can't come exactly to cause the accuracy of cluster result and stability not high based on similarity to bunch dividing.

Summary of the invention

In view of this, the invention provides a kind of text clustering element study method and device, improve the accuracy and the stability of cluster result.

For achieving the above object, technical scheme of the present invention is achieved in that

Embodiments of the invention provide a kind of text clustering element study method, and this method may further comprise the steps:

A, with the text analyzing method text set is carried out soft cluster or soft classification and handle, obtain at least two clusters or classification results;

B, described cluster or classification results are expressed as the result matrix, described result matrix is spliced into the text vector matrix;

C, described text vector matrix is carried out unit study, obtain final cluster result.

Embodiments of the invention also provide a kind of text clustering element learning device, and this device comprises: text analysis model, matrix synthesis module and first study module;

Described text analysis model is used for text set is carried out soft cluster or soft classification processing, and cluster or the classification results that obtains sent to described matrix synthesis module;

Described matrix synthesis module is used for received cluster or classification results are changed into matrix, and the matrix after will transforming is spliced into the text vector matrix, and described text vector matrix is sent to described first study module;

Described first study module is used for the text vector matrix that receives is carried out unit's study, exports final cluster result.

By above-mentioned technical scheme as can be known, text clustering element study method among the present invention and device, owing to synthesized the text vector matrix according to a plurality of text analyzing results, and above-mentioned text vector matrix carried out first study, taken all factors into consideration the analysis result of repeatedly soft cluster or soft classification, so can obtain text cluster result more accurately, effectively reduced the deviation that the single cluster is brought, improved the accuracy and the stability of cluster result.

Description of drawings

Fig. 1 is the schematic diagram of embodiment of the invention Chinese version clustering element study method.

Fig. 2 is the process flow diagram of embodiment of the invention Chinese version clustering element study method.

Fig. 3 is the structural drawing of embodiment of the invention Chinese version cluster unit learning device.

Embodiment

For making the purpose, technical solutions and advantages of the present invention express clearlyer, the present invention is further described in more detail below in conjunction with drawings and the specific embodiments.

The invention provides a kind of text clustering element study method, this method utilizes soft cluster and soft sorting technique that text set is repeatedly handled earlier, obtains result; Pass through repeatedly soft cluster and soft classification result then and make up the text vector matrix, utilize clustering algorithm that the text vector matrix is carried out unit's study at last and handle, obtain last cluster result.

Fig. 1 is the schematic diagram of embodiment of the invention Chinese version clustering element study method.As shown in Figure 1, in the text clustering element study method in embodiments of the present invention, utilize soft clustering method of multiple text and the soft sorting technique of text that text set is handled at first respectively, thereby obtain a plurality of cluster results; Then, make up a text vector matrix based on above-mentioned by soft clustering method of multiple text and the resulting a plurality of cluster results of the soft classification of text; Then, the text vector matrix is carried out normalized, the element study method based on hard clustering algorithm carries out unit's study to above-mentioned text vector matrix at last, thereby obtains the result of final text cluster.

Fig. 2 is the process flow diagram of embodiment of the invention Chinese version clustering element study method.As shown in Figure 2, embodiment of the invention Chinese version clustering element study method comprises step as described below:

Step 201 is carried out m time with the text analyzing method to text set and is handled.

Above-mentioned text analyzing method comprises soft clustering method and soft sorting technique.Described soft clustering method is the different data digging methods of two classes with soft sorting technique, each has different applications.Using soft clustering method and the resulting result of soft sorting technique is several classes, and each text belongs to the probability of a specific class.For soft clustering method, class represent that soft clustering method finds bunch, for soft classification, class is represented a concrete classification.In an embodiment of the present invention, two above-mentioned class methods are combined, thereby can obtain more accurate and stable cluster result.

Above-mentioned with the text analyzing method text set carried out m time and handle, can be to be used alone or multiple soft clustering method carries out m processing to text set; Also can be to be used alone or multiple soft sorting technique is carried out m time to text set and handled; Can also be to use one or more soft clustering methods and soft sorting technique that text set is carried out m time simultaneously to handle.For sake of convenience, below will be described in detail to use soft clustering method and soft sorting technique that text set is carried out being treated to for m time example simultaneously.

In above-mentioned steps, can use some disclosed soft clusters and soft sorting technique that text set is handled, described soft cluster and soft sorting technique comprise soft K-means, soft KNN scheduling algorithm.In actual applications, can from soft cluster and soft sorting technique, select n kind algorithm earlier separately, use selected each algorithm that the text set of required processing is carried out m processing then.Above-mentioned n and m are integer, can be provided with in advance, and described n and m satisfy: n 〉=1, m 〉=1.Why use with a kind of method and carry out m processing, with a kind of method same text set is handled because use even be, resulting result also may be different, for example for soft K-means algorithm, even for same text set, the same soft K-means algorithm of each use is handled, and the result who is obtained is different.The result's who obtains in order to improve accuracy is carried out m processing so need to use with a kind of method, thereby is obtained m result.

In addition, before the soft clustering method of stating in the use and soft sorting technique, also need text set is carried out pre-service, the text in the text set is converted into text vector based on the feature speech, so that use soft clustering method and soft sorting technique that text set is handled.Described pre-service mainly comprises three steps such as participle, Feature Selection and text vectorization, each text in the text set all can be converted to a text vector in the multidimensional Euclidean space by these three steps.Concrete treatment step is as described below:

1) at first needs to carry out word segmentation processing.

Participle is that the text in the text set is divided into single speech, and adds up the number of times that each speech occurs in text, and in view of the above as the foundation of text vectorization.

2) then, carrying out Feature Selection handles.

Because the speech that occurs in a text set is very many, and a lot of speech do not have positive role to the differentiation of cluster, can have influence on the effect of cluster on the contrary, therefore need carry out Feature Selection to the speech in the text set.The purpose of carrying out Feature Selection is to keep those to help the speech that cluster is distinguished, and removes the most speech that is unfavorable for that cluster is distinguished, thereby reduces the dimension of text vector under the situation that does not influence the cluster effect.Can adopt the various features choosing method in the present invention, comprise: remove the speech and the removal very few methods such as speech of occurrence number in single text that stop speech, removal in too much or very few text, to occur.By being used in combination above-mentioned method, can be under the situation that the cluster effect is not almost had influence the quantity of the feature speech in the text set be reduced to about 1,000 from up to ten thousand.

3) last, carry out text vectorization.

The purpose of text vectorization is that the text in the text set is converted into a text vector in the multidimensional Euclidean space, each dimension in the described multidimensional Euclidean space is all corresponding with a feature speech, and text vector value on each dimension in the multidimensional Euclidean space is exactly that this ties up the weight of pairing feature speech in the pairing text of text vector.Text vector weighing computation method commonly used comprises word frequency (TF), the anti-document frequency of word frequency (TFIDF), word frequency control (tfc) and length word frequency methods such as (1tc).Compare with other weighing computation methods, the 1tc method has reduced the influence of the length difference of different texts to text vectorization and cluster, and what therefore use in the present invention is the 1tc method.The formula that uses the 1tc method to calculate weight is:

ltc (i, k) = \frac{\log (f_{ik} + 1.0) * \log (\frac{N}{n_{i}})}{\sqrt{Σ_{j = 1}^{M} {[\log (f_{jk} + 1.0) * \log (\frac{N}{n_{j}})]}^{2}}} - - - (1)

Wherein, N is the quantity of text set Chinese version, and M is the quantity of selected feature speech, f _IkBe i the number of times that the feature speech occurs in k text, n _iFor containing the number of texts of i feature speech.

After text set carried out above-mentioned pre-service, re-use above-mentioned soft clustering method and soft sorting technique text set is repeatedly handled, obtain result.

The above carries out the method for handling for m time for using soft clustering method and soft sorting technique simultaneously to text set, and use step that soft clustering method or soft sorting technique handle text set separately and above-mentioned to use the step of soft clustering method and soft sorting technique simultaneously be similar, therefore repeat no more.

Step 202 makes up the text vector matrix based on m result.

In this step, the construction method of above-mentioned text vector matrix is as follows:

At first, above-mentioned each result of utilizing soft cluster or soft sorting technique to obtain all is expressed as a matrix respectively, the soft clustering method of the row representative of matrix produce bunch or soft sorting technique in classification, row represents text in the text set, and each is to liking the probability that some texts belong to bunch or classification.For example, for a text set Corpus:(d who comprises n text ₁, d ₂, d ₃..., d _n), suppose that to its number of processes of carrying out soft cluster and soft classification be m, then can obtain m division result to text collection, can be expressed as Partition:(P ₁, P ₂, P ₃..., P _m), P wherein _iBe the result of i soft cluster or soft classification, can be expressed as

P_{i} = \{\begin{matrix} C_{i}^{1} & C_{i}^{2} & . . . & C_{i}^{k_{i}} \end{matrix}\},

K wherein _iBe in the i time soft cluster or the soft classification results bunch or the quantity of classification, promptly be illustrated in this result P _iIn, text collection has been divided into k _iIndividual bunch or classification,

Be j bunch or classification in the i time soft cluster of text collection or the soft classification results.Therefore, according to above-mentioned result, the i time soft cluster and soft classification results can be expressed as a matrix:

M_{i} = [\begin{matrix} v_{1 i} \\ v_{2 i} \\ \cdot \cdot \cdot \\ v_{ni} \end{matrix}] - - - (2)

Wherein, v _LiFor the result of l text in the above-mentioned text set, can be expressed as in the i time soft cluster or soft classification

v_{li} = \{\begin{matrix} {Prob}_{li}^{1} & {Prob}_{li}^{2} & . . . & {Prob}_{li}^{k_{i}} \end{matrix}\},

Wherein It is the probability that l text belongs in the i time soft cluster or the soft classification results j bunch or classification.

By above-mentioned method, we can obtain all soft classification and soft cluster result matrix.Then, all soft clusters and soft classification results matrix are spliced into a partitioned matrix M, as follows, M is the text vector matrix.

M＝[M ₁M ₂...M _m] (3)

Wherein, M _iFor utilizing the resulting result matrix of soft cluster or soft sorting technique, m is the soft cluster of being carried out and the number of times of soft classification.

Because the different soft clusters and the accuracy of soft sorting technique be difference to some extent, therefore the user can be according to actual conditions, in advance the different soft clustering methods that utilizes is provided with different weights k with the resulting result of soft sorting technique, then with resulting each result, be after matrix is multiplied by corresponding weights k respectively, more all matrixes be spliced into a text vector matrix.As follows:

M＝[k ₁M ₁?k ₂M ₂...k _mM _n] (4)

Wherein, k _jBe the weights that the soft clustering method of difference and soft sorting technique are set in advance, m is the employed different soft clustering method or the quantity of soft sorting technique.

Step 203 is carried out unit's study, obtains last cluster result.

In this step, the embodiment of the invention is at first carried out normalized to the text vector matrix, utilizes clustering method that above-mentioned text vector matrix is carried out unit's study then, obtains the result of last text cluster.Described unit study, the study again that the learning outcome of soft cluster and soft classification in the previous step is carried out exactly.Can adopt some clustering algorithms commonly used in this area, for example: the K-means algorithm, level cohesion clustering algorithm, Density Clustering algorithms etc. carry out normalized and the study of cluster unit to above-mentioned text vector matrix.The cluster result that obtains at last is: P={C ₁, C ₂..., C _k, wherein K for produce bunch quantity, C _iIt is i bunch.

In summary, the invention provides a kind of text clustering element study method, this method is at first with the text set vectorization, utilizing soft cluster and soft sorting algorithm that text set is carried out several times respectively handles, make up a text vector matrix according to soft cluster that obtains and soft classification results then, each object in this vector matrix be respectively a certain text belong to that some soft clustering methods produce bunch or some soft sorting techniques in the probability of classification, utilize general clustering algorithm that text vector matrix is carried out unit's study at last, obtain last text cluster result.Because this method synthesis has been considered repeatedly cluster and sorting result, so can obtain text cluster result more accurately, has effectively reduced the deviation that the single cluster is brought, and has improved the accuracy and the stability of cluster result.

In addition, embodiments of the invention also provide a kind of text clustering element learning device.Figure 3 shows that the structural drawing of embodiment of the invention Chinese version cluster unit learning device, as shown in Figure 3, this device comprises: pretreatment module 300, text analysis model 301, matrix synthesis module 302 and first study module 303.Wherein, pretreatment module 300 is used for the text of text set carry out text vectorization, and the text set after the text vectorization is sent to text analysis model 301; The text set of 301 pairs of required processing of text analysis model carries out text analyzing, and resulting text analyzing result is sent to matrix synthesis module 302.In text analysis model 301, employed text analyzing method comprises soft clustering method and soft sorting technique; The text analyzing that matrix synthesis module 302 bases receive is the synthesis text vector matrix as a result, and synthetic text vector matrix is sent to described first study module 303; Described first study module 303 is used for the text vector matrix that receives is carried out unit's study, exports final cluster result.

In above-mentioned text clustering element learning device, pretreatment module 300 optional modules are so dot.As shown in Figure 3, pretreatment module 300 also comprises: participle unit 304, Feature Selection unit 305 and text vector unit 306.Participle unit 304 is used for the text of text set is divided into single speech, and adds up the number of times that each speech occurs in text set, and will divide the result and statistics sends to Feature Selection unit 305; Feature Selection unit 305 is according to the division result and the statistics that receive, selected characteristic speech from text set, and the feature speech of choosing sent to text vector unit 306; Text vector unit 306 changes into text vector according to the feature speech that receives with the text in the text set, and the text set after the text vectorization is sent to text analysis model 301.

As shown in Figure 3, described matrix synthesis module 302 also comprises: matrixing unit 307 and synthesis unit 308.Matrixing unit 307 changes into matrix with the text analyzing result who receives, and the matrix after will transforming sends to synthesis unit 308; Matrix after all conversions that synthesis unit 308 will receive synthesizes a text vector matrix, and synthetic text vector matrix is sent to described first study module 303.

As shown in Figure 3, first study module 303 also comprises: normalization unit 309 and unit 310.Normalization unit 309 is used for the text vector matrix that receives is carried out normalized, and the text vector matrix after the normalization is sent to unit 310; Unit 310 is used for the text vector matrix after the normalization that receives is carried out unit's study, thereby exports final cluster result.

The above is preferred embodiment of the present invention only, is not to be used to limit protection scope of the present invention.Within the spirit and principles in the present invention all, any modification of being done, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1, a kind of text clustering element study method is characterized in that, this method may further comprise the steps:

2, method according to claim 1 is characterized in that: the text analyzing method in the steps A is soft clustering method and/or soft sorting technique.

3, method according to claim 1, it is characterized in that, described result matrix be spliced into the text vector matrix comprise among the described step B: after described result matrix is multiplied by the weights that set in advance respectively, more all result matrixes are spliced into the text vector matrix.

4, method according to claim 1 is characterized in that, also comprises before the steps A: described text set is carried out pre-service; Described pre-service comprises: participle, Feature Selection and text vectorization.

5, a kind of text clustering element learning device is characterized in that, this device comprises: text analysis model, matrix synthesis module and first study module;

6, device according to claim 5 is characterized in that, described device also comprises: pretreatment module;

Described pretreatment module is used for the text of text set carry out text vectorization, and the text set after the text vectorization is sent to described text analysis model.

7, device according to claim 6 is characterized in that, described pretreatment module comprises: participle unit, Feature Selection unit and text vector unit;

Described participle unit is used for the text of text set is divided into single speech, and adds up the number of times that each speech occurs in text set, will divide result and statistics and send to described Feature Selection unit;

Described Feature Selection unit is used for according to the division result and the statistics that receive, and selected characteristic speech from text set sends to described text vector unit with the feature speech of choosing;

Described text vector unit is used for according to the feature speech that receives the text of text set being changed into text vector, and the text set after the text vectorization is sent to described text analysis model.

8, device according to claim 5 is characterized in that, described matrix synthesis module comprises: matrixing unit and synthesis unit;

Described matrixing unit, the cluster or the classification results that are used for receiving change into matrix, and the matrix after transforming is sent to described synthesis unit;

Described synthesis unit, the matrix after all conversions that are used for receiving is spliced into a text vector matrix, and described text vector matrix is sent to described first study module.

9, device according to claim 5 is characterized in that, described first study module comprises: normalization unit and unit;

Described normalization unit is used for the text vector matrix that receives is carried out normalized, and the text vector matrix after the normalization is sent to described unit;

Described unit is used for the text vector matrix after the normalization that receives is carried out unit's study, exports final cluster result.