CN101408893A

CN101408893A - Method for rapidly clustering documents

Info

Publication number: CN101408893A
Application number: CNA2008102095246A
Authority: CN
Inventors: 刘远超; 刘铭; 王晓龙
Original assignee: Harbin Institute of Technology Shenzhen
Current assignee: Harbin Institute of Technology Shenzhen
Priority date: 2008-11-26
Filing date: 2008-11-26
Publication date: 2009-04-15

Abstract

The invention provides a fast document clustering method. The method is realized by the following steps: 1, a group of key words is extracted from each document by word frequency statistics; 2, the document is expressed to be a corresponding dimensional congregation of index value, the contained key words of the document are in the characteristic space of the congregation; 3, a nerve element in a self organization mapping model is expressed as a vector in the characteristic space; 4, the documents are input in sequence, and the similarity between the documents and all nerve elements is calculated; 5, the nerve element with maximum accumulated value is the winner; the winner and neighbor nerve elements adjust weight in current document direction; 6, an individual dimension that the nerve element is matched with the input document is adjusted while the weight of other dimensions are weakened; 7, all the documents are input, and the method is over. The invention utilizes a self-organization mapping clustering model to renovate the links of document quantization expression and similarity calculation, thus the calculation efficiency is greatly improved under the condition that the number of the documents is same and the clustering quality is maintained.

Description

A Fast Document Clustering Method

(一)技术领域(1) Technical field

本发明涉及一种文档聚类技术，具体涉及一种快速文档聚类方法。The invention relates to a document clustering technology, in particular to a fast document clustering method.

(二)背景技术(2) Background technology

随着网络的日益普及和信息资讯建设取得的显著成效，人们经常需要面对数目惊人的自然语言文档，突出的问题是如何对其中包含的丰富信息和知识进行迅速有效的组织、浓缩和融合等处理，以提高人类把握这些海量信息的能力，改善认知水平。特别是近年来受到广泛关注的用户个人文档自动整理、网络大规模信息舆情监控、话题跟踪与检测技术、网络舆论态势跟踪、论坛大量文档的自动分类等研究内容中，就离不开快速高质量文本聚类技术的支持。With the increasing popularity of the network and the remarkable achievements of information construction, people often need to face an astonishing number of natural language documents. The prominent problem is how to organize, concentrate and integrate the rich information and knowledge contained in them quickly and effectively. Processing, in order to improve the ability of human beings to grasp these massive amounts of information, and improve the level of cognition. Especially in recent years, the automatic sorting of users' personal documents, large-scale network information public opinion monitoring, topic tracking and detection technology, network public opinion trend tracking, and automatic classification of a large number of forum documents are inseparable from fast and high-quality research. Support for text clustering techniques.

由于一般认为聚类算法的复杂度较高，并且采用广为人知的Salton向量空间模型表示自然语言文档容易导致维数灾难，当处理文档数目达到较大规模时其高昂的计算开销被公认为是文本聚类技术在实践应用中需要解决的重要问题之一。Because it is generally believed that the complexity of the clustering algorithm is high, and the use of the well-known Salton vector space model to represent natural language documents is easy to cause the curse of dimensionality, its high computational overhead is recognized as a problem for text clustering when the number of processed documents reaches a large scale. It is one of the important problems to be solved in the practical application of class technology.

(三)发明内容(3) Contents of the invention

本发明提供一种用以克服已有的聚类方法由于特征高维量化和频繁相似度计算所带来的效率及其低下问题的快速文档聚类方法。The invention provides a fast document clustering method for overcoming the efficiency and low problems of the existing clustering methods due to high-dimensional feature quantization and frequent similarity calculation.

本发明的目的是这样实现的：它通过下述步骤实现：一、利用词频统计从每篇文档中抽取出一组关键词(如10个)，用于代表该文的主要内容；二、通过一次扫描，利用抽取的所有文档的关键词构造特征向量空间，并将文档表示为其包含的关键词在特征空间上的相应维度的索引值的集合；三、将自组织映射模型中的神经元表示为特征空间上的向量；四、依次输入文档，并计算其与所有神经元之间的相似度；五、累加值最大的神经元为获胜神经元，其和其邻域的神经元向当前文档方向调整权值；六、在调整神经元与输入文档匹配的个别维的同时，对其他维的权值进行弱化，以防止其他主题的文档错误映射到该神经元上；七、所有文档输入完毕后，结束。The purpose of the present invention is achieved like this: it realizes by following steps: one, utilize word frequency statistics to extract a group of keywords (as 10) from every document, be used to represent the main content of this article; Two, pass In one scan, the keywords of all the documents extracted are used to construct the feature vector space, and the document is expressed as a set of index values of the corresponding dimensions of the keywords contained in the feature space; 3. The neuron in the self-organizing map model Expressed as a vector on the feature space; 4. Input the document in sequence, and calculate the similarity between it and all neurons; 5. The neuron with the largest cumulative value is the winning neuron, and the neuron in its neighbor Document direction adjustment weights; 6. While adjusting the individual dimensions of the neuron that matches the input document, the weights of other dimensions are weakened to prevent documents of other topics from being wrongly mapped to the neuron; 7. All document inputs When you're done, it's over.

本发明还有这样一些技术特征：The present invention also has some technical characteristics:

1、所述的相似度计算方法为，计算文档关键词索引在神经元节点上的相关维度上的权值的累加值。1. The similarity calculation method is to calculate the cumulative value of the weight of the document keyword index on the relevant dimension of the neuron node.

本发明针对目前文本聚类由于维数较高和相似度计算比较频繁带来的效率较低的问题，利用自组织映射聚类模型，在文档量化表示和相似度计算等环节进行革新，使得在处理文档数目相同且聚类质量得以保持的情况下，计算效率获得大幅提升。The present invention aims at the problem of low efficiency of current text clustering due to high dimensionality and frequent calculation of similarity, and utilizes the self-organizing map clustering model to carry out innovations in links such as document quantitative representation and similarity calculation. When the number of processed documents is the same and the clustering quality is maintained, the computational efficiency is greatly improved.

本发明方法的步骤：与传统上将文档表示为高维空间向量的做法不同，本方法首先从文档中提取若干关键词(如10个关键词)，关键词为对文本进行高频词统计生成的重要内容词。而后实现一次扫描，完成了利用所有文档的关键词动态构造特征空间和同时将文档则直接表示为其包含的关键词在向量空间中的索引的工作。自组织映射模型中的神经元节点表示为空间上的向量。虽然神经元节点仍为高维向量，但大量的文档则仅包含若个关键词的索引(如10个)，而不是传统上表示为与神经元节点同样的高维向量(如几千维)。因此传统上聚类过程中频繁进行的文档和神经元节点之间的相似度计算被简化。The steps of the method of the present invention: different from the traditional way of representing documents as high-dimensional space vectors, this method first extracts some keywords (such as 10 keywords) from the documents, and the keywords are generated by performing high-frequency word statistics on the text important content words. Then, a scan is realized, and the work of dynamically constructing the feature space by using the keywords of all documents and simultaneously expressing the index of the keywords contained in the document directly in the vector space is completed. The neuron nodes in the self-organizing map model are represented as vectors in space. Although neuron nodes are still high-dimensional vectors, a large number of documents only contain indexes of a few keywords (such as 10), rather than traditionally represented as high-dimensional vectors (such as several thousand dimensions) as neuron nodes. . Therefore, the similarity calculation between documents and neuron nodes, which is frequently performed in the traditional clustering process, is simplified.

文档和神经元节点之间的相似度计算方法为文档关键词索引在神经元节点上的相关维度上的权值的累加值。累加值最大的神经元为获胜神经元，其和其邻域的神经元获得了向当前文档调整权值的机会。为了防止文档向神经元的错误映射，采取了抑制其它维的办法，即在调整神经元与输入文档匹配的个别维的同时，对其他维的权值进行弱化，以防止其他主题的文档错误映射到该神经元上。The similarity calculation method between the document and the neuron node is the cumulative value of the weight of the document keyword index on the relevant dimension of the neuron node. The neuron with the largest cumulative value is the winning neuron, and it and its neighbor neurons have the opportunity to adjust the weights to the current document. In order to prevent the wrong mapping of documents to neurons, the method of suppressing other dimensions is adopted, that is, while adjusting the individual dimensions of neurons and input documents, the weights of other dimensions are weakened to prevent wrong mapping of documents of other topics to the neuron.

本发明方法的本质在于规避了传统方法的大量冗余计算，特征并没有压缩，因而可以达到不影响聚类质量，大幅提升聚类效率的目的。The essence of the method of the present invention is to avoid a large number of redundant calculations in the traditional method, and the features are not compressed, so that the purpose of not affecting the clustering quality and greatly improving the clustering efficiency can be achieved.

本发明的潜在用户包括：1.需要进行网络大规模文本信息流的动态分析与监控的国家有关部门；2.众多从事文档信息检索和信息管理应用和研究的企业用户、图书情报机构、科研院所等单位；3.大量需要进行文档组织管理和快速检索浏览(如个人电子邮件和各类自然语言文档聚类处理)的个人用户等。Potential users of the present invention include: 1. Relevant national departments that need to carry out dynamic analysis and monitoring of large-scale text information flow on the network; 2. Many enterprise users, library and information institutions, and scientific research institutes engaged in document information retrieval and information management applications and research 3. A large number of individual users who need to organize and manage documents and quickly retrieve and browse (such as personal emails and clustering of various natural language documents).

本发明的方法将文档表示为若干关键词索引构成的集合，其数目远少于特征空间的维数，后者一般为几千维。神经元仍然延续传统的做法。由于自组织映射模型中频繁进行文档与神经元节点之间的相似度，且文档数目一般远远大于神经元节点的数目(一般可以设定为需要生成的文档簇的数目)，因此节省的计算开销非常可观。注意到本发明方法并不是对特征进行压缩，其采用的特征与传统方法完全相同。本发明的特点在于通过在特征量化表示和相似度计算等环节进行改进，使得大量冗余计算得以消除，从而达到保持聚类质量，大幅提升效率的目的。聚类质量可以利用聚类F值来衡量。The method of the present invention represents a document as a collection of several keyword indexes, the number of which is far less than the dimension of the feature space, which is generally several thousand. Neurons still continue the traditional approach. Since the similarity between documents and neuron nodes is frequently performed in the self-organizing map model, and the number of documents is generally much larger than the number of neuron nodes (generally it can be set to the number of document clusters that need to be generated), the calculation saved The overhead is substantial. Note that the method of the present invention does not compress the features, and the features used are exactly the same as the traditional method. The feature of the present invention is that a large number of redundant calculations can be eliminated by improving aspects such as feature quantitative representation and similarity calculation, so as to achieve the purpose of maintaining clustering quality and greatly improving efficiency. Clustering quality can be measured by clustering F value.

聚类F值的计算方法：用聚类F值对文档聚类的综合质量进行评价。对于聚类生成的某一个聚类类别r和原来的预定类别s，召回率recall和精确率precision的定义分别为：Calculation method of clustering F value: use clustering F value to evaluate the comprehensive quality of document clustering. For a clustering category r generated by clustering and the original predetermined category s, the definitions of recall rate recall and precision rate precision are respectively:

recall(r，s)＝n(r，s)/n_s (1)recall(r, s) = n(r, s)/n _s (1)

precision(r，s)＝n(r，s)/n_r (2)precision(r, s) = n(r, s)/n _r (2)

其中n(r，s)是聚类后的类别r和预定义类别s中的公共文档个数。n_r是聚类类别r中的文档个数，n_s是预定义类别s中的文档个数。定义F(r，s)为where n(r, s) is the number of common documents in the clustered category r and the predefined category s. n _r is the number of documents in the clustering category r, and n _s is the number of documents in the predefined category s. Define F(r, s) as

F(r，s)＝(2*recall(r，s)*precision(r，s))/((precison(r，s)+recall(r，s)) (3)F(r, s)＝(2*recall(r, s)*precision(r, s))/((precision(r, s)+recall(r, s)) (3)

则聚类结果的总体评价函数为Then the overall evaluation function of the clustering result is

$F f = = \underset{i i}{Σ Σ} = = \frac{{n no}_{i i}}{n no} max max {{F f ((i i,, j j))}} - - - - - - ((44))$

这里，n是聚类的输入文档个数。而n_i表示预定义类别i中的文档个数。Here, n is the number of input documents for clustering. And n _i represents the number of documents in the predefined category i.

(四)附图说明(4) Description of drawings

图1是本发明方法的整体原理示意图；Fig. 1 is the overall schematic diagram of the inventive method;

图2是本发明方法的关键词抽取原理示意图；Fig. 2 is a schematic diagram of the keyword extraction principle of the method of the present invention;

图3是本发明方法的相似度计算原理示意图。Fig. 3 is a schematic diagram of the similarity calculation principle of the method of the present invention.

(五)具体实施方式(5) Specific implementation methods

下面结合图1至图3和具体实施例对本发明作进一步的说明：The present invention will be further described below in conjunction with Fig. 1 to Fig. 3 and specific embodiment:

本发明中将文档表示为由若干代表词构成的集合，而不再是广泛采用的与模型节点表示为具有相同高维空间中的向量，使得在大规模文本聚类的情况下，文档的特征表示所需的内存消耗大大降低。在这种模式下，还需要妥善处理两个问题：一是模型中节点向量所在的向量空间的构造；另外一个问题是由于文档和节点向量表示方法和维数的不同，如何有效计算相似度。In the present invention, the document is represented as a set composed of several representative words, instead of being widely used as a vector in the same high-dimensional space as the model node, so that in the case of large-scale text clustering, the features of the document Indicates that the required memory consumption is greatly reduced. In this mode, two issues need to be properly dealt with: one is the construction of the vector space where the node vectors in the model are located; the other is how to effectively calculate the similarity due to the differences in the representation methods and dimensions of document and node vectors.

对于向量空间构造问题，可以有两种方法：一是在聚类处理的文档为开放域的情况下，根据待聚类样本的实际情况动态生成向量空间，这是因为即使处理文档规模较大，一般也不可能覆盖大部分词表。而如果用整个词表(如汉语词表)直接构造向量空间，其维数将会相当大。从而增加计算开销。而采用从处理样本文档中抽取特征词构造向量空间的办法，将使得模型节点中的稀疏元素大大减少，降低向量表示的冗余性。在这种模式下，只需将模型节点表示为高维向量，而文档则表示为少量代表词构成的集合。由于文档数目一般远远大于节点数，并且自组织映射模型中频繁进行的是文档与节点之间的相似度。因此将使得计算开销大大降低。此外，这种表示方式的另外一个好处是使向量空间的构造效率也得到提高。For the problem of vector space construction, there are two methods: one is to dynamically generate a vector space according to the actual situation of the samples to be clustered when the documents to be clustered are open domains, because even if the size of the processed documents is large, It is generally impossible to cover most of the vocabulary. However, if the entire vocabulary (such as the Chinese vocabulary) is used to directly construct the vector space, its dimension will be quite large. This increases computational overhead. However, the method of extracting feature words from processing sample documents to construct vector space will greatly reduce the sparse elements in model nodes and reduce the redundancy of vector representation. In this mode, only model nodes need to be represented as high-dimensional vectors, while documents are represented as a collection of a small number of representative words. Because the number of documents is generally much larger than the number of nodes, and the similarity between documents and nodes is frequently carried out in the self-organizing map model. Therefore, the computational overhead will be greatly reduced. In addition, another benefit of this representation is that the construction efficiency of the vector space is also improved.

本发明采取如下的步骤(整体示意图见图1)：The present invention takes following steps (the overall schematic diagram is shown in Fig. 1):

1)对每篇文档进行分词处理，而后过滤掉停用词，并进行词频统计，最终保留若干高频词作为文档的关键词集合(参见图2)；1) Perform word segmentation processing on each document, then filter out stop words, and perform word frequency statistics, and finally retain several high-frequency words as the keyword set of the document (see Figure 2);

2)在获得每篇文档的关键词集合后，首先初始化一个零长度的向量空间，而后重复进行如下操作：每读入一篇文档中的每个关键词w，到向量空间(可以是一个vector)中检索该词，如果找到，则记录w在vector中的位置vec[i]，如果没有找到，则在vec结尾追加一个元素，同时在当前处理的文档向量中记录其位置。通过这种方式，可以实现一次扫描，同时构造向量空间和每个文档中关键词的索引。同类研究的做法是先构造空间，然后再将文档表示为高维空间中的向量。2) After obtaining the keyword set of each document, first initialize a zero-length vector space, and then repeat the following operations: each keyword w in a document is read into the vector space (it can be a vector ), if found, record the position vec[i] of w in the vector, if not found, add an element at the end of vec, and record its position in the currently processed document vector. In this way, one scan can be implemented, and the vector space and the index of keywords in each document can be constructed at the same time. The approach of similar research is to construct the space first, and then represent the document as a vector in the high-dimensional space.

3)将自组织映射网络中的神经元初始化为构建的特征空间上的向量；3) Initialize the neurons in the self-organizing map network as vectors on the constructed feature space;

4)输入文档，计算其与所有神经元之间的相似度。对于文档d与任意一个节点n，只需要将该节点某些维上的权值相加即可，例如：4) Input the document and calculate the similarity between it and all neurons. For document d and any node n, it is only necessary to add the weights of certain dimensions of the node, for example:

vec[0001]+vec[0008]+vec[0009]+vec[0023]；vec[0001]+vec[0008]+vec[0009]+vec[0023];

5)假设节点N与文档d的相似度最大，则将节点N相应于文档d的各维进行相应调整，例如：

这里的

是一个经验值，其具体的取值可以通过大量实践获得。(根据自组织映射模型的原理，可以在开始时取较大值，如0.03，以后减小，最后0.01)。通过这种处理，使得与文档d同一主题的文档与节点N相似度较大，而使得节点N在这一主题上的向量组成上得到强化。5) Assuming that the similarity between node N and document d is the largest, adjust node N corresponding to each dimension of document d, for example:

here

It is an experience value, and its specific value can be obtained through a lot of practice. (According to the principle of the self-organizing map model, a larger value can be taken at the beginning, such as 0.03, and then reduced, and finally 0.01). Through this processing, the similarity between the document with the same topic as document d and node N is greater, and the vector composition of node N on this topic is strengthened.

6)在对与文档匹配的若干维(最多10个)增加

的同时，对其他维减少一个数，如(或者在每个节点第一次调整权值时将其他维一律清零)(参见图3)。6) Increase the number of dimensions (up to 10) that match the document

At the same time, reduce a number for other dimensions, such as (Or all other dimensions are cleared to zero when each node adjusts the weight for the first time) (see Figure 3).

7)所有文档输入完毕后，结束。7) After all documents are input, end.

本发明的应用过程是：用户输入检索词给搜索引擎，搜索引擎通过检索将找到的结果返回，这些返回的文档将作为本发明聚类方法的输入，通过快速聚类处理，使得返回的结果被分类处理，改善了可视化效果，从而大幅提升检索效率。The application process of the present invention is: the user inputs the search term to the search engine, and the search engine returns the found results through retrieval, and these returned documents will be used as the input of the clustering method of the present invention, and the returned results are processed by fast clustering Classification processing improves the visualization effect, thereby greatly improving retrieval efficiency.

Claims

1, a kind of method for rapidly clustering documents is characterized in that it realizes by following step: one, utilize word frequency statistics to extract one group of keyword from every piece of document, be used to represent the main contents of this article; Two,, utilize the keyword structural attitude vector space of all documents that extract, and document is expressed as the set of the index value of the respective dimensions of its keyword that comprises on feature space by single pass; Three, the neuron in the self organizing maps model is expressed as vector on the feature space; Four, import document successively, and calculate the similarity between itself and all neurons; Five, the neuron of accumulated value maximum is the triumph neuron, and the neuron of itself and its neighborhood is adjusted weights to the current document direction; Six, when adjusting indivedual dimensions of neuron and input document coupling, the weights of other dimensions are weakened, be mapped on this neuron with the document mistake that prevents other themes; Seven, after all document inputs finish, finish.

2, a kind of method for rapidly clustering documents according to claim 1 is characterized in that described similarity calculating method is, calculates the accumulated value of the weights of document keyword index on the relevant dimension on the neuron node.