CN101408893A - Method for rapidly clustering documents - Google Patents

Method for rapidly clustering documents Download PDF

Info

Publication number
CN101408893A
CN101408893A CNA2008102095246A CN200810209524A CN101408893A CN 101408893 A CN101408893 A CN 101408893A CN A2008102095246 A CNA2008102095246 A CN A2008102095246A CN 200810209524 A CN200810209524 A CN 200810209524A CN 101408893 A CN101408893 A CN 101408893A
Authority
CN
China
Prior art keywords
document
neuron
documents
keyword
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2008102095246A
Other languages
Chinese (zh)
Inventor
刘远超
刘铭
王晓龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CNA2008102095246A priority Critical patent/CN101408893A/en
Publication of CN101408893A publication Critical patent/CN101408893A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a fast document clustering method. The method is realized by the following steps: 1, a group of key words is extracted from each document by word frequency statistics; 2, the document is expressed to be a corresponding dimensional congregation of index value, the contained key words of the document are in the characteristic space of the congregation; 3, a nerve element in a self organization mapping model is expressed as a vector in the characteristic space; 4, the documents are input in sequence, and the similarity between the documents and all nerve elements is calculated; 5, the nerve element with maximum accumulated value is the winner; the winner and neighbor nerve elements adjust weight in current document direction; 6, an individual dimension that the nerve element is matched with the input document is adjusted while the weight of other dimensions are weakened; 7, all the documents are input, and the method is over. The invention utilizes a self-organization mapping clustering model to renovate the links of document quantization expression and similarity calculation, thus the calculation efficiency is greatly improved under the condition that the number of the documents is same and the clustering quality is maintained.

Description

A kind of method for rapidly clustering documents
(1) technical field
The present invention relates to a kind of clustering documents technology, be specifically related to a kind of method for rapidly clustering documents.
(2) background technology
The universal day by day remarkable effect that obtains with the information construction along with network, the natural language document that the frequent demand side of people is surprising to number, distinct issues are how the abundant information that wherein comprises and knowledge to be carried out rapidly effectively tissue, concentrated and processing such as fusion, to improve the human ability of holding these magnanimity informations, improve cognitive level.The individual subscriber document that particularly is subjected to extensive concern is in recent years put in order automatically, the monitoring of the extensive information public sentiment of network, Topic Tracking and detection technique, network public opinion situation are followed the tracks of, in the research contents such as automatic classification of forum's large volume document, just be unable to do without the support of fast and high quality text cluster technology.
Owing to it is generally acknowledged that the complexity of clustering algorithm is higher, and adopt the Salton vector space model that is widely known by the people to represent that the natural language document causes dimension disaster easily, its high computing cost is acknowledged as the text cluster technology when the processing number of documents reaches fairly large needs one of major issue that solves in application in practice.
(3) summary of the invention
The invention provides a kind of in order to overcome existing clustering method because the feature higher-dimension quantizes and frequent similarity is calculated the efficient brought and the method for rapidly clustering documents of low problem thereof.
The object of the present invention is achieved like this: it is realized by following step: one, utilize word frequency statistics to extract one group of keyword (as 10) from every piece of document, be used to represent the main contents of this article; Two,, utilize the keyword structural attitude vector space of all documents that extract, and document is expressed as the set of the index value of the respective dimensions of its keyword that comprises on feature space by single pass; Three, the neuron in the self organizing maps model is expressed as vector on the feature space; Four, import document successively, and calculate the similarity between itself and all neurons; Five, the neuron of accumulated value maximum is the triumph neuron, and the neuron of itself and its neighborhood is adjusted weights to the current document direction; Six, when adjusting indivedual dimensions of neuron and input document coupling, the weights of other dimensions are weakened, be mapped on this neuron with the document mistake that prevents other themes; Seven, after all document inputs finish, finish.
The present invention also has some technical characterictics like this:
1, described similarity calculating method is, calculates the accumulated value of the weights of document keyword index on the relevant dimension on the neuron node.
The present invention is directed to present text cluster because dimension is higher and similarity is calculated the lower problem of more frequently bringing of efficient, utilize self-organization mapping Clustering Model, reform in links such as document quantization means and similarity calculating, make and handling under the situation that number of documents is identical and the cluster quality is kept that counting yield obtains significantly to promote.
The step of the inventive method: different with the way that traditionally document is expressed as the higher dimensional space vector, this method is at first extracted plurality of keywords (as 10 keywords) from document, and keyword is for carrying out the important content speech that the high frequency words statistics generates to text.Then realize single pass, finished the keyword that utilizes all documents dynamic structural attitude space and simultaneously with document then direct representation be the work of the index of its keyword that comprises in vector space.Neuron node in the self organizing maps model is expressed as the vector on the space.Though neuron node still is a high dimension vector, a large amount of documents then only comprises the index (as 10) as if a keyword, rather than is expressed as the high dimension vector same with neuron node (as several thousand dimensions) traditionally.Therefore document that frequently carries out in the cluster process traditionally and the similarity between the neuron node are calculated and are simplified.
Similarity calculating method between document and the neuron node is the accumulated value of the weights of document keyword index on the relevant dimension on the neuron node.The neuron of accumulated value maximum is the triumph neuron, and the neuron of itself and its neighborhood has obtained to adjust to current document the chance of weights.In order to prevent the neuronotropic wrong mapping of document, taked to suppress the way of other dimension, promptly when adjusting indivedual dimensions of neuron and input document coupling, the weights of other dimensions are weakened, be mapped on this neuron with the document mistake that prevents other themes.
The bulk redundancy that the essence of the inventive method has been to evade classic method calculates, and feature is compression not, thereby can reach and do not influence the cluster quality, significantly promotes the purpose of cluster efficient.
Potential user of the present invention comprises: need carry out the performance analysis of the extensive text message stream of network and the national departments concerned of monitoring 1.; 2. units such as numerous enterprise customers that are engaged in document information retrieval and communication management application and research, the books apparatus of information, scientific research institutions; 3. need carry out in a large number file organization management and fast retrieval browse the personal user etc. of (handling) as personal email and each natural language clustering documents.
Method of the present invention is expressed as the set that the plurality of keywords index constitutes with document, and its number is far fewer than the dimension of feature space, and the latter is generally several thousand dimensions.Neuron still continues traditional way.Owing to frequently carry out the similarity between document and the neuron node in the self organizing maps model, and number of documents generally is far longer than the number (generally can be set at the number of the document clusters that needs generation) of neuron node, and therefore the computing cost of saving is considerable.Notice that the inventive method is not that feature is compressed, feature and classic method that it adopts are identical.Characteristics of the present invention are to improve with link such as similarity calculating by representing at characteristic quantification, make bulk redundancy calculate to eliminate, and keep the cluster quality thereby reach, and significantly promote the purpose of efficient.The cluster quality can utilize cluster F value to weigh.
The computing method of cluster F value: the overall quality of clustering documents is estimated with cluster F value.For some cluster classification r of cluster generation and original predetermine class s, the definition of recall rate recall and accurate rate precision is respectively:
recall(r,s)=n(r,s)/n s (1)
precision(r,s)=n(r,s)/n r (2)
Wherein (r s) is classification r after the cluster and the common document number among the predefine classification s to n.n rBe the document number among the cluster classification r, n sIt is the document number among the predefine classification s.(r s) is definition F
F(r,s)=(2*recall(r,s)*precision(r,s))/((precison(r,s)+recall(r,s)) (3)
Then the overall assessment function of cluster result is
F = Σ i = n i n max { F ( i , j ) } - - - ( 4 )
Here, n is the input document number of cluster.And n iDocument number among the expression predefine classification i.
(4) description of drawings
Fig. 1 is the whole principle schematic of the inventive method;
Fig. 2 is the keyword abstraction principle schematic of the inventive method;
Fig. 3 is that the similarity of the inventive method is calculated principle schematic.
(5) embodiment
The present invention is further illustrated below in conjunction with Fig. 1 to Fig. 3 and specific embodiment:
Among the present invention document is expressed as the set that constitutes by some representative speech, and no longer be extensively adopt be shown the vector that has in the identical higher dimensional space with the model node table, make that the required memory consumption of the character representation of document reduces greatly under the situation of extensive text cluster.Under this pattern, also need to deal carefully with two problems: the one, the structure of the vector space at knot vector place in the model; The another one problem is because how document and knot vector method for expressing and dimension different effectively calculate similarity.
For the vector space construction problem, two kinds of methods can be arranged: the one, be under the situation of open field at the document of clustering processing, actual conditions according to sample to be clustered dynamically generate vector space, even this is larger because handle document, generally also can not cover most of vocabulary.And if directly construct vector space with whole vocabulary (as the Chinese vocabulary), its dimension will be quite big.Thereby increase computing cost.And adopt the way that from treatment samples this document, extracts feature speech structure vector space, and will make the sparse element in the model node significantly reduce, reduce the redundancy of vector representation.Under this pattern, only need the model node table is shown high dimension vector, document then is expressed as the set that a small amount of representative speech constitutes.Because number of documents generally is far longer than the node number, and what frequently carry out in the self organizing maps model is similarity between document and the node.Therefore will make that computing cost reduces greatly.In addition, the another one benefit of this expression mode is that the structure efficient of vector space also is improved.
The present invention takes following step (overall schematic is seen Fig. 1):
1) every piece of document is carried out word segmentation processing, then filter out stop words, and carry out word frequency statistics, finally keep the keyword set (referring to Fig. 2) of some high frequency words as document;
2) after the keyword set that obtains every piece of document, the vector space of a distance of zero mark degree of initialization at first, then repeat following operation: whenever read in each the keyword w in one piece of document, in vector space (can be a vector), retrieve this speech, if find, then write down the position vec[i of w in vector], if do not find, then append an element, simultaneously in its position of record in the document vector of pre-treatment in the vec ending.In this way, can realize single pass, construct the index of keyword in vector space and each document simultaneously.The way of similar research is first structure space, and then document is expressed as vector in the higher dimensional space.
3) neuron in the self-organized mapping network is initialized as vector on the feature space of structure;
4) the input document calculates the similarity between itself and all neurons.For document d and any one node n, only need the weights addition on some dimension of this node is got final product, for example:
vec[0001]+vec[0008]+vec[0009]+vec[0023];
5) suppose the similarity maximum of node N and document d, then node N is adjusted accordingly corresponding to each dimension of document d, for example:
Figure A20081020952400061
Figure A20081020952400063
Figure A20081020952400064
Here
Figure A20081020952400065
Be an empirical value, its concrete value can obtain by a large amount of practices.(according to the principle of self organizing maps model, can when beginning, get higher value,, reduce last 0.01 later on as 0.03).By this processing, feasible document and node N similarity with the same theme of document d is bigger, and makes node N be strengthened on the vector on this theme is formed.
6) the some dimensions (maximum 10) with the document coupling are being increased
Figure A20081020952400066
The time, other dimensions are reduced by a number, as (perhaps when each node is adjusted weights for the first time, other being tieed up same zero clearing) (referring to Fig. 3).
7) after all document inputs finish, finish.
Application process of the present invention is: the user imports term and gives search engine, the result that search engine will find by retrieval returns, these documents that return will be as the input of clustering method of the present invention, handle by quick clustering, make the result who returns be classified processing, improve effect of visualization, thereby significantly promoted recall precision.

Claims (2)

1, a kind of method for rapidly clustering documents is characterized in that it realizes by following step: one, utilize word frequency statistics to extract one group of keyword from every piece of document, be used to represent the main contents of this article; Two,, utilize the keyword structural attitude vector space of all documents that extract, and document is expressed as the set of the index value of the respective dimensions of its keyword that comprises on feature space by single pass; Three, the neuron in the self organizing maps model is expressed as vector on the feature space; Four, import document successively, and calculate the similarity between itself and all neurons; Five, the neuron of accumulated value maximum is the triumph neuron, and the neuron of itself and its neighborhood is adjusted weights to the current document direction; Six, when adjusting indivedual dimensions of neuron and input document coupling, the weights of other dimensions are weakened, be mapped on this neuron with the document mistake that prevents other themes; Seven, after all document inputs finish, finish.
2, a kind of method for rapidly clustering documents according to claim 1 is characterized in that described similarity calculating method is, calculates the accumulated value of the weights of document keyword index on the relevant dimension on the neuron node.
CNA2008102095246A 2008-11-26 2008-11-26 Method for rapidly clustering documents Pending CN101408893A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2008102095246A CN101408893A (en) 2008-11-26 2008-11-26 Method for rapidly clustering documents

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2008102095246A CN101408893A (en) 2008-11-26 2008-11-26 Method for rapidly clustering documents

Publications (1)

Publication Number Publication Date
CN101408893A true CN101408893A (en) 2009-04-15

Family

ID=40571905

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2008102095246A Pending CN101408893A (en) 2008-11-26 2008-11-26 Method for rapidly clustering documents

Country Status (1)

Country Link
CN (1) CN101408893A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102012748A (en) * 2010-11-30 2011-04-13 哈尔滨工业大学 Statement-level Chinese and English mixed input method
CN102081598A (en) * 2011-01-27 2011-06-01 北京邮电大学 Method for detecting duplicated texts
CN101694668B (en) * 2009-09-29 2012-04-18 北京百度网讯科技有限公司 Method and device for confirming web structure similarity
CN102629272A (en) * 2012-03-14 2012-08-08 北京邮电大学 Clustering based optimization method for examination system database
CN104731811A (en) * 2013-12-20 2015-06-24 北京师范大学珠海分校 Cluster information evolution analysis method for large-scale dynamic short texts
CN108536753A (en) * 2018-03-13 2018-09-14 腾讯科技(深圳)有限公司 The determination method and relevant apparatus of duplicate message

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101694668B (en) * 2009-09-29 2012-04-18 北京百度网讯科技有限公司 Method and device for confirming web structure similarity
CN102012748A (en) * 2010-11-30 2011-04-13 哈尔滨工业大学 Statement-level Chinese and English mixed input method
CN102081598A (en) * 2011-01-27 2011-06-01 北京邮电大学 Method for detecting duplicated texts
CN102081598B (en) * 2011-01-27 2012-07-04 北京邮电大学 Method for detecting duplicated texts
CN102629272A (en) * 2012-03-14 2012-08-08 北京邮电大学 Clustering based optimization method for examination system database
CN104731811A (en) * 2013-12-20 2015-06-24 北京师范大学珠海分校 Cluster information evolution analysis method for large-scale dynamic short texts
CN104731811B (en) * 2013-12-20 2018-10-09 北京师范大学珠海分校 A kind of clustering information evolution analysis method towards extensive dynamic short text
CN108536753A (en) * 2018-03-13 2018-09-14 腾讯科技(深圳)有限公司 The determination method and relevant apparatus of duplicate message

Similar Documents

Publication Publication Date Title
CN109271522B (en) Comment emotion classification method and system based on deep hybrid model transfer learning
WO2020108430A1 (en) Weibo sentiment analysis method and system
CN107608999A (en) A kind of Question Classification method suitable for automatically request-answering system
CN107729919A (en) In-depth based on big data technology is complained and penetrates analysis method
CN103617290B (en) Chinese machine-reading system
CN105139237A (en) Information push method and apparatus
CN101408893A (en) Method for rapidly clustering documents
CN109558492A (en) A kind of listed company's knowledge mapping construction method and device suitable for event attribution
CN110175221B (en) Junk short message identification method by combining word vector with machine learning
CN106126619A (en) A kind of video retrieval method based on video content and system
CN112000801A (en) Government affair text classification and hot spot problem mining method and system based on machine learning
WO2021217772A1 (en) Ai-based interview corpus classification method and apparatus, computer device and medium
CN113672718B (en) Dialogue intention recognition method and system based on feature matching and field self-adaption
CN106909946A (en) A kind of picking system of multi-modal fusion
CN110955776A (en) Construction method of government affair text classification model
CN107688576B (en) Construction and tendency classification method of CNN-SVM model
CN109753602A (en) A kind of across social network user personal identification method and system based on machine learning
CN111782759B (en) Question-answering processing method and device and computer readable storage medium
CN110457562A (en) A kind of food safety affair classification method and device based on neural network model
CN110991218A (en) Network public opinion early warning system and method based on images
CN111104975B (en) Credit evaluation method based on breadth learning
CN109299266A (en) A kind of text classification and abstracting method for Chinese news emergency event
CN110019820A (en) Main suit and present illness history symptom Timing Coincidence Detection method in a kind of case history
CN106611016B (en) A kind of image search method based on decomposable word packet model
CN104504406A (en) Rapid and high-efficiency near-duplicate image matching method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Open date: 20090415