CN108399213A

CN108399213A - A kind of clustering method and system of user oriented personal document

Info

Publication number: CN108399213A
Application number: CN201810112624.0A
Authority: CN
Inventors: 李鹏; 王斌; 齐保元; 周美林; 郭莉; 梅钰
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2018-02-05
Filing date: 2018-02-05
Publication date: 2018-08-14
Anticipated expiration: 2038-02-05
Also published as: CN108399213B

Abstract

The present invention provides a kind of clustering method of user oriented personal document, and step includes：The preservation custom of similar documents is grouped user file using user, obtains multiple file groups；File in file group is clustered, one or more local clusters are obtained, the file content in each part cluster is similar；Each local cluster is considered as a file, all local clusters are clustered, global cluster is generated.The present invention also provides a kind of clustering systems of user oriented personal document, including cluster calculation unit, cluster result storage unit and cluster result searching unit, and wherein cluster calculation unit includes batch documents cluster calculation unit and delta file cluster calculation unit.

Description

A kind of clustering method and system of user oriented personal document

Technical field

The present invention relates to individual subscriber file management field, may be implemented to cluster user file according to content, especially It is related to the document clustering based on user's use habit.

Background technology

With popularizing for computer office application, user edits at work generates a large amount of file (such as WORD/WPS/ PDF etc.).According to the design of existing file system, these files are preserved by tree structure in a computer, this guarantor Depositing preservation (memory) mode of mode with people to file, there are dramatically different.People to the memory of file often by task (or Theme) tissue.A problem caused by this species diversity is to need to put into greatly when user subsequently searches and manages to file Energy is measured, especially when multiple versions of user's same subject are stored in different location, subsequent lookup becomes more difficult.

Currently, how personal information management (Personal Information Management) research generates user Information retrieved, different company also develops the searching system for user file, such as Google Desktop Search, Baidu's table Faceted search etc., but these tools are all based on the document retrieval of keyword, do not solve " file organization based on tree construction " It is converted to " file organization based on thematic structure ".

The cluster based on content can be used by carrying out tissue to file by theme, and All Files, which are directly inputted to cluster, to be calculated Method, each cluster (file set) clustered are used as a document theme.Processing directly is carried out to file and has ignored user couple The preservation custom (file under such as same catalogue belongs to the possibility higher of same class cluster) of similar documents, while can dramatically increase Unnecessary calculation amount.

Invention content

The present invention proposes that a kind of clustering method and system of user oriented personal document, this method are preserved by using user With the custom of browsing file, individual subscriber file is clustered, cluster is efficient.

In order to achieve the above objectives, the present invention adopts the following technical scheme that：

A kind of clustering method of user oriented personal document, step include：

(1) file grouping：The preservation custom of similar documents is grouped user file using user, obtains multiple texts Part group；

(2) local fasciation at：File in file group is clustered, one or more local clusters, each part are obtained File content in cluster is similar；

(3) the full generation that clusters：Each local cluster is considered as a file, all local clusters are clustered, are generated global Cluster.

Further, user file is grouped according to file tree directory structure, file type, filename etc..

Further, it is to the method that user file is grouped according to file tree directory structure：Assuming that NodeSet tables Show all directory nodes in file directory tree, X_iIndicate directory node i to the distance of root node, setting distance threshold parameters δ_X, By eligible Xi>δ_XDirectory node included direct file as an individual files group.

Further, it is to the method that user file is grouped according to file tree directory structure：Assuming that NodeSet tables Show all directory nodes in file directory tree, Y_iIndicate that directory node i to the distance of farthest leaf node, is arranged apart from threshold Value parameter δ_Y, method and step includes：

(1) Y value based on directory node selects a Y value and is more than δ_YAnd and δ_YIt is worth immediate directory node；

(2) direct file for including using the directory node chosen is as an individual files group；

(3) the corresponding subtree of the directory node chosen is deleted from entire directory tree；

(4) Y value of all directory nodes in directory tree is updated；

(5) step (1)-(4) are repeated, until the Y value of not no directory node is more than δ_Y；

(6) using the All Files included by remaining directory node as an individual file group.

Further, the direct file refers to the file under directory node, and depth of this document in directory tree with Directory node depth difference is 1.

Further, it is to the method that user file is grouped according to file type, filename：If document clustering mesh Mark require the file type in class cluster identical, can be first grouped according to file type, then in each group based on filename into Row cluster, using each class cluster of generation as a file group；If document clustering target does not require class, clustered file type is identical, Directly file can be clustered based on filename, using each class cluster of generation as a file group.

Further, the clustering method based on filename is：The editing distance of calculation document name character string, by editing distance File less than given threshold is divided into a file group.

Further, the clustering method based on filename is：Filename is segmented, by table of file name be shown as entry to It measures, the text distance between calculation document name, the file that text distance is more than to given threshold is divided into a file group, the text Distance is Jaccard distances or cosine distances.

Further, the method and step clustered to file is calculated including first representation of file, then file similarity, finally Run clustering algorithm.

Further, each local cluster is considered as a file by the representation of file of the full generation phase that clusters, using only One representation of file：Both single file in local cluster may be selected and be indicated (the table as selected the longest file of length in local cluster Local clustered file is shown as to indicate), it is possible to use expression is calculated (as by institute in local cluster in All Files expression in local cluster Documentary expression indicates after being spliced as local clustered file).

Further, the method for the representation of file is：Vectorization is carried out to file, vectorial every one-dimensional characteristic corresponds to text Vector space model can be used to indicate for a part for part, wherein text file, image file can be used characteristics of image or based on god Abstracting method through network characterization.

Further, the method for the representation of file is：Using fuzzy hash method calculation document cryptographic Hash, cryptographic Hash is used It indicates.

Further, the file similarity computational methods are：Jaccard distances between calculation document or cosine Distance.

Further, the file similarity computational methods are：The cryptographic Hash for comparing two files, cryptographic Hash is converted to Similarity.

Further, the clustering algorithm includes kmeans clustering algorithms, density-based algorithms.

A kind of clustering system of user oriented personal document, including：

Cluster calculation unit, for being clustered to user file；

Cluster result storage unit, for storing cluster result, the mapping relations of document retaining and class cluster；

Cluster result searching unit, for realizing returning to file according to class cluster and returning to the function of class cluster according to file.

Further, the cluster calculation unit includes：

Batch documents cluster calculation unit, for being clustered for the first time to being used for personal document；

Delta file cluster calculation unit by increment cluster result and has for being clustered to subsequent delta file Cluster result merges.

Further, merging increment cluster result and existing cluster result, refer to using generate cluster entirely it is poly- Class cluster and existing class cluster that increment obtains once are clustered, obtain amalgamation result by class process.

The input of the method for the present invention is the file directory of tree construction, is exported as by the document cluster list of theme tissue, each File content in document cluster is similar, as shown in Figure 1.

The present invention is grouped file using file tree directory structure, file type, filename, first in file group It clusters, then carries out global clustering, compared with directly on All Files using clustering algorithm, cluster can be significantly improved Computational efficiency.Assuming that total number of files is N, the file group number being divided into is D, and the number of files in average each file group is N/D, often The number for the local cluster that a file group generates is K, and the class number of clusters finally clustered is M.Assuming that the time of the clustering algorithm used is multiple Miscellaneous degree is f (x), and wherein x is the number of files of cluster input, then carrying out the time complexity of document clustering using the method for the present invention The upper limit be Df (N/D)+f (DK).Wherein, Df (N/D) be local fasciation at time complexity；F (DK) be global fasciation at Time complexity the upper limit, why referred to as the upper limit be because cluster full generation when, K office that each file group obtains Portion's cluster is without similarity calculation (being centainly not belonging to same cluster).And directly All Files are clustered using conventional method Time complexity is then f (N).In practical application, by the way that file grouping strategy is rationally arranged, K may make<<N/D, K<<M, To achieve the effect that cluster accelerates.Such as using Kmeans clustering algorithms, the time complexity of the algorithm is f (x)=λ TxM, Wherein, x is the number of files of cluster input, and T is the iterations of Kmeans, and λ is file comparison time；So directly to all texts The time complexity that part is clustered is λ TNM, and the time complexity using the method for the present invention is λ TNK+ λ TDKM, is meeting 1> Under conditions of K/M+DK/N, λ TNM>λ TNK+ λ TDKM, to reach the acceleration effect to document clustering, improved efficiency rate is 1- (K/M+DK/N)。

After the completion of document clustering, user can be clear to carry out file using the document cluster (i.e. global cluster) that this method generates It lookes at, there are two types of browsing modes：A kind of to be to provide the file organization based on subject catalogue structure, user first browses document cluster, then A document cluster is selected to browse associated documents in cluster；Another kind is to provide the association function of file, and user uses traditional tree knot Structure catalogue carries out browsing file, and the associated documents of cluster where clicking document presentation this document realize file recommendation.With above-mentioned two When kind of browsing mode carries out clustered file displaying, user can come pair according to file size, file creation time, filemodetime File is ranked up, and realizes preferential browsing.

Description of the drawings

Fig. 1 is that the file organization based on tree construction is shown using the file organization based on thematic structure that the method for the present invention obtains It is intended to.

Fig. 2 is the clustering method flow chart of user oriented personal document of embodiment a kind of.

Fig. 3 is a file tree directory structure sample figure.

Fig. 4 is the clustering system structure chart of user oriented personal document of embodiment a kind of.

Specific implementation mode

Features described above and advantage to enable the present invention are clearer and more comprehensible, special embodiment below, and institute's attached drawing is coordinated to make Detailed description are as follows.

The present embodiment provides a kind of clustering methods of user oriented personal document, as shown in Fig. 2, including following 3 steps：

1. file grouping

The preservation custom of similar documents is grouped file using user, target is to ensure that the text in each file group Part is grouped into less several class clusters as far as possible, and duplicate file is not present between file group.

Following grouping strategy can be used：

1) it is grouped using file tree directory structure：

All directory nodes in tree are indicated with NodeSet, are calculated each directory node " to the distance of root node " and " are arrived The distance of farthest leaf node " indicates that, then for each directory node i, two distances can using symbol X, Y respectively To be expressed as<X_i,Y_i>。

File grouping is carried out following with X, Y：

Method 1：File grouping is carried out using X.User setting distance parameter δ_X, by eligible X_i>δ_X, i ∈ NodeSet The directory node file that is included as file group, complete grouping.

By taking Fig. 3 as an example, it is assumed that distance parameter δ_XBe set as 2, then the file group obtained be { A0 }, { A1.1 }, A2.1, A3.1, A3.2}、{B1.1}、{C1.1,C1.2}。

Method 2：File grouping is carried out using Y.User setting distance parameter is δ_Y.The pseudocode of process is as follows：

It is broadly divided into 5 steps by the algorithm known to above-mentioned pseudocode：(1) Y value based on directory node selects a Y value More than δ_YAnd and δ_YImmediate directory node comes out；(2) direct file for including using the directory node chosen is as one File group；(3) the corresponding subtree of the directory node chosen is deleted from entire directory tree；(4) institute in directory tree is updated There is the Y value of directory node；(5) step (1)-(4) are repeated, until the Y value of not no directory node is more than δ_Y；(6) by remaining catalogue All Files included by node are as an individual file group.

Still by taking Fig. 3 as an example, it is assumed that distance parameter δ_YBe set as 2, then the file group obtained be { A0, A1.1 }, A2.1, A3.1, A3.2}、{B1.1}、{C1.1,C1.2}。

Compared with method 1, method 2 can include the excessive file group of file to avoid some.User can select above-mentioned two Any one in kind method two kinds can also be used in combination to carry out file grouping.

2) it is grouped using filename, file type：

If cluster target is that the file type in requirement class cluster is identical, file grouping can be carried out first according to file type Grouping, then file is clustered based on filename in each classification, each class cluster of generation is as file group.If cluster Target does not require class clustered file identical, can directly be clustered based on filename to file.Since filename is shorter, speed is clustered It is very fast.

Cluster based on filename has 2 kinds of methods：

Method 1：The editing distance of calculation document name, the file that editing distance is less than to specified threshold are grouped as one, Threshold value is set as needed by user.

Method 2：Filename is segmented, table of file name is shown as entry vector, the text between calculation document name away from From if Jaccard distances are (with reference to https://baike.baidu.com/item/%E6%9D%B0%E5%8D%A1% E5%BE%B7%E8%B7%9D%E7%A6%BB/15416212Fr=aladdin) or cosine is apart from (reference https://zh.wikipedia.org/zh-cn/%E4%BD%99%E5%BC%A6%E7%9B%B8%E4%BC % BC%E6%80%A7), as a file group, threshold value is set the file using similarity more than given threshold as needed by user It is fixed.

3) it is grouped in conjunction with above-mentioned 2 kinds of grouping strategies：

First strategy can be used 1) to be grouped, tactful 2) be grouped again for obtaining each being grouped reusing；Or first It 2) is grouped using strategy, strategy 1) grouping is reused for obtaining each being grouped.

2. local fasciation at

All Files in step 1 in each file group are clustered, local cluster, the file content in local cluster are obtained It is similar.One file group can generate multiple local clusters.

3. the full generation that clusters

Step 2 each of is obtained into local cluster and is regarded as a file, all local clusters are clustered, global cluster is generated.

Specifically, each local cluster is regarded as a file, is indicated using unique.Local clustered file can use in cluster Single file is indicated, for example the longest file of length is selected to be indicated as clustered file；All Files in cluster can also be used Expression is calculated, for example the expression of All Files in cluster is spliced.

The cluster process of step 2 and step 3 is related to representation of file, file similarity calculates and operation clustering algorithm. Wherein clustering algorithm can select kmeans clustering algorithms or be based on density clustering algorithm.Representation of file and file similarity It calculates and determines according to demand, common demand is divided into following two：

1) cluster target is to gather the file of same topic together, and the file content difference in class cluster can be larger.Than Such as wish that the article by " economy " class is brought together, the article in class cluster can be the article of " international economy ", can also be The article of " local economic development ", although this two articles are not that something is discussed.For the demand, the file used is clustered It indicates and file similarity computational methods is as follows：

Representation of file：Vectorization is carried out to file, per one-dimensional characteristic one piece of content of respective file, wherein text file can To use vector space model to indicate (with reference to https://en.wikipedia.org/wiki/Vector_space_model), Image file can use the abstracting method of characteristics of image, can also use the Feature Extraction Method based on neural network.

File similarity calculates：The distance between calculation document, such as Jaccard distances or cosine distances.

2) cluster target is to gather source file together, and source file refers to file can be turned by increasing, deleting, change It changes, source file is often user to complete multiple process original texts of definitive document formation.For the demand, cluster uses Representation of file and file similarity computational methods it is as follows：

Representation of file：Using fuzzy Hash (with reference to http://dx.doi.org/10.1016/ J.diin.2006.06.015 method calculation document cryptographic Hash), the fuzzy Hash are that the fragment Hash based on content segmentation is calculated Method (context triggered piecewise hashing, CTPH) judges file by carrying out fragment to a file Similar programs.

File similarity calculates：Cryptographic Hash is converted to similarity (with reference to fuzzy Hash by the cryptographic Hash for comparing two files Algorithm http://dx.doi.org/10.1016/j.diin.2006.06.015).

The present embodiment also provides a kind of clustering system of user oriented personal document, as shown in figure 4, including：

Cluster calculation unit, for being clustered to user file；The cluster calculation unit includes：Batch documents cluster meter Unit is calculated, for being clustered for the first time to being used for personal document；And delta file cluster calculation unit, for subsequent delta File is clustered, and the cluster process to be clustered entirely using generation is once gathered class cluster and existing class cluster that increment obtains Class, obtains amalgamation result, and realization increment cluster result merges with existing cluster result；

Individual subscriber file is clustered by above step, after the completion of document clustering, user can utilize this method The global cluster of generation carries out browsing file, and there are two types of browsing modes：It is a kind of to be to provide the file group based on subject catalogue structure It knits, user first browses document cluster, then selectes a document cluster to browse associated documents in cluster；Another kind is to provide the association of file Function, user carry out browsing file using traditional tree construction catalogue, the associated documents of cluster where clicking document presentation this document, Realize file recommendation.When carrying out clustered file displaying with above two browsing mode, user can create according to file size, file Build the time, filemodetime is ranked up file, realize preferential browsing.

The above embodiments are merely illustrative of the technical solutions of the present invention rather than is limited, the ordinary skill of this field Personnel can be modified or replaced equivalently technical scheme of the present invention, without departing from the spirit and scope of the present invention, this The protection domain of invention should be subject to described in claims.

Claims

1. a kind of clustering method of user oriented personal document, step include：

User file is grouped, multiple file groups are obtained；

File in file group is clustered, one or more local clusters are obtained, the file content in each part cluster is similar；

Each local cluster is considered as a file, all local clusters are clustered, global cluster is generated.

2. according to the method described in claim 1, it is characterized in that, according to file tree directory structure, file type, filename User file is grouped.

3. according to the method described in claim 2, it is characterized in that, dividing user file according to file tree directory structure Group method include：

Method one：Assuming that NodeSet indicates all directory nodes in file directory tree, X_iIndicate directory node i to root node Distance, setting distance threshold parameters δ_X, by eligible Xi>δ_XThe directory node direct file that is included as one individually File group；

Method two：Assuming that NodeSet indicates all directory nodes in file directory tree, Y_iIndicate directory node i to farthest leaf The distance of child node, setting distance threshold parameters δ_Y, method and step includes：Y value based on directory node selects a Y value More than δ_YAnd and δ_YIt is worth immediate directory node；The direct file for including using the directory node is as an individual files group；It will The corresponding subtree of the directory node is deleted from entire directory tree；Update the Y value of all directory nodes in directory tree；It repeats above-mentioned Step, until the Y value of not no directory node is more than δ_Y；Individually using the All Files included by remaining directory node as one File group.

4. according to the method described in claim 2, it is characterized in that, dividing user file according to file type, filename Group method include：If the file type in document clustering target call class cluster is identical, can first be divided according to file type Group, then clustered based on filename in each group, using each class cluster of generation as a file group；If document clustering Target does not require class clustered file type identical, can directly be clustered based on filename to file, and each class cluster of generation is made For a file group.

5. according to the method described in claim 3, it is characterized in that, the method based on filename cluster includes：

Method one：The editing distance of calculation document name character string, the file that editing distance is less than to given threshold are divided into a text Part group；

Method two：Filename is segmented, table of file name is shown as entry vector, the text distance between calculation document name, The file that text distance is more than to given threshold is divided into a file group, text distance be Jaccard distances or cosine away from From.

6. according to the method described in claim 1, it is characterized in that, the method and step packet clustered to file or local cluster It includes：First representation of file, then file similarity calculate, and finally run clustering algorithm.

7. according to the method described in claim 6, it is characterized in that, the method for the representation of file includes：

Method one：Vectorization, a part for vectorial every one-dimensional characteristic respective file are carried out to file, wherein text file can make It is indicated with vector space model, characteristics of image or the abstracting method based on neural network characteristics can be used in image file；

Method two：Using fuzzy hash method calculation document cryptographic Hash, indicated with cryptographic Hash.

8. according to the method described in claim 6, it is characterized in that, the file similarity computational methods include：

Method one：Jaccard distances between calculation document or cosine distances；

Method two：The cryptographic Hash for comparing two files, similarity is converted to by cryptographic Hash.

9. according to the method described in claim 6, it is characterized in that, the clustering algorithm includes kmeans clustering algorithms, is based on The clustering algorithm of density.

10. a kind of clustering system of user oriented personal document, including：

Cluster result searching unit, for realizing returning to file according to class cluster and returning to the function of class cluster according to file；

Cluster calculation unit, for being clustered to user file comprising：

Delta file cluster calculation unit, for being clustered to subsequent delta file, by increment cluster result and existing cluster As a result it merges.