CN108399213B

CN108399213B - User-oriented personal file clustering method and system

Info

Publication number: CN108399213B
Application number: CN201810112624.0A
Authority: CN
Inventors: 李鹏; 王斌; 齐保元; 周美林; 郭莉; 梅钰
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2018-02-05
Filing date: 2018-02-05
Publication date: 2022-04-01
Anticipated expiration: 2038-02-05
Also published as: CN108399213A

Abstract

The invention provides a user-oriented personal file clustering method, which comprises the following steps: grouping the user files by using the storage habit of the user to the similar files to obtain a plurality of file groups; clustering the files in the file group to obtain one or more local clusters, wherein the file content in each local cluster is similar; and regarding each local cluster as a file, clustering all the local clusters to generate a global cluster. The invention also provides a user-individual-file-oriented clustering system which comprises a clustering calculation unit, a clustering result storage unit and a clustering result searching unit, wherein the clustering calculation unit comprises a batch file clustering calculation unit and an incremental file clustering calculation unit.

Description

User-oriented personal file clustering method and system

Technical Field

The invention relates to the field of user personal file management, which can realize the clustering of user files according to contents, in particular to the file clustering based on the use habits of users.

Background

With the popularization of computer office applications, a user edits and generates a large number of files (such as WORD/WPS/PDF) in work. According to the design of the existing file system, the files are stored in a tree structure in the computer, and the storage mode is obviously different from the storage (memorizing) mode of the files by people. Human memory of documents is often organized by task (or topic). One problem that results from such differences is that the user is required to devote a great deal of effort to the subsequent searching and management of the document, especially when multiple versions of the same theme are stored in different locations, and subsequent searching becomes more difficult.

At present, Personal Information Management (Personal Information Management) researches how to search Information generated by a user, and different companies develop search systems for user files, such as google desktop search, Baidu desktop search, and the like, but these tools are all keyword-based file search, and do not solve the problem of converting "file organization based on tree structure" into "file organization based on topic structure".

Organizing files by topic may use content-based clustering, where all files are directly input to a clustering algorithm, and each cluster (set of files) resulting from clustering is used as a file topic. Directly processing files ignores the user's saving habits of similar files (e.g., files under the same directory are more likely to belong to the same cluster), while significantly increasing the amount of unnecessary computation.

Disclosure of Invention

The invention provides a user-individual-file-oriented clustering method and a user-individual-file-oriented clustering system.

In order to achieve the purpose, the invention adopts the following technical scheme:

a user-oriented personal file clustering method comprises the following steps:

(1) grouping files: grouping the user files by using the storage habit of the user to the similar files to obtain a plurality of file groups;

(2) local cluster generation: clustering the files in the file group to obtain one or more local clusters, wherein the file content in each local cluster is similar;

(3) and (3) full clustering generation: and regarding each local cluster as a file, clustering all the local clusters to generate a global cluster.

Further, the user files are grouped by file tree directory structure, file type, file name, etc.

Further, the method for grouping the user files according to the file tree directory structure comprises the following steps: let NodeSet denote all directory nodes in the file directory tree, X_iRepresenting the distance from the directory node i to the root node, and setting a distance threshold parameter delta_XWill meet the condition Xi>δ_XAs a single file group。

Further, the method for grouping the user files according to the file tree directory structure comprises the following steps: let NodeSet denote all directory nodes in the file directory tree, Y_iRepresenting the distance from directory node i to the farthest leaf node, and setting a distance threshold parameter delta_YThe method comprises the following steps:

(1) selecting a Y value greater than delta based on the Y values of the directory nodes_YAnd is delta_YThe directory node with the closest value;

(2) taking the direct file contained in the selected directory node as a single file group;

(3) deleting the subtree corresponding to the selected directory node from the whole directory tree;

(4) updating Y values of all directory nodes in the directory tree;

(5) repeating steps (1) - (4) until no directory node has a Y value greater than δ_Y；

(6) All files included in the remaining directory nodes are treated as a single file group.

Further, the direct file refers to a file under a directory node, and the difference between the depth of the file in the directory tree and the depth of the directory node is 1.

Further, the method for grouping the user files according to the file types and the file names comprises the following steps: if the file clustering target requires that the file types in the clusters are the same, the clusters can be grouped according to the file types, then clustering is carried out in each group based on the file names, and each generated cluster is used as a file group; if the file clustering target does not require the same cluster file type, the files can be directly clustered based on the file names, and each generated cluster is used as a file group.

Further, the file name-based clustering method comprises the following steps: and calculating the editing distance of the file name character string, and dividing the files with the editing distance smaller than a set threshold value into a file group.

Further, the file name-based clustering method comprises the following steps: the method comprises the steps of segmenting the file names, expressing the file names as entry vectors, calculating text distances among the file names, and segmenting files with the text distances larger than a set threshold value into a file group, wherein the text distances are Jaccard distances or cosine distances.

Furthermore, the method for clustering the files comprises the steps of file representation, file similarity calculation and operation of a clustering algorithm.

Further, the file representation of the full cluster generation stage regards each local cluster as a file, and uses a unique file representation: either a single file in a local cluster can be selected for representation (e.g., a representation of a file with the longest length in the local cluster is selected as a local cluster file representation), or a representation can be computed using representations of all files in the local cluster (e.g., representations of all files in the local cluster are spliced to form a local cluster file representation).

Further, the file representation method comprises the following steps: vectorizing the file, wherein each dimension feature of the vector corresponds to a part of the file, the text file can be represented by using a vector space model, and the image file can be represented by using an image feature or an extraction method based on a neural network feature.

Further, the file representation method comprises the following steps: and calculating the file hash value by using a fuzzy hash method, wherein the file hash value is represented by a hash value.

Further, the file similarity calculation method comprises the following steps: jaccard distance or cosine distance between files is calculated.

Further, the file similarity calculation method comprises the following steps: and comparing the hash values of the two files, and converting the hash values into the similarity.

Further, the clustering algorithm comprises a kmeans clustering algorithm and a density-based clustering algorithm.

A user-oriented personal file clustering system, comprising:

the cluster calculating unit is used for clustering the user files;

the clustering result storage unit is used for storing clustering results and reserving the mapping relation between the files and the class clusters;

and the clustering result searching unit is used for realizing the functions of returning the file according to the class cluster and returning the class cluster according to the file.

Further, the cluster calculating unit includes:

the batch file clustering calculation unit is used for clustering the personal files for the first time;

and the incremental file clustering calculation unit is used for clustering subsequent incremental files and combining the incremental clustering result with the existing clustering result.

Further, merging the incremental clustering result with the existing clustering result means that the cluster obtained by incremental clustering and the existing cluster are clustered at a time by utilizing the clustering process of generating full clusters to obtain a merged result.

The input of the method is a file directory with a tree structure, the output is a file cluster list organized according to the subject, and the file contents in each file cluster are similar, as shown in fig. 1.

The invention uses the file tree directory structure, the file type and the file name to group the files, firstly clusters the files in the file group, and then carries out global clustering, and compared with the method of directly applying a clustering algorithm on all the files, the invention can obviously improve the clustering calculation efficiency. Assuming that the total number of files is N, the number of divided file groups is D, the average number of files in each file group is N/D, the number of local clusters generated by each file group is K, and the number of cluster clusters of the final cluster is M. Assuming that the time complexity of the clustering algorithm used is f (x), where x is the number of files input for clustering, the upper limit of the time complexity for file clustering using the method of the present invention is Df (N/D) + f (DK). Where Df (N/D) is the temporal complexity of local cluster generation; (dk) is the upper limit of the time complexity of global cluster generation, which is called the upper limit because the K local clusters obtained from each file group are not subjected to similarity calculation (do not necessarily belong to the same cluster) when performing full cluster generation. The temporal complexity of directly clustering all files using the conventional method is f (N). In practical application, K < < N/D and K < < M can be realized by reasonably setting a file grouping strategy, so that the clustering acceleration effect is achieved. For example, using a Kmeans clustering algorithm, the time complexity of the algorithm is f (x) ═ λ TxM, where x is the number of files input for clustering, T is the number of iterations of Kmeans, and λ is the file alignment time; the time complexity of directly clustering all files is lambda TNK + lambda TDKM, and the lambda TNM is greater than the lambda TNK + lambda TDKM under the condition that 1 is greater than K/M + DK/N, so that the file clustering acceleration effect is achieved, and the efficiency improvement rate is 1- (K/M + DK/N).

After the file clustering is completed, a user can browse files by using the file cluster (namely, the global cluster) generated by the method, and the method has two browsing modes: one is to provide a file organization based on a subject directory structure, wherein a user browses a file cluster first and then selects a file cluster to browse related files in the cluster; and the other is to provide an association function of the file, a user browses the file by using a traditional tree structure directory, clicks the file to display the related file of the cluster where the file is located, and then the file recommendation is realized. When displaying cluster files in the two browsing modes, a user can sort the files according to the file size, the file creation time and the file modification time, so that the prior browsing is realized.

Drawings

FIG. 1 is a schematic diagram of document organization based on a subject structure obtained by the method of the present invention.

FIG. 2 is a flowchart of an embodiment of a user-oriented personal file clustering method.

FIG. 3 is a sample diagram of a file tree directory structure.

FIG. 4 is a diagram of an embodiment of a clustering system for user-oriented personal files.

Detailed Description

In order to make the aforementioned and other features and advantages of the invention more comprehensible, embodiments accompanied with figures are described in detail below.

The embodiment provides a user-oriented personal file clustering method, as shown in fig. 2, which includes the following 3 steps:

1. grouping of files

The files are grouped by utilizing the storage habit of a user on similar files, and the aim is to ensure that the files in each file group are classified into a few class clusters as less as possible, and no repeated files exist among the file groups.

The following grouping strategy may be used:

1) grouping using a file tree directory structure:

with NodeSet representing all directory nodes in the tree, calculating the "distance to root node" and "distance to farthest leaf node" of each directory node, respectively, using the notation X, Y, then for each directory node i, the two distances can be represented as<X_i,Y_i>。

The following grouping of files is done using X, Y:

the method comprises the following steps: grouping of files is done using X. User setting distance parameter delta_XWill meet the condition X_i>δ_XAnd the files contained in the directory nodes of i belonging to the NodeSet are used as file groups to complete grouping.

Taking FIG. 3 as an example, assume that the distance parameter δ_XAssuming that 2, the obtained file set is { a0}, { a1.1}, { a2.1, a3.1, a3.2}, { B1.1}, and { C1.1, C1.2 }.

The method 2 comprises the following steps: grouping of files is performed using Y. The user sets the distance parameter to delta_Y. The pseudo-code of the procedure is as follows:

the algorithm is mainly divided into 5 steps according to the pseudo code: (1) selecting a Y value greater than δ based on the Y values of the directory nodes_YAnd is delta_YThe closest directory node comes out; (2) taking the direct file contained in the selected directory node as a file group; (3) deleting the subtree corresponding to the selected directory node from the whole directory tree; (4) updating Y values of all directory nodes in the directory tree; (5) repeating steps (1) - (4) until no directory node has a Y value greater than δ_Y(ii) a (6) All files included in the remaining directory nodes are treated as a single file group.

Still taking FIG. 3 as an example, assume that the distance parameter δ_YSet to 2, the resulting set of files is { A0,A1.1}、{A2.1,A3.1,A3.2}、{B1.1}、{C1.1,C1.2}。

Compared to method 1, method 2 can avoid some file groups that contain too many files. The user can select either one of the above two methods or combine the two methods for grouping files.

2) Grouping using file name, file type:

if the clustering target is that the file types in the cluster are required to be the same, the files can be grouped according to the file types, then the files are clustered in each category based on the file names, and each generated cluster is used as a file group. If the clustering target does not require the cluster-like files to be the same, the files can be directly clustered based on the file names. Due to the short file names, the clustering speed is very fast.

There are 2 methods for file name based clustering:

the method comprises the following steps: and calculating the editing distance of the file name, and taking the files with the editing distance smaller than a specified threshold as a group, wherein the threshold is set by a user according to needs.

The method 2 comprises the following steps: the file names are participled, the file names are expressed as vocabulary entry vectors, the text distance between the file names is calculated, such as the Jaccard distance (see https:// baike.baidu.com/item/% E6% 9D% B0% E5% 8D% A1% E5% BE% B7% E8% B7% 9D% E7% A6% BB/15416212fr ═ adadin) or the cosine distance (see https:// zhui.wikipedia.org/zh-cn/% E4% BD 99% E5% BC% A6% E7% 9B 8% E4% BC 6% E5630% A7), the similarity is greater than a set file threshold, which is set by the user as needed.

3) Grouping is carried out by combining the 2 grouping strategies:

the strategy 1) can be used for grouping, and the strategy 2) can be used for grouping again for each obtained group; or grouping by using the strategy 2) and then grouping by using the strategy 1) for each obtained group.

2. Local cluster generation

And (4) clustering all the files in each file group in the step (1) to obtain local clusters, wherein the file contents in the local clusters are similar. One file group may generate a plurality of partial clusters.

3. Full cluster generation

And (3) regarding each local cluster obtained in the step (2) as a file, clustering all the local clusters, and generating a global cluster.

Specifically, each local cluster is treated as a file, using a unique representation. A local cluster file can be represented by using a single file in a cluster, such as selecting a file with the longest length as a cluster file representation; it is also possible to use all file representations in a cluster for calculation, such as concatenating the representations of all files in a cluster.

The clustering process of step 2 and step 3 involves document representation, document similarity calculation, and running a clustering algorithm. Wherein the clustering algorithm can select a kmeans clustering algorithm or a density-based clustering algorithm. The file representation and the file similarity calculation are determined according to requirements, and the common requirements are divided into the following two types:

1) the clustering target is to cluster files of the same topic together, and the content difference of the files in the cluster can be larger. For example, if it is desired to bring together articles in the "economy" class, the articles in the class cluster may be articles in the "international economy" class or articles in the "local economy development class, although these two articles are not discussed. For the requirement, the method for calculating the file representation and the file similarity used by clustering is as follows:

the file represents: the file is vectorized, each dimension feature corresponds to one piece of file content, wherein the text file can be represented by using a Vector space model (refer to https:// en. wikipedia. org/wiki/Vector _ space _ model), and the image file can use an extraction method of image features or a feature extraction method based on a neural network.

Calculating the similarity of the files: the distance between the files is calculated, such as the Jaccard distance or cosine distance.

2) The clustering target is to cluster the homologous files together, the homologous files refer to files which can be converted by adding, deleting and changing, and the homologous files are often drafts of a plurality of processes formed by a user in order to finish a final file. For the requirement, the method for calculating the file representation and the file similarity used by clustering is as follows:

the file represents: the file hash value is calculated by using a fuzzy hash (refer to http:// dx. doi. org/10.1016/j. di. 2006.06.015) method, wherein the fuzzy hash is based on a content-divided piecemeal hashing algorithm (CTPH), and a file similarity program is judged by fragmenting a file.

Calculating the similarity of the files: and comparing the hash values of the two files, and converting the hash values into similarity (refer to a fuzzy hash algorithm http:// dx. doi. org/10.1016/j. diin.2006.06.015).

The embodiment also provides a user-oriented personal file clustering system, as shown in fig. 4, including:

the cluster calculating unit is used for clustering the user files; the cluster calculation unit includes: the batch file clustering calculation unit is used for clustering the personal files for the first time; the incremental file clustering calculation unit is used for clustering subsequent incremental files, and performing primary clustering on the clusters obtained by the increment and the existing clusters by utilizing the clustering process of generating full clusters to obtain a merging result and realize the merging of the incremental clustering result and the existing clustering result;

The user personal files are clustered through the steps, after the files are clustered, the user can browse the files by using the global cluster generated by the method, and the method has two browsing modes: one is to provide a file organization based on a subject directory structure, wherein a user browses a file cluster first and then selects a file cluster to browse related files in the cluster; and the other is to provide an association function of the file, a user browses the file by using a traditional tree structure directory, clicks the file to display the related file of the cluster where the file is located, and then the file recommendation is realized. When displaying cluster files in the two browsing modes, a user can sort the files according to the file size, the file creation time and the file modification time, so that the prior browsing is realized.

The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims

1. A user-oriented personal file clustering method comprises the following steps:

grouping user files according to a file tree directory structure, grouping the user files according to file types and file names, and combining the two grouping strategies to obtain a plurality of file groups; the method for grouping the user files according to the file tree directory structure comprises the following steps: the method comprises the following steps: let NodeSet denote all directory nodes in the file directory tree, X_iRepresenting the distance from the directory node i to the root node, and setting a distance threshold parameter delta_XWill meet the condition Xi>δ_XThe direct files contained in the directory node are used as a single file group; or the second method: let NodeSet denote all directory nodes in the file directory tree, Y_iRepresenting the distance from the directory node i to the farthest leaf node under the directory node, and setting a distance threshold parameter delta_YThe method comprises the following steps: selecting a Y value greater than delta based on the Y values of the directory nodes_YAnd is delta_YThe directory node with the closest value; taking the direct file contained in the directory node as a single file group; deleting the subtree corresponding to the directory node from the whole directory tree; updating Y values of all directory nodes in the directory tree; repeating the steps until the Y value of no directory node is larger than delta_Y(ii) a All files included by the remaining directory nodes are used as a single file group; the method for grouping the user files according to the file types and the file names comprises the following steps: if the file clustering target requires that the file types in the clusters are the same, the files can be firstly grouped according to the file types, and then clustering is carried out in each group based on the file namesEach formed class cluster is used as a file group; if the file clustering target does not require the same cluster file type, the files can be directly clustered based on the file names, and each generated cluster is used as a file group;

clustering the files in the file group to obtain one or more local clusters, wherein the file content in each local cluster is similar;

and regarding each local cluster as a file, clustering all the local clusters to generate a global cluster.

2. The method of claim 1, wherein the method for file name based clustering comprises:

the method comprises the following steps: calculating the editing distance of the file name character string, and dividing files with the editing distance smaller than a set threshold into a file group;

the second method comprises the following steps: the method comprises the steps of segmenting the file names, expressing the file names as entry vectors, calculating text distances among the file names, and segmenting files with the text distances larger than a set threshold value into a file group, wherein the text distances are Jaccard distances or cosine distances.

3. The method of claim 1, wherein the method step of clustering files or local clusters comprises: and (3) firstly representing the files, then calculating the similarity of the files, and finally operating a clustering algorithm.

4. The method of claim 3, wherein the method of file representation comprises:

the method comprises the following steps: vectorizing the file, wherein each dimension feature of the vector corresponds to a part of the file, the text file can be represented by a vector space model, and the image file can be represented by an image feature or an extraction method based on a neural network feature;

the second method comprises the following steps: and calculating the file hash value by using a fuzzy hash method, wherein the file hash value is represented by a hash value.

5. The method according to claim 3, wherein the document similarity calculation method includes:

the method comprises the following steps: calculating the Jaccard distance or cosine distance between the files;

the second method comprises the following steps: and comparing the hash values of the two files, and converting the hash values into the similarity.

6. The method of claim 3, wherein the clustering algorithm comprises a kmeans clustering algorithm, a density-based clustering algorithm.

7. A user-oriented personal file clustering system for implementing the method of any one of claims 1 to 6, the system comprising:

the clustering result searching unit is used for realizing the functions of returning files according to the clusters and returning the clusters according to the files;

the cluster calculating unit is used for clustering the user files and comprises: